In machine-learning, we often use the L2 norm to prevent the weight vector from being too "big" according to this norm and thus to try to generalize more from the trainig dataset. However it is also possible use different norms that are close to L2, that is, norms for which each component is weighted differently according to some criteria. For instance it could be: "put a higher borm weight on components of the weight vector that appear rarely", or any other ideas of this kind. Is there any articles/studies about this? ie using a tweaked L2 penalty and analyzing the results?
2026-03-26 10:58:32.1774522712
analysis of different L2 norms for regularization
303 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
There are 1 best solutions below
Related Questions in NORMED-SPACES
- How to prove the following equality with matrix norm?
- Closure and Subsets of Normed Vector Spaces
- Exercise 1.105 of Megginson's "An Introduction to Banach Space Theory"
- derive the expectation of exponential function $e^{-\left\Vert \mathbf{x} - V\mathbf{x}+\mathbf{a}\right\Vert^2}$ or its upper bound
- Minimum of the 2-norm
- Show that $\Phi$ is a contraction with a maximum norm.
- Understanding the essential range
- Mean value theorem for functions from $\mathbb R^n \to \mathbb R^n$
- Metric on a linear space is induced by norm if and only if the metric is homogeneous and translation invariant
- Gradient of integral of vector norm
Related Questions in MACHINE-LEARNING
- KL divergence between two multivariate Bernoulli distribution
- Can someone explain the calculus within this gradient descent function?
- Gaussian Processes Regression with multiple input frequencies
- Kernel functions for vectors in discrete spaces
- Estimate $P(A_1|A_2 \cup A_3 \cup A_4...)$, given $P(A_i|A_j)$
- Relationship between Training Neural Networks and Calculus of Variations
- How does maximum a posteriori estimation (MAP) differs from maximum likelihood estimation (MLE)
- To find the new weights of an error function by minimizing it
- How to calculate Vapnik-Chervonenkis dimension?
- maximize a posteriori
Related Questions in LINEAR-REGRESSION
- Least Absolute Deviation (LAD) Line Fitting / Regression
- How does the probabilistic interpretation of least squares for linear regression works?
- A question regarding standardized regression coefficient in a regression model with more than one independent variable
- Product of elements of a linear regression
- Covariance of least squares parameter?
- Contradiction in simple linear regression formula
- Prove that a random error and the fitted value of y are independent
- Is this a Generalized Linear Model?
- The expected value of mean sum of square for the simple linear regression
- How to get bias-variance expression on linear regression with p parameters?
Related Questions in REGULARIZATION
- Zeta regularization vs Dirichlet series
- Uniform convergence of regularized inverse
- Composition of regularized inverse of linear operator on dense subspace converges on whole space?
- Linear Least Squares with $ {L}_{2} $ Norm Regularization / Penalty Term
- SeDuMi form of $\min_x\left\{\|Ax-b\|_2^2 + \lambda\|x\|_2\right\}$
- Solving minimization problem $L_2$ IRLS (Iteration derivation)
- How to utilize the right-hand side in inverse problems
- How Does $ {L}_{1} $ Regularization Present Itself in Gradient Descent?
- Proof in inverse scattering theory (regularization schemes)
- Derivation of Hard Thresholding Operator (Least Squares with Pseudo $ {L}_{0} $ Norm)
Related Questions in LOGISTIC-REGRESSION
- How do you interpret a categorical dependent variable in a plot?
- Why do we get rid of the minus sign when minimizing the cost function of logistic regression?
- How does one show that the likelihood solution for logistic regression has a magnitude of infinity for separable data (Bishop exercise 4.14)?
- Logistic regression with weighted variables
- Do logistic functions have to be symmetrical?
- Logistic Regression Adjusting for True Population Proportion
- The response $Y$ in a regression is a Bernoulli r.v., express the log likelihood function and the part. derivatives
- Can the product of logistic ($x_1$) and logistic ($x_2$) be approximated by logistic ($x_1+x_2$)?
- Logistics Regression + Variance
- Is there a nice matrix expression for the gradient of the cross-entropy for multinomial logistic regression?
Trending Questions
- Induction on the number of equations
- How to convince a math teacher of this simple and obvious fact?
- Find $E[XY|Y+Z=1 ]$
- Refuting the Anti-Cantor Cranks
- What are imaginary numbers?
- Determine the adjoint of $\tilde Q(x)$ for $\tilde Q(x)u:=(Qu)(x)$ where $Q:U→L^2(Ω,ℝ^d$ is a Hilbert-Schmidt operator and $U$ is a Hilbert space
- Why does this innovative method of subtraction from a third grader always work?
- How do we know that the number $1$ is not equal to the number $-1$?
- What are the Implications of having VΩ as a model for a theory?
- Defining a Galois Field based on primitive element versus polynomial?
- Can't find the relationship between two columns of numbers. Please Help
- Is computer science a branch of mathematics?
- Is there a bijection of $\mathbb{R}^n$ with itself such that the forward map is connected but the inverse is not?
- Identification of a quadrilateral as a trapezoid, rectangle, or square
- Generator of inertia group in function field extension
Popular # Hahtags
second-order-logic
numerical-methods
puzzle
logic
probability
number-theory
winding-number
real-analysis
integration
calculus
complex-analysis
sequences-and-series
proof-writing
set-theory
functions
homotopy-theory
elementary-number-theory
ordinary-differential-equations
circles
derivatives
game-theory
definite-integrals
elementary-set-theory
limits
multivariable-calculus
geometry
algebraic-number-theory
proof-verification
partial-derivative
algebra-precalculus
Popular Questions
- What is the integral of 1/x?
- How many squares actually ARE in this picture? Is this a trick question with no right answer?
- Is a matrix multiplied with its transpose something special?
- What is the difference between independent and mutually exclusive events?
- Visually stunning math concepts which are easy to explain
- taylor series of $\ln(1+x)$?
- How to tell if a set of vectors spans a space?
- Calculus question taking derivative to find horizontal tangent line
- How to determine if a function is one-to-one?
- Determine if vectors are linearly independent
- What does it mean to have a determinant equal to zero?
- Is this Batman equation for real?
- How to find perpendicular vector to another vector?
- How to find mean and median from histogram
- How many sides does a circle have?
Suppose you want to minimize a function $f$, but you know that the system is ill-conditioned. Then you might use a regularization function to make the problem more stable. So instead of minimizing $f$, you solve \begin{align} \text{minimize} \hspace{8pt} f(x) + \lambda \, R(x) \end{align} where $R$ is a regularization function and $\lambda$ is the regularization parameter.
You want to put as much information into your regularization as you can. When using just the $R=\|\cdot\|_2$ (the L2 norm) is the regularization function, what you're saying is that you know all of the values should be small. How small? That's what the regularization parameter is for. If $\lambda$ is large then you're saying that you know you want all of the values to be very small. If $\lambda$ is small, then you're saying that you want all of the values to somewhat small.
Using the $L2$ norm as a regularization function in this way gives you the optimal estimate when the noise in the system is Gaussian. Bayesian probabilists relate this to the prior distribution of your variable, which quantifies what you know about your variable prior to solving the problem.
But what if you know more? What if you know that the values should be small, but the first value should be 10 times smaller than then last. In that case, you use the weighted norm. And you weight the first component with 10 when all the other components are weighted with 1. This takes advantage of the additional information that you know.
What if you know that the values should be small, but occasionally there will be large values? Then you'd set $R=\|\cdot\|_1$; the L1 norm. And this corresponds to the optimal estimate when the noise is Laplacian.
Boyd and Vandenberghe have a great discussion of this in their book Convex Optimization, which you can get on the internet for free. If you have access to papers, an example of how a weighted norm can be used to good effect can be seen in this paper: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=7134771. (If you don't have easy access to papers, don't worry about this article. It wouldn't be that useful, just an example.)