The cost function is some indication of the 'cost'/how the predicted value differs from the actual value. In linear regression, this can be measured using MSE. In the case of the logistic function, this is done using the log-likelihood. Is this always the case for nonlinear functions which have the property to map the inputs and weights to a probability?
2026-03-27 04:22:25.1774585345
Machine learning - Cost function for non linear functions
177 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
There are 1 best solutions below
Related Questions in MACHINE-LEARNING
- KL divergence between two multivariate Bernoulli distribution
- Can someone explain the calculus within this gradient descent function?
- Gaussian Processes Regression with multiple input frequencies
- Kernel functions for vectors in discrete spaces
- Estimate $P(A_1|A_2 \cup A_3 \cup A_4...)$, given $P(A_i|A_j)$
- Relationship between Training Neural Networks and Calculus of Variations
- How does maximum a posteriori estimation (MAP) differs from maximum likelihood estimation (MLE)
- To find the new weights of an error function by minimizing it
- How to calculate Vapnik-Chervonenkis dimension?
- maximize a posteriori
Related Questions in LOGISTIC-REGRESSION
- How do you interpret a categorical dependent variable in a plot?
- Why do we get rid of the minus sign when minimizing the cost function of logistic regression?
- How does one show that the likelihood solution for logistic regression has a magnitude of infinity for separable data (Bishop exercise 4.14)?
- Logistic regression with weighted variables
- Do logistic functions have to be symmetrical?
- Logistic Regression Adjusting for True Population Proportion
- The response $Y$ in a regression is a Bernoulli r.v., express the log likelihood function and the part. derivatives
- Can the product of logistic ($x_1$) and logistic ($x_2$) be approximated by logistic ($x_1+x_2$)?
- Logistics Regression + Variance
- Is there a nice matrix expression for the gradient of the cross-entropy for multinomial logistic regression?
Trending Questions
- Induction on the number of equations
- How to convince a math teacher of this simple and obvious fact?
- Find $E[XY|Y+Z=1 ]$
- Refuting the Anti-Cantor Cranks
- What are imaginary numbers?
- Determine the adjoint of $\tilde Q(x)$ for $\tilde Q(x)u:=(Qu)(x)$ where $Q:U→L^2(Ω,ℝ^d$ is a Hilbert-Schmidt operator and $U$ is a Hilbert space
- Why does this innovative method of subtraction from a third grader always work?
- How do we know that the number $1$ is not equal to the number $-1$?
- What are the Implications of having VΩ as a model for a theory?
- Defining a Galois Field based on primitive element versus polynomial?
- Can't find the relationship between two columns of numbers. Please Help
- Is computer science a branch of mathematics?
- Is there a bijection of $\mathbb{R}^n$ with itself such that the forward map is connected but the inverse is not?
- Identification of a quadrilateral as a trapezoid, rectangle, or square
- Generator of inertia group in function field extension
Popular # Hahtags
second-order-logic
numerical-methods
puzzle
logic
probability
number-theory
winding-number
real-analysis
integration
calculus
complex-analysis
sequences-and-series
proof-writing
set-theory
functions
homotopy-theory
elementary-number-theory
ordinary-differential-equations
circles
derivatives
game-theory
definite-integrals
elementary-set-theory
limits
multivariable-calculus
geometry
algebraic-number-theory
proof-verification
partial-derivative
algebra-precalculus
Popular Questions
- What is the integral of 1/x?
- How many squares actually ARE in this picture? Is this a trick question with no right answer?
- Is a matrix multiplied with its transpose something special?
- What is the difference between independent and mutually exclusive events?
- Visually stunning math concepts which are easy to explain
- taylor series of $\ln(1+x)$?
- How to tell if a set of vectors spans a space?
- Calculus question taking derivative to find horizontal tangent line
- How to determine if a function is one-to-one?
- Determine if vectors are linearly independent
- What does it mean to have a determinant equal to zero?
- Is this Batman equation for real?
- How to find perpendicular vector to another vector?
- How to find mean and median from histogram
- How many sides does a circle have?
All right. Firstly, since you are new to these concepts, I would recommend keeping a good reference book with you. Probably something like Pattern Recognition and Machine Learning by Bishop.
Now the answer to your question is too big to cover entirely, so I point you to the book again. But I'll try to explain it a bit-
Let us take the classification problem(only two classes). If you were given a computer able to solve such things automatically, your first idea would not be to give the loss function, but something like $$L_d(h)=\sum_{(x,y)}\mathbb{I}_{\{h(x)\neq y\}}$$ where $h(.)$ is your program and (x, y) are your specifications. $\mathbb{I}$ is the indicator function here.
(You're basically just enumerating every point in your data and penalising when the program is incorrect). The problem here is we cannot dream of solving by enumerating very large datasets, which we work with usually. So we try to take a few approaches to make the penalty more amenable.
Method 1: Optimisation approach
You need to come up with a loss function over which we can actually optimise $h(.)$. Turns out, we have efficient algorithms to solve this, only when the function h is convex. So any convex loss function upper bounding and approximating $L_d(h)$ would work, assuming h is either linear or convex. (Upper bounding is just a sufficient condition and we can make do without it )
You might have heard of the sigmoid function. It is a very good approximation of $L_d$. But it is not convex. Here the log function can be used to make it convex! (Try it out)
So you can simplify non-convex approximations by a tool called log under some conditions.
Method 2: Probabilistic approach
For any dataset, you assume the function to be taken from a distribution. Now you want your function $h(.)$ to parametrise the distribution best. For example, if you assume a classification process to be a bernoulli trial with probability a function of input, f(x). Your h(x) wants to find/approximate f(x) as we don't know it. Give this, we write the likelihood: $$L({(x_i, y_i)}, h) = \mathbb{P}({(x_i, y_i)}|f=h)$$
where {$(x_i, y_i)$} is the dataset/samples.
And try to maximise it(I won't explain it completely here because it is quite vast and I refer you to the book by Bishop). MLE, or maximum likelihood estimator, wants to maximise the likelihood, i.e., it finds the $argmax_{h}{L(\{(x_i, y_i)|f=h\})}$. So the computer, while optimising the loss, is actually trying to find the MLE.
Why MLE? I'll leave that for you to explore. But in short, it is an unbiased estimator and supposedly has the lowest variance amongst all estimators.
But the thing is finding arg max over the probability function or the log of probability function makes no difference as log is increasing. So log here is used as a tool to simplify the MLE based loss. For eg; in gaussian your f is usually a function of the mean and variance. So h needs to approximate the mean and variance. Taking log removes the exponential part and gives us easy functions(like the L2 norm) to optimise over.
There are many other(like information theory) analogies but the main point is, it should be convex and approximate our intuitive penalty defined earlier.
Also note there are functions that don't use log like the hinge loss, dice loss, etc. And I think I wrote too much oop-