I'm trying to do a variant of linear regression. To set the context, we normally define a cost function as something like this: $$E=\frac{1}{n} \sum_{i=1}^n \left( h(x_i) - y_i\right)^2$$ Where: $$h(x_i) = \sum_{j=1}^m w_j x_{i,j} = \mathbf{w}^T\mathbf{x}_i$$ Now, following various standard textbooks, we can derive the gradient with respect to $\mathbf{w}$, set the gradient to zero, solve for $\mathbf{w}$, and ultimately derive this formula for linear regression: $$\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
The problem I'm having is that the scale of my $\mathbf{x}$ and $\mathbf{y}$ values is pretty wide, from very small values to very large ones. What linear regression normally does is minimize the squared difference between computed $\mathbf{\hat{y}}=\mathbf{w}^T\mathbf{x}$ and data set $\mathbf{y}$. This is a problem for me, because for my data the difference needs to be small for small y's, but it can be large for large y's. In other words, for larger y values, I can tolerate more error.
What actually would work better for me is if I could get the ratio of $h(x)$ and $y$ to be as close as possible to 1. Something like this: $$E=\frac{1}{n} \sum_{i=1}^n \left( \frac{h(x_i)}{y_i}-1\right)^2$$
Questions:
- Is there already such a thing? After some time googling for something like this, I've come up empty, but I'd be amazed if someone hadn't already invented something like this hundreds of years ago. If it already exists, what's it called?
- I get stuck trying to turn this into a matrix form. The first thing I've never dealt with before is the element-wise vector division. I don't even know a notation for that. How would one get past this in the derivation?
- I could do this using gradient descent. The on-paper derivation (avoiding vector notation) is not a big deal. Besides, I could use sympy or any other algebra library and have the library do the partial derivatives for me. But: Is there a closed form solution for this?
- Maybe this isn't the best way to achieve what I want. I think that making the ratio of computed $\mathbf{\hat{y}}$ and data set $\mathbf{y}$ close to 1 seems most appropriate for my data set, but is there a more sensible approach that would achieve the same goals that someone smarter has already thought of?
Note that:
$\begin{align} \ln (1 + x) \approx x + \frac{x^2}{2} + \dotsb \end{align}$
so that fitting a log-log line is approximately minimizing the relative deviations (as long as the errors are small).