How to find a line that divides a data set with the minimum or maximum variance

62 Views Asked by At

I study computer science and a lot of machine learning algorithms are based on the variance of a set and I was wondering if there are know methods to find a line (or n-dimension plane) such that the line divides the set into two subsets with the minimum or maximum variance.

1

There are 1 best solutions below

1
On BEST ANSWER

From a statistics standpoint, this would be called a Linear Modell. Although by itself it doesn't solve this exact problem, with some variation we can fit it for this issue as well.

First, order your data into ascending order. I'll the call elements in this ascending order $X_1, X_2, \ ... \ X_n$.

Now you can plot your data on the $(i, X_i)_{i \in \{1, ...,n\}}$ plane. The Linear Modell would calculate the line of least deviation on this plane, meaning the sum of all variances from this line would be minimal. This is not exactly what we're trying to do, since it's possible that most of the data will be on one side of this line, and only a few will be on the other. So let's try to find equations that would satisfy your criteria.

Let's call the axis with $(i)_{i \in \{1, ...,n\}}$ values the $x$ axis, and the axis with $(X_i)_{i \in \{1, ...,n\}}$ values the $y$ axis. Now we can setup some variables: $a \in \Bbb{R}^+, b \in \Bbb{R}$. So the (yet unknown) best fitting linear equation is: $y = ax+b$.

Now we can setup a criteria for this line to satisfy:

$$\Bigl\lvert\sum_{X_i \le ax+b}1 - \sum_{X_j \ge ax+b}1\Bigr\rvert \le 1$$

So the number of points above the line and below the line are the same, with some deviation, 1 extra point is allowed above or below the line, since if the number of points is odd, this would be unavoidable.

And our objective function is:

$$\min\Bigl(\sum_{i=1}^{n}( X_i - (ai+b))^2 \Bigr)$$

To minimalize the variance of the differences between out line and our data points. (Replace $\min$ to $\max$ if you want to maximalize - although the solution would probably be quite different.)

Now to solve this optimalization modell, I'll use a binary approximation algorithm for the values of $a$ and $b$. I highly recommend for you to draw this on a piece of paper, or else you might get lost.

  1. Calculate the median of your data. If you have an odd number of points, this will be one exact datapoint. If you have an even number of points, this will be the average of two neighbouring points. $$\text{In the first case, this will be}\ M = (\frac{n+1}{2},X_{\frac{n+1}{2}}), \\ \text{in the second case:}\ M = (\frac{n+1}{2}, \frac{X_{\frac{n}{2}} + X_{\frac{n}{2}+1}}{2})$$

  2. Then using M, we'll draw 2 starting lines (2 guesses for $y = ax+b$. We want the optimal line to be somewhere inbetween these). $$\text{First starting line:}\ y=M \ \text{(a horizontal line, let's name it }L_1\text{)}$$

  3. We'll define the second starting line ($L_2$) with 2 points. Its first point is also $M$. It's second point is the one that makes it have a larger $a$ (larger gradient) out of 2 possiblities. (Calculate both of these, write up the equation for the line, and choose the one whose $a$ term is larger. If their $a$ term is equal, they will give the same line.) These possiblities are: $$(1,X_1-1) \text{, or } (n, X_n+1).$$ This gives us 2 lines ($L_1$ and $L_2$) that encompass our data points from 2 directions. We'll now rotate these lines around $M$ and see for which rotation our objective function is lower. This is the part where we'll start our binary approximation.

  4. Calculate the line whose gradient ($a$) is the average of $L_1$'s and $L_2$'s gradients, and that goes through $M$. Call this line $L_3$.

  5. Calculate the objective function for $L_3$, without the $min$ sign. If this is $0$, we've got our optimal line. If not, continue.

  6. Calculate the line whose gradient ($a$) is the average of $L_1$'s and $L_3$'s and goes through $M$. Call it $L_4$. Then calculate the line whose gradient ($a$) is the average of $L_2$'s and $L_3$'s and goes through $M$. Call it $L_5$.

  7. Compare the objective function (without the $min$ sign) for $L_4$ and $L_5$. If any of them are $0$, that's the optimal line. If not, check whether or not $L_4$'s objective function is lower than $L_5$'s. If so, delete $L_2$ and $L_5$, and set the values of $L_2$ to be $L_3$, and $L_3$ to be $L_4$ ($L_1$ stays the same). If $L_4$'s objective function is greater or equal to $L_5$'s, then delete the values of $L_1$ and $L_4$, and set the values of $L_1$ to be $L_3$, and $L_3$ to be $L_5$ ($L_2$ stays the same).

  8. Go back to Step 5 with your new $L_1, L_2, L_3$ lines. If you've reached Step 8 (for e.g.) 100 times, call $L_3$ the solution. (It should be incredibly close to the real solution. $L_3$ is now a line that divides the dataset into 2 parts, with very close to mininal variance.)