Determine which of $N$ points is not on $\sin(ax + b)$, where $a$ and $b$ are unknown.

98 Views Asked by At

Suppose $N$ points ($(x_1,y_1), (x_2,y_2), ... (x_N,y_N)$) are given from a curve $y=\sin(ax+b)$ where $a, b$ values are unknown. Before giving these $N$ points to you, $y$ coordinate of one point is randomly tampered so that it does not lie on the curve. Write a program to determine which point among $N$ points is NOT on the sinusoidal curve, whose $a$, $b$ values are unknown.

Any logic on how to approach this question, would be highly appreciate would be very thankful!

3

There are 3 best solutions below

1
On BEST ANSWER

HINT :

Consider the point $P_1(x_1\:,\:y_1)$ and a point $P_k(x_k\:,\:y_k)\quad$ in $k>1$. $$\begin{cases} ax_1+b=\sin^{-1}(y_1)\\ ax_k+b=\sin^{-1}(y_k) \end{cases}$$ Solve the linear system for $a_k=a$ and $b_k=b$. See the note $(*)$ below.

Repeat from $k=2$ to $k=n$.

If all points where exactly on the sinusoid, all $a_k$ would be equal one to the others and all $b_k$ would be equal one to the others. This is not the case since one point is outside the sinusoid.

They are two possibility :

  • You detect the point among $P_2\:...\:P_n$ which gives a different result from the others : This is the point outside the sinusoid.

  • All the successive results are different one from the others. Then the point $P_1$ is not on the sinusoid.

(*) Note : The process of comparison will be a bit sophisticated, taking account that the function $\sin^{-1}$ is multivaluated (in extended sens). I let you manage this small difficulty.

1
On

Note that $\sin^{-1}(y) = b + ax$, namely, if your data has no noise, you can literally see which observation is an outlier. Formally, you can run a linear regression (OLS) on $$ \sin^{-1}(y_i) = b + ax_i + \epsilon_i, $$ and then check the Cook's distance, the outlier will have the highest value $D_i$.

0
On

'Almost' any arbitrary set of points can be approximated arbitrarily well by a sine wave if one is willing to take high enough frequencies (details here). However, assuming that the original curve was 'nice' and not too high in frequency, nonlinear optimization would probably be able to converge on the intended $a$ and $b,$ and you could then identify the tampered point as the one with a large remaining residual. In the optimization, I suggest minimizing the residual 1-norm instead of the 2-norm to make your fit less sensitive to the single tampered point.