Probability of a non-random sample to represent initial Poisson distribution

331 Views Asked by At

I'm looking for a way to correctly approach the following noise-signal problem. What I'm doing is a looking for arbitrary structures in a seemingly random input data, akin to a night vision with noisy sensor. Let's say I have an area $S$. Null-hypothesis is that it is filled with pure noise that has Poisson distribution of points, $\lambda$ equals to average points density. I assume it also may or may not in some cases have additional points concentrated in a limited sub-area $S'\subset S$. For simplicity let's say I look for a splat of points that also have a Poisson distribution so it's not obviously discernible from noise around by some geometric considerations, but have density that differs from average.

That problem have two parts: I need a way to select sub-areas that contain clusters of points and I need a way to discriminate between weak-signal and noise fluctuation. I have a coarse solution for both of these, but I struggle to solve the second part for the refined algorithm.

I'm dealing with a finite data of $N$ points distributed across a finite area $S$ and I'm able to find candidates for sub-areas by probabilistic approach: an algorithm that partitions $S$ as a grid, with $K$ adjacent cells of fixed area $S'=S/K$, calculates how many dots end up in each cell ($n_i$) and then calculates probability $q_i$ of having $n_i$ or more dots in each of the surveyed cells, given the trials are independent (cells aren't intersecting) and it obeys binomial distribution that can be approximated by Poisson with the parameter $\lambda'\rightarrow\lambda$. As the next step, using Bernoulli trials approach I can assess overall probability ($P_i$) of finding ${K \choose 1}$ cells with ($n_i$) or more dots in it. After that I can use Z-score for Poisson, with 95% confidence interval, to approach something similar to $3\sigma$ rule for Poisson and discriminate which cells I can gauge as not-fitting the random distribution thus discerning them as signal vs surrounding noise with given confidence. This one is good as a first proof-of-concept approach to the task, but is just a coarse solution for the position of clusters, however giving probabilities assessment.

More robust approach involves building a density map and using a DBSCAN clustering algorithm to select a sub-area $S'$. However, because arbitrary $S'$ here is selected non-randomly based on that density map in contrast to first approach I can't just assume $S'$ has the same distribution $\lambda'\rightarrow\lambda$ and calculate Z-score for it as I did previously. Probability of finding a denser sub-area in a larger area obviously depends on S and N, which need to be taken into account somehow. Since there are obviously infinite number of possible dissections of area into 2 subareas binomial approach isn't applicable to calculate overall probability here. There's however a boundary on sub-area selection imposed by algorithm which involves triangulation, there need to be at least three dots and a nonzero area: $S' \cup S" = S, S' \cap S" = \varnothing, S">0, S'>0, N" >= 3, N' >=3, N=N'+N" \lambda'=N'/S', \lambda"=N"/S" $ This removes the boundary cases with zero and infinite $\lambda'$ but that's it. Another formulation of the question: can the probability of deviation of apparent $\lambda'$ for this biased-sample from initial $\lambda$ be characterized and how?

Maybe I need to use probability density for lambda and some kind of Bayesian inference and beta function here or I need some entirely other approach obscured for me by initial coarse attempt to solve this described above, please advise.