The problem: I have repeated observations of the ordering of about 25-30 objects, and I have maybe 1000 sets of these objects. I wish to determine which sets of objects are ordered and which are not, but there are a couple complications.
First, there are missing values in each observation. For each observation, about 60% of the objects are observed (on average, but this varies between sets).
Second, I would like to account for some noise, so that even if a small number of the objects swap order or are in the wrong place on a given day, but most of the time most of the objects are in the same order, then it would be labeled as ordered.
What I've done (and some problems with it): So far, I have created a variance-like measure (M) based off a “true” rank. The “true” rank is constructed by comparing each object i to each object j and determining the percent of observations for which rank(i)$<$rank(j). From this, I pull a vector of row averages (averaging the percent of times object i is before every other object j), and I rank these. The object whose rank is truly “first” should have a high average, and so on. I then compare this “true” rank to the observed rank in the following way:
$M=\frac{1}{D}\sum_d^D \sum_i^n \frac{1}{n_d^*} ( rank_{id}-\bar {rank}_{id} )^2$
where D is the number of observations d, i is the individual, and $\bar {rank}$ is the "true" rank for any given observation d. It is specific to the observation, because the "true" rank is adjusted for which objects are present in the observation (since some are missing each observation). Similarly, n* is the number of objects observed that day (observation). I then use some threshold value as a cutoff point.
This seems to work, and I've used several thresholds, but the (main) issue is that it is not invariant to set size. That is, sets with higher missing rates appear to have a lower measure (M) than those with lower missing rates, simply because the potential difference in values is greater. Sets with higher missing rates are observably different in other areas, and so this method of dividing the data fails a balancing test. The second issue is that determining the threshold is a bit ad hoc, but I could probably live with that if the other issue was not so glaring.
Other Thoughts: I've started looking at clustering, with the hope that I could test H0: single cluster vs H1: more than one cluster. This is in part because I want to do some hierarchical clustering with the unordered sets once I have determined which they are.
There seem to be a few ways to test for the number of clusters, but (particularly in hierarchical clustering) I don't see many ways to conclude that there is a single cluster. It looks like the gap statistic might be able to do this in more of a k-means context (as well as some others in the k-means context), but testing for the existence of clusters with one type of clustering algorithm and then creating clusters later in the process via another seems strange to me.
I'm sure there are other (possibly obvious) ways to think about an ordering problem like this, and am open to suggestions outside of these areas I've been thinking about.
This is just to record my train of thoughts and to construct $M$ at least as "intuitive" as the original one but with the property that dropping a random set of objects from the observation won't affect it too much.
Let us have a permutation $\pi$ that deviates from some fixed order (say, the standard $1,2,\dots n$ one) in some places. The original quantity that was suggested to measure the "disorder" was essentially $\frac 1n\sum_{k=1}^n (\pi(k)-k)^2$ and the problem seemed to be that the proposed estimator for it from a randomly chosen subset of varying cardinality about $n/2$ was biased with the bias depending on the cardinality.
There were 2 things I didn't like. One was about the measure itself. It was that the permutation $(n,1,2,\dots,n-1)$ had $n$ times larger measure of disorder (about $n^2$ than the permutation $(2,1,4,3,\dots, n,n-1)$ (about $n$). One could argue that one big displacement and $n-1$ small ones is worse than $n$ small ones, but "30 times worse" seemed like an overkill. So I took the liberty to reduce the power with which to average to $1$ and consider $\frac 1n\sum_{k=1}^n |\pi(k)-k|$ instead.
The second thing was the attempt to discern the actual order before computing the deviations. That introduces some extra uncertainty because the order estimator based on the ranking may be reasonably correct when the ranks are really different but can get not too reliable when the ranks are close. So I wanted the estimator that does not rely on the knowledge of the true order.
The first thing I recalled was that $\sum_{k=1}^n |\pi(k)-k|$ is pretty much the same as the total number of inversions in $\pi$. One inequality is obvious: every displaced number participates in at least as many inversions as its displacement. The other inequality is a bit subtler but it also holds with an absolute coefficient independent of $n$. Thus, we can just try to estimate the expected number of inversions per one set element.
How to count the number of inversions? That's easy: just consider all pairs $i<j$ and count the number of observations in which this pair of objects appears in the wrong order and then sum over $i,j$. The only complication is that, since we do not know the true order, we do not know which order is right and which is wrong. However, if the pair $i,j$ appeared in $B_{i,j}$ observations $A_{i,j}$ times in one order and $B_{i,j}-A_{i,j}$ times in the other one, we definitely have the lower bound $C_{i,j}=\min(A_{i,j},B_{i,j}-A_{i,j})$ for the number of wrong orders and if we believe that the disorder is not too high, it won't be a big stretch of reality to assume that most of the time this lower bound will be the truth (if the disorder is high, then $C_{i,j}$ would be comparable with $B_{i,j}$, so the threshold most likely won't be passed anyway).
This leads to the quantity $S=\sum_{i<j}C(i,j)$ but we should normalize it somehow. The normalization I chose is "per observed pair", i.e., the denominator is $B=\sum_{i<j}B(i,j)$, so the meaning is, roughly speaking, the chance of the random pair of objects to get swapped. If you want the normalization per element, then it should be multiplied by $n-1$ when comparing different $n$'s. Which normalization makes more sense depends on your passing criteria (one can argue that when $n$ gets larger, proportionally higher deviations should be allowed (which would be my idea) but one can also argue that they shouldn't; in a sense, comparing the levels of disorder for different $n$'s is a bit of art and should be discussed separately).
Thus we are tempted to put $M=S/B$ and to finish the story. Unfortunately, it still shows the tendency to noticeably go down when you remove objects from observations (to be exact, if you have some noisy set order with random selection and just lower all probabilities of being observed proportionally, it goes down). Let us try to understand why.
We will suppose that for each pair $(i,j)$ being included into the observation does not depend on being swapped and that swaps in different observations are independent (that is another assumptions that may or may not hold; it can be tested from the data, though it doesn't seem necessary because if it really mattered, you would observe it on the test with removing random objects from existing observations that you ran).
So my model is just that for each fixed pair $(i,j)$ you have some probability $p$ (I'll drop the indices $i,j$ for brevity) of appearance and the other probability $q$ of being swapped from the true order (I do not assume that they are independent of the pair!). Then the expected value of the corresponding $B$ is just $E=Bp$ (with the binomial distribution $Bin(D,p)$) and once $B$ is fixed, we are trying to use the estimator $C$ for the expected number of swaps $qB$.
However our estimator $C$ is biased: we actually have $$ \mathcal E(C)=\sum_{k=0}^B {B\choose k}q^k(1-q)^{B-k}\min(k,B-k)<qB $$ This bias disappears when $B\to\infty$, but it is quite noticeable for small values of $B$ (if you have nothing better to do, you can graph the ratio $\mathcal E(C)/(qB)$ as a function of $B$ for various $q$) and it increases in size when $B$ goes down. So when the probability of appearance goes down, the average bias increases and a lower value of $M$ is observed.
We don't mind the bias per se (just view it as measuring some other quantity) but we certainly want to make it independent of $B$.
When $B=0,1$, $C=0$, so these values carry no information whatsoever, which means that we cannot fix the bias there. Fortunately, $B=0$ makes $0$ contribution to everything, so we do not need to worry about it. But $B=1$ should be just excluded by force, so we are lead to redefining the denominator in the formula for $M$ to $$ \sum_{i<j, B_{i,j}>1}B_{i,j}\,. $$ When $B=2,3$, we have $\mathcal E(C)=Bq(1-q)$, so we would like to maintain this ratio throughout higher values. This is possible if instead of $\min(k,B-k)$, we write $$ B{B\choose k}^{-1}{B-2\choose k-1}=\frac{k(B-k)}{B-1} $$ mimicking a shorter binomial formula. Note that it differs from $\min(k,B-k)$ at most twice, so the physical meaning is more or less preserved.
Thus, $S$ gets redefined as $\sum_{i<j, B_{ij}>1}\frac{A_{ij}(B_{ij}-A_{ij})}{B_{ij}-1}$
The only real influence on the expected value of $M$ now comes from dropping $B=1$, so we need to look at how big contribution to $\mathcal E(B)$ it makes. It is just $Dp(1-p)^{D-1}\approx Dp e^{-Dp}$, so it is just $e^{-Dp}$ - portion of the whole expectation. With your $D=175$ and the individual object appearance rate at least $1/4$, this is less that $e^{-10}$, so we should not bother about it too much.
I would say more but with MathJax typing becomes progressively more and more unpleasant, so I'll stop here. I hope the derivation is more or less clear nowbut feel free to ask questions.