What does it mean for data in $((T\times \mathbb{R})^k)^i$ to be close to data in $((T\times \mathbb{R})^k)^j$?

18 Views Asked by At

Let $T \subset \mathbb{R}$ be compact (think [0,1]). Suppose you have a dataset in $((T\times \mathbb{R})^k)^j$, such as j collections of k pairs of (time,measurement), where the $k$ pairs are in time order. Let's call this dataset X. Suppose we have a second dataset similarly structured but with $i$ collections of $k$ pairs instead. What does it mean for these datasets to be close? I'm need a single number that can tell me how close they are.

If you want a more concrete example: suppose you have $j=55$ subjects whom participated in a study. They were given alcohol and their BAC was measured about every 5 minutes for 2 hours (so at 4 minutes 5 seconds, 11 minutes 2 seconds, so on), so they had $k=25$ measurements . The study was repeated on a different group of $i=95$ people. Suppose we want to claim the two groups data were similar. How would you compare their data? My goal is to say that if they are very close, then we can make certain conclusions, so I am interested in a measure that ensures they are close.

I was thinking something along the lines of: for each individual interpolate the data so that you have a function for each person from time to BAC. Shift the functions in time so that the peaks align (if one person starts drinking much later than another, I want to account for this). Average the functions, say $X(t)$ and $Y(t)$ averaged, shifted interpolated measurements from the first and second group respectively. Then consider $$\max_{t\in T}|X(t)-Y(t)|.$$

If this is small then they are "close". Now this "metric" for closeness is not ideal, since a group that had half people with 0 BAC the whole time and the other half at .2 BAC the whole time would be considered close to a group that had all people with .1 BAC. Similarly a measure of deviation wouldn't account for mean. Is there something I could measure that would gaurantee closeness of the data. One problem I have is that there is deviation within each single dataset and I'm not sure how to account for that.

I think the Z test might closer to what I'm looking for, but I'm going to try looking into that more. Any help is appreciated. Thanks.