Does "Big Data" Have a Ramsey Theory Problem?

1.9k Views Asked by At

I'm erring on the side of conservatism asking here rather than MO, as it is possible this is a complex question.

"Big Data" is the Silicon Valley term for the issues surrounding the huge amounts of data being produced by the global IT structure. Advanced mathematics is starting to pay attention to this, with very early thoughts on topological approaches. For example, see the Wiki here.

But one obvious way to think about patterns in Big Data is as polychromatic colored complete graphs: let the vertices be your data, let the edges represent relations between data, and let the colors be specific relations (which are the objective of Big Data visualization), with some neutral color representing no relation.

By the very definition of Big Data and Ramsey Theory, this virtually guarantees the existence of monochromatic complete subgraphs which may be nothing but spurious relations that must exist because of Ramsey Theory.

I am NOT a graph theorist in any way. So what I am asking specifically is this:

Are there other techniques that can be overlayed on to a graph theoretical approach that add information that the monochromatic subgraphs are real "signal" and not Ramsey noise? Or, am I misunderstanding, and the Ramsey structures are not actually noise?

3

There are 3 best solutions below

2
On BEST ANSWER

As requested, I'll post the comment above as an answer:

The OP is right that there are inevitably patterns in large data sets, and in fact often ones of the sort that we want to find. Here are a couple of very common examples.

In statistics, traditionally you do "hypothesis testing" where you try to find evidence for or against the hypothesis that a parameter has a particular value, for example that the mean height of American males is 5'11''. You do this by measuring the mean height of a sample and then seeing if it is "significantly" different from 5'11''. The problem is, if your sample is big enough, it is always "significantly different", because significance, whatever that is, increases with the size of the sample.

Another example is finance, where people called technical analysts look for support and resistance patterns, which is where a price keeps declining after reaching a certain value (say \$20), and then bouncing back up again. This is evidence that people are selling when the price reaches \$20 and taking their profits. However, such patterns also very commonly appear in random walks, so it is often not clear whether they represnt anything real about the financial situation.

It is said that "If you torture the data for long enough, it will confess." Some people use the term Data Mining in a perjorative sense to refer to finding patterns which aren't really there, in a sense that all sufficiently large data sets should contain such patterns just by chance.

One of our main defences against this problem is to split the data randomly into subsets and look for patterns in one subset, then see whether they generalize to the other subsets. In machine learning, people say "training" instead of "fitting a statistical model" or "looking for patterns" and then talk about "validation". The idea is that the validation should show you how your model is likely to perform on unseen data. If your model learns a spurious pattern, then you hope that this will show up as a poor performance on the validation data. Such a model is said to be over-fitted.

Splitting up the whole data set into $k$ subsets and validating a model fitted to one of the subsets on the rest of the data, and doing this $k$ times, once for each subset, is called $k$--fold cross-validation and is a common way of estimating how your model will perform on new data. This is why the Stackexchange statistics site is called Cross Validated.

The motivation for this procedure is that we really want to see how our patterns generlize to unseen data. If they are real patterns, they should show up in unseen data as well. But if we don't have any unseen data yet, we just pretend that some of the data we have got is unseen, and use it to validate the model. And it makes perfect sense, because usually our reason for wanting to fit models/find patterns/learn is to make predictions about new data that we haven't seen yet.

7
On

Ramsey's theorem has nothing to say about real data sets, which contain much bigger structures and anti-structures than the theorem predicts, and it only predicts that at least one must exist.

Giant monochromatic anti-structures (such as tens of millions of people on earth none of whom are friends with each other) are there because the structure is sparse. Monochromatic structures far larger than what might have been forced by Ramsey theorem (had there not been anti-structure) are there because of the small world phenomenon, where people related to you are likely to be related to each other.

0
On

I don't know if "Big Data" has a Ramsey subgraph problem, but many, many big data practitioners have an "applied Ramsey Theory" problem.

The fundamental theorem of applied Ramsey Theory is:

Theorem: For any algorithm, if you look hard enough, and define terms loosely enough, you can find data that the algorithm fits perfectly.

This, of course, has a well-known corollary.

Corollary: For any algorithm for which you seek funding, there exists a data set that can be declared interesting enough to fill a proposal.

This phenomenon is particularly noticeable in many big data applications; semantic search is a wonderful example. Often, you find researchers that see an interesting problem but don't have the sophistication to tackle it (because the sophistication may not even yet exist). So they create frameworks, and then create problems on those frameworks, while eliding any discussion of the degree to which the framework matches reality.

Of course, they're very good at creating problems within frameworks and then solving them, but it is not yet clear that they've actually manifest progress towards any true mathematical understanding.


That said, a foundation of mathematics is understanding simpler cases to the ones that are interesting, so it may be that what's being done in the "big data" world is indeed very interesting. But I haven't seen any wonderful breakthroughs yet, despite seeing many wonderful solutions to problems that don't actually exist.