I am a Java programmer who has reached the limits of brute computer power. My relational database (and non relational databases) is not producing results quick enough and I have hit a bottleneck in the software so I am turning to math possibly improve my algorithm to my problem. I have data, lots of data! I am collecting weather data for 30 days and each day contains just over 100,000 variables. The variables can reoccur on the other days or not. In total there are 12 billion different variables which can occur. So on each day only 100,000 of these 12 billion occur, but again these can reoccur on any other day or not. I am trying to see if say varA occurs varH occured 20 times in a 30 day period, or when varB occurs varZ occured 12 times in those 30 days. So currently my approach is running a nested loop through each day and counting each entry in a master count table.
Ex Master count
varA-varA : count
varA-varB : count
varA-varC : count
.
.
.
varZ-VarX : count
varZ-VarY : count
varZ-VarZ : count
I have been thinking of different ways to solve this. One approach I thought of is to break each day into a set, then extract the similarties between them somehow. Is there a method to count combinations in sets quickly? Is there already a mathematical method which can do what I need?
Since you only keep $30$ days of history and at most $10^5$ event occurs each day.
Out of $12\times 10^9$ events, at most $3\times 10^6 = 30 \times 10^5$ events will occur.
A way to deal with this is build a list of events that actually occur and for each event that occur, record the set of days where the event occur. $30$ is a small enough number, you can store the history of an event as the bits on an unsigned 32-bit integer. On that integer, a bit is set if and only if the event appear on corresponding date.
In pseudo code, you can build the list incrementally.
If you choose to represent the history as an unsigned 32 bit integer, the determination of the count of common occurrence of two events is easy. You look up the two numbers corresponds to the history of the two events, perform a bit-wise AND on the two numbers and then count the number of set bits in the result.
The memory requirement of storing $3\times 10^6$ numbers is of the order of ten megabytes. This is very small for today's computer capacity. With the right package/language, you can pull everything in memory. Once everything is in memory, you can perform you comparison/analysis there and everything should be very fast.
Please be warned the major bottleneck of this approach is the part searching for an entry on the list hist. You should use a package/library that can perform the search quickly.
EDIT
As an example, let's say we only keep $8$ days of history and we label the dates and the bits of a byte using a number between $0$ and $7$ (inclusive). Let's say
To represent event A, we set the $i^{th}$ bit of a byte to $1$ if event A happens on day $i$. Since event $A$ happens on $0, 1, 2, 4, 7$, the $0^{th}$, $1^{st}$, $2^{nd}$ $4^{th}$ and $7^{th}$ bit of corresponding byte will be set. Numerically, the corresponding number is $2^7 + 2^4 + 2^2 + 2^1 + 2^0 = 151$.
Similarly, the number corresponds to event $B$ is $2^6 + 2^4 + 2^2 + 2^0 = 85$
When you ask a computer to compute the bitwise AND of these two bytes. The $i^{th}$ of the resulting byte will be set if and and only the $i^{th}$ bit on both bytes are set. As you can see, this is precisely those dates when both event A and B happen.
In this particular example, the $0, 2, 4$ bits will be set and the bitwise AND return the number $2^4 + 2^2 + 2^0 = 21$. Since the resulting number $21$ has $3$ set bits, the number of days when both event A and B happen is $3$.
$$\newcommand{\mybox}[2][4pt,border: 1px solid]{\bbox[#1]{#2}} \begin{array}{rccccccccl} \tt{bit} & 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0\\ A = &\mybox{1}&\mybox{0}&\mybox{0}&\mybox{1}& \mybox{0}&\mybox{1}&\mybox{1}& \mybox{1}& = 2^7 + 2^4 + 2^2 + 2^1 + 2^0 = 151 \\ B = &\mybox{0}&\mybox{1}&\mybox{0}&\mybox{1}& \mybox{0}&\mybox{1}&\mybox{0}& \mybox{1}& = 2^6 + 2^4 + 2^2 + 2^0 = 85\\ A{\&}B = &\mybox{0}&\mybox{0}&\mybox{0}&\mybox{1}& \mybox{0}&\mybox{1}&\mybox{0}& \mybox{1}& = 2^4 + 2^2 + 2^0 = 21 \end{array}$$
As a side note, not only the logical AND operation between two events can be represented by a bitwise AND (
&in Java) between corresponding numbers. The logical OR and logical eXclusive OR operations can be represented by bitwise OR (|in Java) and bitwise XOR (^in Java) between corresponding numbers.I just look up the documentation for Java, there is a
bitCount()method which count the number of set bits for Java integers.