This is definitely a case of "not even sure how to ask the question," but I am wondering if there is math available to solve a problem I have.
I have several years of daily weather observations (high temp., low temp., avg. temp., avg. humidity, avg. wind speed, total rainfall, avg. air pressure, total solar radiation, etc.). For any given day, I have records for each of these observations.
Here's an example set of the data (there are thousands of rows like these):
| Date | High Temp. | Low Temp. | Avg. Temp | Avg. Humidity | Avg. Wind | Rain | Avg. Press. | Solar Radiation |
|---|---|---|---|---|---|---|---|---|
| 2022-03-09 | 46.0 | 35.1 | 40.32 | 60.71 | 3.71 | 0.00 | 30.22 | 46281 |
| 2022-03-10 | 45.7 | 30.2 | 39.76 | 51.22 | 1.21 | 0.00 | 30.32 | 47012 |
| 2022-03-11 | 49.6 | 39.0 | 44.29 | 59.17 | 1.58 | 0.00 | 30.24 | 47745 |
| 2022-03-12 | 55.2 | 42.1 | 46.58 | 67.13 | 2.03 | 0.04 | 29.86 | 48477 |
| 2022-03-13 | 52.2 | 42.3 | 46.91 | 86.91 | 3.43 | 0.07 | 29.87 | 49219 |
| 2022-03-14 | 48.4 | 43.2 | 44.75 | 93.99 | 3.00 | 0.49 | 30.06 | 49958 |
| 2022-03-15 | 51.6 | 43.7 | 46.11 | 91.60 | 4.61 | 0.26 | 30.04 | 50699 |
| 2022-03-16 | 54.0 | 42.6 | 46.42 | 80.62 | 3.18 | 0.00 | 30.28 | 51442 |
| 2022-03-17 | 46.9 | 42.3 | 44.03 | 87.85 | 1.47 | 0.00 | 30.14 | 52185 |
I am hoping to use these data points to find days that are "most similar" to any other given day. So, out of the thousands of days, which day was most similar, weather-wise, based on the daily observations? Using the table example above, how would you determine which day was closest/most similar in weather observations to 2022-03-13?
I'm not even sure of the right terminology to use when asking this question; this is a new concept for me. If I had to describe in the words I know, I'd be saying that I'm seeking "a calculation for similarity based on multiple daily data points," but even that is probably off somehow. I'm not sure if I need to create an "index" for each day or not, but that doesn't seem right. Any help out there in steering me towards better knowledge would be sincerely appreciated. Thank you.
I was able to find a solution to this issue and would like to offer it here as a suggestion for others, and to solicit feedback if there are any other ideas or better ways of handling it.
I used a "Gower Distance" calculation (using the math on this page). This method seems to help with the issues of some data being single digit integers, while other data in the same row is in the five-digit range. This normalizes the differences and averages them.
I looped through the entire dataset doing an absolute difference between the test value and all other values, then summed those differences and divided by the number of measurements, preserving the distance number in a new column. Which ever item had the lowest score was the closest data.
For example, if the comparison date was 2022-03-09, I did this for 2022-03-10:
Resulting in
I did that same calculation on each row, which resulted in Gower Scores like this:
March 10 was the day most similar to March 9, while March 15 is the least similar.
In lots of random sampling, this seems to be highly accurate for my purposes. I've even incorporated additional weighting for certain measures that are underrepresented.
For the future I am likely to turn this into a MySQL function of some kind, to reduce the programming overhead on my page.