Weather
For years I have had a conspiracy theory that it rains more on weekends. I did some online searching and it seems like a lot of people think this is true, but the eggheads online claim it’s purely a psychological phenomenon. So I wrote some python code that uses data from weather.gov and calculates how much precipitation, relative to the average daily precipitation, each day of the week gets (for example, a score of 0.2 means it gets 20% more than the average day).
Here’s how the web scraping part of this worked (if you do web scrape this yourself, be gentle and only do a few requests a minute, it’s pretty nice of them to provide this data for free (well actually we pay for it in taxes so it’d be rude of them not to, but still…))
area = "DCAthr" # DCA, seems like most airport codes work here too
start_year = 2020 # "por" is the code for first possible year
end_year = 2024
params = {
"elems": [
{
"interval": "dly",
"duration": 1,
"name": "pcpn",
"smry": {
"reduce": "mean"
},
"groupby": [
"year",
"1-1",
"12-31"
]
},
"smry_only": 1,
],
"sid": area + " 9",
"sDate": f"{start_year}",
"eDate": f"{end_year}-12-31",
"meta": [
"name",
"state",
"valid_daterange",
"sids"
]
}
params_json = json.dumps(params)
encoded_params = urlencode({'params': params_json, 'output': 'json'})
headers = {
'Accept-Encoding': 'json',
'Content-Type': 'application/x-www-form-urlencoded'
}
response = requests.post(
'https://data.rcc-acis.org/StnData',
headers=headers,
data=encoded_params
)
data = response.json()
From there it’s pretty easy to get python to do the grunt work of analyzing the data. For the graphs here I just did the calculations based on days either getting rain or not getting rain, but you could also look at how much rain each day gets.
I checked a segment of 4 years of data from the washington DC area around the timeframe I felt had more rain on weekends. The results were pretty suprising.
Aha! So there is a conspiracy! Or at least, there is something humans do that is on a weekly schedule that effects rainfall?
But how does this pattern hold on longer time frames? Let’s check the data from 1980-2024.
Ok, still looks weird, but there is no longer a clear trend pointing to more rain on weekends. Maybe this pattern is purely random, what happens if we just randomize all the data then map it?
So even if it were just random, it would still look like certain days have a pattern, so, no conspiracy here, however, the past few years in DC there has actually been more rain on weekends than weekdays. It seems nothing humans are doing impacts this process much.
Let’s see what the math says about this. I’m not at all an expert at statistics, so this next part might be a bit suspect.
The distribution of which days get rain and which ones don’t is binomial, so we can directly calculate the variance we can expect to see. Given that there are 7 days of the week, and \(N\) number of days get rain, if the rain is randomly distributed (a suspect assumption, but, it should be fine for our purposes), the expected variance is:
$$ N (\frac 1 7) (\frac 6 7) $$Making the standard deviation:
$$ \sqrt {N (\frac 6 {49})} $$But what we actually care about is not the variance, it’s how the variance of each day compares to the average value. We don’t care if one day of the week gets 17 more rainy days than another day of the week, we care that it gets 5 percent more rainy days than the average day of the week. This means the value we care about is the variance, divided by the expected value (this is also called the coefficient of variation). The expected value is just:
$$ \frac N 7 $$Making the coefficient of variation:
$$ (\frac 7 N) \sqrt {N (\frac 6 {49})} $$As N goes to infinity, this value gets very small, which makes sense, given, say, a trillion days of data, we would expect to see less variation between the percentages of which days got rain than we would with, say, one week of data.
From 2021 to 2024, as used in the first graph, there are roughly 469 days which got rain. This leads to a coefficient of variation of 0.113. This means that the standard deviation is about 11% of the mean. This tracks closely with the actual data in the first graph, where the value seem to range from -15% to 20%, with most days having a magnitude less than 11%.
This shows an important lesson in data analysis, it isn’t enough just to analyze data and find trends in it, researchers should also always try to consider the effect that random chance and statistics has on their results. If I roll 10 dice and record their values, there will (the vast majority of the time), be some weak trend in the data, even though in reality it is purely random. It doesn’t mean the trend is wrong, in the case of the rain data for example, there actually was more rain on weekends (if we count Friday as a weekend day…), that is true, it just doesn’t tell us anything useful other than “weather is random”.
Most researchers today check for this using the P value, which basically just tells you the chances of seeing the results the researchers saw if the thing they were trying to prove was totally irrelevant and only due to random chance. There are a some issues with this approach too however; if thousands of researchers all study something, even if the actual effect they try to prove is totally made up, by pure chance some will likely observe an effect, and correctly claim that the odds of their results due to random chance are small.
Imagine rolling a dice 10 times and getting all 1s. You might reasonably assume the dice was weighted, given the odds of that happening randomly are roughly one in 60 million. But if 100 million researchers all rolled that same dice, the chances are good at least one of them will roll all ones, and if that researcher fails to read the reports of all the others who didn’t roll all ones, they might incorrectly conclude the dice isn’t fair.
To make matters worse, it isn’t always so simple to know what the odds of something happening by chance even are. What were the odds of America winning the revolutionary war? What about the odds that humans build a Mars base by 2100? Not everything is a trial that can be run dozens of times.
To anyone interested, The art of uncertainty is a great read that covers this sort of thing in a lot more detail.