Large Scale Data Collection

OmCheeto · Apr 21, 2024

I've been entertaining myself since my retirement with science and maths problems.
My latest endeavor's plot seems a bit too linear, and I was curious if others have had such experiences with ongoing national level data collection?:

The U.S. CDC posts death data for Flu, Pneumonia, and Covid-19 on a weekly basis.
Each week, the posted data for all weeks changes.
Most annoying is the fact that I didn't start collecting the data delay until about 4 weeks ago.
Most interestingly is that, as I mentioned previously, the data delay collection, graphed logarithmically, is exquisitely linear.

logarithmic linearity.t 2024-04-21 at 14.16.53.png

Raw data:

Lag Week	Graphical Week	Week	week ending	Flu	Pneum	C-19	∑	Flu	Pneum	C-19	∑	Flu	Pneum	C-19	∑	Flu	Pneum	C-19	∑	Flu	Pneum	C-19	∑	mult	mult	mult	mult
				latest	data	collected		last	weeks	data		2	weeks	ago		3	weeks	ago		4	weeks	ago		4	3	2	1	lag
0	63	11	3/16/2024	303	3,459	982	4744	292	3,343	961	4596	263	3041	887	4191	216	2527	755	3498	118	1186	356	1660	2.8578	1.3562	1.1319	1.0322	∑
1	62	10	3/9/2024	384	3,644	1,047	5075	379	3,592	1,030	5001	368	3487	1002	4857	342	3193	933	4468	285	2497	748	3530	2.5678	1.4028	1.1521	1.0377	Flu
2	61	9	3/2/2024	382	3,868	1,221	5471	381	3,848	1,215	5444	374	3797	1198	5369	365	3698	1164	5227	320	3292	1036	4648	2.9165	1.3688	1.1375	1.0347	Pneum
3	60	8	2/24/2024	399	3,817	1,271	5487	397	3,789	1,269	5455	394	3744	1255	5393	388	3686	1234	5308	366	3512	1190	5068	2.7584	1.3007	1.1071	1.0219	C-19

FactChecker · Apr 21, 2024

Those numbers get tangled up in politics. The CDC just collects the data from the states. I'm not clear on what you are collecting, but I looked at delayed COVID-19 death reporting from Florida (and some other states) for a couple of years. It varies from state to state. Florida always reported very low numbers for the more recent days compared to other states, but the cumulative totals remained high. Florida's recent day reports never came close to adding up to the totals. The effect was that Florida always looked as though they were handling COVID very well and had driven the death rates comparatively very low. But if you look at the cumulative totals, you can't help but notice that they actually did poorly. They are currently the seventh worst state per capita. On a day-by-day basis, Florida would report less than one tenth the deaths of states like Texas and California, but Florida's cumulative totals did not reflect that. I once tried to determine how long Florida delayed before adding the deaths. I could not determine a pattern in when they adjusted the past data.

I never noticed anything similar for the other states. If you are analyzing death report delays, you might want to exclude the Florida COVID-19 numbers or treat them separately. I did not look at flu or pneumonia.

jedishrfu · Apr 21, 2024

In 2020 there was this news item about a fired Florida COVID data scientist who was fired for releasing timely statistics on the disease:

https://www.cnn.com/2022/05/27/us/florida-report-coronavirus-numbers-manipulated/index.html

While they found no impropriety, how Florida reports is still suspect.

hutchphd · Apr 21, 2024

I am shocked. And the governor an erstwhile candidate to be POTUS. Shocked, I say..

FactChecker · Apr 21, 2024

jedishrfu said:

While they found no impropriety, how Florida reports is still suspect.

Yes. For well over a year, any time you looked at the latest week or two (or month) of Florida's COVID-19 data, it looked like COVID had been drastically reduced in Florida and they were doing much better than other states. IMO, that was a political trick. As the deaths were slowly added in to the old daily numbers, the cumulative total told the truth. In fact, Florida is the 7th worst state in COVID-19 per-capita deaths.

OmCheeto · Apr 22, 2024

FactChecker said:

I'm not clear on what you are collecting

Data, on a weekly basis.

But that's not the problem. The problem is that the weekly updates change all values back to the beginning of data collection for the current flu season.

The solution I was looking for was; Given a value posted on Fridays, what will be the ultimate correct value, when all is said and done.

And that's when after my analysis I noticed the nearly perfect linear logarithmic solution.

FactChecker · Apr 22, 2024

OmCheeto said:

Data, on a weekly basis.

But that's not the problem. The problem is that the weekly updates change all values back to the beginning of data collection for the current flu season.

The solution I was looking for was; Given a value posted on Fridays, what will be the ultimate correct value, when all is said and done.

Determining the "ultimate correct value" may be asking too much. When data is being recorded, there are delays and several errors that are made. Some are found and corrected. Others are not. I do not know if the CDC ever corrects whatever data source you are using. Do they only accept revised data from states? I do know that later analysis tends to estimate that there is a large percentage of under-reporting, as much as 25%. One problem with the records system is that there is often an entry for "Immediate Cause of Death" and another for "Underlying Cause". How those are initially interpreted and recorded versus later interpreted and used can be a problem.

OmCheeto · Apr 23, 2024

FactChecker said:

Determining the "ultimate correct value" may be asking too much. When data is being recorded, there are delays and several errors that are made. Some are found and corrected. Others are not. I do not know if the CDC ever corrects whatever data source you are using. Do they only accept revised data from states? I do know that later analysis tends to estimate that there is a large percentage of under-reporting, as much as 25%. One problem with the records system is that there is often an entry for "Immediate Cause of Death" and another for "Underlying Cause". How those are initially interpreted and recorded versus later interpreted and used can be a problem.

How about "ultimate best fit"?

Looking back now at my OP, I'm fairly certain that in a year from now, I will have no idea what the hell I was talking about. So I'll focus on Pneumonia data, and try and explain what I'm seeing:

The CDC posts this data every Friday.

On 3/22, the CDC posted that 1186 people died of pneumonia during week 11.
On 3/29, the CDC posted that 2527 people died of pneumonia during week 11.
On 4/5, the CDC posted that 3041 people died of pneumonia during week 11.
On 4/12, the CDC posted that 3343 people died of pneumonia during week 11.
On 4/19, the CDC posted that 3459 people died of pneumonia during week 11.

Plotting the posted deaths by date yields a curve.

Beings that I'm somewhat mathematically illiterate, I asked my spreadsheet for polynomial and logarithmic fits. I was not impressed.
So I evaluated the change from week to week on a log scale, and, as I mentioned, it came out very linear.

week lag	0	1	2	3	4
deaths	3,459	3,343	3041	2527	1186
multiplier		2.9	1.37	1.14	1.035
%		191.7%	36.9%	13.7%	3.5%
log10(%)		0.28	-0.43	-0.86	-1.46

My multiplier here is a bit cattywampus, as it goes backwards.
The multiplier of week lag 1 is deaths from week 0 divided by deaths from week 4
and
The multiplier of week lag 2 is deaths from week 0 divided by deaths from week 3

The multiplier vs lag yielded another wonky curve, so I converted the multiplier to % change and .... JOILA!

The 'week lag' vs 'log10(%)' curve was exquisite.

50 states feeding data from a million doctors through a myriad of number crunchers.

FactChecker · Apr 23, 2024

I think that your result makes sense.
Suppose that each week, the errors remaining that week are a certain proportion of the remaining errors from the prior week. That is a reasonable assumption to make for your problem. The errors are the deaths that week which were not included in the death count.
Let ##e_n## denote the errors remaining after week ##n##. Let ##p## be the fractional portion of errors that will remain after one week. Start with ##e_0## = the deaths that were not reported on week 0. Then
##e_1 = p e_0##
##e_2 = p e_1 = p (p e_0) = p^2 e_0##
##e_3 = p e_2 = p (p^2 e_0) = p^3 e_0##
...
##e_n = p e_{n-1} = p (p^{n-1} e_0) = p^n e_0##

Taking the logarithm gives you ##\log( e_n) = n \log(p) +\log(e_0)##, which is linear in ##n## with a slope ##\log(p)##.
Since ##p \lt 1##, ##\log(p) \lt 0##.

Large Scale Data Collection

Hot Threads

Recent Insights