My Data Runneth Over

Counting is not analysis

Oct 17, 2025

If I say “Jane has 5 ideas on how to improve the company, and Dave has 6”, it would also be true if I said, “Dave has 20% more ideas on how to improve the company than Jane”.

True in a very limited sort of way – and certainly not very helpful. Some ideas are great, and some are bad enough that not having them would actually be better.

This gets even more noticeable at scale. If I said, “Jane has 5 ideas on how to improve the company, and Dave has 24,950,” you’d probably wonder if Dave needed a break from ChatGPT. The sheer volume would be a strong indicator that he was doing something to mass produce his ideas.

Just because you can count two things doesn’t mean they’re comparable. But sometimes maybe we forget that?

Study results show that total data volume continues to grow with 5.9 million datapoints now collected on average per phase III protocol, up 11% annually since 2020. […]
Between 2012 and 2025 the volume of data collected per protocol has increased 15.4% annually.
Figure 3. Growth in overall average data volume collected per phase III protocol
2012: 929,203
2020: 3,560,201
2025: 5,957,122

This from a new Transcelerate/Tufts CSDD study1 that’s been getting a lot of airtime lately, especially as the authors make the fall industry conference circuit.

Here’s the first problem: datapoints are like ideas. Measuring the maximum diameter of an index tumor on an MRI is one data point. Slapping a fitbit on someone and downloading their daily activity logs for 3 months will produce many hundreds, if not thousands, of data points.

The second problem is that the cost of obtaining and storing data keeps shrinking. I remember the first trial I worked on that utilized a portable EKG. It was easier to use and cheaper than previous models. Did the sponsor stuff that savings into the mattress? Absolutely not – the FDA requested that they take more EKG readings. And I’d argue that that was unequivocally better for the trial. I’ve never heard anyone make the case that we have too much safety data.

Are trials getting more complex? Sure!

Is that an unsustainable data tower of babel or a savvy use-every-part-of-the-buffalo strategy? What a great question! I don’t think this study helps us answer it in any way, unfortunately.

Postscript: The study paper is in preprint, so maybe it will get better after peer review? But there are so many eyebrow-raising things in it, from a seemingly-wonky definition of “non-essential” to really suspect methodological moves, and even citations that link to other research that absolutely does not support the statement made. Ultimately, what bothers me the most is that this seems to have started with an interesting question, involved a lot of tedious work from a lot of probably-very-smart people, and ended with such an underwhelming result.

Well, sort of? The data from 2012 and 2020 were not from this study, and it’s extremely unclear that they even used the same methodology.

First Patient In

Discussion about this post

Ready for more?