The epistemological challenges of automating A/B testing, or how will AI do science
Once upon a time with had the great rationalist/empiricist debate: where does knowledge come from? Then came the scientific method. It didn’t settle the debate but it made it irrelevant. We have models about the world (or theories, or hypotheses, it doesn’t matter: the difference between those three is sociological). It doesn’t matter where they come from. The scientific method offers a set of tools and best practices to test the models. You make observations, in deliberate experiences, observatories, or even accidentally. Your goal is not to validate the models but to falsify them. Once you do, you refine the model, or throw it out and come up with something new.
How to falsify a model? In the good old days we had binary outcomes: “light speed is invariant to the observer, something is wrong with the aether model”. But soon we went out of those slam dunk black and white background-free observations. For testing ambiguous, inherently fluctuating phenomena, we invented modern statistics. We needed it. If we go with Rutherford, science would be over. So, we add a model of the fluctuations to the model we are testing. We make observations, often repeatedly, until the law of large numbers kicks in. The goal is to show that the original model is statistically untenable: the probability (under the fluctuation model) that the observed phenomena can be explained by the model is less than 0.05, 0.01, or 0.0000001 (pick your favorite threshold). The method is fine, it works as it should work. If you set the bar high, as particle physicists do, you can be pretty sure that, e.g., we are not living in a Higgs-boson-less universe (even though no observation of an individual Higgs boson was ever made). If you don’t, you may fail sometimes: “signals” can “fade”, indicating false positives. Sometimes you make mistakes in modeling the fluctuations or your own observatory. But those don’t shake the foundations of the method. What does shake it is automated hypothesis generation.
The whole machinery supposes that the hypothesis has a high prior probability of going through (being valid). This paper explains it beautifully. In a nutshell, by construction, if I churn out random false models, a certain portion of them will pass as false positives. If I publish only those models that passed the threshold, in the extreme case all of them may be false. How to defend against this? Well, 1) don’t churn out random hypotheses, and/or 2) make the threshold high enough so the expected number of false positives (that is, the number of hypotheses you’ll ever test times the probability of a false positive) is much smaller than one. Physicists obey both principles. Drug testing, social science, not so much.
Automating the discovery process is around the corner. Interestingly, the issue discussed here is rarely raised (I challenge you to find it among the hits for “automating drug discovery”). With the advances of AI, it is not unimaginable that we will soon have artificial agents looking for scientific discoveries. To make this happen, we need to include hypothesis generation into the epistemology and the algorithmic practice.
Now, if you think it’s some kind of esoteric, ivory-towery problem of the fuzzy-haired scientist, you are badly mistaken. The full IT industry is hit hard by it and working hard on it. The scientific method have made a roaring appearance in internet-based businesses under the name of A/B testing. Polls and focus groups were always part of the marketing toolkit, what changed the game is the speed and low price of these experiments, made possible by the internet. In a sense the whole lean/agile startupping thing is a way to solve the organizational issues of running controlled repeated business experiments. Automating the whole process using AI is now in sight. There are several issues to solve, among which the epistemology that includes hypothesis generation is arguably the least known/addressed.