How a Cup of Tea Laid the Foundations for Modern Statistical Analysis
NEWS | 30 March 2025
In the early 1920s, a trio of scientists sat down for a break at Rothamsted agricultural research station in Hertfordshire, UK. One of them, a statistician by the name of Ronald Fisher, poured a cup of tea, then offered it to his colleague Muriel Bristol, an algae specialist who would later have the plant C. muriella named after her. Bristol refused, as she liked to put the milk in before the tea. Fisher was skeptical. Surely it didn’t matter? Yes, she said, it did. A cup with milk poured first tasted better. “Let’s test her,” chipped in the third scientist, who also happened to be Bristol’s fiancé. That raised the question of how to assess her tasting abilities. They would need to make sure she was given both types of tea, so she could make a fair comparison. They settled on pouring several cups, some tea-then-milk and others milk-then-tea, then getting her to try them one at a time. But there were still a couple of problems. Bristol might try to anticipate the sequence they’d chosen, which meant cups needed to arrive in a genuinely random order. And even if the ordering was random, she might get a few correct by chance. So there would need to be enough cups to ensure this was sufficiently unlikely. Fisher realized that if they gave her six cups—three with milk first and three with milk second—there were 20 different ways they could be randomly ordered. Therefore, if she simply guessed, one in 20 times she’d get all six correct. What about using eight cups instead? In this situation, Fisher calculated there were 70 possible combinations, meaning there was a one in 70—or 1.4 percent—probability she’d get the sequence right by sheer luck. This was the experiment they decided to run with Bristol. They poured eight cups, four of each type, and got her to test them in a random order. She named the four she preferred, and the four she disliked, then they compared her conclusions with the true pattern. She’d got all eight correct. Buy this book at: Amazon Bookshop.org If you buy something using links in our stories, we may earn a commission. This helps support our journalism. Learn more. The reason for Bristol’s success was ultimately down to chemistry. In 2008, the Royal Society of Chemistry reported that tea-then-milk will give the milk a more burnt flavour. “If milk is poured into hot tea, individual drops separate from the bulk of the milk and come into contact with the high temperatures of the tea for enough time for significant denaturation to occur,” they noted. “This is much less likely to happen if hot water is added to the milk.” Fisher later described the tea-tasting experiment in a 1935 book titled simply The Design of Experiments. Among other things, the book summarized the crucial techniques they’d pioneered in that Rothamsted tea room. One was the importance of randomization; it wouldn’t have been a rigorous test of Bristol’s ability if the ordering of the cups was somehow predictable. Another was how to arrive at a scientific conclusion. Fisher’s basic statistical recipe was simple: start with an initial theory—he called it the “null hypothesis”—then test it against data. In the Rothamsted tea room, Fisher’s null hypothesis had been that Bristol couldn’t tell the difference between tea-then-milk and milk-then-tea. Her success in the resulting experiment showed Fisher had good reason to discard his null hypothesis. But what if she’d only got seven out of eight correct? Or six, or five? Would that mean the null hypothesis was correct and she couldn’t tell the difference at all? According to Fisher, the answer was no. “It should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation,” he later wrote. “Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.” If Bristol had got one or two wrong, it didn’t necessarily mean she had zero ability to distinguish milk order. It just meant the experiment hadn’t provided strong enough evidence to reject Fisher’s initial view that it made no difference. If Fisher wanted experiments to challenge null hypotheses, he needed to decide where to set the line. Statistical findings have traditionally been deemed “significant” if the probability of obtaining a result that extreme by chance (i.e. the p-value) is less than 5 percent. But why did a p-value of 5 percent become such a popular threshold?
Author: Will Knight. Adam Kucharski. Claire L. Evans. Sheon Han. Jason Kehe. Steven Levy. Amit Katwala. João Medeiros. Asher Elbein. Steve Nadis.
Source