Two things happened recently. I was thinking about better ways to teach Bayesian thinking with minimal math in my upcoming class on historical reasoning (which starts in two days; if you want in, you can register now!). And I just finished reading an advance copy of the first proper academic review of my book Proving History, which is the main textbook I use for my course. That review is by fellow Bayesian philosopher of history Aviezer Tucker, which will appear in the February 2016 issue of the academic journal History and Theory.
The review is an interesting mix of observations. I’ll comment on it in more detail when it is officially released. The abstract of his review is available, but it’s not a wholly accurate description of its content. In fact the review is mostly positive, and when critical, Tucker proposes what he thinks would be improvements. He’s uncertain whether a Bayesian approach will solve disagreements in Jesus studies, and he mentions some possible barriers to that that weren’t directly addressed in Proving History, but he is certain Bayesian methods do need to be employed there. The question becomes how best to do that. He makes some suggestions, which actually anticipate some aspects of how I did indeed end up arguing in Proving History‘s sequel, On the Historicity of Jesus (which Tucker hasn’t yet read, so he was unaware of that, but it’s nice to see he comes to similar conclusions about how to examine the evidence). He takes no side in the debate over the conclusion.
Both events converged. Tucker’s review reminded me of some ways to better discuss and teach Bayesian thinking. In truth, Everyone Is a Bayesian. They might think they’re not. Or they don’t know they are. But they are. Any time you make a correct argument to a conclusion in empirical matters, you’re argument is Bayesian, whether you realize it or not. It’s better to realize it. So you can model it correctly. And thus check that all its premises are sound; and be aware of where you might still be going wrong; and grasp all the ways someone could validly change your mind.
Bayesian Reasoning in a Nutshell
Here is just a basic guide for simple Bayesian reasoning about history…
- Rule 1: Estimate probabilities as far against your assumptions as you can reasonably believe them to be.
I discuss this method in detail in Proving History (index, “a fortiori, method of”). But in short, what it means is this: You can’t be certain of any probability. But you know enough to know what that probably can’t be. Or can’t reasonably be. You may not know what the odds are of a meteor from outer space vaporizing your house tomorrow. But you certainly know it’s a lot less than 1 in 1000. Otherwise it’d have happened by now. To lots of people you know. If it’s important to test probabilities closer to what they actually are, or what you think they are, by all means do so. And engage whatever research and data gathering is needed to get a better estimate. There actually are some data-based stats on destructive meteor strike frequencies you can track down and average per house-lot of area, for example. But most of the time, that effort simply won’t be necessary. You can see where a conclusion is going just from an initial a fortiori estimate. And any greater precision will just make the probability stronger still (either lower than it already is, or higher, depending on which possibility you are testing).
- Rule 2: Estimate the prior probability of the conclusion you are testing.
Based on past cases like the one you are looking at, what has usually been the case? And how often? When you get emails from relatives with some astonishing unsourced claim about the President of the United States being a member of a secret Devil Cult, what usually is the case with things like that? They are usually made up urban legends. To an astonishingly high frequency even. Some few might end up being true. But this is what you are estimating, when you are estimating a prior: How often do claims like the one you are testing turn out to be true? How often does something else turn out to be the case instead? From estimating that you get a probability.
In history, we are testing causal claims. There are two kinds of historical claims. There are historical claims that ask “What happened?” And those amount to asking “What caused the evidence we have?” Was the claim about the President of the United States being a member of a secret Devil Cult caused by someone actually discovering good evidence that the President of the United States was a member of a secret Devil Cult? Or was it caused by someone making that up? And there are historical claims that ask “Why did that happen?” And those also amount to asking “What caused the evidence we have?” but in this case we’re talking about a different body of evidence, since we are looking not at the historical event itself, but at other historical events leading up to it. But that just makes it the same question again. And in every case, “What usually happens in such cases?” is where you get your prior. If you don’t know (if there is literally no relevantly similar case at all), then the principle of indifference prevails, and in the simplest binary cases will just be 50/50.
Again, I say a lot more about this in in Proving History. But this is something we already always do in all aspects of our lives. We make assumptions about what usually does or doesn’t cause the things we see, and adjust our certainty accordingly. It’s also done constantly by historians, even when they don’t realize it. Every time they dismiss a theory because it’s “implausible” or say things like “we don’t know for sure what happened in that case, but this is what usually happened in such cases,” they are reasoning about prior probability.
- Rule 3: Prior probabilities are always relative probabilities. Because the prior is an estimate of the frequency of a claimed cause of the evidence relative to all other things that could have caused the same evidence.
In other words, the prior is a measure of how frequently something is the cause of the evidence we are looking at relative to all other causes of that same evidence. And the sum of the individual prior probabilities of every possible cause of the evidence must equal 100%, since we know the evidence exists and therefore something caused it, and there are no other possible things to have caused it but those.
This means, for example, that the prior probability of someone having gotten rich by winning the lottery is not the probability of winning the lottery. Rather, it is the relative frequency with which rich people got rich that way, as opposed to some other way. So if half of all rich people got rich by winning the lottery, then the prior probability that a rich person won the lottery is fully 50%. Regardless of how improbable winning lotteries is. Always think in these terms. So, for instance, if the only ways to get rich have a 1 in 1000 chance of occurring, and someone is rich, method A is 1000 to 1 against and method B is 1000 to 1 against, but these balance out. As each is equally likely, then the probability of having gotten rich by method A is simply 50%.
It’s too easy to get seduced by the unlikeliness of every possible explanation of a certain observation, and conclude they are all impossible. But that’s not how it works. What we want to know is the relative likeliness among all those explanations. So, for example, someone stealing the body of Jesus and someone else hallucinating seeing him alive again is, like a lottery, highly unlikely. But it’s still millions of times more likely than a space ghost magically regenerating corpse flesh. So if those were the only possibilities (they aren’t, but just for the sake of illustration), then the prior probability someone stole the body of Jesus and someone else hallucinated seeing him alive again is actually very nearly 100%. Because if that is, say, 2,000,000 times more likely than the alternative, then the ratio of the priors is 2,000,000/1. And since the priors must sum to 1 (because they exhaust all possible causes of the evidence), it follows that the prior probability of the “amazing conjunction of theft & hallucination” hypothesis is more than 99.99995% (and the prior probability of the space ghost theory is a dwindling 0.000049999975%). In other words, it doesn’t matter how unlikely the “amazing conjunction of theft & hallucination” hypothesis is. It only matters how likely it is relative to alternatives.
This is an important lesson in logic that understanding Bayesian reasoning teaches us.
- Rule 4: Estimate the probability (also known as the “likelihood”) of all the evidence as a whole if the claim you are testing is true.
Literally assume the claim is true. Then ask, “How likely then is all this evidence?” You must mean all the evidence when you do that. You can’t leave any out—if it will make the claim more or less likely than alternative explanations (alternative causes) of the same evidence. And there are different ways to figure this probability (discussed in Proving History). But the question always comes down to this: Is the evidence pretty much or exactly what we’d expect? (All things considered.) Or is it in some ways not quite what we’d expect? If it’s at all unexpected, if there is anything unexpected about it, then it’s less likely. And you have to estimate that.
This is in fact what we always do anyway, in every aspect of life. And it’s what historians constantly are doing. When they say the evidence perfectly fits a hypothesis, they mean it would have had a very high likelihood (a very high probability) if that hypothesis is true. Whereas when historians say the evidence fits a hypothesis poorly, they mean it’s not very probable that the evidence would look like that, if the hypothesis were true. And this is what you mean, every time you have ever said that in your life, too.
- Rule 5: Estimate the probability (also known as the “likelihood”) of all the same evidence if the claim you are testing is false. Which always means: if some other explanation is true.
Because you can only know whether a claim is true, by comparing it against other competing claims. This is true in estimating the prior probability, since that is always a relative probability (per rule 3). It is also true here. There are always at least two likelihoods you have to estimate before you can know if some claim is probably true or not. The first is the likelihood on the claim you are testing (rule 4). The other is the likelihood of all that same evidence on an alternative theory—the best alternative, at the very least; but every good alternative should be considered. A bad alternative, BTW, is one that either (A) makes the evidence we have extremely unlikely (and does not have a correspondingly remarkably high prior probability) or (B) has an extremely small prior probability (and does not have a correspondingly remarkably higher likelihood than every other competing hypothesis).
Since the evidence we have has to have been caused by something, such as the event you are claiming happened, the most likely alternative has to be some other event that could have produced the same evidence. If someone says “Joe must have had syphilis because he was observed to be suffering from dementia in his later years,” they are implicitly assuming no other causes of dementia are at all as likely as syphilis (which is not a sound assumption; there are many other common causes of dementia). They are also implicitly assuming there can be no other causes of the observed symptoms of dementia than having dementia—when, in fact, pretending to have dementia is an alternative that has to be accounted for (and there are many other possibilities as well).
So here, you are doing the same thing you did in rule 4. Except now you are “literally assuming” some other cause is true, and then asking “How likely then is all this evidence?” All the same principles apply as were discussed under rule 4. And this again is something you already do all the time; and that historians do constantly. Although, not as often as they should. One of the most common logical fails in history writing is failing to complete this step of reasoning, and assuming that because the evidence we have is exactly what we expect on hypothesis A, that therefore we’ve proved hypothesis A (that it is the most likely explanation of that evidence). No. Because hypothesis B might explain all the same evidence just as well. Or better. The evidence we have may in fact be exactly what we expect on B as well! So taking alternatives into account, and doing it seriously, is a fundamental requirement of all sound reasoning about evidence. You can’t use a straw man here, either. If you aren’t comparing your hypothesis to the best alternative, then your logic will be invalid.
- Rule 6: The ratio between those likelihoods (generated by following rules 4 and 5) is how strongly the evidence supports the claim you are testing. This is called the likelihood ratio.
Whenever historians talk about a body of evidence or a particular item of evidence being weak or strong, or weighing a lot or a little, or anything like that, they mean by “weak” that this evidence is just as expected or almost as expected on many different hypotheses, and therefore doesn’t weigh very much in favor of one of those hypotheses over those others; and they mean by “strong” that this evidence is not very expected at all on any other hypothesis but the hypothesis it is supporting.
Thus, ironically, what you are looking for when you are looking for strong evidence for a claim—when you are looking for “really good” evidence—is evidence that’s extremely improbable … on any other explanation than the one you are testing. We already expect good evidence will fit the hypothesis. That is, that it will be just what we expect to see, if that hypothesis is true. But that’s not enough. Because as we just noted under rule 5, the evidence might fit other hypotheses equally well. And if that’s the case, then it isn’t good evidence after all. So the key step is this last one, where we look at the ratio of likelihoods among all credible explanations of the same evidence.
- The odds on a Claim Being True = The Prior Odds times the Likelihood Ratio
The easiest way to think all this through on a napkin, as it were, is to use that formula, which is called the Odds Form of Bayes’ Theorem. It doesn’t let you see all the moving parts in the engine compartment, as it were. But if you just want to do a quick figuring, or if you already know how the engine works, then this is a handy way to do it.
The prior odds on a claim being true equals the ratio of priors (found through rules 2 and 3). So, for example, if one person gets rich by winning the lottery for every hundred other rich people (who get rich some other way), then the prior odds on a rich person having won the lottery equals 1/100. We can convert that to two prior probabilities that sum to 100%. But that’s next level. For now, just think, if it’s usually a hundred times more likely to have gotten rich some other way than winning the lottery, then the prior odds on having won the lottery is 1 in 100 (for anyone we observe to be rich).
The likelihood ratio is then the ratio of the two likelihoods (generated in rules 4 and 5). So, for example, if hypothesis A explains the evidence just as well as hypothesis B, then the likelihood ratio will be 1/1, in other words 50/50, because the likelihood of the evidence is the same on both hypotheses. The evidence then argues for neither hypothesis. But if the evidence is a hundred times more likely on A than on B, and A and B exhaust all possible causes of that evidence, then the likelihood ratio is 100/1. So, if we have really good evidence that Joe Rich won the lottery, evidence that’s a hundred times less likely to exist if he didn’t (and instead got rich some other way), then we get:
Prior Odds [x] Likelihood Ratio = 1/100 x 100/1 = 100/100 = 1/1
So with that evidence, it’s just as likely that Joe got rich by winning the lottery as that he got rich some other way. It’s 50/50. To get more certain than that, you need better evidence than that. For example, evidence that’s a thousand times less likely to exist if Joe didn’t win the lottery has this effect:
Prior Odds [x] Likelihood Ratio = 1/100 x 1000/1 = 1000/100 = 10/1
Then the odds Joe got rich by winning the lottery are ten to one. That means it’s ten times more likely he won the lottery, than anything else.
Once you realize how to do this simple napkin math, you can analyze all kinds of questions, such as about how good the evidence has to be to reach a certain level of certainty in a claim, or about what it even means for evidence to be “good.” It also helps understand what a prior probability is (through the idea of the “prior odds” on a claim being true, something gamblers are always calculating for anything and everything), and how it affects the amount of evidence we need to believe a claim. You’ll start to get a sense, in other words, for the whole logic of evidence.
I’ve Said It Before
In my reply to Chris Guest’s remarks at TAM, one of the points I made was:
Guest is first bothered by not knowing where I get my estimates from [in a historical analysis]. But … they are just measures of what I mean by “unlikely,” “very unlikely,” and similar judgments. My argument is that “assigning higher likelihoods to any of these would be defying all objective reason,” … which is a challenge to anyone who would provide an objective reason to believe them more likely. In other words, when historians ask how much [a certain piece of] evidence weighs [for or against a conclusion], they have to do something like this. And whether they do it using cheat words like “it’s very unlikely that” or numbers that can be more astutely questioned makes no difference. The cheats just conceal the numbers anyway (e.g., no one says “it’s very unlikely that” and means the odds are 1:1). So an honest historian should pop the hood and let you see what she means.
This is why it is still important to learn at least the basics of probability mathematics (which really requires nothing more than sixth grade math, and some people think I’m joking but I’m actually serious when I say Math Doesn’t Suck will get you up to speed on that, even if all you do is read the sections on fractions and percentages). You really should start thinking about what you mean when you make colloquial assertions of probability and frequency throughout your life. “It happens all the time”; “that’s abnormal”; “that would be weird”; “I can imagine that”; “that’s not what I expected to happen”; and on and on; these are all mathematical statements of probability. You just don’t realize these are expressions of mathematical quantities (frequencies, specifically, and thus probabilities). But never asking what numbers they correspond to doesn’t make you a better thinker or communicator. It makes you a worse one.
Likewise, in Understanding Bayesian History, one of the points I made was:
Historians are testing two competing hypotheses: that a claim is true vs. the claim is fabricated (or in error etc.), but to a historian that means the actual hypotheses being tested are “the event happened vs. a mistake/fabrication happened,” which gives us the causal model “the claim exists because the event happened vs. the claim exists because a mistake/fabrication happened.” In this model, b contains the background evidence relating to context (who is making this claim, where, to what end, what kind of claim is it, etc.), which gives us a reference class that gives us a ratio of how often such claims typically turn out to be true, vs. fabricated (etc.), which historians can better estimate because they’ve been dealing with this kind of data for years. We can then introduce additional indicators that distinguish this claim from those others, to update our priors. And we can do that anywhere in the chain of indicators. So you can start with a really general reference class, or a really narrow one—and which you should prefer depends on the best data you have for building a prior, which historians rarely have any control over, so they need more flexibility in deciding that (I discuss this extensively in chapter 6 [of Proving History], pp. 229-56).
History is about testing competing claims about what caused the evidence we now have. Which is the first thing to consider. And that also means the second thing to consider is that we often have a lot of prior information about what usually is the cause of certain kinds of evidence. And that information can’t be ignored. It has to be factored in—and it factors in at the prior probability.
Thus, for example, if we find that literature like the Gospels, sacred stories written about revered holy men by religious fanatics featuring endless wonders, most other cases of that kind of writing (in fact, all of them) are highly unreliable historically. Just peruse the Medieval Hagiography literature for the closest parallels to the Gospels…though even just all the forty or so Gospels illustrate the statistical point. They therefore cannot be treated like personal memoirs or rationalist histories or sourced biographies of the era. Indeed, even secular biographies in antiquity are highly unreliable. As I show in On the Historicity of Jesus (p. 219, w. note 168), experts have demonstrated that “most biographies of philosophers and poets in antiquity were of this type: inventions passed off as facts,” and that fabrication was the norm even for them, even biographies of verifiably historical people.
A third point to close with is that there is still a difference between knowing a claim is fabricated and not knowing whether it isn’t. This has caused many confusion. If I say we cannot tell that a particular story is true or fabricated, that it’s a 50/50 toss up in the absence of any other corroboration, I am not saying the story is fabricated. I’m saying we do not know whether it is true or fabricated, that either is as likely as the other on present information. This is a conclusion most difficult for people who suffer from ambiguity intolerance. They can’t compute the idea of “not knowing.” It either has to be true or fabricated. So if I say we can’t say it’s probably true, because we don’t know that it is more probably true than fabricated, they cannot comprehend me as saying anything other than “it’s fabricated.” If you catch people in that error early you can call their attention to it. Bayesian reasoning helps you realize this is a thing to watch for.
Concluding Observations for the Future
In If You Learn Nothing Else about Bayes’ Theorem, Let It Be This, I listed two things you should definitely get from Bayesian reasoning: theories cannot be argued in isolation; and prior assumptions matter. The first point is that all probabilities are relative probabilities. The probability that something you believe is true is relative to the probability that something else is true instead. So you cannot know how likely your belief is, if you don’t know how likely its alternatives are. All hypotheses have to be compared with other hypotheses. You can’t just rack up evidence for “your” hypothesis and conclude it’s true. You have to test your hypothesis against others, both on the measure of how well they explain the evidence (because they might both perform equally well…or something else might perform better than yours!), and on the measure of prior probability.
The conclusion that we really are all Bayesians now is a growing realization across disciplines. And that realization is linked to both of those features: hypotheses cannot be argued for in isolation from alternatives; and prior knowledge cannot be ignored. Legal scholar Enrique Guerra-Pujol of Puerto Rico (whose own research often touches on Bayesian reasoning in the law) just recently wrote a brief post on this point at Prior Probability, in positive reaction to a paper by F.D. Flam, The Odds, Continually Updated, which summarizes how, in fact…
Now Bayesian statistics are rippling through everything from physics to cancer research, ecology to psychology. Enthusiasts say they are allowing scientists to solve problems that would have been considered impossible just 20 years ago. And lately, they have been thrust into an intense debate over the reliability of research results.
Flam summarizes how frequentism fails to accurately describe reality, and that major scientific errors are being corrected by bringing problems back to a Bayesian framework where they should have started. I’ve often used the example myself of fraud: frequentism is completely helpless to detect fraud. It simply assumes fraud has a zero probability. Because frequentism ignores prior probability. But you can’t do that and claim to be making valid inferences.
Guerra-Pujol zooms right in on the same point made by Flam, in a sentence that truly captures why Bayesian reasoning is the superior framework for all empirical reasoning:
Bayesian calculations go straight for the probability of the hypothesis, factoring in not just the data from the coin-toss experiment but any other relevant information—including whether you’ve previously seen your friend use a weighted coin.
This captures the whole point about the importance of priors. That point is not simply that you have to account for the known frequency of fraud in a given case (though you do), but that ignoring prior information in general is simply not going to get you to an accurate inference about what’s happening. If you know your friend loves using weighted coins, that’s not data you can ignore. Likewise, if you know most remarkably successful drug trials end up being invalidated later, and not just because of fraud but even more because of the statistical anomaly that so many drug trials are being conducted that remarkable anomalies are inevitable by chance alone (a hidden example of the multiple comparisons fallacy), you have to take that into account. Mere frequentism will deceive you here. Similar points have exposed serious flaws across the sciences, where Bayesian analysis exposes the inadequacies of traditional frequentism time and again (I only recently mentioned just one instance, that this is happening in the field of psychology). Nate Silver has been trying to explain this to people for ages now. Alex Birkett has recently composed an article on this as well.
Another famous example of ignoring priors (also mentioned by Flam and Guerra-Pujol) is the famous Monty Hall Problem, which I love not only because my coming to understand it was a eureka moment for me, but also because it’s connected to one of the most jaw dropping examples of sexism in the sciences, when a woman pwned the entire male mathematics establishment, and they ate their foot over it. But the point here is that those men (actual PhDs in mathematics) who tried to mansplain to Marilyn vos Savant why she was wrong, failed to realize that conclusions about probability must take into account prior information. In the Monty Hall problem, that included the subtle fact that Monty is giving you information when he opens one of the doors for you. He’s telling you which door he didn’t open. And that changes everything.
Priors thus can’t be ignored. And anyone who thinks they can “do probability” without them, is not talking about the real world anymore. Yes, you have to justify your priors. They do have to come from actual prior information. Not just your unaccountable guessing. As the tagline reads at Guerra-Pujol’s blog Prior Probability, “Hey, where did you get your priors?” And a lot of pseudologic proceeds from inventing priors out of nowhere (theists do it all the time). But real logic still requires you to come up with a prior probability. And that prior had better be informed. I discuss the challenges of how to determine priors from many competing collections of data in Proving History (especially in chapter six).
I may publish some of my other materials on Bayesian teaching in future. But those who want to get started will find help at Bayes’ Theorem: Lust for Glory. For those who want to delve deeper and see what’s really going on under the hood of Bayesian historical theory, some of the heavier philosophical issues of Bayesian epistemology of history I tackle in responding to the Tim Hendrix review, which supplements Tucker’s as one of the few done by an actual expert in mathematics. You might also benefit from my article on Two Bayesian Fallacies, which shows how a philosopher stumbled badly in understanding how Bayesian reasoning works, and seeing how that’s a stumble can be educational when you want to understand what correct Bayesian reasoning is.