At INR3 in Kamloops I spoke on applying Bayesian logic to the study of Jesus along with the same principles we apply to dead religions (so as to avoid the “don’t offend the Christians” reaction to controversial claims…claims that would not be controversial if Jesus was not the object of worship of billions of loud, influential people). In Q&A philosopher Louise Antony challenged my application of Bayes’ Theorem to historical reasoning with a series of technical complaints, especially two fallacies commonly voiced by opponents of Bayesianism. I was running out of time (and there was one more questioner to get to) so I explained that I answered all her stated objections in my book Proving History (and I do, at considerable length).
But I thought it might be worth talking about those two fallacies specifically here, in case others run into the same arguments and need to know what’s fishy about them. The first was the claim that prior probabilities (the first pair of premises in Bayesian logic) are wholly arbitrary and therefore Bayesian arguments are useless because you can just put any number in that you want for the priors and therefore get any conclusion you want. The second was the claim that the probabilities of the evidence (the second pair of premises in Bayesian logic) are always 1 (i.e. 100%) and therefore Bayes’ Theorem cannot determine anything at all as to which competing hypothesis is true.
Never mind that if Antony was right about these points, then all Bayesian arguments and all Bayesian conclusions in scientific and industrial and public policy research would be bogus, not just the theorem’s application to history, and thousands of scientists and mathematicians would have to be engaged in a conspiracy to conceal this from the public, like some cabal of astrologers. Put that obvious objection aside. Because, though also true, it doesn’t educate. It’s important to understand why thousands of scientists and mathematicians reject claims like hers. Because such claims are mathematically illiterate. And philosophers should not want to be mathematically illiterate…especially if they want to issue critiques of mathematics.
1. The Fallacy of Arbitrary Priors
As to the first claim, no, you can’t put just any prior in. You have to justify your priors. You might only be able to justify priors with uncomfortably wide margins of error, or you might not be able to justify any prior better than 50/50, for example, but these are not always the case, and in any event are still not arbitrary. They are compelled by the state of the background evidence, or b. Because all the premises in Bayes’ Theorem (all the terms in the equation) are conditional on b. This should be evident from the mathematical notation:
See how the letter “b” is in every single term in the equation? The up-bar separating the two halves of each term’s element within parentheses is a mathematical symbol for conditional probability. Thus P(h|b), which is the prior probability of the hypothesis being true (as opposed to P(~h|b), which is the prior probability of the hypothesis being false) means the probability (P) of the hypothesis (h) conditional on our background knowledge (b). To insert just any number here would be to simply disregard all the information in b. Which is cheating, not honest Bayesian reasoning. The mathematical notation requires your input to reflect b. That is why b is there. Your use of the equation is invalid if you ignore it.
I discuss how b constrains your input at the stage of assigning priors in Proving History, pp. 70-71, 79-85, 110-14, 229-80. But there is an even more fundamental point here, which is that b is simply always a previous e (hence often b is called old evidence and e is called new evidence), and therefore every prior probability is in fact the posterior probability of a previous run of the equation (whether actually or hypothetically). This is why e and b can be demarcated any way you want (as long as they don’t overlap), and the equation will always come out the same (if it doesn’t, you’ve done something wrong).
So you could hypothetically start at a state of zero empirical knowledge, where b contains only a priori knowledge (e.g. logic and mathematics) and your priors are therefore 50/50 (when there is literally no evidence yet to favor h over ~h, or vice versa, then that logically entails P(h|b) = P(~h|b) = 0.5), and then start adding evidence into e one bit at a time: first you add one item, run the equation, and see how the prior (starting at 0.5) gets adjusted by that item of evidence (if at all), and the posterior probability that results becomes the prior probability in the next run of the equation when you add one more item of evidence into e (the previous item of evidence having now been folded into b by the run of the equation you just completed, so what was e is now part of b, and in the new run of the equation e contains instead the new item of evidence you are adding). And so on, until you’ve entered all known evidence (all relevant human knowledge). It’s just tedious to do it this way and thus very unnecessary. But you could do it this way. And your result will be no different (and again, if it is, you’ve done something wrong).
This is called the method of iteration in Proving History (check the index, “iteration, method of”). But the point is that priors are not mysterious values pulled out of thin air, they are simply the posterior probabilities that result from prior ratios of likelihoods. And that means they are constrained. They are constrained by those ratios of likelihoods (the likelihood ratios of all prior evidence). You thus can’t just input any number you want. Your input must reflect what this iteration from zero empirical knowledge would produce (if you had the inordinate time to sit down and complete it). It therefore must reflect background evidence. Starting with a ballpark prior (ideally from some reference class that gives you a defensible ratio), is just an easier way to do it. But in no way should this conceal the fact that this ballpark estimate must be derivable from an iterated run of all prior evidence from an initial neutral prior of 50%. Thus, b constrains P(h|b).
Priors are therefore not arbitrary–at least in the hands of any honest Bayesian. Abuses of logic by the unscrupulous or the incompetent do not serve to challenge the validity of logic, and neither do abuses of Bayes’ Theorem challenge its validity. Priors may still be subjective, but that’s not the same thing as arbitrary–and perhaps the fallacy here derives from confusing the two. On why they should not be confused, check the index of Proving History, “subjective priors, problem of.”
2. The Fallacy of Foregone Probability
I can only assume what Antony meant when she said the probability of the evidence is always 1 is this weird notion I’ve heard from several anti-Bayesians who don’t understand Bayes’ Theorem or even the concept of epistemic probability altogether: that the probability of an event after it has occurred is 100% (as otherwise it wouldn’t have happened), therefore the probability of all events is ultimately 100%, and therefore always 1. This is easier to understand on an assumption of causal determinism, where there was always only ever one outcome, and our assigning it a probability is just a consequence of our inevitable ignorance of all the conditions. But often it’s framed as “since you are observing e, the existence of e is well nigh 100% certain, regardless of what caused it,” therefore P(e|h) and P(e|~h) are always 1, because e is observed and therefore certainly exists, whether h is true or not.
(I have heard and read these arguments several times before, but cannot at present locate an example online; if anyone knows of one, do please post its URL in comments.)
Antony might have meant, instead, that hypotheses can always be gerrymandered so that the probability of the evidence is 1, but that would not mean P(e|h) is always 1, but only that it can always be forced to equal 1 with some elaboration (thus, only with the elaboration can it do so). And even then, such a tactic cannot ignore the mathematical consequences of such gerrymandering to the prior probability. The more you gerrymander a theory to raise its consequent probability, the more you lower that theory’s prior probability, often to no net gain. This is the basic logic of Ockham’s Razor, and I discuss the principle of gerrymandering underlying it in Proving History, pp. 80-81 (see also in the index, “gerrymandering (a theory)”). In essence, you can’t gerrymander P(e|h) to equal 1 without “paying for it” with a reduction in P(h|b). So you would just be moving probabilities around in the equation and not actually getting any better result. Hopefully Antony was aware of that and thus not making this argument.
But if Antony meant the first argument (or some variant of it), then that would mean she does not understand the conditional nature of Bayesian probabilities. She would also be confusing a physical with an epistemic probability. Bayes’ Theorem can operate with physical probabilities without running afoul of the “foregone probability” conundrum, since even though (e.g.) on causal determinism the outcome of a die roll is foregone, it still produces frequencies of outcome, and that is what physical probability measures (so even then the alleged problem does not arise, except for omniscient beings perhaps, but they would have no need of Bayes’ Theorem because they already know everything: Proving History, p. 332, n. 43).
But Bayes’ Theorem is usually employed with epistemic probabilities, i.e. the probability that some belief about the world is true (which can always be any value less than 1, even when the physical probability is 1). See Stephen Law’s remarks on the distinction between epistemic and “objective” probability, or what I call physical probability in Proving History, pp. 23-26 (I have concluded that calling the latter an “objective” probability is confusing and should be abandoned: ibid., p. 297, n. 4; I also argue all epistemic probabilities are estimates of physical probabilities, but only through the logic of Bayesian reasoning itself: ibid., pp. 265-80; BTW, a better term for those uncomfortable with presuming physicalism is “existential” probability: see David Hawkins, “Existential and Epistemic Probability,” Philosophy of Science 10.4 [Oct. 1943]: 255-61).
But the key element to understand here is that the probabilities in a Bayesian equation are all conditional probabilities. That is why the term for evidential likelihood reads P(e|h.b) and not P(e). Even in short forms of the equation, where you find P(e) in the denominator (which is really P(e|b); often mathematicians drop elements that appear in every term, like b, since they already know those elements are there and don’t need to be reminded of it, although laymen often won’t know that so I tend to avoid that kind of abbreviated notation myself), in that case P(e) is only standing-in for the long-form denominator of [P(h) x P(e|h)] + [P(~h) x P(e|~h)] in accordance with the principle of total probability. So it’s still a conditional probability: what is being measured is the probability of the evidence (e) conditional on the hypothesis being true (h) in one case and conditional on the hypothesis being false (~h) in the other case. The notation for the one is P(e|h); for the other, P(e|~h); for both combined, P(e).
Thus, the probabilities being asked for at this stage (the probabilities you must enter as premises) are the probability of the evidence if the hypothesis is true and the probability of the evidence if the hypothesis is false. As conditional probabilities, these are probabilities conditional on h being true or false. The difference between those two probabilities is precisely the extent to which the evidence e supports one hypothesis over another (if such it does). The actual existential probability of the evidence is completely irrelevant to this–except insofar as what we know about the existential probabilities informs our epistemic probabilities (see Proving History, pp. 265-80), but even then it cannot be the case that the existential probability is always 1 (it certainly can be 1, but not always), because if h is true and ~h is false (existentially, i.e. unknown to us), then the existential probability of the observed evidence on ~h is not at all likely to be the same as the existential probability on h, as if every possible past would have resulted in exactly the same future–a notion that would make a mockery of the whole of physics. Imagine mixing any two chemicals in a glass and no matter what two chemicals you mix you always only ever end up with a glass of water; now imagine everything worked that way, such that causation was wholly irrelevant to the course of history. That’s what it would mean if the probability of the evidence was always 1.
Hence we are not asking what the probability is of history having turned out as it did (which on causal determinism is always 1, i.e. 100%) or what the probability is of e existing when we observe e (which is typically quite close to 100%). We are asking what the probability is of history having turned out as it did if certain causes were in place (the causes we are hypothesizing), and what that probability would be if those causes were not in place (and some other set of causes were in place instead). One of these is necessarily a contrafactual. Since either h or ~h must be true, not both, the probability of the evidence on one of those two possibilities cannot be the probability of history having turned out as it actually did, because one of them didn’t even happen, and thus was not even involved in history having turned out as it actually did (which, incidentally, is what we are trying to find out with Bayes’ Theorem: whether h or ~h is what actually happened).
What goes into Bayes’ Theorem, therefore, is not the probability that an event e occurred given that we observe e, which would be P(e|o) where o = an observation of e, a probability that is usually close to 100% (barring hallucination and such). Rather, what goes into Bayes’ Theorem is the probability of e occurring given the occurrence of h, hence P(e|h) where h = a hypothesized system of events prior to e, a probability that is often not even close to 100%. And we must also put into the equation P(e|~h) where ~h = any other possible system of events prior to e except h, a probability that will only be the same as P(e|h) when e is equally expected regardless of whether h is true or ~h is true instead. In that case, e is not evidence for either h or ~h, because it’s just as likely to have appeared on either possibility. But if, say, e rarely results from a system of causes like h (say, only 1 in 100 times), yet often results from some other system of causes (say, on any other system of causes, e will result 80 in 100 times), then P(e|~h) = 0.01 and P(e|h) = 0.80, neither of which is 1.
There are other occasions where a consequent probability can fall out as 1 that don’t relate to this fallacy. For example, a consequent probability can become 1 after factoring out coefficients of contingency, mathematically reducing the remaining consequent to 1 (Proving History, pp. 215-19, with pp. 77-79; and check the index for “coefficient of contingency”), and often a value of simply 1 will be substituted for approximations to 1 merely out of convenience, since the difference between a precise number and an approximate number are so small they won’t even show up in the math at the resolution you are working with (e.g., Proving History, pp. 85-88, 221-22, etc.), especially in history, where the margins of error are often so wide that they wash out any such small differences in estimates, rendering those differences moot to the outcome.
Concluding with an Example
Medical diagnosis affords many examples of how conditional probability works here, and likewise the background-evidence dependency of prior probability. Suppose we have a test for cancer that detects cancer 95% of the time (meaning 5% of the time, when there is cancer, it misses it, producing a negative result) and that gives a false positive 1% of the time (meaning that 1 out of every 100 times that the test is taken by someone without cancer, the test falsely reports they have cancer anyway). And suppose we know from abundant data that 2% of the population has cancer. We take the test, and it reports a positive result. What is the probability that we have cancer?
Here “positive result” (POS) is the evidence (e) and “we have cancer” is the hypothesis (h). The frequency data (those 95% and 1% and 2% figures just cited) derive from our background knowledge (b), e.g. millions of prior tests, and prior data on cancer rates in the population we belong to. Bayes’ Theorem then tells us (I will here leave b out of the notation, but remember it is in every term):
[P(h) x P(POS|h)] / ( [P(h) x P(POS|h)] + [P(~h) x P(POS|~h)] ) =
[0.02 x 0.95] / ( [0.02 x 0.95] + [0.98 x 0.01] ) =
0.019 / (0.019 + 0.0098) =
0.019 / 0.0288 =
0.66 (rounded) =
A positive result from this test thus gives us a mere 2 in 3 chance of actually having cancer (despite this test being “95% accurate” or even “99% accurate,” depending on whether you are referring to its false-negative rate or its false-positive rate). Note that it would make no sense to say that the probability of the evidence (observing a positive result from the test) is always 100% no matter what. That would amount to saying that the test always gives a positive result whether there is cancer or not. Which would mean the test was completely useless. You may as well test for cancer using tea leaves or astrology.
So obviously it makes no sense to say that the probability of the evidence is always 100% in Bayesian reasoning. Certainly it is not. What we want to know in this case, for example, is the probability that a positive result is caused by something other than cancer (and is thus a false-positive), and a test that does that 1 in 100 times gives us that probability: it’s 0.01 (or 1%). That is not 1. Even if causal determinism is true, it’s still not 1. Because this is not the probability that the test came out positive if we observe the test having come out positive; it’s the probability that the test came out positive if we have cancer or not. So, too, for the other consequent probability, P(e|h), which is the probability of observing e (a positive result) if we do indeed have cancer…which from prior observations we know is 0.95 (or 95%). That is also not 1.
Likewise, notice that we can’t just insert any prior probability we want here, either. What we insert has to derive from prior evidence, namely prior observations of the test giving false negatives and false positives and prior observations of how many people tend to have cancer at any given time. All those prior observations constitute background knowledge (b) that constrains the prior probability, in this case to a very narrow range (if we have a lot of data, that will be a tiny margin of error around a value of 2%).
It is here that one can question whether we can use this tool on much scarcer data, as in the case of history. In Proving History I prove that we can, as long as we accept correspondingly large margins of error and degrees of uncertainty.
In the case of priors, we might not have thousands of data points to work with–maybe only ten, let’s say–but there are ways to mathematically work with that. Likewise if we have millions of data points and could never systematically enumerate them (e.g. general knowledge of human behavior) and thus have to make a best guess from what we can observe: there is a way to do this, mathematically, that accounts for the degree of error it introduces. And it might be less clear what reference class we should start from, and how we estimate data frequencies in that class might often be subjective, but the latter only introduces uncertainties that we can again define mathematically, and the former ultimately won’t matter as long as we leave no evidence out of the sum of b and e. Since by iteration, no matter what we start with in b (no matter what reference class we draw a ratio from for our initial priors), we will always end up with the same final result once we’ve put everything else in e (as we are obligated to do).
In the case of consequents, too little or too much data can be dealt with mathematically in the same way as for priors. Likewise, the expectation of outcomes given our hypothesis, and then given other alternative explanations of the same evidence, might be subjective, but that again only introduces uncertainties that we can again define mathematically. Historians can then debate the basis for whatever value you assign to either of the two consequent probabilities (also known as the likelihoods). If it has no basis, then the assignment is unwarranted. Otherwise, it has whatever warrant its basis affords. And when we argue a fortiori, using a fortiori premises (Proving History, pp. 85-88), we can easily reach agreement among all honest and competent observers on an a fortiori result (Proving History, pp. 88-93, 208-14).
And that is actually what historians already do anyway. They just aren’t aware of the mathematical formalism that justifies what they are doing, or that exposes when they are doing it wrong (i.e. illogically or fallaciously).
Thus, it’s time to retire the fallacies of arbitrary priors and foregone likelihoods. They do not constitute a valid critique of Bayesianism in general or of Bayesianism in history.