What scientists misunderstand about statistical evidence and how to fix it

Statistics is an important tool in science that helps us understand important questions revealed by data. However, the concept of “statistical evidence” remains difficult to define. Professor Michael Ivins of the University of Toronto explored this complex issue in his recent study published in Encyclopedia 2024.
The field of statistics deals with situations where there is some quantity of interest whose value is unknown, data has been collected and it is believed that the data contains evidence about the unknown value. Statistical theory should answer two main questions based on the data: (i) provide a reasonable value for the quantity of interest and a measure of the accuracy of the estimate, and (ii) assess whether there is evidence for or against the hypothesized value of the quantity of interest and the strength of the evidence measurement standard. For example, it would certainly be of interest to estimate the proportion of people infected with COVID-19 who will develop severe disease, or we would like to know whether there is evidence for or against the hypothesized existence of dark matter after acquired matter based on measurements made by the Webb telescope.
As discussed in this article, there are two major themes in how these issues are addressed: evidential approaches and decision-making approaches. The focus of the evidential approach is to ensure that any statistical methods used are explicitly based on the evidence in the data. In contrast, decision theory aims to minimize potential losses using penalties based on assumptions about incorrect conclusions. For scientific applications, however, it is argued that prioritizing evidence in data is well aligned with the fundamental goal of science, which is to determine truth. Professor Evans’ article puts him firmly in the evidence camp.
The following cited paper establishes a fundamental issue in evidentiary methods: “Most statistical analyzes refer to the concept of statistical evidence, with phrases such as “the evidence in the data shows” or “based on the evidence we derived.” However, it has long been recognized that the concept itself has never been satisfactorily defined, or at least not provided with a definition that is universally agreed upon.
The fundamental question of the evidence approach then is: how to define statistical evidence? For how can one claim that a particular approach is evidence-based without explicitly stating what statistical evidence means? Professor Evans’ article reviews the many approaches that have been taken over the years to address this problem.
There are some well-known statistical methods that are used as expressions of statistical evidence. Many people are familiar with the use of p-values in question (ii). There are well-known problems with p-values as a measure of statistical evidence, some of which are reviewed in this article. For example, a cutoff alpha needs to be chosen to determine when a p-value is small enough to indicate that there is evidence against the hypothesis and that natural selection for alpha does not exist. Furthermore, p-values never provide evidence that a hypothesis is correct. The concept of confidence intervals is closely related to p-values and therefore suffers from similar flaws.
Allan Birnabum made a substantial attempt in the 1960s and 1970s to establish the concept of statistical evidence as central to the field of statistics, and his work is discussed in this article. This led to the discovery of many interesting relationships between principles agreed upon by many statisticians, such as the principles of possibility, sufficiency, and conditionality. Birnbaum did not succeed in fully describing what statistical evidence means, but his work pointed to another well-known division in statistics: frequentism versus Bayesianism. Birnbaum seeks a definition of statistical evidence in frequentism. Both p-values and confidence intervals are frequentist in nature. Frequent readers imagine that the statistical problem under study is repeated many times independently, and then look for statistical procedures that perform well in such sequences.
In contrast, Bayesians want inference to rely solely on observed data, without regard to such imaginary sequences. One cost of the Bayesian approach is that the analyst needs to provide a prior probability distribution for the quantity of interest that reflects the analyst’s view of the true value of that quantity. Bayesian statisticians are forced to update their beliefs after seeing the data, as represented by the posterior probability distribution of the quantity of interest. It is the comparison of a priori and a posteriori beliefs that leads intuitively to a clear definition of statistical evidence. evidence principle: If the posterior probability that a particular value is true is greater than the corresponding prior probability, there is evidence that it is a true value; if the posterior probability is less than the prior probability, there is evidence that it is not a true value. It is the evidence in the data that changes the belief, and the Evidence Principle clearly describes this.
As explained in Professor Evans’ paper, other elements are required in addition to the evidentiary principle. In order to estimate and measure the strength of evidence, it is necessary to rank the possible values of the quantity of interest, and the natural way to do this is through the relative confidence ratio: the ratio of the posterior probability of a value to its prior probability. When the ratio is greater than 1, there is supporting evidence. The larger the ratio, the more supporting evidence. On the contrary, when the ratio is less than 1, there is supporting evidence.
This article discusses more, including how we deal with the subjectivity inherent in statistical methods, such as using model checking and checking for prior conflicting information. Perhaps most surprisingly, however, the evidential approach through relative belief led to a resolution between frequentism and Bayesianism. Part of the story is that the reliability of any inference should be evaluated, and that’s what frequentism does. This arises in the relative confidence approach, where a priori probabilities are used to obtain evidence against a value when the value is true, and evidence for the value when the value is false. Finally, inference is Bayesian in that it reflects belief and provides a clear definition of statistical evidence, whereas controlling the reliability of the inference is frequentist. Both play a key role in the application of statistics to scientific problems.
As the world becomes increasingly reliant on data-driven insights, it becomes increasingly important to understand what constitutes solid evidence. Professor Evans’ research provides a thoughtful basis for addressing this pressing question.
Journal reference
Evans, M. “The concept, historical roots and current developments of statistical evidence”. Encyclopedia 2024, 4, 1201–1216. DOI: https://doi.org/10.3390/encyclopedia4030078
About the author
Michael Evans is a professor of statistics at the University of Toronto. He received his Ph.D. in 1977 from the University of Toronto, where he has been employed ever since, with study breaks at Stanford University and Carnegie Mellon University. He is a fellow of the American Statistical Association, serving as Chair of the Department of Statistics from 1992-97 and Interim Chair in 2022-23, and as President of the Canadian Statistical Association from 2013-2014. He has served in various editorial positions: Associate Editor of JASA Theory and Methods from 1991 to 2005, Associate Editor of the Canadian Journal of Statistics from 1999 to 2006 and 2017 to present, and Associate Editor of the Journal of Bayesian Analysis from 2005 to 2015. and Editor 2015-2021, Subject Editor (current) of the online journal FACETS and Associate Editor (current) of the New England Journal of Data Science Statistics.
Michael Evans’s research involves multivariate statistical methods, computational statistics, and statistical foundations. Current research focuses on developing a theory of reasoning called relative belief, which is based on a clear definition of how to weigh statistical evidence. Furthermore, his research involves developing tools to address criticisms of statistical methodologies related to their inherent subjectivity. He has authored or co-authored numerous research papers, as well as Approximating Integrals via Monte Carlo and Deterministic Methods (with T. Swartz), Probability and Statistics: The Science of Uncertainty, published by Oxford University in 2000 (With J. Rosenthal) Published by WH Freeman 2004 and 2010 Measuring Statistical Evidence Using Relative Confidence Published by CRC Press/Chapman and Hall 2015.