首页 期刊介绍 编 委 会 投稿指南 期刊订阅 联系我们 English

## 贝叶斯因子及其在JASP中的实现

,1,2, 3, 4, 4,5, ,1

1 清华大学心理学系, 北京 100084

2 Neuroimaging Center, Johannes Gutenberg University Medical Center, 55131 Mainz, Germany

3 Language and Genetics Department, Max Planck Institute for Psycholinguistics, 6500 AH Nijmegen, The Netherlands

4 Department of Psychological Methods, University of Amsterdam, 1018 VZ Amsterdam, The Netherlands

5 Centrum Wiskunde & Informatica, 1090 GB Amsterdam, The Netherlands

## The Bayes factor and its implementation in JASP: A practical primer

,1,2, 3, 4, 4,5, ,1

1 Department of Psychology, School of Social Science, Tsinghua University, Beijing 100084, China

2 Neuroimaging Center, Johannes Gutenberg University Medical Center, 55131 Mainz, Germany

3 Language and Genetics Department, Max Planck Institute for Psycholinguistics, 6500 AH Nijmegen, The Netherlands

4 Department of Psychological Methods, University of Amsterdam, 1018 VZ Amsterdam, The Netherlands

5 Centrum Wiskunde & Informatica, 1090 GB Amsterdam, The Netherlands

Abstract

Statistical inference plays a critical role in modern scientific research, however, the dominant method for statistical inference in science, null hypothesis significance testing (NHST), is often misunderstood and misused, which leads to unreproducible findings. To address this issue, researchers propose to adopt the Bayes factor as an alternative to NHST. The Bayes factor is a principled Bayesian tool for model selection and hypothesis testing, and can be interpreted as the strength for both the null hypothesis H0 and the alternative hypothesis H1 based on the current data. Compared to NHST, the Bayes factor has the following advantages: it quantifies the evidence that the data provide for both the H0 and the H1, it is not “violently biased” against H0, it allows one to monitor the evidence as the data accumulate, and it does not depend on sampling plans. Importantly, the recently developed open software JASP makes the calculation of Bayes factor accessible for most researchers in psychology, as we demonstrated for the t-test. Given these advantages, adopting the Bayes factor will improve psychological researchers’ statistical inferences. Nevertheless, to make the analysis more reproducible, researchers should keep their data analysis transparent and open.

Keywords： Bayes factor ; Bayesian statistics ; Frequentist ; NHST ; JASP

HU Chuan-Peng, KONG Xiang-Zhen, Eric-Jan WAGENMAKERS, Alexander LY, PENG Kaiping. The Bayes factor and its implementation in JASP: A practical primer. Advances in Psychological Science[J], 2018, 26(6): 951-965 doi:10.3724/SP.J.1042.2018.00951

### 1.1 贝叶斯统计简介

$p(\text{A}\cap \text{B})=p(A\text{ }\!\!|\!\!\text{ }B)\times p(B)=p(B\text{ }\!\!|\!\!\text{ }A)\times p(A)$ (1)

$p(A\text{ }\!\!|\!\!\text{ }B)=\frac{p(\text{A}\cap \text{B})}{p(B)}=\frac{p(B\text{ }\!\!|\!\!\text{ }A)\times p(A)}{p(B)}$ (2)

$p({{H}_{0}}\text{ }\!\!|\!\!\text{ }data)=\frac{p(data\text{ }\!\!|\!\!\text{ }{{H}_{0}})\times p({{H}_{0}})}{p(data)}$ (3)

p(H0|data)表示数据更新之后理论模型H0正确的概率, 即后验概率(posterior); p (H0)表示更新数据之前认为理论模型H0正确的概率, 即先验概率(prior); 而p (data|H0)则是在模型H0之下, 出现当前数据的概率, 即边缘似然性(marginal likelihood)。由此可以看出, 在贝叶斯统计之中, 一次数据收集(实验)的主要功能在于帮助我们更新理论模型的可信度。

$p({{H}_{1}}\text{ }\!\!|\!\!\text{ }data)=\frac{p(data\text{ }\!\!|\!\!\text{ }{{H}_{1}})\times p({{H}_{1}})}{p(data)}$ (4)

$\frac{p\text{(}{{H}_{1}}\text{ }\!\!|\!\!\text{ }data)}{p\text{(}{{H}_{0}}\text{ }\!\!|\!\!\text{ }data)}~=~\frac{p\left( data\text{ }\!\!|\!\!\text{ }{{H}_{1}} \right)}{p\left( data\text{ }\!\!|\!\!\text{ }{{H}_{0}} \right)}\times \frac{p\left( {{H}_{1}} \right)}{p\left( {{H}_{0}} \right)}$ (5)

$\text{B}{{\text{F}}_{10}}\frac{p\text{(}data\text{ }\!\!|\!\!\text{ }{{H}_{1}})}{p(data{{H}_{0}})}$ (6)

1. 同时考虑H0H1的支持证据 × 10, 11
2. 可以用来支持H0 × 12, 13
3. 不“严重”地倾向于反对H0 × 14, 15, 16
4. 可以随着数据累积来监控证据的强度 × 17, 18
5. 不依赖于未知的或者不存在的抽样计划 × 19, 20

Jeffreys (1961)的基础上, Wagenmakers, Love等人(2017)对贝叶斯因子的大小所代表的意义进行原则上的划分(见表2)。但是这个划分仅是大致参考, 不能严格对应, 研究者需要根据具体的研究来判断贝叶斯因子的意义。

> 100 极强的证据支持H1
30 ~ 100 非常强的证据支持H1
10 ~ 30 较强的证据支持H1
3 ~ 10 中等程度的证据支持H1
1 ~ 3 较弱的证据支持H1
1 没有证据
1/3 ~ 1 较弱的证据支持H0
1/10 ~ 1/3 中等程度的证据支持H0
1/30 ~ 1/10 较强的证据支持H0
1/100 ~ 1/30 非常强的证据支持H0
< 1/100 极强的证据支持H0

### 1.2 备择假设的默认先验

$\text{ }\!\!\delta\!\!\text{ }\!\!~\!\!\text{ }\!\!\tilde{\ }\!\!\text{ }Cauchy\left( {{x}_{0}}=0,\text{ }\!\!\gamma\!\!\text{ }=1 \right)$

### 3.1 JASP软件简介

JASP是一个免费、开源的统计软件, 其使用R语言的工具包进行数据处理, 但其使用不需要安装R。JASP的长期目标是让所有人能够通过免费的统计软件进行最先进统计技术, 尤其是贝叶斯因子。

JASP是在心理学研究面临可重复危机的背景下开发的, 其开发理念如下：第一, 开源与免费, 因为开源应该是科学研究的本质元素; 第二, 包容性, 既包括贝叶斯分析, 也包括NHST分析方法, 而且NHST分析方法中, 增加了对效应量及其置信区间的输出(Cumming, 2014); 第三, 简洁性, 即JASP的基本软件中仅包括最常用的分析, 而更高级的统计方法又可以通过插件模块进行补充; 第四, 友好的图形界面, 例如, 输出部分随着用户选择变量输入而实时更新, 表格使用APA格式。同时, JASP的使用递进式输出, 即默认的结果输出是最简洁的, 更多的结果输出可以由研究者自己进行定义。此外, 为方便公开和分享分析过程, JASP将输入的数据与输出结果保存于同一个后缀为.jasp的文件之中, 每个分析的结果均与相应的分析和变量数据相关联。这种结果与数据整合的文件可以与开放科学平台Open science framework (OSF, https://osf.io/)兼容, 从而做到数据与结果公开。

### 3.2 贝叶斯因子分析在JASP的实现及其结果解读

Topolinski和Sparenberg (2012)的第二个实验中, 一组被试以顺时针方向拔动一个厨房用的钟, 而另一组则以逆时针方向拨动。随后, 被试填写一个评估经验开放性的问卷。他们的数据表明, 被试顺时针转时比逆时针转的被试报告更高的对经验的开放性(Topolinski & Sparenberg, 2012) (但是见Francis, 2013)。Wagenmakers等人(2015)采用提前注册(preregistration)的方式对该研究进行重复, 在实验开始前确定停止收集数据的标准：当支持某一个假设的贝叶斯因子达到10时即停止收集数据, 或者每条件下达到50个样本后停止收集数据。此外, 预注册时采用单侧t检验的默认先验, 即γ = 1的柯西分布。而单侧的t检验的先验是只有正效应的柯西分布, 即备择假设为H+ : Cauchy (0, 1)。

$\text{ }\!\!\gamma\!\!\text{ }=\frac{1}{2}\sqrt{2}\approx 0.707$

JASP中对于单侧的t检验同样采用这个先验。γ减小意味着H1H0相似, 他们对观测数据的预测相似, 更难得到支持H0的强证据。

### 3.3 如何报告贝叶斯因子结果

“贝叶斯因子为BF01 = 10.76, 说明在(假定没有效应的)零假设下出现当前数据的可能性是在(假定存在效应的)备择假设下可能性的10.76倍。根据Jeffreys (1961)提出的分类标准, 这是较强的证据支持了零假设, 即在顺时针和立逆时针转钟表指针的人在经验开放性(NEO)得分上没有差异。”

### 4.2 贝叶斯因子的应用前景

JASP的开发, 使用贝叶斯因子的计算和解读变得更加简便, 研究者即便没有很强的编程基础, 也能够使用JASP地进行贝叶斯因子分析。这可能有助于推动研究者更加广泛地使用贝叶斯因子。此外, JASP本身正在快速发展, 其功能的深度和广度正在不断地扩大, 新的方法和标准将不断地整合到软件之中, 可能帮助研究者更科学地进行研究。

## 参考文献 原文顺序 文献年度倒序 文中引用次数倒序 被引期刊影响因子

Bahadur,R. R., &Bickel, P. J . ( 2009).

An optimality property of Bayes' test statistics

Lecture Notes-Monograph Series, 57, 18-30.

The links between the British economy and the world economy are becoming more important every year. As a result, change in Britain and the rest of the world is now directly related as never before. But British research on urban and regional change has been slow to take account of this fact and, in general, remains committed to a view of the world in which the British economy stops at the shoreline of Britain and the world economy appears as a set of 0900external factors' or 0900macrotrends in the economy0964, or 0900international processes0964 that are conjured up by the outside world and then fed into the British economy as something called 0900restructuring0964. This parochial point of view, so inappropriate in the modern world, is illustrated by the Economic and Social Research Council Environment and Planning Committee document, 0904Research policy and priorities09.

Baker, M.(2016).

1,500 scientists lift the lid on reproducibility

Nature, 533, 452-454.

URL     PMID:27225100

Survey sheds light on the ‘crisis’ rocking research.

Begley,C. G., & Ellis, L. M . ( 2012).

Drug development: Raise standards for preclinical cancer research

Nature, 483( 7391), 531-533.

Bem,D. J . ( 2011).

Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect

Journal of Personality and Social Psychology, 100( 3), 407-425.

URL     PMID:21280961

Abstract The term psi denotes anomalous processes of information or energy transfer that are currently unexplained in terms of known physical or biological mechanisms. Two variants of psi are precognition (conscious cognitive awareness) and premonition (affective apprehension) of a future event that could not otherwise be anticipated through any known inferential process. Precognition and premonition are themselves special cases of a more general phenomenon: the anomalous retroactive influence of some future event on an individual's current responses, whether those responses are conscious or nonconscious, cognitive or affective. This article reports 9 experiments, involving more than 1,000 participants, that test for retroactive influence by "time-reversing" well-established psychological effects so that the individual's responses are obtained before the putatively causal stimulus events occur. Data are presented for 4 time-reversed effects: precognitive approach to erotic stimuli and precognitive avoidance of negative stimuli; retroactive priming; retroactive habituation; and retroactive facilitation of recall. The mean effect size (d) in psi performance across all 9 experiments was 0.22, and all but one of the experiments yielded statistically significant results. The individual-difference variable of stimulus seeking, a component of extraversion, was significantly correlated with psi performance in 5 of the experiments, with participants who scored above the midpoint on a scale of stimulus seeking achieving a mean effect size of 0.43. Skepticism about psi, issues of replication, and theories of psi are also discussed. (c) 2011 APA, all rights reserved

Bem D. J., Utts J., & Johnson W. O . ( 2011).

Must psychologists change the way they analyze their data?

Journal of Personality and Social Psychology, 101( 4), 716-719.

URL     PMID:21928916

Wagenmakers, Wetzels, Borsboom, and van der Maas (2011) argued that psychologists should replace the familiar "frequentist" statistical analyses of their data with bayesian analyses. To illustrate their argument, they reanalyzed a set of psi experiments published recently in this journal by Bem (2011), maintaining that, contrary to his conclusion, his data do not yield evidence in favor of the psi hypothesis. We argue that they have incorrectly selected an unrealistic prior distribution for their analysis and that a bayesian analysis using a more reasonable distribution yields strong evidence in favor of the psi hypothesis. More generally, we argue that there are advantages to bayesian analyses that merit their increased use in the future. However, as Wagenmakers et al.'s analysis inadvertently revealed, they contain hidden traps that must be better understood before being more widely substituted for the familiar frequentist analyses currently employed by most research psychologists.

Benjamin D. J., Berger J. O., Johannesson M., Nosek B. A., Wagenmakers E.-J., Berk R., … Johnson V. E . ( 2018).

Redefine statistical significance

Nature Human Behaviour, 2( 1), 6-10.

Berger,J. O., & Berry, D. A . ( 1988).

Statistical analysis and the illusion of objectivity

American Scientist, 76( 2), 159-165.

Introduces the debate in statistical analysis between the subjectivists, or Bayesians, and the nonsubjectivists. Two situations where differences in interpretation arise; Testing of a precise hypothesis; Standard statistical approach; Bayesian approach; Analysis of accumulating data; Role of subjectivity. INSET: Calculating the Final Probability of H..

Testing precise hypotheses

Statistical Science, 2( 3), 317-335.

Berger,J. O., & Wolpert, R. L . ( 1988). The likelihood principle (2nd ed.). Hayward (CA): Institute of Mathematical Statistics.

Carpenter B., Gelman A., Hoffman M. D., Lee D., Goodrich B., Betancourt M., … Riddell A . ( 2017).

Stan: A probabilistic programming language

Journal of Statistical Software, 76( 1), 1-32.

Stan is a probabilistic programming language for specifying statistical models. A Stan program imperatively defines a log probability function over parameters conditioned on specified data and constants. As of version 2.14.0, Stan provides full Bayesian inference for continuous-variable models through Markov chain Monte Carlo methods such as the No-U-Turn sampler, an adaptive form of Hamiltonian Monte Carlo sampling. Penalized maximum likelihood estimates are calculated using optimization methods such as the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm. Stan is also a platform for computing log densities and their gradients and Hessians, which can be used in alternative algorithms such as variational Bayes, expectation propagation, and marginal inference using approximate integration. To this end, Stan is set up so that the densities, gradients, and Hessians, along with intermediate quantities of the algorithm such as acceptance probabilities, are easily accessible. Stan can be called from the command line using the cmdstan package, through R using the rstan package, and through Python using the pystan package. All three interfaces support sampling and optimization-based inference with diagnostics and posterior analysis. rstan and pystan also provide access to log probabilities, gradients, Hessians, parameter transforms, and specialized plotting.

Chambers C. D., Feredoes E., Muthukumaraswamy S. D., & Etchells P. J . ( 2014).

Instead of “playing the game” it is time to change the rules: Registered Reports at AIMS Neuroscience and beyond

AIMS Neuroscience, 1( 1), 4-17.

Chen X., Lu B., & Yan C.-G . ( 2018).

Reproducibility of R-fMRI metrics on the impact of different strategies for multiple comparison correction and sample sizes

Human Brain Mapping, 39( 1), 300-318.

URL     PMID:29024299

Reproducibility is one of the key defining features of science and plays a central role in knowledge accumulation. In the field of resting-state functional magnetic resonance imaging (R-fMRI), concerns regarding the reproducibility of findings have been raised. In response, we comprehensively assessed the reproducibility of widely used R-fMRI metrics and systematically investigated the impact of different strategies correcting for multiple comparisons and for small sample sizes. We found that multiple comparison correction strategies with liberal thresholds yield higher reproducibility but can dramatically increase the family wise error rate (FWER) to unacceptable levels. We noted permutation test with Threshold-Free Cluster Enhancement (TFCE), a strict multiple comparison correction strategy, reached the best balance between FWER (under 5%) and reproducibility (e.g., 0.68 for within-subject reproducibility of amplitude of low-frequency fluctuations). Although the sex differences in R-fMRI metrics can be moderately reproduced from a scan to another scan within subjects, they are poorly reproduced in another different dataset (between-subject reproducibility < 0.3). Among the brain regions showing the most reproducible sex differences, posterior cingulate cortex demonstrated consistent lower spontaneous activity in males than in females. Defining the most reproducible brain regions in two large sample datasets as -old standard-, we found that small sample size not only minimized power (sensitivity < 5%), but also decreased the likelihood that significant results reflect true effects. For the liberal multiple comparison correction, results were unlikely to reflect true effects (positive predictive value = 10%). Fortunately, voxels determined to be significant using permutation test with TFCE have a 71% probability of reflecting true effects. Our findings have implications for how to select multiple comparison correction strategies and highlight the need for sufficiently large sample sizes in future R-fMRI studies.

Cumming, G.(2014).

The new statistics: Why and how

Psychological Science, 25( 1), 7-29.

Depaoli, S.,& van de Schoot, R.(2017).

Improving transparency and replication in Bayesian statistics: The WAMBS-Checklist

Psychological Methods, 22( 2), 240-261.

URL     PMID:26690773

Bayesian statistical methods are slowly creeping into all fields of science and are becoming ever more popular in applied research. Although it is very attractive to use Bayesian statistics, our personal experience has led us to believe that naively applying Bayesian methods can be dangerous for at least 3 main reasons: the potential influence of priors, misinterpretation of Bayesian features and results, and improper reporting of Bayesian results. To deal with these 3 points of potential danger, we have developed a succinct checklist: the WAMBS-checklist (When to worry and how to Avoid the Misuse of Bayesian Statistics). The purpose of the questionnaire is to describe 10 main points that should be thoroughly checked when applying Bayesian analysis. We provide an account of "when to worry" for each of these issues related to: (a) issues to check before estimating the model, (b) issues to check after estimating the model but before interpreting results, understanding the influence of priors, and (d) actions to take after interpreting results. To accompany these key points of concern, we will present diagnostic tools that can be used in conjunction with the development and assessment of a Bayesian model. We also include examples of how to interpret results when "problems" in estimation arise, as well as syntax and instructions for implementation. Our aim is to stress the importance of openness and transparency of all aspects of Bayesian estimation, and it is our hope that the WAMBS questionnaire can aid in this process.

Dienes, Z.(2008). Understanding psychology as a science: An introduction to scientific and statistical inference. London, UK: Palgrave Macmillan.

Dienes, Z.(2011).

Bayesian versus orthodox statistics: Which side are you on?

Perspectives on Psychological Science, 6( 3), 274-290.

URL     PMID:26168518

Researchers are often confused about what can be inferred from significance tests. One problem occurs when people apply Bayesian intuitions to significance testing-two approaches that must be firmly separated. This article presents some common situations in which the approaches come to different conclusions; you can see where your intuitions initially lie. The situations include multiple testing, deciding when to stop running participants, and when a theory was thought of relative to finding out results. The interpretation of nonsignificant results has also been persistently problematic in a way that Bayesian inference can clarify. The Bayesian and orthodox approaches are placed in the context of different notions of rationality, and I accuse myself and others as having been irrational in the way we have been using statistics on a key notion of rationality. The reader is shown how to apply Bayesian inference in practice, using free online software, to allow more coherent inferences from data.

Dienes, Z.(2014).

Using Bayes to get the most out of non-significant results

Frontiers in Psychology, 5, 781.

URL     PMID:4114196

No scientific conclusion follows automatically from a statistically non-significant result, yet people routinely use non-significant results to guide conclusions about the status of theories (or the effectiveness of practices). To know whether a non-significant result counts against a theory, or if it just indicates data insensitivity, researchers must use one of: power, intervals (such as confidence or credibility intervals), or else an indicator of the relative evidence for one theory over another, such as a Bayes factor. I argue Bayes factors allow theory to be linked to data in a way that overcomes the weaknesses of the other approaches. Specifically, Bayes factors use the data themselves to determine their sensitivity in distinguishing theories (unlike power), and they make use of those aspects of a theory predictions that are often easiest to specify (unlike power and intervals, which require specifying the minimal interesting value in order to address theory). Bayes factors provide a coherent approach to determining whether non-significant results support a null hypothesis over a theory, or whether the data are just insensitive. They allow accepting and rejecting the null hypothesis to be put on an equal footing. Concrete examples are provided to indicate the range of application of a simple online Bayes calculator, which reveal both the strengths and weaknesses of Bayes factors.

Ebersole C. R., Atherton O. E., Belanger A. L., Skulborstad H. M., Allen J. M., Banks J. B., .. Nosek B. A . ( 2016).

Many Labs 3: Evaluating participant pool quality across the academic semester via replication

Journal of Experimental Social Psychology, 67, 68-82.

The university participant pool is a key resource for behavioral research, and data quality is believed to vary over the course of the academic semester. This crowdsourced project examined time of semester variation in 10 known effects, 10 individual differences, and 3 data quality indicators over the course of the academic semester in 20 participant pools ( N 02=022696) and with an online sample ( N 02=02737). Weak time of semester effects were observed on data quality indicators, participant sex, and a few individual differences—conscientiousness, mood, and stress. However, there was little evidence for time of semester qualifying experimental or correlational effects. The generality of this evidence is unknown because only a subset of the tested effects demonstrated evidence for the original result in the whole sample. Mean characteristics of pool samples change slightly during the semester, but these data suggest that those changes are mostly irrelevant for detecting effects.

Edwards, W.(1965).

Tactical note on the relation between scientific and statistical hypotheses

Psychological Bulletin, 63( 6), 400-402.

URL     PMID:14314074

Abstract Grant, Binder, and others have debated what should be the appropriate relationship between the scientific hypotheses that a scientist is interested in and the customary procedures of classical statistical inference. Classical significance tests are violently biased against the null hypothesis. A conservative theorist will therefore associate his theory with the null hypothesis, while an enthusiast will not-nd they may often reach conflicting conclusions, whether or not the theory is correct. No procedure can satisfactorily test the goodness of fit of a single model to data. The remedy is to compare the fit of several models to the same data. Such procedures do not compare null with alternative hypotheses, and so are in this respect unbiased. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Edwards W., Lindman H., & Savage L. J . ( 1963).

Bayesian statistical inference for psychological research

Psychological Review, 70( 3), 193-242.

Bayesian statistics, a currently controversial viewpoint concerning statistical inference, is based on a definition of probability as a particular measure of the opinions of ideally consistent people.

Etz A .(in press).

Introduction to the concept of likelihood and its applications

.Advances in Methods and Practices in Psychological Science.

We introduce the statistical concept known as likelihood and discuss how it underlies common Frequentist and Bayesian statistical methods. This article is suitable for researchers interested in understanding the basis of their statistical tools, and is also ideal for teachers to use in their classrooms to introduce the topic to students at a conceptual level.

Francis, G.(2013).

Replication, statistical consistency, and publication bias

Journal of Mathematical Psychology, 57( 5), 153-169.

Scientific methods of investigation offer systematic ways to gather information about the world; and in the field of psychology application of such methods should lead to a better understanding of human behavior. Instead, recent reports in psychological science have used apparently scientific methods to report strong evidence for unbelievable claims such as precognition. To try to resolve the apparent conflict between unbelievable claims and the scientific method many researchers turn to empirical replication to reveal the truth. Such an approach relies on the belief that true phenomena can be successfully demonstrated in well-designed experiments, and the ability to reliably reproduce an experimental outcome is widely considered the gold standard of scientific investigations. Unfortunately, this view is incorrect; and misunderstandings about replication contribute to the conflicts in psychological science. Because experimental effects in psychology are measured by statistics, there should almost always be some variability in the reported outcomes. An absence of such variability actually indicates that experimental replications are invalid, perhaps because of a bias to suppress contrary findings or because the experiments were run improperly. Recent investigations have demonstrated how to identify evidence of such invalid experiment sets and noted its appearance for prominent findings in experimental psychology. The present manuscript explores those investigative methods by using computer simulations to demonstrate their properties and limitations. The methods are shown to be a check on the statistical consistency of a set of experiments by comparing the reported power of the experiments with the reported frequency of statistical significance. Overall, the methods are extremely conservative about reporting inconsistency when experiments are run properly and reported fully. The manuscript also considers how to improve scientific practice to avoid inconsistency, and discusses criticisms of the investigative method.

Gallistel,C. R . ( 2009).

The importance of proving the null

Psychological Review, 116( 2), 439-453.

URL     PMID:2859953

Null hypotheses are simple, precise, and theoretically important. Conventional statistical analysis cannot support them; Bayesian analysis can. The challenge in a Bayesian analysis is to formulate a suitably vague alternative, because the vaguer the alternative is (the more it spreads out the unit mass of prior probability), the more the null is favored. A general solution is a sensitivity analysis: Compute the odds for or against the null as a function of the limit(s) on the vagueness of the alternative. If the odds on the null approach 1 from above as the hypothesized maximum size of the possible effect approaches 0, then the data favor the null over any vaguer alternative to it. The simple computations and the intuitive graphic representation of the analysis are illustrated by the analysis of diverse examples from the current literature. They pose 3 common experimental questions: (a) Are 2 means the same? (b) Is performance at chance? (c) Are factors additive?

Gigerenzer, G.(2004).

Mindless statistics

The Journal of Socio-Economics, 33( 5), 587-606.

Greenland S., Senn S. J., Rothman K. J., Carlin J. B., Poole C., Goodman S. N., … Altman D. G . ( 2016).

Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations

European Journal of Epidemiology, 31( 4), 337-350.

Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of...

Gronau,Q. F., & Wagenmakers, E.-J.(2017).

Bayesian evidence accumulation in experimental mathematics: A case study of four irrational numbers

Experimental Mathematics, 1-10.

Many questions in experimental mathematics are fundamentally inductive in nature. Here we demonstrate how Bayesian inference --the logic of partial beliefs-- can be used to quantify the evidence that finite data provide in favor of a general law. As a concrete example we focus on the general law which posits that certain fundamental constants (i.e., the irrational numbers $\pi$, $e$, $\sqrt2$, and $\ln{2}$) are normal; specifically, we consider the more restricted hypothesis that each digit in the constant's decimal expansion occurs equally often. Our analysis indicates that for each of the four constants, the evidence in favor of the general law is overwhelming. We argue that the Bayesian paradigm is particularly apt for applications in experimental mathematics, a field in which the plausibility of a general law is in need of constant revision in light of data sets whose size is increasing continually and indefinitely.

Halsey L. G., Curran-Everett D., Vowler S. L., & Drummond G. B . ( 2015).

The fickle P value generates irreproducible results

Nature Methods, 12( 3), 179-185.

URL     PMID:25719825

The reliability and reproducibility of science are under scrutiny. However, a major cause of this lack of repeatability is not being considered: the wide sample-to-sample variability in the P value. We explain why P is fickle to discourage the ill-informed practice of interpreting analyses based predominantly on this statistic.

Hoijtink, H.(2011).

Informative hypotheses: Theory and practice for behavioral and social scientists

Boca Raton, FL: Chapman & Hall/CRC.

Hoijtink H., van Kooten P., & Hulsker K . ( 2016).

Why Bayesian psychologists should change the way they use the Bayes factor

Multivariate Behavioral Research, 51( 1), 2-10.

URL     PMID:26881951

The discussion following Bem’s (2011) psi research highlights that applications of the Bayes factor in psychological research are not without problems. The first problem is the omission to translate subjective prior knowledge into subjective prior distributions. In the words of Savage (1961): “they make the Bayesian omelet without breaking the Bayesian egg.” The second problem occurs if the Bayesian egg isnotbroken: the omission to choose default prior distributions such that the ensuing inferences are well calibrated. The third problem is the adherence to inadequate rules for the interpretation of the size of the Bayes factor. The current paper will elaborate these problems and show how to avoid them using the basic hypotheses and statistical model used in the first experiment described in Bem (2011). It will be argued that a thorough investigation of these problems in the context of more encompassing hypotheses and statistical models is called for if Bayesian psychologists want to add a well-founded Bayes factor to the tool kit of psychological researchers.

JASP Team. ( 2017).

JASP (Version 0.8.2) [Computer software].

Jeffreys, H.(1935).

Some tests of significance, treated by the theory of probability

Mathematical Proceedings of the Cambridge Philosophical Society, 31( 2), 203-222.

Jeffreys, H.(1938).

Significance tests when several degrees of freedom arise simultaneously

Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 165( 921), 161-198.

Not Available

Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford, UK: Oxford University Press.

Johnson,V. E . ( 2013).

Revised standards for statistical evidence

Proceedings of the National Academy of Sciences of the United States of America, 110( 48), 19313-19317.

Kerr,N. L . ( 1998).

HARKing: Hypothesizing after the results are known

Personality and Social Psychology Review, 2( 3), 196-217.

URL     PMID:15647155

Abstract This article considers a practice in scientific communication termed HARKing (Hypothesizing After the Results are Known). HARKing is defined as presenting a post hoc hypothesis (i.e., one based on or informed by one's results) in one's research report as i f it were, in fact, an a priori hypotheses. Several forms of HARKing are identified and survey data are presented that suggests that at least some forms of HARKing are widely practiced and widely seen as inappropriate. I identify several reasons why scientists might HARK. Then I discuss several reasons why scientists ought not to HARK. It is conceded that the question of whether HARKing ' s costs exceed its benefits is a complex one that ought to be addressed through research, open discussion, and debate. To help stimulate such discussion (and for those such as myself who suspect that HARKing's costs do exceed its benefits), I conclude the article with some suggestions for deterring HARKing.

Klein R. A., Ratliff K. A., Vianello M., Adams R. B., Jr., Bahník Š., Bernstein M. J., … Nosek B. A . ( 2014).

Investigating variation in replicability: A “many labs” replication project

Social Psychology, 45( 3), 142-152.

Klugkist I., Laudy O., & Hoijtink H . ( 2005).

Inequality constrained analysis of variance: A Bayesian approach

Psychological Methods, 10( 4), 477-493.

URL     PMID:16393001

Researchers often have one or more theories or expectations with respect to the outcome of their empirical research. When researchers talk about the expected relations between variables if a certain theory is correct, their statements are often in terms of one or more parameters expected to be larger or smaller than one or more other parameters. Stated otherwise, their statements are often formulated using inequality constraints. In this article, a Bayesian approach to evaluate analysis of variance or analysis of covariance models with inequality constraints on the (adjusted) means is presented. This evaluation contains two issues: estimation of the parameters given the restrictions using the Gibbs sampler and model selection using Bayes factors in the case of competing theories. The article concludes with two illustrations: a one-way analysis of covariance and an analysis of a three-way table of ordered means.

Kruschke J. K. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and stan (2nd ed.). San Diego, CA: Academic Press/Elsevier.

Kruschke,J. K., & Liddell, T. M . ( 2017a).

Bayesian data analysis for newcomers

Psychonomic Bulletin & Review, 1-23.

URL     PMID:28405907

This article explains the foundational concepts of Bayesian data analysis using virtually no mathematical notation. Bayesian ideas already match your intuitions from everyday reasoning and from traditional data analysis. Simple examples of Bayesian data analysis are presented that illustrate how the information delivered by a Bayesian analysis can be directly interpreted. Bayesian approaches to null-value assessment are discussed. The article clarifies misconceptions about Bayesian methods that newcomers might have acquired elsewhere. We discuss prior distributions and explain how they are not a liability but an important asset. We discuss the relation of Bayesian data analysis to Bayesian models of mind, and we briefly discuss what methodological problems Bayesian data analysis is not meant to solve. After you have read this article, you should have a clear sense of how Bayesian data analysis works and the sort of information it delivers, and why that information is so intuitive and useful for drawing conclusions from data.

Kruschke,J. K., & Liddell, T. M . ( 2017

b). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective

Psychonomic Bulletin & Review, 1-29.

Lakens, D.(2017).

Equivalence tests: A practical primer for t-Tests, correlations, and meta-analyses

Social Psychological and Personality Science, 8( 4), 355-362.

Lindley,D. V . ( 1993).

The analysis of experimental data: The appreciation of tea and wine

Teaching Statistics, 15( 1), 22-25.

Summary A classical experiment on the tasting of tea is used to show that many standard methods of analysis of the resulting data are unsatisfactory. A similar experiment with wine is used to show how a more sensible method may be developed.

Lindsay,D. S . ( 2015).

Replication in psychological science

Psychological Science, 26( 12), 1827-1832.

Lunn D., Spiegelhalter D., Thomas A., & Best N . ( 2009).

The BUGS project: Evolution, critique and future directions

Statistics in Medicine, 28( 25), 3049-3067.

URL     PMID:19630097

Abstract BUGS is a software package for Bayesian inference using Gibbs sampling. The software has been instrumental in raising awareness of Bayesian modelling among both academic and commercial communities internationally, and has enjoyed considerable success over its 20-year life span. Despite this, the software has a number of shortcomings and a principal aim of this paper is to provide a balanced critical appraisal, in particular highlighting how various ideas have led to unprecedented flexibility while at the same time producing negative side effects. We also present a historical overview of the BUGS project and some future perspectives.

Ly A., Etz A., Marsman M., & Wagenmakers E.-J . ( 2017).

Replication Bayes factors from evidence updating

Ly A., Marsman M., & Wagenmakers E.-J . ( 2018).

Analytic posteriors for Pearson’s correlation coefficient

Statistica Neerlandica, 72, 4-13.

Abstract: Pearson's correlation is one of the most common measures of linear dependence. Recently, Bernardo (2015) introduced a flexible class of priors to study this measure in a Bayesian setting. For this large class of priors we show that the (marginal) posterior for Pearson's correlation coefficient and all of the posterior moments are analytic. Our results are available in the open-source software package JASP.

Ly A., Verhagen J., & Wagenmakers E.-J . (2016a).

An evaluation of alternative methods for testing hypotheses, from the perspective of Harold Jeffreys

Journal of Mathematical Psychology, 72, 43-55.

Our original article provided a relatively detailed summary of Harold Jeffreys’s philosophy on statistical hypothesis testing. In response, Robert (2016) maintains that Bayes factors have a number of serious shortcomings. These shortcomings, Robert argues, may be addressed by an alternative approach that conceptualizes model selection as parameter estimation in a mixture model. In a second comment, Chandramouli and Shiffrin (2016) seek to extend Jeffreys’s framework by also taking into consideration data distributions that do not originate from either of the models under test. In this rejoinder we argue that Robert’s (2016) alternative view on testing has more in common with Jeffreys’s Bayes factor than he suggests, as they share the same “shortcomings”. On the other hand, we show that the proposition of Chandramouli and Shiffrin (2016) to extend the Bayes factor is in fact further removed from Jeffreys’s view on testing than the authors suggest. By elaborating on these points, we hope to clarify our case for Jeffreys’s Bayes factors.

Ly A., Verhagen J., & Wagenmakers E.-J . (2016b).

Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology

Journal of Mathematical Psychology, 72, 19-32.

Harold Jeffreys pioneered the development of default Bayes factor hypothesis tests for standard statistical problems. Using Jeffreys’s Bayes factor hypothesis tests, researchers can grade the decisiveness of the evidence that the data provide for a point null hypothesis H 0 H 0 mathContainer Loading Mathjax versus a composite alternative hypothesis H 1 H 1 mathContainer Loading Mathjax . Consequently, Jeffreys’s tests are of considerable theoretical and practical relevance for empirical researchers in general and for experimental psychologists in particular. To highlight this relevance and to facilitate the interpretation and use of Jeffreys’s Bayes factor tests we focus on two common inferential scenarios: testing the nullity of a normal mean (i.e.,the Bayesian equivalent of the t t mathContainer Loading Mathjax -test) and testing the nullity of a correlation. For both Bayes factor tests, we explain their development, we extend them to one-sided problems, and we apply them to concrete examples from experimental psychology.

Marsman, M.,& Wagenmakers, E.-J.(2017 a).

Bayesian benefits with JASP

European Journal of Developmental Psychology, 14( 5), 545-555.

Marsman, M.,& Wagenmakers, E.-J.(2017 b).

Three insights from a bayesian interpretation of the one-sided P value

Educational and Psychological Measurement, 77( 3), 529-539.

P values have been critiqued on several grounds but remain entrenched as the dominant inferential method in the empirical sciences. In this article, we elaborate on the fact that in many statistical models, the one-sided "P" value has a direct Bayesian interpretation as the approximate posterior mass for values lower than zero. The connection between the one-sided "P" value and posterior probability mass reveals three insights: (1) "P" values can be interpreted as Bayesian tests of direction, to be used only when the null hypothesis is known from the outset to be false; (2) as a measure of evidence, "P" values are biased against a point null hypothesis; and (3) with "N" fixed and effect size variable, there is an approximately linear relation between "P" values and Bayesian point null hypothesis tests.

Masson,M. E. J . ( 2011).

A tutorial on a practical Bayesian alternative to null-hypothesis significance testing

Behavior Research Methods, 43( 3), 679-690.

URL     PMID:21302025

Null-hypothesis significance testing remains the standard inferential tool in cognitive science despite its serious disadvantages. Primary among these is the fact that the resulting probability value does not tell the researcher what he or she usually wants to know: How probable is a hypothesis, given the obtained data? Inspired by developments presented by Wagenmakers ( Psychonomic Bulletin & Review, 14 , 779-804, 2007 ), I provide a tutorial on a Bayesian model selection approach that requires only a simple transformation of sum-of-squares values generated by the standard analysis of variance. This approach generates a graded level of evidence regarding which model (e.g., effect absent [null hypothesis] vs. effect present [alternative hypothesis]) is more strongly supported by the data. This method also obviates admonitions never to speak of accepting the null hypothesis. An Excel worksheet for computing the Bayesian analysis is provided as supplemental material .

Matzke D., Nieuwenhuis S., van Rijn H., Slagter H. A., van der Molen, M. W., & Wagenmakers E.-J . ( 2015).

The effect of horizontal eye movements on free recall: A preregistered adversarial collaboration

Journal of Experimental Psychology: General, 144( 1), e1-e15.

URL     PMID:25621378

A growing body of research has suggested that horizontal saccadic eye movements facilitate the retrieval of episodic in free recall and recognition tasks. Nevertheless, a minority of studies have failed to replicate this effect. This article attempts to resolve the inconsistent results by introducing a novel variant of proponent-skeptic collaboration. The proposed approach combines the features of adversarial collaboration and purely confirmatory preregistered research. Prior to data collection, the adversaries reached consensus on an optimal research design, formulated their expectations, and agreed to submit the findings to an academic journal regardless of the outcome. To increase transparency and secure the purely confirmatory nature of the investigation, the 2 parties set up a publicly available adversarial collaboration agreement that detailed the proposed design and all foreseeable aspects of the data analysis. As anticipated by the skeptics, a series of Bayesian hypothesis tests indicated that horizontal eye movements did not improve free recall performance. The skeptics suggested that the nonreplication may partly reflect the use of suboptimal and questionable research practices in earlier eye movement studies. The proponents countered this suggestion and used a p curve analysis to argue that the effect of horizontal eye movements on explicit did not merely reflect selective reporting.

Miller, G.(2011).

ESP paper rekindles discussion about statistics

Science, 331( 6015), 272-273.

URL     PMID:21252321

Not Available

Morey R. D., Hoekstra R., Rouder J. N., Lee M. D., & Wagenmakers E.-J . ( 2016).

The fallacy of placing confidence in confidence intervals

Psychonomic Bulletin & Review, 23( 1), 103-123.

URL     PMID:26450628

Interval estimates – estimates of parameters that include an allowance for sampling uncertainty – have long been touted as a key component of statistical analyses. There are several kinds of interval estimates, but the most popular are confidence intervals (CIs): intervals that contain the true parameter value in some known proportion of repeated samples, on average. The width of confidence intervals is thought to index the precision of an estimate; CIs are thought to be a guide to which parameter values are plausible or reasonable; and the confidence coefficient of the interval (e.g., 95 %) is thought to index the plausibility that the true parameter is included in the interval. We show in a number of examples that CIs do not necessarily have any of these properties, and can lead to unjustified or arbitrary inferences. For this reason, we caution against relying upon confidence interval theory to justify interval estimates, and suggest that other theories of interval estimation should be used instead.

Morey,R. D., & Rouder, J. N . ( 2011).

Bayes factor approaches for testing interval null hypotheses

Psychological Methods, 16( 4), 406-419.

URL     PMID:21787084

Psychological theories are statements of constraint. The role of hypothesis testing in psychology is to test whether specific theoretical constraints hold in data. Bayesian statistics is well suited to the task of finding supporting evidence for constraint, because it allows for comparing evidence for 2 hypotheses against each another. One issue in hypothesis testing is that constraints may hold only approximately rather than exactly, and the reason for small deviations may be trivial or uninteresting. In the large-sample limit, these uninteresting, small deviations lead to the rejection of a useful constraint. In this article, we develop several Bayes factor 1-sample tests for the assessment of approximate equality and ordinal constraints. In these tests, the null hypothesis covers a small interval of non-0 but negligible effect sizes around 0. These Bayes factors are alternatives to previously developed Bayes factors, which do not allow for interval null hypotheses, and may especially prove useful to researchers who use statistical equivalence testing. To facilitate adoption of these Bayes factor tests, we provide easy-to-use software.

Mulder J., Klugkist I., van de Schoot R., Meeus W. H. J., Selfhout M., & Hoijtink H . ( 2009).

Bayesian model selection of informative hypotheses for repeated measurements

Journal of Mathematical Psychology, 53( 6), 530-546.

When analyzing repeated measurements data, researchers often have expectations about the relations between the measurement means. The expectations can often be formalized using equality and inequality constraints between (i) the measurement means over time, (ii) the measurement means between groups, (iii) the means adjusted for time-invariant covariates, and (iv) the means adjusted for time-varying covariates. The result is a set of informative hypotheses. In this paper, the Bayes factor is used to determine which hypothesis receives most support from the data. A pivotal element in the Bayesian framework is the specification of the prior. To avoid subjective prior specification, training data in combination with restrictions on the measurement means are used to obtain so-called constrained posterior priors. A simulation study and an empirical example from developmental psychology show that this prior results in Bayes factors with desirable properties.

Munafò M. R., Nosek B. A., Bishop D. V. M., Button K. S., Chambers C. D., Percie du Sert N., … Ioannidis, J. P. A.(2017).

A manifesto for reproducible science

Nature Human Behaviour, 1( 1), 0021.

Improving the reliability and efficiency of scientific research will increase the credibility of the published scientific literature and accelerate discovery. Here we argue for the adoption of measures to optimize key elements of the scientific process: methods, reporting and dissemination, reproducibility, evaluation and incentives. There is some evidence from both simulations and empirical studies supporting the likely effectiveness of these measures, but their broad adoption by researchers, institutions, funders and journals will require iterative evaluation and improvement. We discuss the goals of these measures, and how they can be implemented, in the hope that this will facilitate action toward improving the transparency, reproducibility and efficiency of scientific research.

Nosek B. A., Alter G., Banks G. C., Borsboom D., Bowman S. D., Breckler S. J., … Yarkoni T . ( 2015).

Promoting an open research culture

Science, 348( 6242), 1422-1425.

Nosek B. A., Spies J. R., & Motyl M . ( 2012).

Scientific Utopia: II. Restructuring incentives and practices to promote truth over publishability

Perspectives on Psychological Science, 7( 6), 615-631.

Open Science Collaboration. ( 2015).

Estimating the reproducibility of psychological science

Science, 349(6251), aac4716.

Plummer, M.(2003).

JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling

Paper presented at the Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003).

Poldrack R. A., Baker C. I., Durnez J., Gorgolewski K. J., Matthews P. M., Munafò M. R., … Yarkoni T . ( 2017).

Scanning the horizon: Towards transparent and reproducible neuroimaging research

Nature Reviews Neuroscience, 18( 2), 115-126.

URL     PMID:28053326

Functional neuroimaging techniques have transformed our ability to probe the neurobiological basis of behaviour and are increasingly being applied by the wider neuroscience community. However, concerns have recently been raised that the conclusions that are drawn from some human neuroimaging studies are either spurious or not generalizable. Problems such as low statistical power, flexibility in data analysis, software errors and a lack of direct replication apply to many fields, but perhaps particularly to functional MRI. Here, we discuss these problems, outline current and suggested best practices, and describe how we think the field should evolve to produce the most meaningful and reliable answers to neuroscientific questions.

Poldrack,R. A., & Gorgolewski, K. J . ( 2017).

OpenfMRI: Open sharing of task fMRI data

NeuroImage, 144, 259-261.

URL     PMID:4669234

OpenfMRI is a repository for the open sharing of task-based fMRI data. Here we outline its goals, architecture, and current status of the repository, as well as outlining future plans for the project.

Rouder,J. N . ( 2014).

Optional stopping: No problem for Bayesians

Psychonomic Bulletin & Review, 21( 2), 301-308.

URL     PMID:24659049

Abstract Optional stopping refers to the practice of peeking at data and then, based on the results, deciding whether or not to continue an experiment. In the context of ordinary significance-testing analysis, optional stopping is discouraged, because it necessarily leads to increased type I error rates over nominal values. This article addresses whether optional stopping is problematic for Bayesian inference with Bayes factors. Statisticians who developed Bayesian methods thought not, but this wisdom has been challenged by recent simulation results of Yu, Sprenger, Thomas, and Dougherty (2013) and Sanborn and Hills (2013). In this article, I show through simulation that the interpretation of Bayesian quantities does not depend on the stopping rule. Researchers using Bayesian methods may employ optional stopping in their own research and may provide Bayesian analysis of secondary data regardless of the employed stopping rule. I emphasize here the proper interpretation of Bayesian quantities as measures of subjective belief on theoretical positions, the difference between frequentist and Bayesian interpretations, and the difficulty of using frequentist intuition to conceptualize the Bayesian approach.

Rouder,J. N., & Morey, R. D . ( 2011).

A Bayes factor meta-analysis of Bem’s ESP claim

Psychonomic Bulletin & Review, 18( 4), 682-689.

URL     PMID:21573926

Abstract In recent years, statisticians and psychologists have provided the critique that p-values do not capture the evidence afforded by data and are, consequently, ill suited for analysis in scientific endeavors. The issue is particular salient in the assessment of the recent evidence provided for ESP by Bem (2011) in the mainstream Journal of Personality and Social Psychology. Wagenmakers, Wetzels, Borsboom, and van der Maas (Journal of Personality and Social Psychology, 100, 426-432, 2011) have provided an alternative Bayes factor assessment of Bem's data, but their assessment was limited to examining each experiment in isolation. We show here that the variant of the Bayes factor employed by Wagenmakers et al. is inappropriate for making assessments across multiple experiments, and cannot be used to gain an accurate assessment of the total evidence in Bem's data. We develop a meta-analytic Bayes factor that describes how researchers should update their prior beliefs about the odds of hypotheses in light of data across several experiments. We find that the evidence that people can feel the future with neutral and erotic stimuli to be slight, with Bayes factors of 3.23 and 1.57, respectively. There is some evidence, however, for the hypothesis that people can feel the future with emotionally valenced nonerotic stimuli, with a Bayes factor of about 40. Although this value is certainly noteworthy, we believe it is orders of magnitude lower than what is required to overcome appropriate skepticism of ESP.

Rouder J. N., Morey R. D., Speckman P. L., & Province J. M . ( 2012).

Default Bayes factors for ANOVA designs

Journal of Mathematical Psychology, 56( 5), 356-374.

Rouder J. N., Morey R. D., Verhagen J., Swagman A. R., & Wagenmakers E.-J . ( 2017).

Bayesian analysis of factorial designs

Psychological Methods, 22( 2), 304-321.

URL     PMID:27280448

Abstract This article provides a Bayes factor approach to multiway analysis of variance (ANOVA) that allows researchers to state graded evidence for effects or invariances as determined by the data. ANOVA is conceptualized as a hierarchical model where levels are clustered within factors. The development is comprehensive in that it includes Bayes factors for fixed and random effects and for within-subjects, between-subjects, and mixed designs. Different model construction and comparison strategies are discussed, and an example is provided. We show how Bayes factors may be computed with BayesFactor package in R and with the JASP statistical package. (PsycINFO Database Record

Rouder J. N., Speckman P. L., Sun D. C., Morey R. D., & Iverson G . ( 2009).

Bayesian t tests for accepting and rejecting the null hypothesis

Psychonomic Bulletin & Review, 16( 2), 225-237.

Salsburg, D.(2001).

The lady tasting tea: How statistics revolutionized science in the twentieth century

New York, NY: W. H. Freeman and Company.

Salvatier J., Wiecki T. V., & Fonnesbeck C . ( 2016).

Probabilistic programming in Python using PyMC3

Peer J Computer Science, 2, e55.

Probabilistic programming (PP) allows flexible specification of Bayesian statistical models in code. PyMC3 is a new, open-source PP framework with an intutive and readable, yet powerful, syntax that is close to the natural syntax statisticians use to describe models. It features next-generation Markov chain Monte Carlo (MCMC) sampling algorithms such as the No-U-Turn Sampler (NUTS; Hoffman, 2014), a self-tuning variant of Hamiltonian Monte Carlo (HMC; Duane, 1987). Probabilistic programming in Python confers a number of advantages including multi-platform compatibility, an expressive yet clean and readable syntax, easy integration with other scientific libraries, and extensibility via C, C++, Fortran or Cython. These features make it relatively straightforward to write and use custom statistical distributions, samplers and transformation functions, as required by Bayesian analysis.

Schervish,M. J . ( 1996).

P values: What they are and what they are not

The American Statistician, 50( 3), 203-206.

P values (or significance probabilities) have been used in place of hypothesis tests as a means of giving more information about the relationship between the data and the hypothesis than does a simple reject/do not reject decision. Virtually all elementary statistics texts cover the calculation of P values for one-sided and point-null hypotheses concerning the mean of a sample from a normal distribution. There is, however, a third case that is intermediate to the one-sided and point-null cases, namely the interval hypothesis, that receives no coverage in elementary texts. We show that P values are continuous functions of the hypothesis for fixed data. This allows a unified treatment of all three types of hypothesis testing problems. It also leads to the discovery that a common informal use of P values as measures of support or evidence for hypotheses has serious logical flaws.

Schlaifer, R.,& Raiffa, H.(1961).

Applied statistical decision theory

Boston: Harvard University.

Schönbrodt F. D., Wagenmakers E.-J., Zehetleitner M., & Perugini M . ( 2017).

Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences

Psychological Methods, 22( 2), 322-339.

URL     PMID:26651986

Unplanned optional stopping rules have been criticized for inflating Type I error rates under the null hypothesis significance testing (NHST) paradigm. Despite these criticisms, this research practice is not uncommon, probably because it appeals to researcher's intuition to collect more data to push an indecisive result into a decisive region. In this contribution, we investigate the properties of a procedure for Bayesian hypothesis testing that allows optional stopping with unlimited multiple testing, even after each participant. In this procedure, which we call Sequential Bayes Factors (), Bayes factors are computed until an a priori defined level of evidence is reached. This allows flexible sampling plans and is not dependent upon correct effect size guesses in an a priori power analysis. We investigated the long-term rate of misleading evidence, the average expected sample sizes, and the biasedness of effect size estimates when an design is applied to a test of mean differences between 2 groups. Compared with optimal NHST, the design typically needs 50% to 70% smaller samples to reach a conclusion about the presence of an effect, while having the same or lower long-term rate of wrong inference. (PsycINFO Database Record.

Scott,J. G., & Berger, J. O . ( 2006).

An exploration of aspects of Bayesian multiple testing

Journal of Statistical Planning and Inference, 136( 7), 2144-2162.

There has been increased interest of late in the Bayesian approach to multiple testing (often called the multiple comparisons problem), motivated by the need to analyze DNA microarray data in which it is desired to learn which of potentially several thousand genes are activated by a particular stimulus. We study the issue of prior specification for such multiple tests; computation of key posterior quantities; and useful ways to display these quantities. A decision-theoretic approach is also considered.

Scott,J. G., & Berger, J. O . ( 2010).

Bayes and empirical- Bayes multiplicity adjustment in the variable-selection problem

The Annals of Statististics, 38( 5), 2587-2619.

This paper studies the multiplicity-correction effect of standard Bayesian variable-selection priors in linear regression. Our first goal is to clarify when, and how, multiplicity correction happens automatically in Bayesian analysis, and to distinguish this correction from the Bayesian Ockham's-razor effect. Our second goal is to contrast empirical-Bayes and fully Bayesian approaches to variable selection through examples, theoretical results and simulations. Considerable differences between the two approaches are found. In particular, we prove a theorem that characterizes a surprising aymptotic discrepancy between fully Bayes and empirical Bayes. This discrepancy arises from a different source than the failure to account for hyperparameter uncertainty in the empirical-Bayes estimate. Indeed, even at the extreme, when the empirical-Bayes estimate converges asymptotically to the true variable-inclusion probability, the potential for a serious difference remains.

Sellke T., Bayarri M. J., & Berger J. O . ( 2001).

Calibration of ρ values for testing precise null hypotheses

The American Statistician, 55( 1), 62-71.

P values are the most commonly used tool to measure evidence against a hypothesis or hypothesized model. Unfortunately, they are often incorrectly viewed as an error probability for rejection of the hypothesis or, even worse, as the posterior probability that the hypothesis is true. The fact that these interpretations can be completely misleading when testing precise hypotheses is first reviewed, through consideration of two revealing simulations. Then two calibrations of a value are developed, the first being interpretable as odds and the second as either a (conditional) frequentist error probability or as the posterior probability of the hypothesis.

Stephens, M.,& Balding, D. J . ( 2009).

Bayesian statistical methods for genetic association studies

Nature Reviews Genetics, 10( 10), 681-690.

Bayesian statistical methods have recently made great inroads into many areas of science, and this advance is now extending to the assessment of association between genetic variants and disease or other phenotypes. We review these methods, focusing on single-SNP tests in genome-wide association studies. We discuss the advantages of the Bayesian approach over classical (frequentist) approaches in this setting and provide a tutorial on basic analysis steps, including practical guidelines for appropriate prior specification. We demonstrate the use of Bayesian methods for fine mapping in candidate regions, discuss meta-analyses and provide guidance for refereeing manuscripts that contain Bayesian analyses.

Stulp G., Buunk A. P., Verhulst S., & Pollet T. V . ( 2013).

Tall claims? Sense and nonsense about the importance of height of US presidents

The Leadership Quarterly, 24( 1), 159-171.

According to both the scientific literature and popular media, all one needs to win a US presidential election is to be taller than one's opponent. Yet, such claims are often based on an arbitrary selection of elections, and inadequate statistical analysis. Using data on all presidential elections, we show that height is indeed an important factor in the US presidential elections. Candidates that were taller than their opponents received more popular votes, although they were not significantly more likely to win the actual election. Taller presidents were also more likely to be reelected. In addition, presidents were, on average, much taller than men from the same birth cohort. The advantage of taller candidates is potentially explained by perceptions associated with height: taller presidents are rated by experts as 'greater', and having more leadership and communication skills. We conclude that height is an important characteristic in choosing and evaluating political leaders. (C) 2012 Elsevier Inc. All rights reserved.

Topolinski, S.,& Sparenberg, P.(2012).

Turning the hands of time

Social Psychological and Personality Science, 3( 3), 308-314.

ABSTRACT The omnipresent abstract symbol for time progression and regression is clockwise versus counterclockwise rotation. It was tested whether merely executing and seeing clockwise (vs. counterclockwise) movements would induce psychological states of temporal progression (vs. regression) and according to motivational orientations toward the future and novelty (vs. the past and familiarity). Supporting this hypothesis, participants who turned cranks counterclockwise preferred familiar over novel stimuli, but participants who turned cranks clockwise preferred novel over old stimuli, reversing the classic mere exposure effect (Experiment 1). Also, participants rotating a cylinder clockwise reported higher scores in the personality measure openness to experience than participants rotating counterclockwise (Experiment 2). Merely passively watching a rotating square had similar but weaker effects on exposure and openness (Experiment 3). Finally, participants chose more unconventional candies from a clockwise than from a counterclockwise Lazy Susan, that is, a turntable (Experiment 4).

van de Schoot R., Winter S., Ryan O., Zondervan- Zwijnenburg M., & Depaoli S . ( 2017).

A systematic review of Bayesian papers in psychology: The last 25 years

Psychological Methods, 22( 2), 217-239.

URL     PMID:28594224

Abstract Although the statistical tools most often used by researchers in the field of psychology over the last 25 years are based on frequentist statistics, it is often claimed that the alternative Bayesian approach to statistics is gaining in popularity. In the current article, we investigated this claim by performing the very first systematic review of Bayesian psychological articles published between 1990 and 2015 (n = 1,579). We aim to provide a thorough presentation of the role Bayesian statistics plays in psychology. This historical assessment allows us to identify trends and see how Bayesian methods have been integrated into psychological research in the context of different statistical frameworks (e.g., hypothesis testing, cognitive models, IRT, SEM, etc.). We also describe take-home messages and provide "big-picture" recommendations to the field as Bayesian statistics becomes more popular. Our review indicated that Bayesian statistics is used in a variety of contexts across subfields of psychology and related disciplines. There are many different reasons why one might choose to use Bayes (e.g., the use of priors, estimating otherwise intractable models, modeling uncertainty, etc.). We found in this review that the use of Bayes has increased and broadened in the sense that this methodology can be used in a flexible manner to tackle many different forms of questions. We hope this presentation opens the door for a larger discussion regarding the current state of Bayesian statistics, as well as future trends. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

Vanpaemel, W.(2010).

Prior sensitivity in theory testing: An apologia for the Bayes factor

Journal of Mathematical Psychology, 54( 6), 491-498.

A commonly voiced concern with the Bayes factor is that, unlike many other Bayesian and non-Bayesian quantitative measures of model evaluation, it is highly sensitive to the parameter prior. This paper argues that, when dealing with psychological models that are quantitatively instantiated theories, being sensitive to the prior is an attractive feature of a model evaluation measure. This assertion follows from the observation that in psychological models parameters are not completely unknown, but correspond to psychological variables about which theory often exists. This theory can be formally captured in the prior range and prior distribution of the parameters, indicating which parameter values are allowed, likely, unlikely and forbidden. Because the prior is a vehicle for expressing psychological theory, it should, like the model equation, be considered as an integral part of the model. It is argued that the combined practice of building models using informative priors, and evaluating models using prior sensitive measures advances knowledge.

Wagenmakers, E.-J.(2007).

A practical solution to the pervasive problems of p values

Psychonomic Bulletin & Review, 14( 5), 779-804.

URL     PMID:18087943

In the field of psychology, the practice of p value null-hypothesis testing is as widespread as ever. Despite this popularity, or perhaps because of it, most psychologists are not aware of the statistical peculiarities of the p value procedure. In particular, p values are based on data that were never observed, and these hypothetical data are themselves influenced by subjective intentions. Moreover, p values do not quantify statistical evidence. This article reviews these p value problems and illustrates each problem with concrete examples. The three problems are familiar to statisticians but may be new to psychologists. A practical solution to these p value problems is to adopt a model selection perspective and use the Bayesian information criterion (BIC) for statistical inference (Raftery, 1995). The BIC provides an approximation to a Bayesian hypothesis test, does not require the specification of priors, and can be easily calculated from SPSS output.

Wagenmakers E.-J., Beek T. F., Rotteveel M., Gierholz A., Matzke D., Steingroever H., … Pinto Y . ( 2015).

Turning the hands of time again: A purely confirmatory replication study and a Bayesian analysis

Frontiers in Psychology, 6, 494.

URL     PMID:25964771

Abstract In a series of four experiments, Topolinski and Sparenberg (2012) found support for the conjecture that clockwise movements induce psychological states of temporal progression and an orientation toward the future and novelty. Here we report the results of a preregistered replication attempt of Experiment 2 from Topolinski and Sparenberg (2012). Participants turned kitchen rolls either clockwise or counterclockwise while answering items from a questionnaire assessing openness to experience. Data from 102 participants showed that the effect went slightly in the direction opposite to that predicted by Topolinski and Sparenberg (2012), and a preregistered Bayes factor hypothesis test revealed that the data were 10.76 times more likely under the null hypothesis than under the alternative hypothesis. Our findings illustrate the theoretical importance and practical advantages of preregistered Bayes factor replication studies, both for psychological science and for empirical work in general.

Wagenmakers E.-J., Lodewyckx T., Kuriyal H., & Grasman R . ( 2010).

Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey method

Cognitive Psychology, 60( 3), 158-189.

URL     PMID:20064637

In the field of cognitive psychology, the p-value hypothesis test has established a stranglehold on statistical reporting. This is unfortunate, as the p-value provides at best a rough estimate of the evidence that the data provide for the presence of an experimental effect. An alternative and arguably more appropriate measure of evidence is conveyed by a Bayesian hypothesis test, which prefers the model with the highest average likelihood. One of the main problems with this Bayesian hypothesis test, however, is that it often requires relatively sophisticated numerical methods for its computation. Here we draw attention to the Savage–Dickey density ratio method, a method that can be used to compute the result of a Bayesian hypothesis test for nested models and under certain plausible restrictions on the parameter priors. Practical examples demonstrate the method’s validity, generality, and flexibility.

Wagenmakers E.-J., Love J., Marsman M., Jamil T., Ly A., Verhagen J., … van Doorn J . ( 2017).

Bayesian inference for psychology. Part II: Example applications with JASP

Psychonomic Bulletin & Review, 1-19.

URL     PMID:28685272

Abstract Bayesian hypothesis testing presents an attractive alternative to p value hypothesis testing. Part I of this series outlined several advantages of Bayesian hypothesis testing, including the ability to quantify evidence and the ability to monitor and update this evidence as data come in, without the need to know the intention with which the data were collected. Despite these and other practical advantages, Bayesian hypothesis tests are still reported relatively rarely. An important impediment to the widespread adoption of Bayesian tests is arguably the lack of user-friendly software for the run-of-the-mill statistical problems that confront psychologists for the analysis of almost every experiment: the t-test, ANOVA, correlation, regression, and contingency tables. In Part II of this series we introduce JASP ( http://www.jasp-stats.org ), an open-source, cross-platform, user-friendly graphical software package that allows users to carry out Bayesian hypothesis tests for standard statistical problems. JASP is based in part on the Bayesian analyses implemented in Morey and Rouder's BayesFactor package for R. Armed with JASP, the practical advantages of Bayesian hypothesis testing are only a mouse click away.

Wagenmakers E.-J., Marsman M., Jamil T., Ly A., Verhagen J., Love J., … Morey R. D . ( 2017).

Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications

Psychonomic Bulletin & Review, 1-23.

Bayesian parameter estimation and Bayesian hypothesis testing present attractive alternatives to classical inference using confidence intervals and p values. In part I of this series we outline ten prominent advantages of the Bayesian approach. Many of these advantages translate to concrete opportunities for pragmatic researchers. For instance, Bayesian hypothesis testing allows researchers to quantify evidence and monitor its progression as data come in, without needing to know the intention with which the data were collected. We end by countering several objections to Bayesian hypothesis testing. Part II of this series discusses JASP, a free and open source software program that makes it easy to conduct Bayesian estimation and testing for a range of popular statistical scenarios (Wagenmakers et al. this issue ).

Wagenmakers E.-J., Verhagen J., Ly A., Matzke D., Steingroever H., Rouder J. N., & Morey R. D . ( 2017).

The need for Bayesian hypothesis testing in psychological science

In S. O. Lilienfeld & I. D. Waldman (Eds.), Psychological science under scrutiny(pp. 123-138). Chichester: John Wiley & Sons, Inc.

Abstract This chapter explains why the logic behind p-value significance tests is faulty, leading researchers to mistakenly believe that their results are diagnostic when they are not. It outlines a Bayesian alternative that overcomes the flaws of the p-value procedure, and provides researchers with an honest assessment of the evidence against or in favor of the null hypothesis. The p-value is the probability of encountering the value of a test statistic at least as extreme as the one that was observed, given that the null hypothesis is true. The logic that underlies the p-value as a measure of evidence is based on what is known as Fisher's disjunction. The chapter focuses on an inherent weakness of p-value: the fact that they depend only on what is expected under the null hypothesis H 0, what is expected under an alternative hypothesis H 1 is simply not taken into consideration.

Wagenmakers E.-J., Wetzels R., Borsboom D., & van der Maas, H. L. J.(2011).

Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011)

Journal of Personality and Social Psychology, 100( 3), 426-432.

Does psi exist? D. J. Bem (2011) conducted 9 studies with over 1,000 participants in an attempt to demonstrate that future events retroactively affect people's responses. Here we discuss several limitations of Bem's experiments on psi; in particular, we show that the data analysis was partly exploratory and that one-sided p values may overstate the statistical evidence against the null hypothesis. We reanalyze Bem's data with a default Bayesian t test and show that the evidence for psi is weak to nonexistent. We argue that in order to convince a skeptical audience of a controversial claim, one needs to conduct strictly confirmatory studies and analyze the results with statistical tests that are conservative rather than liberal. We conclude that Bem's p values do not indicate evidence in favor of precognition; instead, they indicate that experimental psychologists need to change the way they conduct their experiments and analyze their data.

Wagenmakers E.-J., Wetzels R., Borsboom D., van der Maas, H. L. J., & Kievit R. A . ( 2012).

An agenda for purely confirmatory research

Perspectives on Psychological Science, 7( 6), 632-638.

URL     PMID:26168122

Abstract The veracity of substantive research claims hinges on the way experimental data are collected and analyzed. In this article, we discuss an uncomfortable fact that threatens the core of psychology's academic enterprise: almost without exception, psychologists do not commit themselves to a method of data analysis before they see the actual data. It then becomes tempting to fine tune the analysis to the data in order to obtain a desired result-a procedure that invalidates the interpretation of the common statistical tests. The extent of the fine tuning varies widely across experiments and experimenters but is almost impossible for reviewers and readers to gauge. To remedy the situation, we propose that researchers preregister their studies and indicate in advance the analyses they intend to conduct. Only these analyses deserve the label "confirmatory," and only for these analyses are the common statistical tests valid. Other analyses can be carried out but these should be labeled "exploratory." We illustrate our proposal with a confirmatory replication attempt of a study on extrasensory perception. The Author(s) 2012.

Wasserstein,R. L., & Lazar, N. A . ( 2016).

The ASA's statement on p-values: Context, process, and purpose

The American Statistician, 70( 2), 129-133.

Wetzels R., Matzke D., Lee M. D., Rouder J. N., Iverson G. J., & Wagenmakers E.-J . ( 2011).

Statistical evidence in experimental psychology: An empirical comparison ssing 855 t tests

Perspectives on Psychological Science, 6( 3), 291-298.

Statistical inference in psychology has traditionally relied heavily on p-value significance testing. This approach to drawing conclusions from data, however, h...

Zhu J., Chen J. F., Hu W. B., & Zhang B . ( 2017).

Big Learning with Bayesian methods

National Science Review, 4( 4), 627-651.

The explosive growth in data volume and the availability of cheap computing resources have sparked increasing interest in Big learning, an emerging subfield that studies scalable machine learning algorithms,systems and applications with Big Data. Bayesian methods represent one important class of statistical methods for machine learning, with substantial recent developments on adaptive, flexible and scalable Bayesian learning. This article provides a survey of the recent advances in Big learning with Bayesian methods, termed Big Bayesian Learning, including non-parametric Bayesian methods for adaptively inferring model complexity, regularized Bayesian inference for improving the flexibility via posterior regularization, and scalable algorithms and systems based on stochastic subsampling and distributed computing for dealing with large-scale applications. We also provide various new perspectives on the large-scale Bayesian modeling and inference.

Ziliak S. T., & McCloskey, D. N.( 2008) . The cult of statistical significance. Ann Arbor: University of Michigan Press.

Zuo X.-N., Anderson J. S., Bellec P., Birn R. M., Biswal B. B., Blautzik J., … Milham M. P . ( 2014).

An open science resource for establishing reliability and reproducibility in functional connectomics

Nature Scientific Data, 1, 140049.

URL     PMID:25977800

Abstract Efforts to identify meaningful functional imaging-based biomarkers are limited by the ability to reliably characterize inter-individual differences in human brain function. Although a growing number of connectomics-based measures are reported to have moderate to high test-retest reliability, the variability in data acquisition, experimental designs, and analytic methods precludes the ability to generalize results. The Consortium for Reliability and Reproducibility (CoRR) is working to address this challenge and establish test-retest reliability as a minimum standard for methods development in functional connectomics. Specifically, CoRR has aggregated 1,629 typical individuals' resting state fMRI (rfMRI) data (5,093 rfMRI scans) from 18 international sites, and is openly sharing them via the International Data-sharing Neuroimaging Initiative (INDI). To allow researchers to generate various estimates of reliability and reproducibility, a variety of data acquisition procedures and experimental designs are included. Similarly, to enable users to assess the impact of commonly encountered artifacts (for example, motion) on characterizations of inter-individual variation, datasets of varying quality are included.

Zuo, X.-N.,&Xing, X.-X.(2014).

Test-retest reliabilities of resting-state FMRI measurements in human brain functional connectomics: A systems neuroscience perspective

Neuroscience & Biobehavioral Reviews, 45, 100-118.

URL     PMID:24875392

Abstract Resting-state functional magnetic resonance imaging (RFMRI) enables researchers to monitor fluctuations in the spontaneous brain activities of thousands of regions in the human brain simultaneously, representing a popular tool for macro-scale functional connectomics to characterize normal brain function, mind-brain associations, and the various disorders. However, the test-retest reliability of RFMRI remains largely unknown. We review previously published papers on the test-retest reliability of voxel-wise metrics and conduct a meta-summary reliability analysis of seven common brain networks. This analysis revealed that the heteromodal associative (default, control, and attention) networks were mostly reliable across the seven networks. Regarding examined metrics, independent component analysis with dual regression, local functional homogeneity and functional homotopic connectivity were the three mostly reliable RFMRI metrics. These observations can guide the use of reliable metrics and further improvement of test-retest reliability for other metics in functional connectomics. We discuss the main issues with low reliability related to sub-optimal design and the choice of data processing options. Future research should use large-sample test-retest data to rectify both the within-subject and between-subject variability of RFMRI measurements and accelerate the application of functional connectomics. Copyright 2014 Elsevier Ltd. All rights reserved.