Interpreting nonsignificant results: A quantitative investigation based on 500 Chinese psychological research

不显著结果(如, p > 0.05)在心理学研究中十分常见, 且容易被误解为接受零假设的证据, 并可能导致分组匹配研究的错误推断或者忽视被小样本的不显著结果掩盖的真实效应。但国内目前尚无实证研究对不显著结果的普遍性及其解读进行调查。本研究调查500篇中文心理学实证研究, 统计其摘要中出现与不显著结果相关的阴性陈述的频率, 判断并统计基于阴性陈述的推断准确性, 并使用贝叶斯因子对不显著结果中包含t值的研究进行重新评估。结果表明, 36%的摘要提及不显著结果, 共包含236个阴性陈述。其中, 41%的阴性陈述对不显著结果的解读出现偏差(如, 解读为支持了零假设)。对包含t值的研究进行贝叶斯因子分析, 结果显示仅有5.1%的不显著结果可以提供强证据支持零假设(BF01 > 10)。与先前对国际心理学期刊的调查结果相比(32%的摘要包含阴性陈述; 72%的阴性陈述对不显著结果的解读错误), 中文心理学期刊中报告不显著结果的比例更高, 且对不显著结果解读错误的比例更低。但国内研究者仍需进一步加强对不显著结果的认识, 推广适于评估不显著结果的统计方法。

关键词: 不显著结果, 零假设显著性检验, 贝叶斯因子, 元研究


Background: P-value is the most widely used statistical index for inference in science. A p-value greater than 0.05, i.e., nonsignificant results, however, cannot distinguish the two following cases: the absence of evidence or the evidence of absence. Unfortunately, researchers in psychological science may not be able to interpret p-values correctly, resulting in wrong inference. For example, Aczel et al (2018), after surveying 412 empirical studies published in Psychonomic Bulletin & Review, Journal of Experimental Psychology: General, and Psychological Science, found that about 72% of nonsignificant results were misinterpreted as evidence in favor of the null hypothesis. Misinterpretations of nonsignificant results may lead to severe consequences. One such consequence is missing potentially meaningful effects. Also, in matched-group clinical trials, misinterpretations of nonsignificant results may lead to false “matched” groups, thus threatening the validity of interventions. So far, how nonsignificant results are interpreted in Chinese psychological literature is unknown. Here we surveyed 500 empirical papers published in five mainstream Chinese psychological journals, to address the following questions: (1) how often are nonsignificant results reported; (2) how do researchers interpret nonsignificant results in these published studies; (3) if researchers interpreted nonsignificant as “evidence for absence,” do empirical data provide enough evidence for null effects? 
Method: Based on our pre-registration (, we first randomly selected 500 empirical papers from all papers published in 2017 and 2018 in five mainstream Chinese psychological journals (Acta Psychologica Sinica, Psychological Science, Chinese Journal of Clinical Psychology, Psychological Development and Education, Psychological and Behavioral Studies). Second, we screened abstracts of these selected articles to check whether they contain negative statements. For those studies which contain negative statements in their abstracts, we searched nonsignificant statistics in their results and checked whether the corresponding interpretations were correct. More specifically, all those statements were classified into four categories (Correct-frequentist, Incorrect-frequentist: whole population, Incorrect-frequentist: current sample, Difficult to judge). Finally, we calculated Bayes factors based on available t values and sample sizes associated with those nonsignificant results. The Bayes factors can help us to estimate to what extent those results provided evidence for the absence of effects (i.e., the way researchers incorrectly interpreted nonsignificant results). 
Results: Our survey revealed that: (1) out of 500 empirical papers, 36% of their abstracts (n = 180) contained negative statements; (2) there are 236 negative statements associated with nonsignificant statistics in those selected studies, and 41% of these 236 negative statements misinterpreted nonsignificant results, i.e., the authors inferred that the results provided evidence for the absence of effects; (3) Bayes factor analyses based on available t-values and sample sizes found that only 5.1% (n = 2) nonsignificant results could provide strong evidence for the absence of effects (BF01 > 10). Compared with the results from Aczel et al (2019), we found that empirical papers published in Chinese journals contain more negative statements (36% vs. 32%), and researchers made fewer misinterpretations of nonsignificant results (41% vs. 72%). It worth noting, however, that there exists a categorization of ambiguous interpretations of nonsignificant results in the Chinese context. More specifically, many statements corresponding to nonsignificant results were “there is no significant difference between condition A and condition B”. These statements can be understood either as “the difference is not statistically significant”, which is correct, or “there is no difference”, which is incorrect. The percentage of misinterpretations of nonsignificant results raised to 64% if we adopt the second way to understand these statements, in contrast to 41% if we used the first understanding.
Conclusion: Our results suggest that Chinese researchers need to improve their understanding of nonsignificant results and use more appropriate statistical methods to extract information from nonsignificant results. Also, more precise wordings should be used in the Chinese context.

Key words: nonsignificant results, null-hypothesis significance testing, Bayes factors, meta-research