The null hypothesis significance test and the dicotomization of the p-value: Errare Humanum Est
DOI:
https://doi.org/10.17843/rpmesp.2024.414.14285Keywords:
Statistical Analysis, Hypothesis-Testing, Biostatistics, Epidemiology and Biostatistics, statistics & numerical dataAbstract
Decision-making in healthcare is complex and needs to be based on the best scientific evidence. In this process, information derived from statistical analysis of data is crucial, which can be developed from either frequentist or Bayesian perspectives. When it comes to the frequentist field, the null hypothesis significance test (NHST) and its p-value is one of the most widely used techniques in different disciplines. However, NHST has been subjected to questioning from different academic points of view, which has led to it being considered as one of the causes of the so-called replicability crisis in science. In this review article, we provide a brief historical account of its development, summarize the underlying methods, describe some controversies and limitations, address misuse and misinterpretation, and finally give some scopes and reflections in the context of biomedical research.
Downloads
References
Fardet A, Lebredonchel L, Rock E. Empirico-inductive and/or hypothetico-deductive methods in food science and nutrition research:
which one to favor for a better global health? Crit Rev Food Sci Nutr. 2023;63(15):2480–93. doi: 10.1080/10408398.2021.1976101.
Lash TL, VanderWeele TJ, Haneause S, Rothman K. Modern Epidemiology. Wolters Kluwer Health; 2020. 1340 p.
Hubbard R, Haig BD, Parsa RA. The Limited Role of Formal Statistical Inference in Scientific Inference. Am Stat. 2019;73(sup1):91–8. doi: 10.1080/00031305.2018.1464947.
Lin H. To Be a Frequentist or Bayesian? Five Positions in a Spectrum. Harv Data Sci Rev [Internet]. 2024 [citado el 4 de agosto de 2024];6(3). doi: 10.1162/99608f92.9a53b923.
Chavalarias D, Wallach JD, Li AHT, Ioannidis JPA. Evolution of Reporting P Values in the Biomedical Literature, 1990-2015. JAMA. 2016;315(11):1141–8. doi: 10.1001/jama.2016.1952.
Gelman A. P values and statistical practice. Epidemiol Camb Mass. 2013;24(1):69–72. doi: 10.1097/EDE.0b013e31827886f7.
Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31:337. doi: 10.1007/s10654-016-0149-3.
Dahabreh IJ, Bibbins-Domingo K. Causal Inference About the Effects of Interventions From Observational Studies in Medical Journals. JAMA. 2024;331(21):1845–53. doi: 10.1001/jama.2024.7741.
Chén OY, Bodelet JS, Saraiva RG, Phan H, Di J, Nagels G, et al. The roles, challenges, and merits of the p value. Patterns. 2023;4(12):100878. doi: 10.1016/j.patter.2023.100878.
Baker M. Statisticians issue warning over misuse of P values. Nature. 2016;531(7593):151. doi: 10.1038/nature.2016.19503.
Demidenko E. The p-Value You Can’t Buy. Am Stat. 2016;70(1):33–8. doi: 10.1080/00031305.2015.1069760.
Kuffner TA, Walker SG. Why are p-Values Controversial? Am Stat. 2019;73(1):1–3. doi: 10.1080/00031305.2016.1277161.
Mendoza C. El Valor P en Epidemiología. Rev Chil Salud Pública. 2006;10(1):47–51.
Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567(7748):305–7. doi: 10.1038/d41586-019-00857-9.
Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2(8):e124. doi: 10.1371/journal.pmed.0020124.
McShane BB, Gal D, Gelman A, Robert C, Tackett JL. Abandon Statistical Significance. Am Stat. 2019;73(sup1):235–45. doi: 10.1080/00031305.2018.1527253.
Szucs D, Ioannidis JPA. When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment. Front Hum Neurosci [Internet]. 2017 [citado el 11 de marzo de 2019];11. doi: 10.3389/fnhum.2017.00390.
Pagano M, Gauvreau K. Principles of Biostatistics. Taylor & Francis; 2018. 584 p.
Perezgonzalez JD. Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing. Front Psychol. 2015;6:223. doi: 10.3389/fpsyg.2015.00223.
Lehmann EL. The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? J Am Stat Assoc. 1993;88(424):1242–9. doi: 10.2307/2291263.
Mark DB, Lee KL, Harrell FE. Understanding the Role of P Values and Hypothesis Tests in Clinical Research. JAMA Cardiol. 2016;1(9):1048–54. doi: 10.1001/jamacardio.2016.3312.
Lytsy P. P in the right place: Revisiting the evidential value of P‐values. J Evid-Based Med. 2018;11(4):288–91. doi: 10.1111/jebm.12319.
Gibson EW. The Role of p-Values in Judging the Strength of Evidence and Realistic Replication Expectations. Stat Biopharm Res. 2021;13(1):6–18. doi: 10.1080/19466315.2020.1724560.
Desai J, Watson D, Wang V, Taddeo M, Floridi L. The epistemological foundations of data science: a critical review. Synthese. 2022;200(6):469. doi: 10.1007/s11229-022-03933-2.
Duerr PM. Popper: Critical Rationalist, Conventionalist, and Virtue Epistemologist. HOPOS J Int Soc Hist Philos Sci. 2023;13(1):54–90. doi: 10.1086/724046.
Koch E, Otarola A, Romero T, Kirschbaum A, Ortuzar E. Popperian epidemiology and the logic of bi-conditional modus tollens arguments for refutational analysis of randomised controlled trials. Med Hypotheses. 2006;67(4):980–8. doi: 10.1016/j.mehy.2006.03.033.
Amrhein V, Greenland S. Remove, rather than redefine, statistical significance. Nat Hum Behav. 2018;2(1):4. doi: 10.1038/s41562-017-0224-0.
Trafimow D, Amrhein V, Areshenkoff CN, Barrera-Causil CJ, Beh EJ, Bilgiç YK, et al. Manipulating the Alpha Level Cannot Cure Significance Testing. Front Psychol [Internet]. 2018;9. doi: 10.3389/fpsyg.2018.00699.
Schober P, Bossers SM, Schwarte LA. Statistical Significance Versus Clinical Importance of Observed Effect Sizes: What Do P Values and Confidence Intervals Really Represent? Anesth Analg. 2018;126(3):1068–72. doi: 10.1213/ANE.0000000000002798.
Wasserstein RL, Lazar NA. The ASA’s Statement on p-Values: Context, Process, and Purpose. Am Stat. 2016;70(2):129–33. doi: 10.1080/00031305.2016.1154108.
van Zwet E, Gelman A, Greenland S, Imbens G, Schwab S, Goodman SN. A New Look at P Values for Randomized Clinical Trials. NEJM Evid. 2023;3(1):EVIDoa2300003. doi: 10.1056/EVIDoa2300003.
van Zwet EW, Cator EA. The significance filter, the winner’s curse and the need to shrink. Stat Neerlandica. 2021;75(4):437–52. doi: 10.1111/stan.12241.
Liao C, Speirs AL, Goldsmith S, Silber SJ. When “facts” are not facts: what does p value really mean, and how does it deceive us? J Assist Reprod Genet. 2020;37(6):1303–10. doi: 10.1007/s10815-020-01751-4.
Ferrill MJ, Brown DA, Kyle JA. Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. J Pharm Pract. 2010;23(4):344–51. doi: 10.1177/0897190009358774.
Lavine M. P-values don’t measure evidence. Commun Stat - Theory Methods. 2024;53(2):718–26. doi:10.1080/03610926.2022.2091783
Amrhein V, Korner-Nievergelt F, Roth T. The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research. PeerJ [Internet]. 2017;5. doi: 10.7717/peerj.3544.
Betensky RA. The p-Value Requires Context, Not a Threshold. Am Stat. 2019;73(sup1):115–7. doi: 10.1080/00031305.2018.1529624.
Bird A. Understanding the Replication Crisis as a Base Rate Fallacy. Br J Philos Sci. 2021;72(4):965–93. doi: 10.1093/bjps/axy051.
Colquhoun D. The reproducibility of research and the misinterpretation of p-values. R Soc Open Sci. 2017;4(12):171085. doi: 10.1098/rsos.171085.
Ioannidis JPA. Why most discovered true associations are inflated. Epidemiol Camb Mass. 2008;19(5):640–8. doi: 10.1097/EDE.0b013e31818131e7.
Schimmack U, Bartoš F. Estimating the false discovery risk of (randomized) clinical trials in medical journals based on published p-values. PLOS ONE. 2023;18(8):e0290084. doi: 10.1371/journal.pone.0290084.
Sidebotham D, Dominick F, Deng C, Barlow J, Jones PM. Statistically significant differences versus convincing evidence of real treatment effects: an analysis of the false positive risk for single-centre trials in anaesthesia. Br J Anaesth. 2024;132(1):116–23. doi: 10.1016/j.bja.2023.10.036.
Andrade C. HARKing, Cherry-Picking, P-Hacking, Fishing Expeditions, and Data Dredging and Mining as Questionable Research Practices. J Clin Psychiatry. 2021;82(1):20f13804. doi: 10.4088/JCP.20f13804.
Dmitrienko A, D’Agostino RB. Multiplicity Considerations in Clinical Trials. N Engl J Med. 2018;378(22):2115–22. doi: 10.1056/NEJMra1709701.
Hoffmann S, Schönbrodt F, Elsas R, Wilson R, Strasser U, Boulesteix A-L. The multiplicity of analysis strategies jeopardizes replicability: lessons learned across disciplines. R Soc Open Sci. 2021;8(4):201925. doi: 10.1098/rsos.201925.
Lydersen S. Adjustment of p values for multiple hypotheses: why, when and how. Ann Rheum Dis. 2024;83(10):1254–5. doi: 10.1136/ard-2024-225537.
Adda J, Decker C, Ottaviani M. P-hacking in clinical trials and how incentives shape the distribution of results across phases. Proc Natl Acad Sci. 2020;117(24):13386–92. doi: 10.1073/pnas.1919906117.
Matthews R. The p -value Statement, Five Years On. Significance. 2021;18(2):16–9. doi: 10.1111/1740-9713.01505.
Benjamini Y, De Veaux RD, Efron B, Evans S, Glickman M, Graubard BI, et al. ASA President’s Task Force Statement on Statistical Significance and Replicability. CHANCE. 2021;34(4):10–1. doi: 10.1080/09332480.2021.2003631.
Lecoutre M-P, Poitevineau J, Lecoutre B. Even statisticians are not immune to misinterpretations of Null Hypothesis Significance Tests. Int J Psychol. 2003;38(1):37–45. doi: 10.1080/00207590244000250.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Edward Mezones-Holguin, Ali Al-kassab-Córdova, Percy Soto-Becerra, Sonia Hernández-Díaz, Jay S. Kaufman
This work is licensed under a Creative Commons Attribution 4.0 International License.