|Year : 2020 | Volume
| Issue : 1 | Page : 39-45
Pitfalls in statistical analysis – A Reviewers' perspective
Sakir Ahmed1, Aadhaar Dhooria2
1 Department of Clinical Immunology and Rheumatology, Kalinga Institute of Medical Sciences, KIIT University, Bhubaneswar, Odisha, India
2 Department of Rheumatology, Santokba Durlabhji Memorial Hospital, Jaipur, Rajasthan, India
|Date of Web Publication||30-Mar-2020|
Dr. Sakir Ahmed
Department of Clinical Immunology and Rheumatology, Kalinga Institute of Medical Sciences, KIIT University, Bhubaneswar - 751 024, Odisha
Source of Support: None, Conflict of Interest: None
Statistics are a quintessential part of scientific manuscripts. Few journals are free of statistics-related errors. Errors can occur in data reporting and presentation, choosing the appropriate or the most powerful statistical test, misinterpretation or overinterpretations of statistics, and ignoring tests of normality. Statistical software used, one-tailed versus two-tailed tests, and exclusion or inclusion of outliers can all influence outcomes and should be explicitly mentioned. This review presents the corresponding nonparametric tests for common parametric tests, popular misinterpretations of the P value, and usual nuances in data reporting. The importance of distinguishing clinical significance from statistical significance using confidence intervals, number needed to treat, and minimal clinically important difference is highlighted. The problem of multiple comparisons may lead to false interpretations, especially in p-hacking when nonsignificant comparisons are concealed. The review also touches upon a few advanced topics such as heteroscedasticity and multicollinearity in multivariate analyses. Journals have various strategies to minimize inaccuracies, but it is invaluable for authors and reviewers to have good concepts of statistics. Furthermore, it is imperative for the reader to understand these concepts to properly interpret studies and judge the validity of the conclusions independently.
Keywords: Biostatistics, common errors, manuscript writing, peer review, reviewer
|How to cite this article:|
Ahmed S, Dhooria A. Pitfalls in statistical analysis – A Reviewers' perspective. Indian J Rheumatol 2020;15:39-45
| Introduction|| |
Measurements are required to study the natural world scientifically, and the proper elucidation of such measurements is statistics. Collection of data and its analysis to extract “intelligence” is the core function of statistics. Statistics continues to play an ever-increasing and indispensable role in biomedical research. However, published literature is not free from errors involving statistics. These errors may relate to the presentation of data, application of appropriate statistical tests, or their interpretations.
One classical example of the misinformation that can arise out of improper use of statistics is the Salmon experiment. The authors used a dead salmon and showed how its brain would “light” up on a positron emission tomography (PET) scan on being “shown” certain photographs. This was an intentional experiment to show how computers programmed with wrong statistics can lead incorrect and even impossible output (explained later on in this manuscript). However, most statistical errors are unintentional.
Esteemed journals such as the Nature and British Medical Journal (BMJ) groups have had their share of errors. Leading Chinese manuscripts have been shown to have errors in aspects such as multiple correlation testing in the last decade. Indian journals too, especially those reporting negative studies, grossly underreport power calculation, sample size calculations, sampling methods, adjustments, or even the names of the statistical tests used. However, the majority of statistical errors in the literature involve basic concepts and may be easily avoided. Currently, the gatekeepers for statistical errors are the reviewers and the editors of journals. However, the peer-review process has its own imperfections. One is that the statistical expertise is not uniform across all authors or reviewers. Another may be the easy availability of numerous statistical software and their extensive use even by the unversed researchers.
In our capacity as reviewers for a few journals, and with experience in handling manuscripts at the editorial level, we come across several such errors. Though such errors percolate literature in all disciplines, the data on such errors in rheumatology manuscripts is limited. In the present manuscript, we share some of the common pitfalls and errors in statistical analysis, interpretation, and reporting.
| Strategy|| |
A SCOPUS search (that includes MEDLINE and PubMed Central) was carried out for “statistics” and “statistical errors” limited to medical journals as per recommendations for biomedical reviews. Titles and abstracts were visually screened to identify articles of interest. In addition, further articles were mined from the bibliographies of these articles, wherever necessary.
| General Errors|| |
This section contains basic concepts. It is important for the author, reader, and reviewer to be aware of these general errors for the sake of scientific propriety.
Parametric versus nonparametric data
Parametric tests are more powerful in the sense that they require smaller sample sizes or smaller effect sizes to pick up significant differences. They are often better known. However, these tests are based on certain assumptions about the distribution of the data. Nonparametric tests are to be used when these assumptions of parametric tests are not met by the data.
One of the basic assumptions for parametric data is that it follows a “distribution” like the “normal” (also called Gaussian) distribution [Figure 1], or a Bayesian (binomial) distribution. Thus, discrete data cannot be parametric. Parametric data has to be continuous – either interval or ratio. The theorem of central limit implies that as the sample size increases, the data tends to assume normal distribution. Thus, with large numbers, it makes more sense to use parametric tests as these are more powerful. Nonparametric tests are around 95% as powerful as parametric tests and are better for smaller numbers.
|Figure 1: Normal or Gaussian distribution (a) has the bell-shaped laterally symmetrical curve. Real-life data may have a flattened peak, that is, platykurtic (b) or have a very pointed peak, that is, leptokurtic (c). Similarly, the symmetric may be lost when too many observations are in lower segments, that is, positive or right skewed (d); or when too many observations are in the higher segments, that is, negative or left skewed (e)|
Click here to view
Before applying other tests, it is always better to check the data for normality. The Shapiro–Wilks is possibly the most powerful test for ascertaining normality and is frequently used. However, no statistical test is infallible. The authors should always prepare a histogram of their data and visually inspect (“eyeball”) the data to see if it is following a normal distribution [the inverted bell curve: [Figure 1]. If the number of samples is more than 30 and these conditions are met, parametric tests can be safely used. If not, the authors can use the nonparametric equivalent of that test [Table 1].
|Table 1: Corresponding parametric and nonparametric tests: Note these are only for continuous data. For discrete data, nonparametric tests or tests of proportion (Fisher's exact test, χ2 test) are used|
Click here to view
One-tailed and two-tailed tests
Often, submitted articles do not mention if one-tailed or two-tailed tests are used. One-tailed tests are used to examine if the central tendency of a particular sample A is smaller than that of another sample B. Two-tailed tests are to inspect if the central tendency of sample A is different (either small or larger) from that of sample B. Authors must take care to explicitly state if they have used one-tailed or two-tailed tests. If not specified, it is presumed that two-tailed tests have been used. While it may seem logical that the probability of a two-way test would be double of that of a one-way test, this only holds true for completely normal distributions.
A two-tailed test is valid when equivalence is at stake, whereas a one-tailed test is appropriate for ascertaining superiority and inferiority. Let us take an example. The height of boys in a high school is compared with the height of girls in the same school. To test whether the average height of boys is similar (not different from) with the average height of girls will require a two-tailed test. Here, the null hypothesis (“average heights are unequal”) has two possibilities: “average height of boys is less than that of girls” and “average height of boys is more than that of girls” (both possibilities lie in opposite directions; hence, there are two “tails”).
However, to conclude that boys are taller than girls requires a one-tailed test. Here, the null hypothesis (“average height of boys is not more than that of girls”) has only one possibility (both “average heights are equal” and “average height of girls is more than that of boys” are in the same direction, that is, within the same “tail”).
To find a “parameter” in a population, we take a sample from the population. This sample gives us an “estimate” of the “parameter.” The confidence interval (CI) provides a range of values, in which the unknown population parameter being studied is likely to be contained. Let us try to understand it with an example. We wish to find out the mean weight of neonates in Delhi. As the population is too large, a random sample of 500 neonates is taken. The mean weight is calculated at 3000 g with a standard deviation of 250 g. The 95% CI will be 2500–3500 g. It means that there is very high (95%) probability that the actual mean weight of neonates in Delhi will lie between 2500 and 3500 g.
Thus, the CI is much more informative than the mean value (3000 g) of the sample. CIs should be reported wherever possible, especially while measuring effect sizes in trials. At the same time, it is also important to understand that for a given single value, it is an “all or none” phenomenon, that is, it is either correct or false and not 95% correct.
Interpretation of P values
No single statistical parameter has been misinterpreted as frequently as the P value. A P value provides the probability of getting the test result provided that the null (or “test”) hypothesis is true. Authors tend to attribute much more to the P value than what it actually represents. [Table 2] lists what the P value does not represent., Overemphasis on P values often leads to multiple comparisons and then selective reporting of only significant values, a practice known as p-hacking. p-hacking is an ethical issue if done knowingly, and a major nuisance even if unintentional. It is widely prevalent in biomedical journals.
Let us consider an example: many studies record liver function tests (LFTs) and complete blood counts of patients at least once. Suppose a researcher looked at six components of the LFT and correlated with the cell counts of lymphocytes, neutrophils, monocytes, and eosinophils, (S) he has made 6 × 4 = 24 comparisons. If the level of significance is P = 0.05, in one out of 20 (5%) cases, (s) he can be expected to get a “false-positive” low P value. Suppose (s) he found out that monocyte count was correlating with gamma glutamate transferase (GGT) level and (s) he wrote an entire manuscript based on the possible relationship of GGT and monocyte count, (s) he will be committing p-hacking. (S) he has made 24 comparisons, but in the manuscript, (s) he will show only one comparison that (s) he knows gives a low P < 0.05. The reader is unaware that multiple comparisons have been made and will not look for Bonferroni (or similar) correction of the level of significance.
Another type of p-hacking is leaving out a few samples or fabricating a few new samples (say by duplication) such that a significance level is reached.
Clinical significance versus statistical significance
Statistical significance does not necessarily imply clinical significance. For example, the use of a particular intervention may reduce the incidence of a particular disease, say, by 75%, a result that was statistically significant in a substantial sampled cohort. However, if the initial incidence of the disease was in 0.04% of the population, and the use of the intervention brings it down to 0.01%, is it justified to apply it? Here, number needed to treat may provide a better answer.
Effect size (measured as odds ratio or relative risk) with CIs helps to assess clinical significance. Again, effect sizes are subject to interpretation and thus, bias. One way to minimize such bias is to have predefined “minimal clinically important differences.”
| Reporting Errors|| |
Reporting errors are often found in manuscripts sent for review. Commonly, some manuscripts provide means and standard deviations for nonparametric data, while others other provide both means and medians. As mentioned in [Table 1], the mean is not a reliable central tendency and the standard deviation is not appropriate for variance in the case of nonparametric data. For such data, median (with interquartile range) is more appropriate.
Information such as age or sex distribution can easily be mentioned in one or two sentences. Elaborate tables are rarely necessary, and pie charts provide the least addition beyond what is provided in textual form. [Table 3] summarizes the common inaccuracies in the presentation of data.
|Table 3: Errors in presentations, duplications, and omissions of data in manuscripts|
Click here to view
Use of noninternational system of units or notations
For scientific work, it is always ideal to use “International System of Units” (S. I. units) and notations. One exception is reporting blood pressure (mm of mercury is accepted). However, some authors use abbreviations like “gm” for “gram,” while the standard S. I. notation is “g”.
Decimal places and error level
There is some debate regarding the optimal number of decimal places. The Cochrane Handbook states that odds ratios, risk ratios, and standardized mean differences should be quoted to two decimal places. This becomes irrelevant when the odds ratio is too big and very difficult to use when it is too small. The problem is created because the number of significant digits is mistaken to represent decimal points. As per the convention of scientific notation, all numbers should be presented in the form a × 10b where 1 ≤ a < 10. If this is followed, the number of significant digits is one more than the number of decimal places. However, medical articles often do not use scientific notation. Thus, a value of 0.003 may have three decimal places but only one significant digit. On the other hand, a value of 58923423.434 in which three decimal places are reported has little significance in those decimal places.
A reasonable practice is to report three or four significant digits (as per the primary data), and not limit ourselves to decimal places. If the initial measurement was accurate to a unit place (say weighing machine has a minimum measure of 1 kg), it makes little sense to report an average (average weight is 7.851 kg) to the third decimal. The best way around is to have a weighing scale with a precision of 10 g. However, this may not always be possible, and hence the number of digits reported should be a balance between the number of significant digits and the precision of the measurement taken.
Statistical software often report to three decimal places, and authors may copy this exact value to the manuscript. One commonly encountered inaccuracy in manuscripts is “P = 0.000.” This is an output of specific statistical programs, but what it actually implies is “P < 0.001” – the program rounds off at the third decimal place. Technically speaking, the P value can rarely be zero, and such output should always be reported as “P < 0.001.”
Reporting of software used
Nowadays, most authors report the software used for statistical tests. This is required because different software apply different methods of calculation for the same statistical test. This is particularly true for nonparametric tests.
| Special Errors|| |
The list of statistical errors is endless and exhaustive. We want to share specific errors that are often encountered by us in our role as reviewers. However, it should be kept in mind that this list is limited only to what we perceive as common.
Outliers: To exclude or not
Identification of outliers is a difficult task. Understanding outliers requires “eyeballing” or visualization of data. When nonparametric tests are used, outliers are less likely to cause major alterations in the results. It may be best to perform two sets of tests: one with outliers included, and the other with the outliers excluded, report both and let the reader decide which is more relevant. Another alternative is normalization of the data (using techniques such as log-transformations or Pareto scaling), which minimizes the influence of outliers.
Correlations: Concept of multiple testing
If for a test, the null hypothesis is true, and sampling is done 1000 times, the probability indicates that a P value less than 0.05 will be found around 50 times. Similarly, if 1000 tests are done on the same sample and the null hypothesis holds true for all of them, about 50 tests will still show P values <0.05. This is known as the multiple testing or multiple comparison problem. In fields like imaging where the correlations can go into millions, the multiple comparison problem may lead to absurd conclusions. One example of this was shown in the experiment involving PET scanning of a dead salmon. Without corrections for multiple testing, the software erroneously implied that the dead salmon's brain lit up when specific images were projected.
The classical correction suggested is the Bonferroni: for n tests done, the corrected level of significance should be P < (0.05/n). The Bonferroni correction is a very conservative correction, and it is hard to reach significance when n is large. Other corrections suggested are Dunnett's for comparing pairs of experimental conditions and Tukey or Duncan tests for comparisons across all pairs. For very large number of comparisons like in genomics, transcriptomics, or metabolomics studies, familywise error rate or false discovery rate have to be used.
Reliability and agreement between observers and between tests
The use of correlations to explore inter- or intra-observer agreements, or for the comparison of laboratory techniques, must be avoided. Consistency or agreement between two sets of qualitative observations is best assessed by Cohen's kappa. The use of Cohen's kappa for continuous data is also fallacious; interclass correlation is the preferred statistical test for such observations.
Multivariate analysis is often required for various studies. For normal data with the dependent variable being a continuous variable (say a score or ratio), linear regression is to be used. For discrete dependent variables (yes or no; disease or notdiseased; success or failure; and likely or neutral or notlikely), logistic regression is appropriate.
The outcomes of the regression model are dependent on all the variables input initially. However, some authors report only the significantly related variables at the end of the analysis without enumerating the variables put in the model. This should be avoided. A common error pertains to the use of excessive number of variables to build a regression model for a small sample size (e.g. 12 variables for a sample size of 50). A simple guide is to ensure that there are at least 10 samples for each variable included in the model (for 12 variables, the model should have a sample size of at least 120).
Linear regression is based on a set of assumptions. Many authors conduct linear regression without ascertaining whether these assumptions are met. One essential requirement for linear regression is homoscedasticity. Homoscedasticity is the absence of heteroscedasticity, that is, the presence of subgroups within the sample that are different from each other. Another essential requirement is the absence of multicollinearity. Multicollinearity is defined as the presence of “independent” variables that can be derived for each other: for example, a regression that has DAS28, CDAI, swollen joint counts, tender joint counts, erythrocyte sedimentation rate, and physician and patients' global assessments as independent variables have multicollinearity (DAS28 and CDAI can be derived from the other variables). This weakens the model, and the chances of reaching statistical significance are diminished.
| Limitations of the Manuscript|| |
This manuscript has not dealt with sample size calculations, matching, randomization, missing values, and other such statistical methods as they are part of “planning a study.” Further reading for the interested reader should include adjusted analysis such as ANCOVA, repeated-measures modeling (such as Kruskal–Wallis), use of receiver operating characteristic curve with Youden index, and quality control of data analysis (post hoc Dunn tests during multivariate modeling).
| Conclusion|| |
As journals and editors became more aware of these concerns, standard recommendations for reporting statistics have been published. The Nature group has adopted statistical reporting guidelines, and the BMJ group now has statistical reviewers. Authors, as well as reviewers, need to be conversant with well acquainted with basic statistics. Reviewers especially need to be aware of their own limitations, and journals should endeavor to have facilities for independent statistical review. Finally, the reader acquainted with these concepts is armed to make independent interpretations of studies without being misdirected intentionally or unintentionally.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Bennett CM, Wolford GL, Miller MB. The principled control of false positives in neuroimaging. Soc Cogn Affect Neurosci 2009;4:417-22.
García-Berthou E, Alcaraz C. Incongruence between test statistics and P
values in medical papers. BMC Med Res Methodol 2004;4:13.
Wu S, Jin Z, Wei X, Gao Q, Lu J, Ma X, et al
. Misuse of statistical methods in 10 leading Chinese medical journals in 1998 and 2008. ScientificWorldJournal 2011;11:2106-14.
Charan J, Saxena D. Reporting of various methodological and statistical parameters in negative studies published in prominent Indian Medical Journals: A systematic review. J Postgrad Med 2014;60:362-5.
] [Full text]
George SL. Statistics in medical journals: A survey of current policies and proposals for editors. Med Pediatr Oncol 1985;13:109-12.
Manchikanti L, Kaye AD, Boswell MV, Hirsch JA. Medical journal peer review: Process and bias. Pain Physician 2015;18:E1-14.
Tsiamalou P, Brotis A. Biostatistics as a tool for medical research: What are we doing wrong? Mediterr J Rheumatol 2019;30:196-200.
Benlidayi IC. Statistical accuracy in rheumatology research. Mediterr J Rheumatol 2019;30:207-15.
Gasparyan AY, Ayvazyan L, Blackmore H, Kitas GD. Writing a narrative biomedical review: Considerations for authors, peer reviewers, and editors. Rheumatol Int 2011;31:1409-17.
Kandane-Rathnayake RK, Enticott JC, Phillips LE. Data distribution: Normal or abnormal? What to do about it. Transfusion (Paris) 2013;53:701-2.
Altman DG, Bland JM. Statistics notes: The normal distribution. BMJ 1995;310:298.
Fahr A. Nonparametric Analysis. Int Encycl Commun Res Methods 2017. p. 1-6. doi: 10.1002/9781118901731.iecrm0168.
Chin R, Lee BY, editors. Analysis of data. In: Principles and Practice of Clinical Trial Medicine. Ch. 15. New York: Academic Press; 2008. p. 325-59.
Schucany WR, Ng HK. Preliminary goodness-of-fit tests for normality do not validate the one-sample student t
. Commun Stat Theory Methods 2006;35:2275-86.
Lakens D. Equivalence tests: A practical primer for and tests, correlations, and meta-analyses. Soc Psychol Personal Sci 2017;8:355-62.
Goodman S. A dirty dozen: Twelve P
value misconceptions. Semin Hematol 2008;45:135-40.
Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al
. Statistical tests, P
values, confidence intervals, and power: A guide to misinterpretations. Eur J Epidemiol 2016;31:337-50.
Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The extent and consequences of p-hacking in science. PLoS Biol 2015;13:e1002106.
Schober P, Bossers SM, Schwarte LA. Statistical significance versus clinical importance of observed effect sizes: What do P
values and confidence intervals really represent? Anesth Analg 2018;126:1068-72.
Fleischmann M, Vaughan B. Commentary: Statistical significance and clinical significance-A call to consider patient reported outcome measures, effect size, confidence interval and minimal clinically important difference (MCID). J Bodyw Mov Ther 2019;23:690-4.
Arifin WN, Sarimah A, Norsa'adah B, Najib Majdi Y, Siti-Azrin AH, Kamarul Imran M, et al
. Reporting statistical results in medical journals. Malays J Med Sci 2016;23:1-7.
Cole TJ. Too many digits: The presentation of numerical data. Arch Dis Child 2015;100:608-9.
Cole TJ. Setting number of decimal places for reporting risk ratios: Rule of four. BMJ 2015;350:h1845.
Zink RC, Castro-Schilo L, Ding J. Understanding the influence of individual variables contributing to multivariate outliers in assessments of data quality. Pharm Stat 2018;17:846-53.
Välikangas T, Suomi T, Elo LL. A systematic evaluation of normalization methods in quantitative label-free proteomics. Brief Bioinform 2018;19:1.
Lee S, Lee DK. What is the proper way to apply the multiple comparison test? Korean J Anesthesiol 2018;71:353-60.
Glickman ME, Rao SR, Schultz MR. False discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies. J Clin Epidemiol 2014;67:850-7.
McHugh ML. Interrater reliability: The kappa statistic. Biochem Med 2012;22:276-82.
Konishi S. Testing hypotheses about interclass correlations from familial data. Biometrics 1985;41:167-76.
Ernst AF, Albers CJ. Regression assumptions in clinical psychology research practice-a systematic review of common misconceptions. PeerJ 2017;5:e3323.
Wang B, Ogburn EL, Rosenblum M. Analysis of covariance in randomized trials: More precision and valid confidence intervals, without model assumptions. Biometrics 2019;75:1391-400.2
Chan Y, Walmsley RP. Learning and understanding the Kruskal-Wallis one-way analysis-of-variance-by-ranks test for differences among three or more independent groups. Phys Ther 1997;77:1755-62.
Fluss R, Faraggi D, Reiser B. Estimation of the Youden Index and its associated cutoff point. Biom J 2005;47:458-72.
Lang TA, Altman DG. Basic statistical reporting for articles published in biomedical journals: The “Statistical Analyses and Methods in the Published Literature” or the SAMPL Guidelines. Int J Nurs Stud 2015;52:5-9.
Clayton MK. How should we achieve high-quality reporting of statistics in scientific journals? A commentary on “Guidelines for reporting statistics in journals published by the American Physiological Society”. Adv Physiol Educ 2007;31:302-4.
[Table 1], [Table 2], [Table 3]