It is interesting going through a promotions review[1]. The committee look at your metrics, and note how many papers you have published, and how many are read. This may not be the best metric. I was in a group which did the NZ Mental health Survey: at the cost of around 12 million, and those papers have been cited about 300 times. Some of my letters have more citations — or a quick theoretical comment — than papers I spent years on.
Besides, there is a sense that if you don’t find anything significant you cannot publish the paper. This is a moral hazard: and leads to people trolling data for significance, which the statisticians call P-hacking.
Pooling p-values across all disciplines, there was strong evidence for “evidential value”; that is, researchers appear to be predominantly studying phenomena with nonzero effect sizes, as shown by the strong right skew of the p-curve for p-values found in both the Results (binomial glm: estimated proportion of p-values in the upper bin (0.025 ? p < 0.05) (lower CI, upper CI) = 0.257 (0.254, 0.259), p < 0.001, n = 14 disciplines) and the Abstracts (binomial glm: estimated proportion of p-values in the upper bin (0.025 ? p < 0.05) (lower CI, upper CI) = 0.262 (0.257, 0.267), p < 0.001, n = 10 disciplines). We found significant evidential value in every discipline represented in our text-mining data, irrespective of whether we tested the p-values from the Results or Abstracts . Based on the net trend across all disciplines, however, there was also strong evidence for p-hacking in both the Results (binomial glm: estimated proportion of p-values in the upper bin (0.045 < p < 0.05) (lower CI) = 0.546 (0.536), p < 0.001, n = 14 disciplines) and the Abstracts (binomial glm: estimated proportion of p-values in the upper bin (0.045 < p < 0.05) (lower CI) = 0.537 (0.518), p < 0.001, n = 10 disciplines). In most disciplines, there were more p-values in the upper than the lower bin; and when we look at the p-values text-mined from Results sections in every discipline where we had good statistical power (i.e., Health and medical Sciences, Biological Sciences, and Multidisciplinary), this difference was statistically significant.
This includes how data is presented. We often use graphs: the first thing I do with data is draw scatter and box plots so I can work out how the data is distributed (most likely, in my research, it will not be a normal distribution) and then I end up having very long discussions with my local statisticians about he correct way of analysing it. The choice of test matters. The way graphs are drawn matters.
In addition to showing data for key findings, figures are important because they give authors the opportunity to display a large amount of data very quickly. However, most figures provided little more information than a table . Bar graphs were the most commonly used figures for presenting continuous data. 85.6% of papers included at least one bar graph. Most of these papers used bar graphs that showed mean ± SE (77.6%), rather than mean ± SD (15.3%). Line graphs and point and error bar plots were also common, and most showed mean ± SE. Figures that provide detailed information about the distribution of the data were seldom used. 13.4% of articles included at least one univariate scatterplot, 5.3% included at least one box plot, and 8.0% included at least one histogram. The journals that we examined publish research conducted by investigators in many fields; therefore, it is likely that investigators in other disciplines follow similar practices. The overuse of bar graphs and other figures that do not provide information about the distribution of the data has also been documented in psychology and medicine.
Our data show that most bar and line graphs present mean ± SE. illustrates that presenting the same data as mean ± SE, mean ± SD, or in a univariate scatterplot can leave the reader with very different impressions. While the scatterplot prompts the reader to critically evaluate the authors’ analysis and interpretation of the data, the bar graphs discourage the reader from thinking about these issues by masking distributional information.
To be fair, this paper has been widely read and mentioned. Publons has a post publication comment in which the authors note that there is nothing particularly new in the analysis (the novelty is that they are systematic in their review) and that the paper has been widely noticed.
The reception of this paper has been quite astounding with over 39,000 visits on Plos biology website since 22 April. Astonishingly, my tweet on this paper (https://twitter.com/GaetanBurgio/status/590958042444800001) has attracted > 300 Retweets, > 300 favorites and over 30,000 impressions. Additionally, today a comment on this paper has been featured in Nature News and comments (http://www.nature.com/news/bar-graphs-criticized-for-misrepresenting-data-1.17383), which start trending strongly on Twitter. This response to this paper underlines the widespread of bad habits in statistics and data representation. I would like to take this unexpected opportunity to share a summary of my discussions on Twitter on this paper and my personal take on this story. Hopefully we can start an interesting and fruitful discussion on this forum. For once, it won’t be on data manipulation and paper retraction!
I would like to make two general comments.
Firstly, bad statistics and bad habits are widespread and common throughout Science, especially in biological sciences. It undermines the reproducibility of the data and experiments. This leads to a waste of public funds and time to reproduce experiments. This can be expressed in various forms amongst small samples sizes, P.hacking (http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106) or cherry-picking data. How many times I reviewed papers with ridiculously low sample size or cherry picking data. I don’t really think I need to convince the readers of this discussion forum on this epidemic habit.
Secondly, the level of the researchers in biological science is often really poor and how many times I’ve seen students, postdoctoral researchers or even PIs just able to perform a t-test on Excel and having virtually no knowledge in basic statistics. We can discuss endlessly on this topic. Although, one specific issue I came across is the teaching of statistics is often boring or not attractive for students. Some would disagree with this but we can discuss this here.
More specifically on this paper now.
Some would argue that this paper is basically a revisit of Anscombe quartet (http://en.wikipedia.org/wiki/Anscombe’s_quartet) which is probably true. However, it is always very good and refreshing to see someone speaking out and trying to address the issue of data misrepresentation.
The issues we face here relate to moral hazard as much as statiscal ignorance. In the discussion there is mention of publishing your analysis source code (generally in R). I can write that, but between analyses I tend to get rusty. The need to find significance drives many to present data in the most favourable way: we all have had papers rejected.
Publication bias is skewing the field. We all have unpublished studies in our bottom drawer.
We need to be more honest. The current system is too converged. I would rather do honest science and accept that I won’t get promoted than join an elite that lies to each other.
_____
1. The committee said my H factor was six. That is using web of science: it is higher now. On google scholar it is 20. Did not matter. No one in the department got promoted.