Reproducibility
In a series of papers, we replicate lab experiments in the social sciences. We also study how accurately replicability can be predicted, by peer scientists in prediction markets, and also by machine learning algorithms.
Replications
In “Evaluating Replicability of Laboratory Experiments in Economics” we replicated 18 studies in experimental economics published in the American Economic Review and the Quarterly Journal of Economics in 2011-2014. We follow a carefully designed procedure and find a significant effect in the right direction in 11 of the experiments.
“Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015” is our second replication attempt, where we study 21 high impact experiments published in the journals Nature and Science during 2010-2015. This time we use even larger sample sizes, with up to five times as many subjects as the original experiments. Of the 21 experiments, 13 replicate. Average effect sizes are about half the size of the original studies. Both false positives (where there is no true effect), and true positives contribute to inflated effect sizes. For true positives, the replication effect size is on average 71%.
Predictability
Replications are very expensive, the work is hard and often not very rewarding. We therefore also study how well replication can be predicted. If a predictive mechanism is accurate enough, we can use it to evaluate which studies to actually replicate and which results we can trust, without having to necessarily conduct replications. A journal could use this information to decide if e.g. a paper needs to be replicated before it can be published.
Both replication studies included prediction markets, where experienced psychologists and experimental economists where given the opportunity to bet on the outcome of our replications before they were conducted. The aggregated beliefs produced by the market are very accurate, as explained in e.g. The Atlantic.
Prediction markets can be used to validate whole departments just as well as individual research papers. In Munafo et. al. (2015) we study how the outcome of the 2014 Research Excellence Framework (REF) evaluation of UK Chemistry departments could be predicted in a market where the traders were faculty members at the participating schools. We show that prediction markets can be a useful tool to complement costly large-scale quality evaluations.
While cheaper than running actual replications, markets require many traders and a large transaction volume to function efficiently. Using market makers to clear trades can end up being quite costly. In comparison, using a statistical model is almost free. In “Predicting the replicability of social science lab experiments” we use machine learning to predict replication outcomes and explore which experimental features drive replicability. The model’s pre-registered predictions of the replicability of Science and Nature papers is only slightly worse than those produced by the prediction market.
Talks and activities
I strive to conduct research that is open and reproducible and try to help others do the same. I am BITSS Catalyst and occasionally give talks related to both methods and theory around reproducibility and how we can upgrade the scientific process. Slides from these talks are available at https://adamaltmejd.se/slides/.
Publications
-
Predicting the replicability of social science lab experiments
PLOS One, 2019
with Anna Dreber et al.
with Anna Dreber, Eskil Forsell, Gideon Nave, Juergen Huber, Magnus Johannesson, Michael Kirchler, Taisuke Imai, Teck Ho, and Colin Camerer
We measure how accurately replication of experimental results can be predicted by black-box statistical models. With data from four large-scale replication projects in experimental psychology and economics, and techniques from machine learning, we train predictive models and study which variables drive predictable replication. The models predicts binary replication with a cross-validated accuracy rate of 70% (AUC of 0.77) and estimates of relative effect sizes with a Spearman ρ of 0.38. The accuracy level is similar to market-aggregated beliefs of peer scientists (Camerer et al., 2016; Dreber et al., 2015). The predictive power is validated in a pre-registered out of sample test of the outcome of Camerer et al. (2018), where 71% (AUC of 0.73) of replications are predicted correctly and effect size correlations amount to ρ = 0.25. Basic features such as the sample and effect sizes in original papers, and whether reported effects are single-variable main effects or two-variable interactions, are predictive of successful replication. The models presented in this paper are simple tools to produce cheap, prognostic replicability metrics. These models could be useful in institutionalizing the process of evaluation of new findings and guiding resources to those direct replications that are likely to be most informative.
-
Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015
Nature Human Behavior, 2018
with Colin F. Camerer et al.
with Colin F. Camerer, Anna Dreber, Felix Holzmeister, Teck Ho, Juergen Huber, Magnus Johannesson, Michael Kirchler, Gideon Nave, Brian A. Nosek, Thomas Pfeiffer, Nick Buttrick, Taizan Chan, Yiling Chen, Eskil Forsell, Anup Gampa, Emma Heikensten, Lily Hummer, Taisuke Imai, Siri Isaksson, Dylan Manfredi, Julia Rose, Eric-Jan Wagenmakers, and Hang Wu
Being able to replicate scientific findings is crucial for scientific progress. We replicate 21 systematically selected experimental studies in the social sciences published in Nature and Science between 2010 and 2015. The replications follow analysis plans reviewed by the original authors and pre-registered prior to the replications. The replications are high powered, with sample sizes on average about five times higher than in the original studies. We find a significant effect in the same direction as the original study for 13 (62%) studies, and the effect size of the replications is on average about 50% of the original effect size. Replicability varies between 12 (57%) and 14 (67%) studies for complementary replicability indicators. Consistent with these results, the estimated true-positive rate is 67% in a Bayesian analysis. The relative effect size of true positives is estimated to be 71%, suggesting that both false positives and inflated effect sizes of true positives contribute to imperfect reproducibility. Furthermore, we find that peer beliefs of replicability are strongly related to replicability, suggesting that the research community could predict which results would replicate and that failures to replicate were not the result of chance alone.
-
Evaluating Replicability of Laboratory Experiments in Economics
Science, 2016
with Colin F. Camerer et al.
with Colin F. Camerer, Anna Dreber, Eskil Forsell, Teck Ho, Juergen Huber, Magnus Johannesson, Michael Kirchler, Johan Almenberg, Adam Altmejd, Taizan Chan, Emma Heikensten, Felix Holzmeister, Taisuke Imai, Siri Isaksson, Gideon Nave, Thomas Pfeiffer, Michael Razen, and Hang Wu
The replicability of some scientific findings has recently been called into question. To contribute data about replicability in economics, we replicated 18 studies published in the American Economic Review and the Quarterly Journal of Economics between 2011 and 2014. All of these replications followed predefined analysis plans that were made publicly available beforehand, and they all have a statistical power of at least 90% to detect the original effect size at the 5% significance level. We found a significant effect in the same direction as in the original study for 11 replications (61%); on average, the replicated effect size is 66% of the original. The replicability rate varies between 67% and 78% for four additional replicability indicators, including a prediction market measure of peer beliefs.
-
Using Prediction Markets to Forecast Research Evaluations
Royal Society Open Science, 2015
with Marcus Munafo et al.
with Marcus Munafo, Thomas Pfeiffer, Adam Altmejd, Emma Heikensten, Johan Almenberg, Alexander Bird, Yiling Chen, Brad Wilson, Magnus Johannesson, and Anna Dreber
The 2014 Research Excellence Framework (REF2014) was conducted to assess the quality of research carried out at higher education institutions in the UK over a 6 year period. However, the process was criticized for being expensive and bureaucratic, and it was argued that similar information could be obtained more simply from various existing metrics. We were interested in whether a prediction market on the outcome of REF2014 for 33 chemistry departments in the UK would provide information similar to that obtained during the REF2014 process. Prediction markets have become increasingly popular as a means of capturing what is colloquially known as the ‘wisdom of crowds’, and enable individuals to trade ‘bets’ on whether a specific outcome will occur or not. These have been shown to be successful at predicting various outcomes in a number of domains (e.g. sport, entertainment and politics), but have rarely been tested against outcomes based on expert judgements such as those that formed the basis of REF2014.