In a series of papers, we replicate lab experiments in the social sciences. We also study how accurately replicability can be predicted, by peer scientists in prediction markets, and also by machine learning algorithms.
In “Evaluating Replicability of Laboratory Experiments in Economics” we replicated 18 studies in experimental economics published in the American Economic Review and the Quarterly Journal of Economics in 2011-2014. We follow a carefully designed procedure and find a significant effect in the right direction in 11 of the experiments.
“Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015” is our second replication attempt, where we study 21 high impact experiments published in the journals Nature and Science during 2010-2015. This time we use even larger sample sizes, with up to five times as many subjects as the original experiments. Of the 21 experiments, 13 replicate. Average effect sizes are about half the size of the original studies. Both false positives (where there is no true effect), and true positives contribute to inflated effect sizes. For true positives, the replication effect size is on average 71%.
Replications are very expensive, the work is hard and often not very rewarding. We therefore also study how well replication can be predicted. If a predictive mechanism is accurate enough, we can use it to evaluate which studies to actually replicate and which results we can trust, without having to necessarily conduct replications. A journal could use this information to decide if e.g. a paper needs to be replicated before it can be published.
Both replication studies included prediction markets, where experienced psychologists and experimental economists where given the opportunity to bet on the outcome of our replications before they were conducted. The aggregated beliefs produced by the market are very accurate, as explained in e.g. The Atlantic.
Prediction markets can be used to validate whole departments just as well as individual research papers. In Munafo et. al. (2015) we study how the outcome of the 2014 Research Excellence Framework (REF) evaluation of UK Chemistry departments could be predicted in a market where the traders were faculty members at the participating schools. We show that prediction markets can be a useful tool to complement costly large-scale quality evaluations.
While cheaper than running actual replications, markets require many traders and a large transaction volume to function efficiently. Using market makers to clear trades can end up being quite costly. In comparison, using a statistical model is almost free. In “Predicting Replication” we use machine learning to predict replication outcomes and explore which experimental features drive replicability. The model’s pre-registered predictions of the replicability of Science and Nature papers is only slightly worse than those produced by the prediction market.
Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015
Nature Human Behavior, 2018
with Colin F. Camerer et. al.
Being able to replicate scientific findings is crucial for scientific progress. We replicate 21 systematically selected experimental studies in the social sciences published in Nature and Science between 2010 and 2015. The replications follow analysis plans reviewed by the original authors and pre-registered prior to the replications. The replications are high powered, with sample sizes on average about five times higher than in the original studies. We find a significant effect in the same direction as the original study for 13 (62%) studies, and the effect size of the replications is on average about 50% of the original effect size. Replicability varies between 12 (57%) and 14 (67%) studies for complementary replicability indicators. Consistent with these results, the estimated true-positive rate is 67% in a Bayesian analysis. The relative effect size of true positives is estimated to be 71%, suggesting that both false positives and inflated effect sizes of true positives contribute to imperfect reproducibility. Furthermore, we find that peer beliefs of replicability are strongly related to replicability, suggesting that the research community could predict which results would replicate and that failures to replicate were not the result of chance alone.
Evaluating Replicability of Laboratory Experiments in Economics
with Colin F. Camerer et. al.
The replicability of some scientific findings has recently been called into question. To contribute data about replicability in economics, we replicated 18 studies published in the American Economic Review and the Quarterly Journal of Economics between 2011 and 2014. All of these replications followed predefined analysis plans that were made publicly available beforehand, and they all have a statistical power of at least 90% to detect the original effect size at the 5% significance level. We found a significant effect in the same direction as in the original study for 11 replications (61%); on average, the replicated effect size is 66% of the original. The replicability rate varies between 67% and 78% for four additional replicability indicators, including a prediction market measure of peer beliefs.
Using Prediction Markets to Forecast Research Evaluations
Royal Society Open Science, 2015
with Marcus Munafo et. al.
The 2014 Research Excellence Framework (REF2014) was conducted to assess the quality of research carried out at higher education institutions in the UK over a 6 year period. However, the process was criticized for being expensive and bureaucratic, and it was argued that similar information could be obtained more simply from various existing metrics. We were interested in whether a prediction market on the outcome of REF2014 for 33 chemistry departments in the UK would provide information similar to that obtained during the REF2014 process. Prediction markets have become increasingly popular as a means of capturing what is colloquially known as the ‘wisdom of crowds’, and enable individuals to trade ‘bets’ on whether a specific outcome will occur or not. These have been shown to be successful at predicting various outcomes in a number of domains (e.g. sport, entertainment and politics), but have rarely been tested against outcomes based on expert judgements such as those that formed the basis of REF2014.
Work in Progress
with Anna Dreber et. al.
We measure how accurately replication of experimental results can be predicted by a black-box statistical model. With data from four large- scale replication projects in experimental psychology and economics, and techniques from machine learning, we train a predictive model and study which variables drive predictable replication. The model predicts binary replication with a cross validated accuracy rate of 70% (AUC of 0.79) and relative effect size with a Spearman ρ of 0.38. The accuracy level is similar to the market-aggregated beliefs of peer scientists (Camerer et al., 2016; Dreber et al., 2015). The predictive power is validated in a pre-registered out of sample test of the outcome of Camerer et al. (2018b), where 71% (AUC of 0.73) of replications are predicted correctly and effect size correlations amount to ρ = 0.25. Basic features such as the sample and effect sizes in original papers, and whether reported effects are single-variable main effects or two- variable interactions, are predictive of successful replication. The models presented in this paper are simple tools to produce cheap, prognostic replicability metrics. These models could be useful in institutionalizing the process of evaluation of new findings and guiding resources to those direct replications that are likely to be most informative.