Open Science & Reproducibility

Adam Altmejd

Nov 17, 2019

  1. The replication crisis
  2. The problem
  3. What to do about it

Part 1

The Replication Crisis

Carney 2010

Original: Carney, Cuddy, and Yap (2010). Replication: Ranehill et al. (2015).

Eyes & Generosity

Original: Bateson, Nettle, and Roberts (2006). Replication: Northover et al. (2017).

Currency Priming

Original: Caruso et al. (2013). Replication: Klein et al. (2014).


Original: Zhong and Liljenquist (2006). Replication: Earp et al. (2014).


Original: Strack, Martin, and Stepper (1988). Replication: Wagenmakers et al. (2016).

Science/Nature Replications


Camerer et al. (2018)

Replication Heterogeneity


Klein et al. (2018)

P-Hacking in Economics

Brodeur 2016

Brodeur et al. (2016)

Power in Economics

  • 64,076 in 6,700 studies
  • Median power 18%
  • 80% of effects exaggerated (x2)
Ioannidis, Stanley, and Doucouliagos (2017)

IV is not reliable

  • 1359 IV regressions from 31 AEA papers

Young 2019

Young (2019)

Why Should we Care?

For science:

“non-reproducible single occurrences are of no significance to science”

For our careers:

“I have spent nearly a decade working on the concept of ego depletion […] The problem is that ego depletion might not even be a thing.”


Popper (2005), Inzlicht (2016)


Wakefield 1998

The Goal

Produce reproducibile research that can be replicated easily.

  • Reproduction: Re-analyze with push of a button.
  • Direct Replication: internal validity
  • Conceptual Replication: external validity

Part 2

The Problem

Munafo 2017

Munafò et al. (2017)
  1. Publication Bias
  2. Forking (p-hacking)
  3. Low Statistical Power
  4. Scientific Misconduct

Publication Bias

Published research is not representative.

  • Hard to assess knowledge
  • Incentive to p-hack

The File Drawer

Franco 2014

Franco, Malhotra, and Simonovits (2014)

What is a P-value?

Under the null, how likely is the observed data?

Says nothing about assumptions+hypothesis validity.

Nuzzo 2014

Nuzzo (2014)

Forking, P-hacking, Fishing, Researcher Degrees of Freedom

XKCD Explains P-Hacking

xkcd comic part 1

xkcd comic part 2

xkcd comic part 3

Bohannon chocolate scam

Forking Paths

Even with a pre-stated hypothesis and no conscious p-hacking, each design choice is a fork in the path towards a finished paper.

If observed significance influences these choices, p-values are meaningless.

Gelman and Loken (2013)

Forking Checklist

Wicherts 2016

Wicherts et al. (2016)

Low Statistical Power

  • Even with registries, subgroups can get small
  • Empirical Economics: median power 18%
  • Type-M/S errors:

Gelman 2014

Gelman and Carlin (2014), Ioannidis, Stanley, and Doucouliagos (2017)

Scientific Misconduct

  • Uncommon, but not as rare as you think
  • Hard to fix, need strong norms

John 2012

John, Loewenstein, and Prelec (2012)

Part 3

What can you do?

Study Registration

Public demonstration of precedence

Pre-registration — Before accessing data

  • combats file-drawer
  • protects against forking
  • fends off reviewer no. 2

OSF Registration

OSF Registration

Pre-Analysis Plan

Pick a path through the garden and stick to it.

Focus: sample selection, estimation

Also: motivation, literature

Could be: your paper before results
(maybe extension of grant proposal)

Register a Pre-Analysis Plan

  • Ideal: registration outputs final results,
    will not work with empirical research.
  • Vague PAP is better than nothing.
  • Consider putting X% of data in lock-box.

PAP Example 1

PAP Example 1

Altmejd (2017)

PAP Example 2

PAP Example 2

Altmejd (2017)

PAP Example 3

PAP Example 2

Altmejd (2017)

Pre-analysis Plans: A Stocktaking

  • 195 development PAPs (AEA, EGAP, -2016)
  • PAPs specify:
    • Clear hypothesis (90%)
    •  > 5 hypotheses (58%)
    • Statistical models (68%)
    • Attrition (25%)
    • Outliers (8%)
Ofosu and Posner (2019)

In 93 papers:

  • 25% of PAP hypotheses not reported
  • 18% additional tests
    (82% do not mention PAP adherence)
  • Why? No policing equilibrium.
Ofosu and Posner (2019)

I’ve gotten an absurd number of requests for sensitivity analyses for strictly pre-specified empirical work. The existing norm appears to keep me from looking for unexpected results while providing no protection from readers or reviewers who want to dig through the data trying to kill off empirical results they don’t agree with.

Ofosu and Posner (2019)

Specification Curve

If you cannot pre-pick one analysis, study all.

Simonsohn, Simmons, and Nelson (2015)

Birth Order

Self-Reported Intellect

Specification curve of birth order and self-reported intellect

Rohrer, Egloff, and Schmukle (2017)

Birth Order

Positive Reciprocity

Specification curve of birth order and positive reciprocity

Rohrer, Egloff, and Schmukle (2017)

Another curve

What is the signaling value of GPA?

Specification curve of Sievertsen

A Reproducibility-Enhancing Research Workflow

Data Management

  • Never edit raw input data directly
  • Keep processed data in separate folder

Reproducible Code

  • Self-documenting code: naming, structure, formatting
  • Each code file has separate purpose
  • Use comments when needed
  • Use functions/programs, never write same code twice
  • Literate programming in RMarkdown


rowmeans <- function(x, y) {
      # Calculate pairwise means of numeric input vectors.
      # Input: two vectors of equal length; "x", "y"
      # Output: one vector "out" of means.
      if (length(x) != length(y)) stop("x, y have unequal lengths")
      # Create empty vector to fill with rowmeans
      out <- vector("numeric", length = length(x))
      for (i in seq_along(x)) {
        # For each row, calculate mean of x and y and store in z.
        out[i] <- mean(c(x[i], y[i]))

Version Control

  • Annotated history of changes
  • Documents progress in research
  • Keep track of latest version
  • Collaborate efficiently, reconcile conflicts
  • Only works with text files, not Word

Version Control in Github

  • Git is opensource VCS, hosted by Github
  • One repository per project
  • Changes tracked through commits
  • Link to OSF, freeze for registration

Github Desktop

Data Sharing

  • Without data, reproduction is impossible
  • Never compromise privacy
  • Alternatives: retrieval protocol, data store

Institutional Change

  • use results-blind review,
  • require study registration,
  • require data publication,
  • require reproducible code.


Thank you!


Altmejd, Adam. 2017. “Pre Analysis Plan: Effects of Education Choice.” October 3, 2017.

Bateson, Melissa, Daniel Nettle, and Gilbert Roberts. 2006. “Cues of Being Watched Enhance Cooperation in a Real-World Setting.” Biology Letters 2 (3): 412–14.

Brodeur, Abel, Mathias Lé, Marc Sangnier, and Yanos Zylberberg. 2016. “Star Wars: The Empirics Strike Back.” American Economic Journal: Applied Economics 8 (1): 1–32.

Camerer, Colin F., Anna Dreber, Felix Holzmeister, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler, et al. 2018. “Evaluating the Replicability of Social Science Experiments in Nature and Science Between 2010 and 2015.” Nature Human Behaviour 2 (9): 637–44.

Carney, Dana R., Amy J. C. Cuddy, and Andy J. Yap. 2010. “Power Posing: Brief Nonverbal Displays Affect Neuroendocrine Levels and Risk Tolerance.” Psychological Science 21 (10): 1363–8.

Caruso, Eugene M., Kathleen D. Vohs, Brittani Baxter, and Adam Waytz. 2013. “Mere Exposure to Money Increases Endorsement of Free-Market Systems and Social Inequality.” Journal of Experimental Psychology. General 142 (2): 301–6.

Christensen, Garret. 2018. “Manual of Best Practices in Transparent Social Science Research.”

Christensen, Garret, Jeremy Freese, and Edward Miguel. 2019. Transparent and Reproducible Social Science Research: How to Do Open Science. 1st ed. University of California Press Books.

Coffman, Lucas C., and Muriel Niederle. 2015. “Pre-Analysis Plans Have Limited Upside, Especially Where Replications Are Feasible.” Journal of Economic Perspectives 29 (3): 81–98.

Earp, Brian D., Jim A. C. Everett, Elizabeth N. Madva, and J. Kiley Hamlin. 2014. “Out, Damned Spot: Can the ‘Macbeth Effect’ Be Replicated?” Basic and Applied Social Psychology 36 (1): 91–98.

Franco, Annie, Neil Malhotra, and Gabor Simonovits. 2014. “Publication Bias in the Social Sciences: Unlocking the File Drawer.” Science 345 (6203): 1502–5.

Gelman, Andrew, and John Carlin. 2014. “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.” Perspectives on Psychological Science 9 (6): 641–51.

Gelman, Andrew, and Eric Loken. 2013. “The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No "Fishing Expedition" or "P-Hacking" and the Research Hypothesis Was Posited Ahead of Time.”

Gentzkow, Matthew, and Jesse M. Shapiro. 2014. “Code and Data for the Social Sciences: A Practitioner’s Guide.”

Inzlicht, Michael. 2016. “Reckoning with the Past.” Michael Inzlicht. February 29, 2016.

Ioannidis, John P. A., T. D. Stanley, and Hristos Doucouliagos. 2017. “The Power of Bias in Economics Research.” The Economic Journal 127 (605): F236–F265.

John, Leslie K., George Loewenstein, and Drazen Prelec. 2012. “Measuring the Prevalence of Questionable Research Practices with Incentives for Truth Telling.” Psychological Science 23 (5): 524–32.

Kidwell, Mallory C., Ljiljana B. Lazarević, Erica Baranski, Tom E. Hardwicke, Sarah Piechowski, Lina-Sophia Falkenberg, Curtis Kennett, et al. 2016. “Badges to Acknowledge Open Practices: A Simple, Low-Cost, Effective Method for Increasing Transparency.” PLOS Biology 14 (5): e1002456.

Klein, Richard A., Kate A. Ratliff, Michelangelo Vianello, Reginald B. Adams, Štěpán Bahník, Michael J. Bernstein, Konrad Bocian, et al. 2014. “Investigating Variation in Replicability: A "Many Labs" Replication Project.” Social Psychology 45 (3): 142–52.

Klein, Richard A., Michelangelo Vianello, Fred Hasselman, Byron G. Adams, Reginald B. Adams, Sinan Alper, Mark Aveyard, et al. 2018. “Many Labs 2: Investigating Variation in Replicability Across Samples and Settings.” Advances in Methods and Practices in Psychological Science 1 (4): 443–90.

Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis. 2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1 (1): 0021.

Northover, Stefanie B., William C. Pedersen, Adam B. Cohen, and Paul W. Andrews. 2017. “Artificial Surveillance Cues Do Not Increase Generosity: Two Meta-Analyses.” Evolution and Human Behavior 38 (1): 144–53.

Nuzzo, Regina. 2014. “Scientific Method: Statistical Errors.” Nature News 506 (7487): 150.

Ofosu, George K, and Daniel N Posner. 2019. “Pre-Analysis Plans: A Stocktaking.”

Popper, Karl. 2005. The Logic of Scientific Discovery. Routledge.

Ranehill, Eva, Anna Dreber, Magnus Johannesson, Susanne Leiberg, Sunhae Sul, and Roberto A. Weber. 2015. “Assessing the Robustness of Power Posing: No Effect on Hormones and Risk Tolerance in a Large Sample of Men and Women.” Psychological Science 26 (5): 653–56.

Rohrer, Julia M., Boris Egloff, and Stefan C. Schmukle. 2017. “Probing Birth-Order Effects on Narrow Traits Using Specification-Curve Analysis.” Psychological Science 28 (12): 1821–32.

Simonsohn, Uri, Joseph P. Simmons, and Leif D. Nelson. 2015. “Specification Curve: Descriptive and Inferential Statistics on All Reasonable Specifications.” SSRN Scholarly Paper ID 2694998. Rochester, NY: Social Science Research Network.

Steegen, Sara, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel. 2016. “Increasing Transparency Through a Multiverse Analysis.” Perspectives on Psychological Science 11 (5): 702–12.

Strack, Fritz, Leonard L. Martin, and Sabine Stepper. 1988. “Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis.” Journal of Personality and Social Psychology 54 (5): 768–77.

Wagenmakers, E.-J., T. Beek, L. Dijkhoff, Q. F. Gronau, A. Acosta, R. B. Adams, D. N. Albohn, et al. 2016. “Registered Replication Report: Strack, Martin, & Stepper (1988).” Perspectives on Psychological Science 11 (6): 917–28.

Wicherts, Jelte M., Coosje L. S. Veldkamp, Hilde E. M. Augusteijn, Marjan Bakker, Robbie C. M. van Aert, and Marcel A. L. M. van Assen. 2016. “Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid P-Hacking.” Frontiers in Psychology 7.

Young, Alwyn. 2019. “Consistency Without Inference: Instrumental Variables in Practical Application.”

Zhong, Chen-Bo, and Katie Liljenquist. 2006. “Washing Away Your Sins: Threatened Morality and Physical Cleansing.” Science 313 (5792): 1451–2.