I created a project that illustrates the use programming code to perform empirical analysis from beginning to end: from database retrieval, through cleaning, manipulating and analyzing the data, to compiling the write-up and display of the results. It consists of original data, a couple of Stata .do files, one LaTex file (plus BibTex) all organized into folders according to the TIER Protocol – a template for replication documentation. I follow Gentzkow and Shapiro and  “automate everything” including the use of batch files to execute programs in the correct order, and compiling of a LaTex document into a final pdf. All of the files are in this github repository. If you would like to see the whole project in action watch this video.

Running one program to do all of the analysis and to produce the write-up of that analysis is the idea behind R Markdown – a tool that is very popular among R users. Many Stata fans wish that Stata had markdown capability. However, existing efforts to replicate R Markdown in Stata fall short (see review here). Using batch files to stitch together Stata and LaTeX code replicates some of the functionality of R Markdown. High profile researchers such as Gentzkow and Shapiro do research this way: it is reproducible and, apparently, in the long-run, very efficient.

The value of my project is that it provides a working example that can be used for teaching purposes. It deals with tasks commonly encountered by empirical researchers, such as importing, cleaning, reshaping and merging data (and dealing with mismatched keys). I show how these tasks can be accomplished using code rather than by-hand.

The actual empirical content is motivated by the work of Herndon et al (2014) who famously discovered a “spreadsheet error” in the work by Reinhart and Rogoff (2010). Given that this episode highlights the importance of careful data manipulation and thorough documentation, it is a fitting application of using code from beginning to end.