Preface

About This Book

Supervising undergraduate, masters-level, and doctoral research students has shown me just how many of the skills that I take for granted in my day-to-day work were never taught in a course, but acquired through years of painful trial-and-error. You've probably heard that "the only way to learn how to do research is by doing research." Indeed: classroom exercises are always somewhat artificial, and there is no substitute for getting your hands dirty with a problem that really matters to you. But trial-and-error is a slow and clumsy way to gain proficiency, and throwing students in at the deep end is neither a recipe for academic success nor for mental well-being. The goal of this book is to put some structure around the process through which students learn to do empirical work in economics, building a strong foundation for later self-directed reading and research.

The book is divided into a number a number of lessons, each designed to take between between 90 minutes and 3 hours to complete. Broadly speaking, the material is a mix of data science, applied econometrics, and research skills. In keeping with the Swiss Army Knife logo, the idea is to teach you lots of little things that will come in handy later. While the topics covered below are something of a miscellany, there are strong connections between the lessons. For best results, complete them in order.

A key theme that runs throughout the lessons is the importance of reproducible research using open-source tools. Reproducible research is about creating a clean and fully-documented path from raw data to final results, making errors less likely to occur and easier to find when they do. It also allows other researchers, or our future selves, to build on past work, expanding the sum total of knowledge. Of course I can only replicate your research if I can run your code, and this is why open-source software is so important. Fortunately there are many fantastic open-source programming languages to choose from. This book uses R, the lingua franca of statistics and an increasingly popular choice among economists.

Pre-requisites

This book does not assume advanced knowledge of programming, mathematics, or econometrics, but it does have some pre-requisites. My target audience is first-year graduate students and final-year undergraduates in economics. At Oxford, I use this book to teach a first-year master's level course on Empirical Research Methods that comes after students have completed 16 weeks of basic statistics and econometrics. I assume that you've taken an econometrics course that uses matrix notation and that you have basic familiarity with R programming. If you need to brush up on econometrics, I recommend Marno Verbeek's Guide to Modern Econometrics. I've linked to the third edition because it is particularly inexpensive to buy a used copy, but any edition will do. At a more advanced level, Bruce Hansen's two volume series Econometrics is both excellent and free to download online. If you haven't used R before or feel the need for a bit of review, I suggest reading Hands-On Programming with R. It's free, short, and will get you up to speed quickly.

Good Coding Style

I aim to follow the tidyverse style guide throughout this book and you should too. In theory, any coding style that is clear and consistent is absolutely fine. In practice, it's almost impossible for a novice programmer to figure out what "clear and consistent" should mean. Pick a standard, learn it, and after you're more experienced you can always make a conscious decision to deviate from it if you so choose.

R Markdown

This book is written entirely in R Markdown, a simple markup language that allows you to combine R code and results with formatted text, LaTeX mathematical expressions and more. Writing documents in R Markdown allows you to output your results in a wide variety of formats, including web pages, pdfs, slides, and even Microsoft Word if you absolutely must. R Markdown is a key ingredient in reproducible research. For this reason, I suggest that you use it to take notes as you work through the lessons that follow. This book will not teach you R Markdown directly, but there are many helpful resources online that will. I suggest starting here. Another way to learn about R Markdown is by viewing the underlying source used to generate this book! You can do so on github. Files in the main directory that end in .Rmd are the ones you want. What you're reading right now is called index.Rmd, and the first lesson is 01-hot-hand.Rmd, for example.

Why not Stata?

Given that much of the material discussed below falls under the broad category of "applied microeconometrics" you may wonder why I chose R rather than Stata. Indeed, Stata is easy-to-use, and makes it relatively painless to implement "textbook" microeconometric methods.1 So why don't I like Stata? Before beginning my polemic I should be absolutely clear that Stata users are not bad people: hate the sin, love the sinner. Here begins the sermon.

First, Stata is expensive. The price for a Business single-user Stata license is $765 per year.2 If you want support for multicore computing, the price is even higher: an 8-core version of Stata costs $1,395 annually. There is no discount for Government or nonprofits, but as an Oxford faculty member, I can obtain an 8-core version of Stata for the low price of $595 per year, or around 9% of my annual research allowance. In contrast, the tools that we will learn in this book, mainly R and C++, are completely free. This is particularly important in the modern world of high-performance cluster computing. If you're considering running your code on a multicore machine on Amazon, Google, or Microsoft cloud servers, you don't want to pay a software license fee for every core that you use.

Second, Stata is almost comically behind the times. Let's see what's new in Stata version 16, released in February 2020.3 At the top of the list is the LASSO, a wildly popular technique for high-dimensional regression. Rob Tibshirani developed this method in a seminal paper from 1996, so it only took 24 years for it to be incorporated into Stata.4 Fortunately, Tibshirani and his co-authors made it easy for Stata, by releasing open-source software to implement the LASSO and related methods in R over a decade ago.5 Next on the list of new Stata features is linear programming, a technique that came to prominence in the late 1940s.6 Stata 16 also has the ability to call "any Python package"--something you can do for free in R using reticulate or in Python itself for that matter--and "truly reproducible reporting." Reproducible reporting is incredibly valuable, and it's something that we'll cover in detail below. It's also been available in R, completely free of charge, since at least 2002.7 I suppose we shouldn't expect too much of a statistical computing package that only added support for matrix programming in 2005, a full 20 years after Stata version 1.0.8

Third, Stata is a black box. Because the underlying source code is kept secret, there's no way for a Stata user to know for certain what's happening under the hood. A few years ago I tried to determine precisely what instrument set Stata was using in its implementation of a well-known dynamic panel estimator. The documentation was vague, so I resorted to reverse-engineering the Stata results by trial-and-error in R. I never did get the results to match perfectly. In contrast, if you're not sure what a particular R function or package is doing, you can simply read the source code and find out.

Fourth, and most importantly, Stata makes it hard to share with others. If I don't own a copy of Stata, I can't replicate your work. Even if I do own a copy of Stata, I still may not be able to so do: Stata's proprietary binary data formats are updated fairly regularly and do not maintain backwards compatibility. Datafiles created in Stata version 16, for example, cannot be opened in Stata 13. Indeed, depending on the number of variables included in your dataset, Stata 16 files cannot necessarily be opened even in Stata 15. Fortunately, as we'll see below, intrepid open-source programmers have developed free software to unlock data from Stata's proprietary and ever-changing binary formats.

Why not Matlab, Julia, or Python?

Unlike Stata, Matlab is a bona fide programming language and a fairly capable one at that. Nevertheless, my other critiques of Stata from above still apply: Matlab is extremely expensive, and it's not open source. In contrast, I have nothing bad to say about Python and Julia: they're great languages and you should consider learning one or both of them!9 In the end I decided to choose one language and R struck me as the best choice at this moment in time for the particular set of topics I wanted to cover. In five or ten years time who knows: I may re-write this book in Julia! But for the moment, R has a decided advantage in terms of maturity, a huge collection of useful packages, and a large and extremely supportive user community.

You can help make this book better!

I'm writing this book "in the open" for two reasons. First, I want to make it easy for others to borrow and adapt my material if they find it useful. Second, I want you to help me make this book better! Do you have comments or suggestions? Have you spotted typos or other errors? Is there a topic that I don't cover but you think I should? If so then please send me feedback! The best way to point out a typo or suggest a correction is by sending me a pull request. For other comments, you can either email me by removing any references to giant extinct lizards from or contact me on twitter @economictricks.

Now let's get started!


  1. Arguably, Stata is too easy to learn precisely because of the incentives faced by a software developer with monopoly power: see Hal Varian's paper: Economic Incentives in Software Design.↩︎

  2. These figures were accurate as of March 2021. For the latest prices, see https://www.stata.com/order/dl/.↩︎

  3. https://www.stata.com/new-in-stata/↩︎

  4. Tibshirani (1996) - Regression Shrinkage and Selection via the Lasso↩︎

  5. Friedman et al (2010) - Regularization Paths for Generalized Linear Models via Coordinate Descent↩︎

  6. For a history of linear programming, see Dantzig (1983). To be completely fair, the linear programming algorithm implemented in Stata 16 was only developed in 1992, a lag of merely 28 years.↩︎

  7. Reproducible reporting in R started with sweave. These days we have a fantastic successor package called knitr, which I cover below.↩︎

  8. The "Mata" programming language was added in Stata 9: https://www.stata.com/stata9/. For a timeline of Stata versions, see https://www.stata.com/support/faqs/resources/history-of-stata/.↩︎

  9. A good resources aimed at economists is https://quantecon.org, available in both Python and Julia versions.↩︎