CausalImpact: A new open-source package for estimating causal effects in time series
September 10th, 2014 | Published in Google Open Source
How can we measure the number of additional clicks or sales that an AdWords campaign generated? How can we estimate the impact of a new feature on app downloads? How do we compare the effectiveness of publicity across countries?
In principle, all of these questions can be answered through causal inference.
In practice, estimating a causal effect accurately is hard, especially when a randomised experiment is not available. One approach we've been developing at Google is based on Bayesian structural time-series models. We use these models to construct a synthetic control — what would have happened to our outcome metric in the absence of the intervention. This approach makes it possible to estimate the causal effect that can be attributed to the intervention, as well as its evolution over time.
We've been testing and applying structural time-series models for some time at Google. For example, we've used them to better understand the effectiveness of advertising campaigns and work out their return on investment. We've also applied the models to settings where a randomised experiment was available, to check how similar our effect estimates would have been without an experimental control.
Today, we're excited to announce the release of CausalImpact, an open-source R package that makes causal analyses simple and fast. With its release, all of our advertisers and users will be able to use the same powerful methods for estimating causal effects that we've been using ourselves.
Our main motivation behind creating the package has been to find a better way of measuring the impact of ad campaigns on outcomes. However, the CausalImpact package could be used for many other applications involving causal inference. Examples include problems found in economics, epidemiology, or the political and social sciences.
How the package works
The CausalImpact R package implements a Bayesian approach to estimating the causal effect of a designed intervention on a time series. Given a response time series (e.g., clicks) and a set of control time series (e.g., clicks in non-affected markets, clicks on other sites, or Google Trends data), the package constructs a Bayesian structural time-series model with a built-in spike-and-slab prior for automatic variable selection. This model is then used to predict the counterfactual, i.e., how the response metric would have evolved after the intervention if the intervention had not occurred.
As with all methods in causal inference, valid conclusions require us to check for any given situation whether key model assumptions are fulfilled. In the case of CausalImpact, we are looking for a set of control time series which are predictive of the outcome time series in the pre-intervention period. In addition, the control time series must not themselves have been affected by the intervention. For details, see Brodersen et al. (2014).
A simple example
The figure below shows an application of the R package. Based on the observed data before the intervention (black) and a control time series (not shown), the model has computed what would have happened after the intervention at time point 70 in the absence of the intervention (blue).
The difference between the actual observed data and the prediction during the post-intervention period is an estimate of the causal effect of the intervention. The first panel shows the observed and predicted response on the original scale. The second panel shows the difference between the two, i.e., the causal effect for each point in time. The third panel shows the individual causal effects added up in time.
The script used to create the above figure is shown in the left part of the window below. Using package defaults means our analysis boils down to just a single line of code: a call to the function CausalImpact() in line 10. The right-hand side of the window shows the resulting numeric output. For details on how to customize the model, see the documentation.
How to get started
The best place to start is the package documentation. The package is hosted on Github and can be installed using:
install.packages("devtools")
library(devtools)
devtools::install_github("google/CausalImpact")
library(CausalImpact)
By Kay H. Brodersen, Google
In principle, all of these questions can be answered through causal inference.
In practice, estimating a causal effect accurately is hard, especially when a randomised experiment is not available. One approach we've been developing at Google is based on Bayesian structural time-series models. We use these models to construct a synthetic control — what would have happened to our outcome metric in the absence of the intervention. This approach makes it possible to estimate the causal effect that can be attributed to the intervention, as well as its evolution over time.
We've been testing and applying structural time-series models for some time at Google. For example, we've used them to better understand the effectiveness of advertising campaigns and work out their return on investment. We've also applied the models to settings where a randomised experiment was available, to check how similar our effect estimates would have been without an experimental control.
Today, we're excited to announce the release of CausalImpact, an open-source R package that makes causal analyses simple and fast. With its release, all of our advertisers and users will be able to use the same powerful methods for estimating causal effects that we've been using ourselves.
Our main motivation behind creating the package has been to find a better way of measuring the impact of ad campaigns on outcomes. However, the CausalImpact package could be used for many other applications involving causal inference. Examples include problems found in economics, epidemiology, or the political and social sciences.
How the package works
The CausalImpact R package implements a Bayesian approach to estimating the causal effect of a designed intervention on a time series. Given a response time series (e.g., clicks) and a set of control time series (e.g., clicks in non-affected markets, clicks on other sites, or Google Trends data), the package constructs a Bayesian structural time-series model with a built-in spike-and-slab prior for automatic variable selection. This model is then used to predict the counterfactual, i.e., how the response metric would have evolved after the intervention if the intervention had not occurred.
As with all methods in causal inference, valid conclusions require us to check for any given situation whether key model assumptions are fulfilled. In the case of CausalImpact, we are looking for a set of control time series which are predictive of the outcome time series in the pre-intervention period. In addition, the control time series must not themselves have been affected by the intervention. For details, see Brodersen et al. (2014).
A simple example
The figure below shows an application of the R package. Based on the observed data before the intervention (black) and a control time series (not shown), the model has computed what would have happened after the intervention at time point 70 in the absence of the intervention (blue).
The difference between the actual observed data and the prediction during the post-intervention period is an estimate of the causal effect of the intervention. The first panel shows the observed and predicted response on the original scale. The second panel shows the difference between the two, i.e., the causal effect for each point in time. The third panel shows the individual causal effects added up in time.
The best place to start is the package documentation. The package is hosted on Github and can be installed using:
install.packages("devtools")
library(devtools)
devtools::install_github("google/CausalImpact")
library(CausalImpact)
By Kay H. Brodersen, Google