Causal Impact to Infer Effect of an Intervention
Just few months after I (re-)joined the Data Team in TVLK, I learned a new concept called Causal Impact. It's the concept used by statisticians or analysts from various domains to understand the impact of an event or action (i.e. intervention) to the outcome they're interested in. Why is this important in various domains, even in business?
It's a common phenomenon in Business to observe various correlation between factors or variables, whether it's a positive or a negative one. The problem is, we would not know if this correlation also means the presence of causation between the variables that we're observing. Causation is harder to discover than correlation because there's a direction in causation. we can say A correlates with B positively, and that would mean the same thing if we say B correlations with A positively. However, A causes B is not the same with B causes A. In causation, we also have to figure out direction of the causation, or in simpler terms, we need to know which one causes which.
In other situation, we are faced with a problem where we would like to implement or impose something that we know would impact several variables, and we would like to measure how big the impact is. Since we are planning to put a change / new treatment in place, one thing we can do to estimate the impact is to plan out a randomised experiment where we split our population (or parts of it) into two groups: control and treatment, randomly. Control group would not experience the intervention while treatment group does. In short, what we need to do is running the experiment for a specified period of time and by the end of the experiment we perform a statistical test (e.g. t-test) to see the difference of the outcome (using our metric of interest) between the two groups and see if both are having statistically significantly difference or not.
However, we don't always have the bandwidth to do experiment. For example, marketing team or product team demanded to release an intervention (e.g. marketing campaign, new app feature) to the production which is open to all audience, soon. We would not be able to conduct experiment by then since there would be a limitation in time for designing the experiment (e.g. performing power analysis) prior to the release. In other situation, the experiment could not be conducted since simply because the team didn't have the resource to implement splitting function in production backend. Another problem that usually arose would be when the team observes something that suddenly changed in the past, and we would like to find out which action actually cause that difference to happen. Apparently, there are numerous actions or events that have occurred which all have the probability of causing that trend to change.
This is where Causal Impact comes in handy. In short, what we're trying to do is comparing the actual observed trend with the predicted trend that would have happened if the intervention did not take place (i.e. counterfactual or synthetic control), and measure the difference between both. The concept is quite simple and intuitive, but the implementation is quite complex. Unlike experimental data, we will only observe one outcome of the population at the same period of time since they are all always receiving the similar treatment for the same period of time. That is why we need a prediction model to estimate the potential outcome that would have happened if the treatment did not change, or if different kind of treatment was implemented instead. Having a point estimate is not enough, since we need to estimate all the missing potential outcomes.
The most powerful ingredient we can use to predict what could have happened is other time series that are related to our outcome variable (i.e. covariates) but we assume would not be impacted by the treatment / intervention. This most powerful ingredient is also the hardest component to get since we need to have enough understanding of the overall metrics (having a domain-specific knowledge will be really useful in this case).
I won't go deeper to the technical side, but what the causal impact model does is basically:
Google has an open source package in R which was built based on Bayesian structural time-series model: CausalImpact. There's also a simple tutorial there which is using one covariate for predicting the outcome. Again, the actual problem in real world will require you to have a handful of covariates to get better predictor of the causal effect.
It's a common phenomenon in Business to observe various correlation between factors or variables, whether it's a positive or a negative one. The problem is, we would not know if this correlation also means the presence of causation between the variables that we're observing. Causation is harder to discover than correlation because there's a direction in causation. we can say A correlates with B positively, and that would mean the same thing if we say B correlations with A positively. However, A causes B is not the same with B causes A. In causation, we also have to figure out direction of the causation, or in simpler terms, we need to know which one causes which.
In other situation, we are faced with a problem where we would like to implement or impose something that we know would impact several variables, and we would like to measure how big the impact is. Since we are planning to put a change / new treatment in place, one thing we can do to estimate the impact is to plan out a randomised experiment where we split our population (or parts of it) into two groups: control and treatment, randomly. Control group would not experience the intervention while treatment group does. In short, what we need to do is running the experiment for a specified period of time and by the end of the experiment we perform a statistical test (e.g. t-test) to see the difference of the outcome (using our metric of interest) between the two groups and see if both are having statistically significantly difference or not.
However, we don't always have the bandwidth to do experiment. For example, marketing team or product team demanded to release an intervention (e.g. marketing campaign, new app feature) to the production which is open to all audience, soon. We would not be able to conduct experiment by then since there would be a limitation in time for designing the experiment (e.g. performing power analysis) prior to the release. In other situation, the experiment could not be conducted since simply because the team didn't have the resource to implement splitting function in production backend. Another problem that usually arose would be when the team observes something that suddenly changed in the past, and we would like to find out which action actually cause that difference to happen. Apparently, there are numerous actions or events that have occurred which all have the probability of causing that trend to change.
This is where Causal Impact comes in handy. In short, what we're trying to do is comparing the actual observed trend with the predicted trend that would have happened if the intervention did not take place (i.e. counterfactual or synthetic control), and measure the difference between both. The concept is quite simple and intuitive, but the implementation is quite complex. Unlike experimental data, we will only observe one outcome of the population at the same period of time since they are all always receiving the similar treatment for the same period of time. That is why we need a prediction model to estimate the potential outcome that would have happened if the treatment did not change, or if different kind of treatment was implemented instead. Having a point estimate is not enough, since we need to estimate all the missing potential outcomes.
The most powerful ingredient we can use to predict what could have happened is other time series that are related to our outcome variable (i.e. covariates) but we assume would not be impacted by the treatment / intervention. This most powerful ingredient is also the hardest component to get since we need to have enough understanding of the overall metrics (having a domain-specific knowledge will be really useful in this case).
I won't go deeper to the technical side, but what the causal impact model does is basically:
- Draw posterior sample from the graphical model (from the moment intervention happened and onwards i.e. post-period)
- Build a model based on pre-period (e.g. model B)
- Use model B to predict what should have happened in post-period
- Draw samples from model B
- Get a posterior distribution of our causal effect
Google has an open source package in R which was built based on Bayesian structural time-series model: CausalImpact. There's also a simple tutorial there which is using one covariate for predicting the outcome. Again, the actual problem in real world will require you to have a handful of covariates to get better predictor of the causal effect.
No comments: