Skip to main content

Understanding Experiment Stats

Causal uses the Sequential Probability Ratio Test in order to evaluate our experiments. Specifically, we use a test from a family of mixture SPRTs. You can find an example in the mixtureSPRT R package.

A sequential test has a major advantage over non-sequential tests (like a t-test). The test results can be interpreted at any time after the test starts. You do not have to wait for a set time period to elapse before the results are valid.

With non-sequential tests (e.g. standard t-tests), you must pick the length of the experiment before the test and wait till this time passes to know the results. If you don't, you are falling prey to the multiple comparisons problem. Every time you look at the running experiment and possibly halting it, you're making a comparison. You may be looking so hard at your experiment results that a random lucky bounce in the noise looks like real effect.

With Causal's sequential test procedure, there is no need to wait. As soon as the test has enough data, Causal will mark the experiment significant and you can rollout the result. Since you are not waiting around for unnecessary data, you'll be able to learn and iterate faster.

Multiple Variants

There's another case where an experiment may have multiple comparisons. That is when an experiment has more than one variant to test. Let's say, for example, that you have an experiment with 100 variants, and are accepting a variant if there is only a 5% chance of a false positive (pvalue = .05). You'd expect about 5 of those variants to show an effect even if there was nothing but noise underneath.

Causal uses a technique called the false discovery rate to account for these comparisons. When Causal says a variant is significant you can trust the results, no matter how many variants you add to the experiment.

Primary vs Guardrail Metrics

Causal allows you to add both a primary and several guardrail metrics to an experiment. The intent is that the primary metric will be used to make the rollout decision, and the guardrail metrics are possible blockers to that rollout.

For example, let's say the primary metric is click through rate and one of the guardrail metrics is revenue. I.E. "I'd like to roll out an experiment if the click through rate is significantly up, as long as it does not hurt revenue."

Since a guardrail metric can only prevent a rollout, that does not increase the number of comparisons you are making for the experiment. That is why Causal does not make an adjustment for multiple comparisons.

However, if you decide to roll out a variant if any of the metrics are up (including both main and guardrail), you are not accounting for the multiple comparisons and your decision would not be statistically sound. This issue exists with all testing methodologies. So, if you are doing this with standard t-tests right now, be aware.

Anticipated Experiment Length

The previous section talked about SPRTs not needing to know the length of an experiment because you can look at the results at any time. However, you'll notice that the experiment configuration screen has a field for "anticipated experiment length." That value is used to tune the test to detect the smallest effect given that time frame.1

Just pick a value that is around the same duration of other experiments that you've run. You do not need to wait for this time to pass before rolling out the experiment. The stats are still valid no matter what value you choose.

1 Specifically, it is used to calculate a good value for the the shape of the mixture of alternative hypotheses.