Guide: When To Trust A/B Test Results
I’ve made this doc to share how I think about A/B testing on Facebook and other places. I’ve split it into what kind of data I’m trying to collect with my test.
Level I: Tests at the ad-set level that you don't want to generalize
How to test: If a test is just between ads in an ad-set, and I’m not looking to generalize learnings across an account (like establishing that headline X is better than Y universally), you can usually just trust the ad platform (Facebook, Google, etc.) to decide which ad is better. The platforms do this by serving more or less ad spend to ads based on your conversion goals, which for me is usually cost per lead or purchase. Sometimes I'll duplicate an ad to give it another chance if I think the platform called it too early.
I recommend this (as opposed to more stringent requirements for each ad test) for two reasons:
When to use this: Anywhere where the testing platform has an effective way to bid per conversion, rather than just spending on raw impressions. That’s because when you bid per conversion, the platform you’re bidding on is always testing the ads against each other anyway. Specifically, this is on Facebook, AdWords (depending on bid model), other ad networks like LinkedIn to a lesser extent.
I recommend this (as opposed to more stringent requirements for each ad test) for two reasons:
- Momentum and performance history are important to overall ad-set performance on Facebook, so it’s generally best to let Facebook do its thing without too many interruptions for optimal overall performance.
- When allocating spend, Facebook considers other variables in the background besides the conversion goal, such as CTR, ad relevance score, X out rate, etc, that will all impact delivery moving forward.
When to use this: Anywhere where the testing platform has an effective way to bid per conversion, rather than just spending on raw impressions. That’s because when you bid per conversion, the platform you’re bidding on is always testing the ads against each other anyway. Specifically, this is on Facebook, AdWords (depending on bid model), other ad networks like LinkedIn to a lesser extent.
Level II: Ad tests you want to generalize
This is if you want to do an ad test where you generalize results across the account rather than just an individual ad-set. For instance, if you want to know if headline X is generally better than headline y, across all similar ads, you would do this method.
For this, you’ll need to actually consider statistical significance mathematically. The best way (IMO) to test significance with the lowest sample size possible for online A/B testing is Sequential A/B Tests.
Evan Miller has a great website and tool for this, but here it is in short.
You absolutely cannot call the test early, or give the test more time, with this methodology. It’s only effective when used as described.
For this, you’ll need to actually consider statistical significance mathematically. The best way (IMO) to test significance with the lowest sample size possible for online A/B testing is Sequential A/B Tests.
Evan Miller has a great website and tool for this, but here it is in short.
- Use this tool to pre-determine how many conversions you’ll need for the test. Switch significance setting α to 10%. Keep the other settings the same. This will design a test that is designed to only detect if one version is at least 20% better than the other, which is the best you can do for online tests at our conversion levels. If you’re willing to invest in longer tests, decrease the minimum detectable effect.
- Run the test until you either hit the required number of conversions, or if the treatment gets the designated number of conversions ahead. Otherwise, call the test inconclusive.
You absolutely cannot call the test early, or give the test more time, with this methodology. It’s only effective when used as described.
Level III tests: Tests you absolutely need to be correct.
Use this when A/B testing funnels on your site, or on nurture emails everyone in your funnel receives, or in large direct mail sends.
It’s the same process as above, but set the significance setting α to 5% and minimum detectable effect to 5%.
You absolutely cannot call the test early, or give the test more time, with this methodology. It’s only effective when used as described.
It’s the same process as above, but set the significance setting α to 5% and minimum detectable effect to 5%.
You absolutely cannot call the test early, or give the test more time, with this methodology. It’s only effective when used as described.