p-Hacking and False Discovery in A/B Testing

papers.ssrn.com

299 points by gwern 6 years ago

I would be shocked if it were as low as 57%. As an intern, I found that the analysts in charge of A/B tests often didn't have a background in science or running experiments, and didn't really care. There were a couple of data analytics teams in the company, and I think a lot of the developers didn't like my team because we were seen as "fussier" than the other one. We required people to preregister hypotheses, and run experiments for predetermined amounts of time.

I don't think the tech environment is very conducive to running experiments. Everything moves too fast, by the time you figure out the results someone gives you are bs, they've already got promoted 3 times and work as a director at a different company.

I work in science now, and although people still p-hack like hell, there's at least some sort of shame about it. There's a long term cost too, I've met a couple researchers who have spent years trying to replicate some finding they got early in their career through suspicious means.

pimmen 6 years ago

I've seen that culture too. I don't want to throw my colleagues under the bus but the other teams at my company that have done A/B testing to compare design choices don't perform proper controls or pre-register hypothesis. Eg. when they want to divide the two groups, they often divide them based on some classification (gender, age, geography etc.) rather than dividing them up completely randomly.
In my team we try to be methodical. I'm just a lowly engineer but one of my team mates is a statistician and another is a PhD student. We know we need to pre-register the hypothesis and at what significance we're testing for first, divide the groups randomly and run the experiment for a set time period.
We've gotten a lot more negative results and been proven wrong in our guesses than the other teams. For some reason I take pride in that.
thomasfedb 6 years ago

It's no fun being labelled as 'fussy'. As a researcher I often work with doctors who ask for analysis with p-values on all sorts of inappropriate data sets, pushing back and telling them "you cannot draw any valid inference from this" can be quite hard.
tedsanders 6 years ago

Running experiments for preset lengths is a mistake. If an effect is strong, it will show up earlier. If this is the case, you want to be able to switch earlier. If you are running a drug trial of drug A vs a control, and drug A kills 100% of the first 100 patients who take it while the control kills 0, you end the trial immediately. You don’t continue to give it to 900 patients just because you pre-registered to treat 1,000 patients, thinking that the effect would be small. This is one reason I think Bayesian approaches are better than frequentist approaches for A/B testing.
- yichijin 6 years ago
  
  Hey all, statistician from Optimizely chiming in here. Just wanted to point out that this is exactly the right point.
  I wanted to add one detail--there actually are ways to do early stopping while staying within a frequentist approach. For example, most clinical trials methods are not Bayesian but rather are just fixed-horizon tests that have the allowable amount of Type 1 error "spread out" amongst the multiple looks that are planned in advance.
  At Optimizely we essentially have a continuous version of this that does in fact allow for multiple looks with rigorous control of Type 1 error. As tedsanders mentions, the key upside is that if you start an experiment with a larger-than-expected lift, you can terminate it early. Then over many repeated experiments, you gain a lot in terms of average time to significance.
  The dissonance in this discussion mostly stems from the fact that this paper (which we actually collaborated on!) uses data from 2014, before we rolled out this new Stats Engine.
  For more, I would encourage a look at our paper: http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-w...
  
  smallnamespace 6 years ago
  
  What's the tradeoff vs. just taking a direct Bayesian approach?
  In fact, why use an inferential framework at all (estimating some sort of probability and using it to guide action), rather than directly using a policy learning framework, e.g. modeling this as Q-learning or multi-armed bandit problem?
  If at the end of the day you have some objective function (e.g. 'making money'), some known space of actions (e.g. move this widget up the page, change the color, engage with user this way), and a reasonable way to associate those two, then isn't the company literally doing reinforcement learning over time?
  It seems one benefit of a reinforcement learning framework is it maintains a set of actions that will still be explored in the future without forcing you to prematurely 'choose' whether A or B is actually better—if A is better in reality, then it will be explored more and more often and B will progressively become downweighted over time.
  
  srean 6 years ago
  
  > If at the end of the day you have some objective function
  That "If" often evaluates to false.
  There are tough judgement calls involved in selecting what is that metric that the org wants to optimize. It is very rare that business management commits to a clear quantitative goal. Reasons are many -- weasel room is important politically, selecting a metric that captures short term and long term goals is difficult, there is a lot of uncertainty in the costs due to uncertainty on how overhead should be billed etc etc.
  This is fairly common. Typically, in these situations its the PMs who make the final call. There the goal of the experiment is to glean as much knowledge as possible, and present it to the PM. If that comes at the cost of exposing some customers to bad choices, so be it -- in other words, explore at the cost of losses in the opportunity to exploit.
  
  dmichulke 6 years ago
  
  > why use an inferential framework at all
  Probably because of the maintenance cost of the code that was only explored but never exploited.
  
  alexgmcm 6 years ago
  
  A policy learning approach is better imho - but getting people to switch to using a multi-armed bandit when they are used to AB testing can be difficult.
  People don't seem to trust the system to make the right decisions even though you can do simulations and have the mathematics to show it is correct.
- boron1006 6 years ago
  
  Agree that bayesian approaches would be better here, but disagree that running experiments for preset lengths is a mistake.
  In a realistic scenario for us, bayesian and frequentist approaches will probably converge to a point that's close enough for a company that runs a website (i.e., we weren't killing anyone by leaving our experiment running). We also weren't getting massive fluctuations in effect size.
  The cost of the Bayesian approach, in terms of learning a new system of statistics, programming everything up from scratch, and interpreting results, probably wouldn't be worth the efficiency gains. If we were creating an AB testing program, then I would probably do so.
  
  gbrown 6 years ago
  
  Adaptive designs and early stopping rules are really good when:
  1. Data collection is expensive (time or money)
  2. Keeping with the status quo in the presence of new evidence is problematic (withholding a promising new drug)
  3. Continuing with the experiment in the presence of new evidence is problematic (the new drug is hurting people)
  Absent one of those features, it's probably not worth the added complexity.
- aalleavitch 6 years ago
  
  http://www.evanmiller.org/how-not-to-run-an-ab-test.html
  
  tedsanders 6 years ago
  
  >With Bayesian experiment design you can stop your experiment at any time and make perfectly valid inferences. Given the real-time nature of web experiments, Bayesian design seems like the way forward.
  Completely agree with Evan Miller here, thank you for sharing the link.
  
  rpier001 6 years ago
  
  Except that the quote reflects a common misunderstanding. The problem of optimal stopping is mostly a function of decision making over multiple looks. A Bayesian approach that makes decisions over multiple looks has similar issues. This can be mitigated by strong priors, but typically to an unknown degree and at some cost to 'power'. How/why does this claim arise? It is because a 'true' Bayesian approach makes no decisions/inferences - it just describes the current state of knowledge. If describing the distribution of the posterior, then 'stop at any time and be valid' else 'you got your Neyman-Pearson in my Bayesian analysis'.
  
  stochastic_monk 6 years ago
  
  Except for your example of patients being catastrophically harmed. In that case, false negatives and false positives are not equally undesirable.
  
  tedsanders 6 years ago
  
  Even in web testing there isn't an equality between false positives and false negatives.
  Suppose you're Facebook and you decide to test a new landing page on 1 million users. You roll out the test and notice that after 10,000 users, the new page is killing engagement. Whoops, turns out it had a bug and isn't even loading. Even though no medical patients are dying, this is still a very negative outcome for Facebook. Obviously they shouldn't test on 990,000 more users before fixing the bug, but that's what slavish adherence to pre-registered trial lengths would tell you to do, because it's 'cheating' to notice that there's a problem after the first 10,000 tests.
  Some of you might say, sure, for extenuating circumstances like a bug you can break procedure. But in that case I think the logic slides down a slippery slope. What if in the example above instead of a bug we just had a feature performing terribly? In either case, the right move is to end the test early, since you don't need all the data points to measure a strong signal.
  
  tomrod 6 years ago
  
  Yes, and that use case/failure mode should be remembered if the product being pushed overlaps that domain. Is that a large market outside of medical?
  
  stochastic_monk 6 years ago
  
  A lot of real world cases have imbalanced reward functions. Autonomous vehicles are another example. I don’t doubt regulation systems on spacecraft are similar.
  
  tomrod 6 years ago
  
  Great examples.
  
  webmaven 6 years ago
  
  Consider Facebook's news feed experiment on user's emotions:
  https://www.nytimes.com/2014/06/30/technology/facebook-tinke...
- YeGoblynQueenne 6 years ago
  
  >> If an effect is strong, it will show up earlier.
  But if an effect shows up earlier, it is not necessarily strong. I think that's the point with running an experiment for a predeterimned length- so you know you didn't get un/lucky and hit a clump of results at the start of the experiments that will sort of average out later on.
  Obviously, if a trial drug is killing your patients at a surprising rate, you need to stop the experiment. In fact, I believe experiments are sometimes stopped on ethical grounds when a drugs is found to heal the experiment group at a high rate, also, either so that the control group can also benefit, or just because it is hard to justify giving only half of your patients a life-saving drug and a placebo to the rest.
  But those are ethical considerations - not practical ones. Such experiments are cut short without complete confidence to the results, when there is the merest hint of ethical issues down the line. At least that's my understanding.
- paulddraper 6 years ago
  
  Agreed. The point of most A/B testing is not to find the best approach per se, but have the greatest number of successes.
  If failures essential don't matter (e.g. number of bacteria killed in a petri dish), sure use frequentist p-value.
  If failure do matter (e.g. number of patients killed), use Bayesian multi-armed bandit.
  Wrote a blog post on it: https://www.lucidchart.com/blog/the-fatal-flaw-of-ab-tests-p...
- gbrown 6 years ago
  
  Adaptive designs are great, but especially in low risk areas like business A/B testing, I'll settle for more people understanding that statistical models are not magic black box truth detectors, and that p-values lose their interpretation in the presence of exploratory practices.
  The world would probably be a better place if we taught introductory statistics from a Bayesian perspective, but people get pretty set on their ways.
baybal2 6 years ago

Well, think of value of any analytics in eCommerce setting:
1. you see a random SKU spiking for n month in a row - good, keep stocking it, maybe even spend few adwords vouchers on it. BUT the fact that nobody in the company readily knows the nature of such spike is already an indication that the business fails at understanding its market.
2. Never in my career, I saw any of "coffee divination" level ideas coming from analysts ever being "life changing" for a company. I been through a number companies spending money on "algorithmic optimisation service" for banner ads of their clients. Yes, the "optimised" banners did score progressively more clicks with time, yet a single full time designer who was hired alongside the "optimiser" company could be scoring more clicks and purchases - without any input from any kind of data analysis
- tomrod 6 years ago
  
  [1] is what I do a lot of discovery on in my own work. The complexity of the digital domain to the sales funnel, as well as the increased reach to new populations, makes simple A/B testing inadequate for accurate causal analysis. I think BigCorp has beneficial scale benefits here.
  [2] I've only seen massive product movement when the analysts and idea folks work together. I've never seen an idea wonk hit the bullseye contrary to simple economic theory/reasoning on their customer target.

aisscott 6 years ago

Hi, I am one of the authors. We found that people p-hack with traditional t-tests. Most A/B tests were run this way in the past and some still are. The paper is using Optimizely data (from 2014) before Optimizely introduced new testing in 2015 designed to solve the issues we found in the paper.

If you want to know how Optimizely prevents p-hacking check out the math behind Optimizely’s current testing here: https://www.optimizely.com/resources/stats-engine-whitepaper...

boron1006 6 years ago

I'm curious about the wording "effects are truly null". I was always under the impression that you can never really "accept the null", but rather "fail to reject the alternative".
- gwern 6 years ago
  
  In your standard NHST test, sure. But you can do different models. If I'm reading OP right, what they do is a mixture model, in which effects are assumed to come either from a zero-mean distribution or a positive-mean distribution with unknown probability _P_/1-_P_ and then you fit the collection of >2k effect sizes to find out what value of _P_ best fits the dataset as a whole. Apparently the best fit assumes that ~70% of effects are actually ~0.
  This can also be done nicely with a Bayesian mixture model or a spike-and-slab multilevel model, and that is what is done in "What works in e-commerce - a meta-analysis of 6700 online experiments", Brown & Jones 2017 http://www.qubit.com/sites/default/files/pdf/qubit_meta_anal... (although they don't formulate it in terms of a sharp null but ask the more relevant 'probability of a >0 beneficial effect', which for some kinds of A/B test has a very low prior - like only 15% for 'back to top' A/B tests).
  
  boron1006 6 years ago
  
  Thanks, that was very helpful.
- windows_tips 6 years ago
  
  The "null" hypothesis would typically be that "there is no effect".
  All that you can really do is prove it wrong, by measuring an effect when there "should", by the hypothesis, be none.
  Due to what is known as the "problem of induction", it's not sufficient to accept a hypothesis because you appeared to not measure an effect in the past, as that says nothing about whether an effect will occur the next time a measurement is made.
  p-value is the "chance" of measuring an effect, given that no effect actually occurred.

nanis 6 years ago

Once at a programming conference, I was talking with a very senior developer at a well known company. He was going on and on about their A/B testing efforts.

I asked how they decided how long they would run an experiment for. The answer was "until we get a significant result."

I was shocked then, but now I am used to getting these kinds of responses from developers ... That and a belief that false positives are not a thing.

yichijin 6 years ago

Hi, Jimmy from Optimizely here. The practice you describe is actually perfectly fine, so long as you're not using a method designed to be checked at a single point in time.
Take a look at clinical trials. Often in clinical trials there are multiple phases, where early stopping is desirable in case the drug has higher-than-expected efficacy (or more-harmful-than-expected side effects).
The type of test conducted in clinical trials explicitly allow for multiple looks while maintaining correct control of the Type 1 error rate. At Optimizely we essentially have a version of this where the monitoring can be conducted contiuously with rigorous control of Type 1 error.
Check out this paper for more details: http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-w...
- generallyfalse 6 years ago
  
  Caveat emperator, I am reading the first pages of the article you link to. In page 1519 they say 1-alpha is the desired significance level. This is wrong, perhaps they mean that alpha is the significance level and 1-alpha is the desired confidence level. In step 3 they say: Preferred test statistics are the ones that can control Type I errors. But that is wrong, Type I error is a parameter you fix so it is not related to the test statistic. Later giving examples of uniform most powerful statistics they require data following a normal distribution, but in web data the distribution can be a mixture of normals whose means depend of the hour. So perhaps the examples are not realistic in the web setting. To be continued.
- nanis 6 years ago
  
  > Hi, Jimmy from Optimizely here. The practice you describe is actually perfectly fine, so long as ...
  Lotsa things are OK so long as you are doing X and Y etc.
  Take a look at a portion clinical trial[1] guidance from FDA. Note specifically the basic Stats guidance:
  6.9.1 A description of the statistical methods to be employed, including timing of any planned interim analysis(ses). 6.9.2 The number of subjects planned to be enrolled. In multicenter trials, the numbers of enrolled subjects projected for each trial site should be specified. Reason for choice of sample size, including reflections on (or calculations of) the power of the trial and clinical justification.
  I don't it's recommended practice anywhere to start collecting data, do a simple t-test after each observation, and declare a significant difference after p < 5%.
  Of course, if every other patient is suffering serious consequences, or becoming miraculously well on the second day of the trial, you stop. In those cases, you generally don't need a statistical test to tell you that your a priori evaluation of the drug or intervention was wrong.
  I fail to see what is so vital about some web site A/B test that one cannot be bothered to think ahead about what defines an observational unit, how many of those one might need to detect an improvement, and wait until after that sample has been attained to test (and, if the web site doesn't get enough visitors to fulfill your sample size requirement for that particular test, that is a different problem entirely).
  [1]: https://www.fda.gov/downloads/Drugs/GuidanceComplianceRegula...
- Jabbles 6 years ago
  
  Presumably using your method takes longer/requires more samples than a method that only checks once?
  
  srean 6 years ago
  
  I haven't looked at the KDD paper, but in general it is the other way round. With sequential hypothesis testing expect to need less data on average.
  
  computerphage 6 years ago
  
  That's highly counter-intuitive to me. Can you try to motivate why that's the case?
  My intuition is that you could use any sequential (which I translated to online) technique could be used in a non-sequential context. By that reasoning, there's no way a sequential technique could do better, at best it could be the same.
  
  srean 6 years ago
  
  This is 1940s stuff. Checkout Wald.
  Short answer: in sequential testing you can ask at intermediate stages whether a satisfactory confidence has been reached. If yes you are done and if not you can continue. On average you will hit a 'yes' sooner. For non sequential you cannot do this if you care about correctness (). So the sample size needs to be pessimistic for non-sequential protocols and then you are bound to that commitment.
  () If your method ensures correctness even after inspection at intermediate stages then its a sequential method by definition. There is some confusion in literature about Bayesian and sequential. They are orthogonal concepts. Both Bayesian and Frequentist test of hypothesis can be sequential
  
  computerphage 6 years ago
  
  Ah! I get it. Thank you!
IshKebab 6 years ago

It's perfectly fine to run an experiment until you get a significant result. You just have to do the maths differently - that's what most people don't know.
- nanis 6 years ago
  
  People use classical methods because it is easier for them to understand than Bayesian. When using classical methods, the least one can do is to fix sample size before the experiment, and not peek until the experiment is over.
  That is easier than explaining Bayesian methods to people who cannot handle classical Stats.
p10_user 6 years ago

It’s done in academia too. Along with “remove the outlier data that messes up our p values”
- glup 6 years ago
  
  The difference is that the people in academia know that it is statistically unsound and choose to act unethically. I think the problem in A/B testing is that a lot of developers don't know it is unsound.
  
  carlmr 6 years ago
  
  This is one of the cases where self-taught developers are usually not as good as those with a "proper" education. In uni you'll learn a lot of tangentially related stuff like p-hacking and design of experiment which a lot of people won't pick up when self-taught.
- whyever 6 years ago
  
  It might make sense to remove systematic outliers if you know they are from non-statistical effects.
beagle3 6 years ago

> The answer was "until we get a significant result."
Done properly, it might be way more efficient than setting your parameters ahead of time, see [0] ; If I had gotten that response from him, I'd assume that's what he meant.
[0] https://en.wikipedia.org/wiki/Sequential_probability_ratio_t...

paraschopra 6 years ago

Hi, founder of VWO here. We revamped our testing engine to Bayesian in 2015 to prevent the ‘peeking problem’ with frequentists approaches. You can read about our approach https://vwo.com/blog/smartstats-testing-for-truth/

paraschopra 6 years ago

Here’s the math of our Bayesian testing engine (for those who are interested in knowing how we do it) https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technic...
clircle 6 years ago

Sequential techniques like Wald's SPRT don't have this problem.

cle 6 years ago

Traditional A/B testing has very poor ergonomics. Experimenters are usually put in awkward conflict-of-interest situations that create multiple strong incentives not to perform rigorous, disciplined, valid experiments.

Null hypothesis significance testing is fundamentally misaligned with business needs and is not a good tool for businesses. This is true in many fields of science as well, but at least they have some mechanisms that try to ensure that experiments are unbiased. Businesses often don't have the same internal and external incentives that lead to those mechanisms, and so NHST is abused even more.

havkom 6 years ago

Well, I have had well run experiments which showed that the “currently internally hyped” way of doing things is completely inferior to the “old boring” and inexpensive way. The project leader was extremely hyped about doing the experiment to prove the superiority of the new hyped way. When he got the results though, it was clear that this was not to be talked about and this result would not be presented to his superiors.
- tzahola 6 years ago
  
  “We’re a data-driven organization! [as long as the data fits our agenda]”
  It’s one of my favorite methodologies, next to “agile waterfall” and “holocracy with managers, middle-managers and minibosses”.

raverbashing 6 years ago

People are conflating A/B tests are like testing a new revolutionary drug or big discovery. But it isn't

Assuming 'B' is the new option, there are 3 possibilities, A is better than B, A is equivalent to B, A is worse than B

If your p-hacked experiment tells you to change from A to B while the null hypothesis was correct, you didn't get much worse off than you were in the first place. And if your long term metrics were in place then you can get a better measure for your experiment.

Not to mention experimental failures by unaccounted variables

dahdum 6 years ago

A large percentage of experiments I've run were intended to only test if variant was worse than control, we didn't care much how much better the variant may be.
Usually these would be positive consumer facing features we were concerned may negatively affect conversion. The switch to Bayesian made that a lot easier to run.
danieltillett 6 years ago

Yes not to mention that for nearly all A/B tests the size of the effect is minimal anyway. I have found that if the effect is large you don't need statistics and if it is small it doesn't matter.

yichijin 6 years ago

Hi all. Jimmy, statistician from Optimizely chiming in.

We were excited to collaborate with the authors on this study. Keep in mind the data used in this analysis is from 2014, before we introduced sequential testing and FDR correction as to specifically address this p-hacking issue. I expect these results are in line with any platform using fixed-horizon frequentist methods.

Check out this paper for more details: http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-w...

gingerlime 6 years ago

I created an open source A/B test framework[0], which also uses Bayesian analysis on the dashboard. IANAS(tatistician), but from what I understand it’s still better to plan the check point in advance, rather than stop when reaching significance.

A couple of articles worth reading [1] [2] (can’t exactly vouch for their validity but seem to make some good arguments that appear thought out)

[0] https://github.com/Alephbet/gimel

[1] http://varianceexplained.org/r/bayesian-ab-testing/

[2] http://blog.analytics-toolkit.com/2017/the-bane-of-ab-testin...

emodendroket 6 years ago

Isn't randomly looking for a pattern and then slapping a hypothesis on it post facto a form of "p-hacking"? Because that's completely commonplace and unremarkable practice in technology

Engineering-MD 6 years ago

I would say it is. By not being hypothesis driven, you are producing too many degrees of freedom by de facto testing every comparison.
This is a huge problem in science too. I have regularly been told to just see what happens and come up with hypotheses after, or others have been unable to say what their hypotheses actually are.
Scientists seem less and less likely to be trained in statistics, and in the scientific process. Technical knowledge is important, but understanding of the scientific process is much more important.
- emodendroket 6 years ago
  
  There was a scandal because it turned out that was essentially what Brian Wansink was up to with his studies of food; that's kind of what put me on to this line of thinking.
andreareina 6 years ago

I think that unnecessarily discounts exploratory work. There's nothing wrong with forming a hypothesis after seeing a pattern in data. But remember that it's just a hypothesis -- an unconfirmed guess. After the hypothesis is formulated, then an experiment can be designed to test it, see if its predictions hold up.
- emodendroket 6 years ago
  
  I don't know about you, but the "exploratory" work is, in my experience, the start and end of it.

geoprofi 6 years ago

My very recent meta-analysis of 115 A/B tests reveals that a large proportion are highly suspect for p-hacking: http://blog.analytics-toolkit.com/2018/analysis-of-115-a-b-t...

Going the Bayesian way, as suggested in some comments, is no solution at all, as I am not aware of an accepted Bayesian approach to dealing with the issue:

http://blog.analytics-toolkit.com/2017/bayesian-ab-testing-n...

(feel free to run sims, if you do not trust the logic ;-)) as well as on a more general level:

http://blog.analytics-toolkit.com/2017/5-reasons-bayesian-ab...

baybal2 6 years ago

Seconding this, the number of e-commerce companies who got "A/B testing into bankruptcy" on my memory approaches 20.

My take on this. Even in cases where such testing was done by disciplined statisticians (which is not the case in at least 9 times out of 10. Yes, a math or cs PhD major is not a professional statistician by any stretch,) the value of advice made from that data is marginal at best.

As eCommerce is bread and butter of cheap electronics industry, I saw times and times again that "science driven" outfits loose out to others. Not so much because of their quality of decision making was demonstrably inferior, but because their obsession with "statistical tasseography" drained their resources, and shifted their focus away from things of obvious importance.

dahdum 6 years ago

After listening to Optimizely reps give a talk about their success with a client (who was present) I suspect the support reps encourage these false positives. They presented a few tests as fantastic wins, when they all had basic flaws (like cold audience vs self selected for interest). Maybe that was just one bad apple (doubtful)...but it was a large client and someone they felt should represent the company as a speaker.

Concerns from the audience were dismissed and referred to follow up after the talk. Never thought the same of Optimizely after that.

babl-yc 6 years ago

How can A/B testing tools be improved to prevent p-value hacking?

Could it be as simple as declaring your test duration before starting the experiment, and having the tool add an asterisk to your results if you stop the experiment early?

stenl 6 years ago

They should use Bayesian statistics, in which case it doesn’t matter when you stop (more precisely, stopping when you get a result does not bias the outcome; of course running the test longer will make the result more robust). See http://andrewgelman.com/2014/02/13/stopping-rules-bayesian-a...
regularfry 6 years ago

Optimizely claim to have since fixed this. The paper where they introduce the method is here: http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-w...
hackandtrip 6 years ago

- You must choose the sample size BEFORE - You must choose the significance level BEFORE (0.05 is not always the best choice)

User23 6 years ago

I had the enjoyable experience of sitting at a tech conference and listening to the others in my group tell one of my friends that he had no idea what he was talking about when he said they weren't designing a proper experiment.

I was the only one there who knew he's a particle physicist.

The OP is horrifyingly right.

raphaelrk 6 years ago

Optimizely being for large enterprises, curious how people do A/B tests at their respective startups. Do most roll their own? How do you make sure your science is sound?

t3scrote 6 years ago

We set audience criteria where the user account must be created after the test launches, from there its a 50/50 split control/treatment experience (based on user id). The metric we are optimizing for is almost always conversion rate. We will turn the experiment off early if the treatment group is having really poor numbers, otherwise once about 4000 accounts have been entered into the experiment we plug the numbers into a bayesian calculator, and call it a winner of there is a 90%+ probability that the treatment beats the control. https://www.abtestguide.com/bayesian/
- joshuamorton 6 years ago
  
  Why so high?
  Beysian results aren't p-values, a 60-70% probably that treatment beats control is just that, not a pvalue of .4 or .3 (which would say nothing).
- tzahola 6 years ago
  
  So one in ten of your findings is bogus.
  
  t3scrote 6 years ago
  
  But isn’t that better than blindly introducing changes without testing them at all?

nvahalik 6 years ago

ELI5... what’s a p-hack?

azernik 6 years ago

Even better than examples and explanations, learn by doing: https://projects.fivethirtyeight.com/p-hacking/
"You’re a social scientist with a hunch: The U.S. economy is affected by whether Republicans or Democrats are in office. Try to show that a connection exists, using real data going back to 1948. For your results to be publishable in an academic journal, you’ll need to prove that they are “statistically significant” by achieving a low enough p-value."
danielvf 6 years ago

“the misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect. This is done by performing many statistical tests on the data and only paying attention to those that come back with significant results, instead of stating a single hypothesis about an underlying effect before the analysis and then conducting a single test for it.“
https://en.m.wikipedia.org/wiki/Data_dredging
And here’s a great example of real life P-Hacking to get a catchy article about the health benefits of chocolate:
https://io9.gizmodo.com/i-fooled-millions-into-thinking-choc...
- vitus 6 years ago
  
  FiveThirtyEight actually has a pretty good demo of p-hacking that demonstrates how one underlying dataset can be used to derive any desired conclusion(s) by deciding which factors to include / exclude.
  https://projects.fivethirtyeight.com/p-hacking/
- cryptonector 6 years ago
  
  Torturing the data until it sings.
delecti 6 years ago

In statistics, a p-value [1] is a the odds of getting a result by chance if there's not actually an effect from the thing you're testing. A common threshold to use is .05, which means that you'd expect the given result less than 5% of the time by chance. Typically this indicates a relationship, but if you simultaneously test enough things that aren't causally linked you'd expect some to have p<0.05. A good example is this xkcd [2]. In the example, green jelly beans aren't actually related to acne, but just by chance the result seems to indicate they are.
[1] https://en.m.wikipedia.org/wiki/P-value [2] https://xkcd.com/882/
SquareWheel 6 years ago

Not an easy ELI5, but consider these resources.
https://en.wikipedia.org/wiki/Data_dredging
https://www.youtube.com/watch?v=bf3egy7TQ2Q
laurieg 6 years ago

Doing the same experiment again and again until you get the answer you want.
- drdrey 6 years ago
  
  In this case though I think they mean stopping the experiment early if there seems to be a positive result rather than experimenting again
patmcc 6 years ago

https://xkcd.com/882/
- vecter 6 years ago
  
  This is not p-hacking.
  
  soberhoff 6 years ago
  
  It is.
cepth 6 years ago

If you’ve taken some statistics or econometrics, you’ve probably heard of “significance levels” and “p-values”. For some reason, academia choose 0.05 as a threshold for “meaningful” or “significant” results.
Generally, a 0.05 p-value means that you would observe your result in 5% of experiments due to random sampling error. I.e. if I tested “is X correlated with cancer”, and my null hypothesis is “X isn’t correlated with cancer”, a 0.05 p-value would meet the threshold to reject that null hypothesis. Generally, a lower p-value means a more statistically significant result.
The problem is that 0.05 seems to be much too high of a p-value. I.e. clever experimental design and cherry picking can generate many results that are statistically significant at that level. Many academics advocate for moving to a 0.01 or even 0.001 significance threshold.
Recently, in some academic fields, there’s been widespread concern that many research studies were p-hacked. See for example, this paper that blew up last year in the finance community, because it suggests a significant number of finance papers, including some seminal ones, had p-hacked results: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3017677.
The counter-argument is that for certain scientific fields, you may never be able to reach a p-value threshold of 0.001. This means the vast majority of research couldn’t be published in journals, academics wouldn’t be able to get promoted etc.
- hammock 6 years ago
  
  This is wrong. P hacking has nothing to do with the p being too lenient.
  
  Fomite 6 years ago
  
  This. I can p-hack in my field, if I wanted to, up to a p-value of arbitrary strictness, given enough time.
  
  cepth 6 years ago
  
  > This is wrong. P hacking has nothing to do with the p being too lenient.
  > This. I can p-hack in my field, if I wanted to, up to a p-value of arbitrary strictness, given enough time.
  I'm not a practicing scientist/academic, so I want to be careful here. But, I think both of you are being a little uncharitable/pedantic.
  P-hacking is one contributor to the broader reproducibility crisis. Lowering the p-value to address the lack of reproducibility is not something that I made up. Yes, lowering the p-value threshold does not eliminate the motivations/techniques that are necessary for p-hacking, but it can make it a lot harder, and a lot less worthwhile. If you work in academia, and it takes you much longer to now cherrypick a sample to meet a much lower p-cutoff, it seems to follow that we would see less of it.
  This is an excerpt from a paper soon to be published in Nature: https://imai.princeton.edu/research/files/significance.pdf. The key quote: 'We have diverse views about how best to improve reproducibility,and many of us believe that other ways of summarizing the data, such as Bayes factors or other posterior summaries based on clearly articulated model assumptions, are preferable to P values. However, changing the P value threshold is simple, aligns with the training undertaken by many researchers, and might quickly achieve broad acceptance.'
  With regards to the comment that you could p-hack up to any strictness, I'm not sure this is correct. If you accept the proposal laid out in that Nature paper, to lower the threshold to P<0.005, or if we go even lower to P<0.001 I don't believe that you'd be able to p-hack in any practical way. Yes, you could cherry pick a tiny sample, but any peer reviewer or colleague of yours is going to ask questions about the sample.
  
  Fomite 6 years ago
  
  I'm not being nitpicky - they are components of a problem of reproducibility, but orthogonal to each other. Bad UI design and a poor backend are both reasons "X website sucks!" but that doesn't mean they're the same.
  A perfectly designed, un-p-hacked study should still perhaps be held to a stricter p-value criteria than 0.05.
  And I am correct - because I've done it. Presently working on a paper where, because I primarily work with simulations, I can translate minute and meaningless difference into arbitrarily small p-values. And I used arbitrary for a good reason - my personal record is the smallest value R can express.
  Ironically, this isn't because I have a tiny sample, but because I can make tremendously large ones. All of this is because no where in the calculation of a p-value is the question "Does this different matter?"
  
  cepth 6 years ago
  
  First off, I don’t have any experience with publishing based off the results of simulations. My (short) time in writing papers centered around economics research with observational datasets.
  I can tell you that given a fixed size dataset, it is not possible to p-hack below a certain threshold in any meaningful way. My colleagues would’ve asked why the gigantic dataset we purchased had 1/4 of its observations thrown out etc.
  Your claim that the threshold and the practice of p-hacking are orthogonal (independent?) is still puzzling to me. I think a better analogy would be trying to game something like your Pagespeed score. In order to get a higher score, you skimp on UX so the page loads faster, and cut out backend functionality because you want fewer HTTP requests. Making it harder to achieve a Pagespeed score forces you at some point to evaluate the tradeoffs of chasing that score.
  I have two questions for you:
  1) Would it take you more time to p-hack a lower threshold, or do all your results yield you a ~0.0000 p-value?
  2) In simulation based research like yours, it seems to me that even other p-hacking “fixes” like forcing there to be a pre-announcement or sample size, sample structure, etc. wouldn’t address what you say you’re able to do. What can be done to fix it?
  
  Fomite 6 years ago
  
  "I can tell you that given a fixed size dataset, it is not possible to p-hack below a certain threshold in any meaningful way. My colleagues would’ve asked why the gigantic dataset we purchased had 1/4 of its observations thrown out etc."
  This is only true if you haven't collected your own data, and the size of the original sample is known - and that you used all of it. I would suggest that a fixed, known sample size is a relatively rare outcome for many fields.
  "Your claim that the threshold and the practice of p-hacking are orthogonal (independent?) is still puzzling to me."
  The suggestion is they're unrelated. Changing to say, p = 0.005, will impact studies that aren't p-hacked, and does not p-hacking proof evidence. It potentially makes things more difficult, but not in a predictable and field-agnostic fashion.
  "1) Would it take you more time to p-hack a lower threshold, or do all your results yield you a ~0.0000 p-value?"
  It might take me more time - but I could also write a script that does the analysis in place and simply stops when I meet a criteria. The question is will it take me meaningfully more time - "run it over the weekend instead of overnight" isn't a meaningful obstacle.
  "In simulation based research like yours, it seems to me that even other p-hacking “fixes” like forcing there to be a pre-announcement or sample size, sample structure, etc. wouldn’t address what you say you’re able to do. What can be done to fix it?"
  My preference is to move past a reliance on significance testing and report effect sizes and measures of precision at the very least. If one must report a p-value, I'd also require the reporting of the minimum detectable effect size that could be obtained by your sample.
  Pre-announcing sample size would...just be a huge pain in the ass, generally.
  
  joshuamorton 6 years ago
  
  Not the above poster, but...
  >I can tell you that given a fixed size dataset, it is not possible to p-hack below a certain threshold in any meaningful way
  Correct, but the most common methods of p-hacking involve changing the dataset size, either by repeating the experiment until the desired result is achieved (a la xkcd [0]), or by removing a large part of the dataset due to a seemingly-legitimate excuse (like the fivethirtyeight demo that has been linked already).
  Pre-announcing your dataset size is pre-announcing your sample size. If you pre-announce your dataset, p-hacking is not possible. This is true. But most research doesn't use a public dataset that is pre-decided.
  >Would it take you more time to p-hack a lower threshold
  Yes.
  >In simulation based research like yours, it seems to me that even other p-hacking “fixes” like forcing there to be a pre-announcement or sample size, sample structure, etc.
  This doesn't follow.
  [0]: https://xkcd.com/882/
  
  cepth 6 years ago
  
  Sorry if the second question was unclear. My point was that for simulation based research, it doesn’t seem that pre-announcing your sample size would do much for preventing p-hacking.
  E.g. if I say “I will do 10000 runs of my simulation”, what’s to prevent me from doing those runs multiple times, and selecting the one that gives me the desired p-value? For observational research, there’s obviously a physical limit to how many subjects you can observe etc. Would still love an answer from the grandparent comment.
  
  joshuamorton 6 years ago
  
  I believe that's where the original post's
  >given enough time.
  comes in.
  One nice thing about simulation based research is that it is often (more) reproducible, so a simulation can be run 10000 times, but then the paper might be expected to report how often the simulation succeeded. In other words, you can increase the simulation size to make p-hacking infeasible
  Note that in practice, pre-announcing your sample size doesn't prevent p-hacking unless your sample size is == to a known sample. If you say "our sample size will be X", but you can collect 2 or 3x X data even, you can almost certainly p-hack.
  Not to mention that I'm unaware of any field where people actually pre-announce their sample sizes. Does this happen on professor's web pages and I'm unaware, or as footnotes in prior papers?
  
  cepth 6 years ago
  
  Again, academia/research is not my profession. But, some cool efforts in this area include osf.io, which is trying to be the Arxiv or Github of preregistration for scientific studies.
  The best preregistration plans will typically include a declared sample or population to observe (http://datacolada.org/64), or at least clear cut criteria for which participants or observations you will exclude.
  I think for the type of economics/finance research I’m most familiar with, you often implicitly announce your sample when securing funding for a research proposal. E.g. if I’m trying to see if pursuing a momentum strategy with S&P 500 stocks is profitable (a la AQR’s work), it’s pretty obvious what the sample ought to be. This is partly why that meta study I linked to earlier was able to sniff out potential signs of p-hacking.
  
  hammock 6 years ago
  
  The parent asked very straightforwardly what is p hacking and you replied with a red herring. If I'm being pedantic, you're being unhelpful.
  
  cepth 6 years ago
  
  > clever experimental design and cherry picking can generate many results that are statistically significant at that level.
  I’m not sure what’s incorrect about this statement? If you disagree with the “fix” to the problem that is most familiar to me, that’s fine. It’s one of many approaches.
  But, at what point did I mislead the parent as to what p-hacking is? What’s your definition?
  
  joshuamorton 6 years ago
  
  p-hacking has nothing to do with any specific significance level. You can p-hack at a significance of .5 or .05 or .000005. A better definition would be
  >p-hacking is a set of related techniques, whereby clever experimental design and cherry picking of data can generate results that falsely appear statistically significant.
  There are a few important differences here:
  1. The effect is not statistically significant. In fact often, there is no effect at all.
  2. There is no mention of a specific significance level.
  Those are both important.
  
  User23 6 years ago
  
  If you've got the money, you can always just increase your sample size until significance is achieved.
- smichel17 6 years ago
  
  The problem with 0.05 isn't how lenient it is, but rather the fact that a default exists at all.
  A p-value should be chosen (before running the experiment!) based on how confident the researcher wants to be in their result.
  For my high school statistics final project, I did an experiment to test whether a stupid prank/joke was funny. Had a pretty terrible experimental design (tons of bias) and tiny sample size (<10). Chose a p-value of 0.8 and ended up with a significant result (it was more amusing than our control). And that was fine, because (A) it was not a very important experiment, and (B) my report acknowledged all of this instead of trying to sweep it under the covers and pretend like I had a strong conclusion.
  That would be wildly inappropriate if I were QA testing a new model of airbag or medication. But I wasn't, and I'm not going to use the results for anything other than sharing this anecdote, so it was fine.
  Similarly, I'd say in some A/B testing scenarios, it's okay to use a lower standard of proof (though p-hacking is definitely not). Especially if you're just using the test as one piece of information too help you decide on the final design. The problem is when people do bad stats and then use the result as an excuse to throw out their human judgment.
  
  cepth 6 years ago
  
  If you chose a p-value threshold of 0.8, and your tested result came in around there, that would suggest that your null hypothesis had a ~80% chance of being true. So in any case, you do not have a strong conclusion.
  I agree that over reliance on a single metric, like a p-value, gets us Goodhart’s Law type problems.
  In econometrics, and really any other statistics adjacent field, if you’ve correctly estimated your standard errors, and are using something like https://en.m.wikipedia.org/wiki/Newey%E2%80%93West_estimator where appropriate, there is nothing wrong with using a p-value as a general approximation of significance.