Data Skeptic

[MINI] Experimental Design

Duration:: 15m
Broadcast on:: 11 Jul 2014
Audio Format:: other

This episode loosely explores the topic of Experimental Design including hypothesis testing, the importance of statistical tests, and an everyday and business example.

(upbeat music) - Welcome to the Data Skeptic Podcast mini episodes. I'm here as always with my co-host and wife, Linda. - Hello. - With it today, our topic is experimental design. What does that mean to you? - I imagine it that someone is designing something. That's what's a product of that design. And they're experimenting with it. That's awfully close to what I meant, specifically what I mean when I say experimental design. By the way, I thought you were gonna pun about modern art. But anyway, it's the design of the experiment itself. So that with the goal of measuring something that you wanna understand. So you have an idea or a theory or a hypothesis. Maybe it's that on your e-commerce website that if you make the shopping cart bigger, people will understand better where it is and buy more things. You wanna set up an experimental design to test this theory. So you would have some group of people that see the alternate way, and some people that see the old way, and that's your test and your control group. And then from that, you can gather some data and measure revenue generated. So can you think of any instance in your life, either professionally or just in your everyday life when you've set up an experimental design? - At my work, we do a lot of A/B testing. We just call it A/B testing. I don't say experiential. Do you say the word's experiential? - Experimental design. - Oh, experimental design. So how do you use the phrase? I set up an experimental design. - That's not quite how I would say it. Although that's not wrong. I would say the experimental design is an A/B test, which would mean we're gonna have an A group and a B group. - Okay. Yeah, so I work. Yeah, that's exactly what we do. I mean, I just have a new job now working at corporate retail headquarters, and we do try to, it's e-commerce. So we're trying to sell things online. So we have a lot of customers, and we will often set up A/B tests, meaning we set up one test and send 50% of the users to one experiment. We send the other 50% to another experiment where only one thing is slightly different, and we see if there are different results. - So when you say you send 50% to one experiment and 50% to another, does that mean you've introduced two things, or you made one group see a change and another group see the same old thing? - I mean, it just depends what it is. - Yeah, I guess right, because if you wanna make a change, like, let's say change the font, then you have the test group, which is what we call who sees the change thing. We'll see the new font, and if the control group would see the old font. But if it's an idea like, we wanna see if 50% off, or buy one, get one, free works better, that's something totally new, and then it's two experimental groups. So how have you approached deciding the percentage of people to send to each group, and how long to run a test for, and how to interpret the results? - Well, that is a question, and I don't really-- - But it is, yes. - I do not really. I don't really go about it necessarily in a scientific way. Okay, let's just talk about one question at a time. - All right. - So the first one you said is-- - How do you decide what percentage of people? - You said 50/50, which is fine, but maybe if you're doing something risky, you might wanna limit the potential damage you're doing. - Yeah, I mean, it just depends what it is. If we're just changing a color, it's probably not that risky or so we've seen. I mean, yeah, if there's something risky, then we'll say, we'll just only test a little small test group. We'll say, you know, one percent, or maybe we'll only have it up for a short amount of time. Like, once it reaches its number, turn off or something, just because we wanna evaluate what the results were before we move forward. So I mean, there's that. I mean, it just depends, what are we doing? What are the risk involved? How expensive is it? - Right. - How long does it cost to set up? How much would it cost to turn it back? - Right. - You're gonna revert it back to the original. And then you said for how long? I mean, it depends 'cause for example, on retail, we have certain pages that get a lot of hits, like a lot of unique visitors. So if we ran an experiment on that page, we don't have to run it that long 'cause then we get, you know, 10,000 visitors in one day, maybe more. - That's true. - But if you go run it on a different page, that's a little further in and deeper. People have to look to find this page. I mean, you just don't get enough clicks. So you don't have enough data. So part of it is looking at our estimated traffic for that time span. - Mm-hmm. And what do you consider to be enough data? - I mean, generally, I'm not a scientist, so I try and keep my experiments simple, like A and B. And then I just try to, depending if it's just a simple thing, like different colors. And I just try and say like, well, do the clicks, then I try to say a thousand is probably a good number. But I mean, that's just a general aim. There's nothing scientific 'cause we just want enough visitors to get a percentage of data. But, you know, that would also be region specific. That would be say a thousand just for the US. Other regions, we'd have to see how many visitors we get 'cause a thousand in Brazil might take a really long time for our retail site. - Sure. And actually that's quite fair. I mean, as you said, a thousand is not a rigorously derived number, but one need not be a statistician to recognize that this is a large sample versus, let's say, 10 people is a small sample. Once you have your data, do you go and do a statistical test or do you eyeball it? - What is a statistical test? - For example, a T test or an analysis of variance, chi-square test or namar, any M-A-R, or namars test or something like that. - Mm, I vaguely remember the, did you say T test? - Yeah, T test is very common. - I feel like you mentioned that earlier in another episode. So I don't really remember exactly what that test was. - So T test is for looking at two distributions that are normally distributed, which is the first criteria. And if it's not that, the test is not actually a very good one. But assuming that they pass muster in testing for normality, then T test helps you decide if those two groupings were drawn from different distributions, which at the end of the day means that they had a different mean or standard deviation. So when I asked you if you'd experiment, I had a backup plan in case you didn't have an example, which you actually had a great example of an industry, you said, since you've applied in your professional life. But the fake example I was gonna give you is the experiments I've seen you doing with cornbread. - Ah, yes, I make a lot of cornbread because I love cornbread. And I try to make it the healthiest cornbread I can. And number two, the tastiest. And number three, I don't like dry cornbread. So there's a few things that I'm looking for in my cornbread. - So you have three objective variables. Moisture, taste and healthiness, is that correct? - Yes. - Aha, and so it's a three-dimensional problem, or three-dimensional output. How many input variables do you have? - You mean what are my ingredients? - Yes, and to what degree do you vary them? - What do I put in the cornbread? - Yeah. - I put white flour, whole wheat flour, cornmeal, and then anything else after that, you could put a little salt. And then if I wanna flavor it, that's what I vary. - Which of the variables are you varying then? - I vary all of them. - All of them, got it. In terms of what percentage they make up of the recipe. - Yeah, if I change one, then I change the other. - Right. Okay, so this is actually an interesting experiment. It suffers from one problem though, because you know what goes into the recipe, and then you are the judge of the output. Is that accurate? - Yes. - So a desirable property of a good experiment might be a double-blinded output, in which either you would give the recipe to independent people to cook it, and then judge for yourself, or you would do all the cooking and hand your outputs to judges to evaluate on your three criteria. - Well, I feed it to you. (laughs) - So it is double-blind. - So, then I ask you what you think. I also feed it to Yoshi the bird, but she has no preference. - Are my preferences aligned with your objective variables of healthiness, tastiness, and moisture? - I don't know, I never asked. - Not much of an experimenter then. - But I just asked generally ago, "Did you like this better or worse than the other one?" - Uh-huh. - I just ask a very simple question. - Well, you face one challenge, then, that we sometimes see in experiments when you have more than one objective variable, it actually can kind of become an optimization problem. So, we can see that while you've done some interesting experiments, they've, for you personally, been only single-blind, which is okay. It just means that since you knew the inputs, you maybe could have been biased to the output. For example, healthiness is an important criteria for you. So, if going into it, you know that you really added a lot of shortening or a lot of sugar. Your expectation for healthiness is low, even though your taste might be through the roof, right? Or your moisture might be through the roof with the shortening especially. - Yeah. - So, a desirable property, when you can get it, which is not always possible, is to double-blind an experiment. So, it's something to consider when doing experimental design. So, you do seem to do this in both your personal and professional life. - Yep, and then at work, I would say, everyone really values like data. And, you know, some kind of methodical, everyone really values a methodical approach to collecting data. So, in that case, I usually come home and I ask your opinion on some kind of data question. - Yeah, I'm still waiting for my checks from all those consultations. - Well, we never did them, and I'll never come. Would you say that you're experimenting both at work and in the kitchen have brought you to higher quality outputs? Have you learned from your experiments in other words? - Sometimes I have. I mean, obviously I don't have, you said, you know, for example, at work, I never ran my test through the data. What's it called? - A T test. - All those tests. - Yeah. - What are they called? - Those are just statistical tests. - Statistical tests. - Yes. - To what validate data? - The purpose of any statistical test is to establish whether or not you want to accept or reject the null hypothesis. You might remember our p-values episode. So, the idea being like, in the case of e-commerce, we made some change to our website, and our hope is that it will increase sales or maybe increase the amount of each sale. And if you do the experiment and you find like, so the question is, did it affect it? Did the average sale price per green shopping cart, was it higher than per blue shopping cart? You can measure that. That's where your data comes in. And the statistical test will help you establish if you found a meaningful result. So let's say people who had the green shopping cart spend an average of $50, and people who had the blue shopping cart spend an average of $50 and 10 cents. Well, 50 and 10 is more, but is it statistically significant or is it due to chance? And a statistical test will help you establish whether or not the result was due to chance. But it won't tell anything about the effect size that you kind of have to determine for yourself. - Yeah, so now that I know, next time I should run a test, then subject it to some kind of statistical analysis. - Do you think your adventures in the kitchen have improved as a result of all your hypothesis testing? - Well, I think I'm a good cook, so. - Do you want to share your recipe with our listeners? - I don't think listeners really want to know a healthy cornbread recipe. - Wow. - Well, I fed it to my sister, Kim, and then she just goes on and said, "You know what would taste even better?" And she said all these things and I was like, "That's not what I want." - Yeah, I recall. - So I just don't think people would like my cornbread. - There was this famous case study in which some statistical guy came in and helped a spaghetti company. They wanted to ask the question of what's the ideal spaghetti sauce? And this person proposed that actually there is no one. There are clusters of preferences that some people like. Chunky, some people like, I think he proposed three clusters. I don't remember the others, but when it was like chunky, one was watery, and one was something else. So of the subset or cohort, if you will, of people that prefer the healthy cornbread, perhaps you have a recipe for them. - I don't think people want healthy cornbread. At least that was appeared to be the way when Kim took a bite of my cornbread. She did not seem to like it as much as if I just dumped a ton of fat and butter. - Fair enough. Well, maybe we can do an informal survey. If any listeners think they would like the healthy cornbread recipe, that you're treating as this very proprietary thing right now, they can leave us a review on iTunes. Please make it a five-star review, even if you're disappointed you're not getting the cornbread recipe. - Or you could just tweet about it, and then we'll see how many tweets. - Well, that's right. @datascap the gun Twitter. And I answer it, but I tell Linda what's going on. Or I will from now on. - Okay, well, if listeners talk about a stink and while my cornbread, we should post it. - All right, I'm gonna get a botnet set up to raise a ruckus about this. Thanks a lot, we'll be still there for a drink. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (gentle music)