Archive FM

Data Skeptic

[MINI] The Central Limit Theorem

Duration:
13m
Broadcast on:
16 Oct 2015
Audio Format:
other

The central limit theorem is an important statistical result which states that typically, the mean of a large enough set of independent trials is approximately normally distributed.  This episode explores how this might be used to determine if an amazon parrot like Yoshi produces or or less waste than an African Grey, under the assumption that the individual distributions are not normal.

[music] Data Skeptic is a weekly show about data science and skepticism that alternates between interviews and mini-episodes, just like this one. Our topic for today is the central limit there. So Linda, I want to ask you a few things about different animals. You've never had cats, right? Nope. But I have. You took care of my cat one time. When you were gone, I tried to pet your cat. Yeah, and what happened? He ran away. Well, you tried to do more than just pet him. Well, I picked him up. Well, here you go. There's your problem. Clearly, you're not a cat person. You might not be aware of how much cats like fried chicken. Did you know that? No. Oh, if you want a cat to love you, give it some fried chicken. They can smell it from a mile away, too. Do you think there's similar stuff that our bird Yoshi has? She likes seeds. She has a couple of different types of seeds she likes more than. Well, we just give her safflower seeds. Yeah, we give her peanuts, too, and sunflower seeds. I don't think a peanut is technically a seed. Or is it? It's a nut. It's not a nut. Is it a legume? Yeah, I think it's like something else. Yeah, we're really not taxonomists. But my point is, what if you give her those little pellet things that we bought? So far, she hasn't eaten much. Right. She doesn't like those. So there's certain things for certain pets. If you show it to them, they're like, oh, let me get that. All right. There's a similar thing for statisticians. They like the normal distribution. Have we talked about that enough that you know what it is? Well, the bell curve is a normal distribution where the expectation is that the most number results rest in the middle. And we describe it in two parameters. It's mean and it's standard deviation. So why do you think statisticians like this so much? Probably because it's easy. You know, in a way, that's actually true. You can compute things about it very easily. Or things have been pre-computed about it easily. So you can work with it. And I could probably do an hours-long podcast about all the ways in which one loves the normal distribution. We'll set that to the side, this being a mini episode, and just talk about one particular reason why people love it. And that's because of the central limit theorem, actually, that a lot of properties will converge to the normal distribution, which is a very well-studied distribution. We know tons of things about it. We have tons of side benefits and theorems that derive from it. So if you have normally distributed data, you're on a good launchpad to leverage a ton of work that already exists. And you can compute things about your data very efficiently and track changes in your data very efficiently. And tons and tons of good reasons. So we love a normal distribution, right? I guess. And wouldn't it be great if all of our data was normally distributed then, right? If we love it so much. Yeah, I don't think that is the case. That's true. Data is not necessarily normally distributed. And later on, we'll do many episodes where we talk about how to test for normality. But today, I want to talk about this thing called the central limit theorem, which says that as you get larger and larger samples of a population, the mean of those samples will always become normally distributed. So you're averaging? Yeah. The average of averages kind of in a way. The average of average. Yeah, always become a normal. Should be a normal distribution. So let's break this down a little bit more. I don't think I'm really cutting through it. Perhaps to start, we should talk about what sampling is. Can you give me your definition? I assume you just ask people something and they respond. And in a larger setting, like let's say you wanted to take measurements about, let's say the snacks that our bird Yoshi likes, compared to the snacks that other birds eat, right? We're going to have a bird plate eight pretty soon. I don't know. It keeps being delayed, we'll see. But we're hoping, what's the other bird's name? Smokey. We're hoping Smokey, who's an African grey, will come and play with Yoshi. And they're probably going to like different things, right? Do you think Smokey's going to like the safflower seeds? Yes. I think all birds like seeds. Yeah. But they're a little small, you know, compared to the size of Smokey. What about the pellets? Well, Smokey like those? No idea. No idea, right? Less information, wider variance. So I'd ask you to think about this since we're on the topic of treats. Imagine if we wanted to know more about what Yoshi actually consumed versus waste. If you could just for a moment describe what that ratio might look like and how it is a bird produces so much waste when they eat. I think she consumes a third of all the food we give her. A third of the mass of the food we give her and what happens to the rest? The rest, she just throws away, flings away. But it varies, right? Yeah. Varys by the food, by the day. Probably not normally distributed, I would imagine. If on every day we took a measurement of how much waste she produced, that's going to vary. What are some of the variables? Well, what we fed her, whether it's seeds or pellets or she likes noodles. Oh yeah, she doesn't mess around with wasted noodles. Maybe her mood for the day, does she eat consistently day to day? No, I think some days she seems hungrier than others. Yeah, very much so. So I bet if we measured the waste she produced of the food she did need on a daily basis and we plotted it, it would not be normally distributed. Do you think that's a fair assumption? Yeah. And what if every bird owner in the Los Angeles area did the same for their birds? They'd all have these crazy distributions of waste production, wouldn't you think? Yeah. So what if we wanted to do some comparisons? We don't have a good statistical model, like a parameterized model of those distributions. So what are we to do? How can we manage this situation? What should we do? Well, let's simplify the question. Let's say we wanted to know if African grays produce more waste than lilac crowned amazons. Ooh, what's your hypothesis there? Well, just to be clear, they're both parrots. Yeah. And my hypothesis is it depends on the food and depends on the day. Even if we had these good measurements, the distributions would be a little bit complex. Picture firmament what that distribution looks like between zero and she can't possibly produce a pound of waste on a daily basis, can she? Only if we gave her a pound of food, maybe. Maybe she's knocked it on the ground immediately. Yeah. Yeah, so maybe between zero and a pound or like a quarter of a pound, something reasonable. If you plotted the histogram of how much waste she produced, it would not be normally distributed. We kind of agreed about that. It would probably be a bunch of ups and downs, peaks and stuff, kind of like a roller coaster, wouldn't you think? Maybe. Also, I mean, I think it's important to call out that some birds have like a second stomach. Oh, tell me more about this. So they could like store food in a pouch. So they may eat a lot one day, but then next day they could probably just eat from their pouch. Yeah, so that contributes to why this distribution will not be normal. It'll be this kind of crazy shape. Because on days when they've got a full pouch, they'll eat less. So there's going to be kind of a mode there. And on days where there's an empty pouch, there'll be a mode on the high end. If we had a bunch of birds and we measured this, we'd get this crazy distribution. But we also have what's called a sampling mean. So if we take the mean of all those distributions. So they all have their own mean, right? But it could be in the middle of that roller coaster with a bunch of peaks. So it's not necessarily super descriptive. But if you took the mean of all the means, somehow, almost magically, that becomes normally distributed via the central limit theorem. Is that a shocking to you, Linda, as it was to me when I first learned it? I don't know. I'm not really working with data that much, so I don't know. The way this made sense to me really intuitively was, I thought of like, all right, let's plot the waste produced on a daily basis, make a histogram for like 100 birds. And let's draw those all out on the same screen or the same piece of paper, whatever, together in rows. So they're all on top of one another. And that's kind of like mark where the mean is on the range between zero and a pound. And then let's wipe out all of the histograms and just let those means all fall into one stack. What do you think that stack would look like? So wait, I don't understand, start over. Can you picture if we made a histogram of the waste Yoshi produced on a daily basis? What are the axes? The x-axis is like bins of how many, I don't know, ounces or grams of waste. And the y-axis is how many days that was the measurement was in that bin. Isn't usually time on the x-axis? In some graphs, here we're going to talk about how we're just aggregating these measurements, though. Like let's say we weighed you on a daily basis and you made a graph of it. You just took all those measurements and put them in a pile. Like you have some mean weight, so on average or whatever that weight is. But some days you're a little bit above, some days you're a little bit less, right? Yeah. Same idea here. We just plot all of the weight of the waste our bird produced on a daily basis. It would probably produce this sort of roller coaster distribution. But it has a mean, right? If you took the average of all those? Yeah. So imagine the mean and then let's say we plotted that, we marked the mean. And then we went to a bird meetup and hundreds of other Amazon owners showed up and they had the same graph. And I just went around and collected the mean value from all of their plots. And then I aggregated those means and I plotted those. Or think of them if we stack them all on top one another. And then we took away the histograms and we just looked at where all the mean values fall on top one another. Are you following my visual on this audio podcast? A little bit, yeah. All right. So imagine like, yeah, we had those all stacked up and then we let the means all kind of fall down. So when two were the same, they would land on top one another. That distribution would be normally distributed as the number of samples goes to infinity. That's pretty amazing, right? The sample mean is always normally distributed. I guess so because the bigger the bird, probably the more waste they produce, but the bigger birds are more rare as all are small birds for pets. So how do people use this in the real world? Like how have you used this? In one case, where I was working on a problem where we were measuring a bunch of independent locations and the data was kind of wild and peaked and erratic. But we had a lot of observations at each place. And then we could easily calculate the mean of each location. And because of the central limit theorem, we could talk about the whole population. Because the sampled means all were normally distributed. So I could talk about the variance in z-scores and fun stuff like that. So that means you need to have a lot of data to use this, huh? Yeah, as the number of samples goes to infinity is when the central limit theorem kicks in. And there's a ton of good research because obviously like nothing ever goes to infinity, right? But there's a ton of good research about how big of a sample size you need before things get really epsilon close to the normal distribution. So those are more advanced topics that perhaps we'll leave for another day. But surprisingly quickly, a lot of data gets to take advantage of the central limit theorem. I guess maybe I should add why it's really important is because it enables us to do hypothesis testing. So getting back to my earlier example, do lilac crowned amazons, produce more waste, or do African grays produce more waste? Well, if the distributions are really erratic, it's hard for us to say, but if we use the central limit theorem, then we can use something like a t-test to compare the difference in those two means. So we can do hypothesis testing on this well-studied distribution, the normal distribution. It's like, what's a history? Who discovered this? That I don't know. Like what were they trying to do? I don't know, I think it's maybe purely... We should look that up. Yeah, we should. The first version was postulated by the French-born mathematician Abraham D. Mulvrey. Oh, thank you. Who, in an article published in 1733, used the normal distribution to approximate the distribution of the number of heads resulting from many tosses of a fair coin. There you go. Coin dawthing. This is per wiki. Thank you for looking it up. That's an interesting addition to the podcast. At least they had lots of data. I was just going to say, I was like, back in the day, where did all the data come from? Yeah, they kept it in weird handwritten logs and stuff. I can't imagine how people lived back then. Well, I was just thinking that they wouldn't have that much data because they're so limited. They don't have the internet to go worldly. It's just in their own little niche. So he just created it by doing coin tosses. Yeah, a lot of fundamental statistical results just derived from descriptions of random mechanisms like a coin toss, the binomial distribution. All interesting stuff to discuss in further many episodes. So thanks again for joining me, Linda. Thank you, Yoshi and Kyle. Yeah, and until next time, I want to remind everyone to keep picking skeptically of and with data. Good night, Linda. Good night.