Archive FM

Data Skeptic

[MINI] p-values

Duration:
16m
Broadcast on:
13 Jun 2014
Audio Format:
other

In this mini, we discuss p-values and their use in hypothesis testing, in the context of an hypothetical experiment on plant flowering, and end with a reference to the Particle Fever documentary and how statistical significance played a role.

[music] Welcome back once again to the Data Skeptic podcast mini episodes. I'm here as always with my co-host and wife Linda. I'm Linda, hello. So Linda, our topic for today is something called P-values. Do you know what these are? Unless they relate to Greenpeace? No. [laughs] So you grow a lot of plants, right? I think they grow themselves. [laughs] But yeah, you could say that. I have plants I care for. Right. So what if someone came along and said, "You know what you need to do for your plants to make them flower? You need to play music for them. In particular, you need to play scot music for them. Every day in the afternoon and that will make plants flower more efficiently. Do you think that's incredible hypothesis?" No. What type of music would make it credible? I mean, I don't think plants listen to music. If anything about music, maybe they like the vibrations. Ah, see, now you have just gotten too logical. I appreciate your skepticism, but let's step back into the realm of nonsense for a moment to entertain this claim. That perhaps by some means, even though as you pointed out, plants are not capable of listening to music, that music might help them in some way. Do you feel qualified now to state each of those hypotheses for this claim? No. Would you like me to try? Yes. So the null hypothesis would be that playing music has no relationship whatsoever to the plant flowering, and the alternative hypothesis would be that playing music will have a positive advantageous effect at making plants flower. And let's assume that you have some plant like the orchids you have over there, which I guess those don't flower very often, right? Or they're hard to flower? It depends what type of orchid, but most orchids only flower once a year. So how come yours didn't flower for a bunch of years? Because... It's just mean, but I would say it's because they were in a pot that was too big. So orchids like to be cramped. And so it spent several years growing roots, and so it didn't spend its energy making flowers. And so now that its roots have filled out the container, it is now happy enough to make flowers. So let's say we wanted to test this music theory. And we got two different orchids as, you know, identical as we could. Same species, same breed, same pot, same soil, same everything as much as we could control. And then for one, we kept it in a room that was relatively speaking silent. And for the other, we played music for it every afternoon, all afternoon. And at the end of that, in fact, probably just the two orchids, or no, let's just say with that. Let's say we had two orchids, and it happened to be the case that the one in the music room flowered and the one in the non-music room did not flower. What would be your conclusion from that data? Um, well, I am not an orchid specialist, so I'm not sure what would be the case, but I'm not, I'm not sure if music would have played into it. Well, given this data, it seems like music playing has 100% efficacy, and non-music playing has 0% efficacy. Might that be the conclusion you would draw? I think you would need a room full of flowers to test this. This is the key concept because what I presented you is generally filed under the category of anecdote, or two anecdotes. And I don't know the originator of this quote, but I'm very fond of it. And it's the plural of anecdote is not data. Knowing two data points like this, one that floured one that didn't pretty much doesn't tell you anything, but let's say you had two different greenhouses. Again, everything's all equal except one as music one doesn't. And in the greenhouse, 100 plants in each, in the greenhouse that had music playing, you saw 52 of 100 plants that floured, and in the non-music greenhouse, you saw 50 out of 100 plants that floured. The question you want to ask, what is the likelihood that this measurement is due to chance? So yes, the music greenhouse produced two more flowering plants than the non-music greenhouse. But is that just the toss of a coin? Or is that because the music has a very small but positive impact? The exact statistics there, if we treat it purely as numbers, we would look at it from a binomial distribution because we're saying flour, not flour. But realistically, we'd probably want to consult a botanist or a biologist who could tell us a little bit about the nature of the flowering of plants and what variation takes place there and use that to inform our hypothesis. So we should admit that the example is a little convoluted. But let's step into the key concept in that we want to ask the question that we want to ask. In fact, the question that pretty much all scientific experiments ask is, what's the likelihood that the result they observed is due to chance? Because we want to either accept or reject the null hypothesis. So what if I told you that the fact that we had two more flowering plants in the music room, it was 50/50. Could have been chance or could have been a real effect? Would that be very convincing for you? Convincing that music works or does it work? That it works. No, it's not convincing. So with what reliability would I have to promise you that the effect was not due to chance before you thought it was worth pursuing further? Like a percentage? Yes. At least in the 90s. Interesting to you. Okay, that's a good number and that's a number some people actually use. Although when we stated in the literature, typically we refer to it in the complement of that. We would say it's rather than we're 90% sure we would say the effect is 10% due to chance. Because 100 minus 90 is 10 and generally that value is called the alpha value of an experiment. And actually the most common value is 5%, which would mean that there's only a 1 in 20 chance, a 1 in 20 possibility that the effect is due to chance. Although some people use a more strict criteria such as a 0.01 alpha, there's only one possibility in 100 that the effect is due to chance versus a real effect. And it kind of just depends on the importance of the result and how certain you need to be that the measurements you're taking are not due to chance. So given what I told you about p values, do you want a higher or lower p value? I mean lower. That's right. So if your p value is lower than your alpha, then you would say we reject the null hypothesis and rejecting the null hypothesis is implicitly accepting the alternative hypothesis that the effect you observed is not due to randomness, new to chance, and due to luck, but it is something that's an actual phenomenon you've measured. What is the alpha value again? The alpha value is your tolerance for incorrectly rejecting the null hypothesis. So if your alpha is 0.05 or 5%, then one out of 20 times you will make a mistake and you will reject the null hypothesis when in fact the null hypothesis was probably the correct one. And this actually touches on one interesting point I wanted to raise that probably won't fit into this mini episode, but an important consideration a lot of people overlook is controlling for multiple comparisons. So if let's say we were exploring how to grow plants better and we did a thousand different things such as in one greenhouse, we had music playing, no music in the other, in one greenhouse, eight McDonald's in there for lunch every day, and in the other greenhouse we had Burger King. And if you add in enough variables, eventually just due to artifacts and noise in the data, there's a high likelihood you'll come up with some nonsense hypothesis that might prove to be true. So a good experiment predefined all its variables and is very transparent about that and corrects for multiple comparisons. And historically, there's a technique called the Bonferoni correction, which more or less says that the more variables you test, the stricter and alpha value you need to use. So instead of 5%, it's, you know, 5 divided by the number of variables, roughly speaking, is the level of scrutiny you want to have, which will reduce your likelihood of false positives, but might introduce some false negatives. So that this all in here is a very interesting area of research and work that I'm particularly interested in. I just wanted to mention here because in an upcoming episode, one of my guests is going to mention the Bonferoni correction. I wanted to make sure listeners had at least some vague understanding of what that meant. Essentially, it means when you're doing many, many comparisons, the odds of coming up with some spurious relationship gets high and Bonferoni is a way of raising the, raising the standard of evidence you expect before assuming that some detected phenomenon is true. The other important factor one wants to consider when looking at p values is the effect size. So in our example with the orchids, where 52 orchids bloomed in the music room and only 50 bloomed in the non music room, even if for some reason, because of maybe the underlying biological botanical models that proved to be statistically significant, the effect size is still rather small. So statistical significance in and of itself is important, but not enough, because one has to also ask, okay, well, maybe it's statistically significant, but does it actually matter? Well, let me ask you this. Have you encountered the concept of statistical significance in your life? Yes. Yeah. We watched that movie, where they built that giant tunnel thing, or giant thing set up that tested something, and then they broke pipes to reimagine. Oh, oh, this is great. I'm glad you brought this up. You're referring to particle field, that's right, documentary about the large Hadron Collider. Yeah, so what's your recollection of the notion of statistical significance there? So in this movie, they were trying to prove something, this particle existed, the Higgs field and the Higgs boson thing existed, right? Correct. And so they had to run tests, and then after they ran these tests, they had examined the data, and from the data, they had to compare it to this percentage of which it had to beat this percentage, and then everyone believed it. Oh, this is wonderful. I'm so glad you remembered this, because I didn't. This is much better than what I had written out on my notes to talk about. Yes, in fact, and a fine commendation for the filmmaker that he imparted this upon you and your memory. So I can give you a little bit of more extra background there, if you care to hear it, do you? Me or I guess. Mine already left. You could get that out. Anyways, sure, I would love to hear it. The physicists have come up, or some of the theoretical physicists have come up with something they call the standard model, which more or less says that everything is made up of a certain set of particles and what they're made of and how they relate to one another, and Peter Higgs, in his contribution, said that if all these things are true, then there must be this Higgs particle that exists. So this is a great example of what science is, because he made a verifiable prediction about something that had not yet been observed. He said, "If this theory is true, you will find this particle. If you don't find it, you falsified the theory. If you do find it, it's consistent." So they said, "Let's go look for it." Yet in looking for it, it's because it's very small, and it's at high energy states. It's very hard to observe and to confirm that it exists. So first and foremost, you have the null hypothesis that it does not exist, and you have the alternative hypothesis that it does. So that part's easy. The hard part is saying, how do we gather the evidence to assess the likelihood that either of these two hypotheses are true? And in their case, they're dealing with very noisy data, where they're seeing things that may or may not suggest that the Higgs is there, but it's hard to say with any certainty from one single observation, whether it is or not. The same way we agreed that two orchids in two separate rooms doesn't tell us anything, but maybe two greenhouses with 200 orchids does tell us something. So their standard of evidence was, I think, 5 or 6 sigma, or was it 4 or 5? I forget how many sigma's of evidence. And in this case, I won't get too deep into what that means, because we'll probably talk more about the Gaussian distribution in the future, or the normal distribution, or, if you don't know what those two words mean, the bell curve. Everyone knows the bell curve, right? Yes, I do. Essentially, the sigma is the width of the bell curve. The bigger the sigma of confidence you have, the more of the area of the bell curve that you can say with certainty, something is between. So saying something is 5 or 6 sigma is like saying, I'm 99.999% confidence. And I don't know if I got the right number of nines in there, just to say that it's very confident. So the reason they were so happy is because they said that given the evidence we collected, and we know that there's some noisiness and some imperfection in the data, we can say with almost 100%, we reject the null hypothesis. So I just wanted to backtrack and say during the movie, the movie ended with them presenting their data and talking about these statistical significance. And once they said, they basically said the negative, they said, we are certain this is it, except for a blah, blah percentage of error, which is obviously incredibly small. Then this entire auditorium of scientists break out and start crying and screaming. So as you can imagine, if you watch this movie, you will remember. Well, I'm glad it had an impact because it's certainly a major discovery of our time, and it imparted upon you amongst many other things, a recollection of what statistical significance means. And I hope that perhaps our conversation today has enhanced that understanding. Maybe, or maybe you just made it cloudier. We'll see. Should we check in in a few episodes? Yeah, ask me again in a few. Okay, thanks a good picture. [BLANK_AUDIO]