Data Skeptic

[MINI] Cross Validation

Duration:: 16m
Broadcast on:: 25 Jul 2014
Audio Format:: other

This miniepisode discusses the technique called Cross Validation - a process by which one randomly divides up a dataset into numerous small partitions. Next, (typically) one is held out, and the rest are used to train some model. The hold out set can then be used to validate how good the model does at describing/predicting new data.

(upbeat music) - Welcome to another day to skip the podcast in the episode. I'm here as always with my co-host and wife Linda. - Hello. - Linda today, our topic is cross-validation. Is this something you've ever heard of? - No, I've heard of cross and I've heard of validation. - There you go. - But together, I don't know what that means. Other than you're validating something. - All right. - It's not a very good definition. - You can't really talk about cross-validation without first talking about fitting. So we went over last time that you've recently started a new job in the corporate headquarters for a large international fashion company. Given your domain expertise, can you tell me what a fit means? - Well, if we're talking about clothes, which is a company I work for, they make clothing, we're talking about how it hangs on a person and whether it nips and tucks to all the right places, that's what a fit means for clothing. Because a fit means actually it refers to the way the person has sewn the clothing as in the size of each cut. - So one might say it's how well the fabric suits the model. - Well, they have people called fit models and they claim to be the size of a perfect six or whatever model it is. - And the garment would exactly fit the model, would that be fair to say? - Yes. - Well, in statistics and data science, we have something similar. Instead of the garment fitting the model, the model needs to fit the data. There's a little word play there. You see what I did with that? - If you guys don't know Kyle, he really likes cheesy jokes. - So, I'm not cheesy, that's a good joke. - A model, when I say a model, I don't mean a perfect size six, I mean a description of how we expect data to work. And there are models everywhere, like the physicists have a lot of good models. For example, Newton's equations are a model for how gravity works. Many people make models of the stock market. There are sometimes explanatory models, which try and explain why the data is a certain way. And then there are predictive models, which are the ones I am much interested in, to try and say well, given what we know from the historical data, what might we infer about the future? So a fit is how well your model matches the data that you're trying to either explain or predict. Let's think about like a weatherman, right? A weatherman or weatherwoman has a lot of models that they use to predict whether it will rain tomorrow, what the temperature will be, that sort of thing. And they might not get the exact temperature within 0.01 degrees, but if they're pretty close, most of the time, we would say that's a good fit. - So what if the data is wrong? - What data? - Well, you're saying the data in the model, they have to interact to have a good fit. - Ah, yes, so this is actually a great question. It's why we're talking about cross-validation. There's two different types of data here to talk about. There's the historical data, so they can get access to any information they want to help come up with a prediction. But the prediction is like, let's say, the temperature tomorrow will be 80 degrees. And if the temperature is 79 degrees, that's pretty close. But if the temperature is 20 degrees, that's a very poor prediction. And then how accurate your predictions turn out to be once you have the real numbers is how a measure of the goodness of fit. So when you're coming up with a model that you want to either use to explain or to predict data, you generally go through a process called training of the model. Let's say you expect that the temperature tomorrow is dependent upon what the temperature is today, yesterday and the preceding three days. It's a bit naive, but it's one way someone could come up with a temperature forecasting model. One could start to look at the last n number of days. And maybe, I don't know, take the average, or maybe take a weighted average where today's data is much more relevant than seven days ago's data. Why would you wait today's data more than yesterday's data and how much more? That's a question that you'd usually try and answer by doing some training on the data. So you would say, okay, well, if we're gonna make our temperature predictions based on the last seven days of data, let's go back five weeks and see how good that would be if we wait heavily on today's data and repeat that process every day over the last five weeks. And if maybe a 90% waiting on today's data is a better prediction than only an 80% waiting on today's data, you would go with the 90% answer because it was giving you a better predictive power. So how do you do that? One of the procedures that data scientists often use is called cross-validation. And now you see why it was important. We talked about fits and models and stuff first. And the idea here is that if you have a certain number of examples available that you're gonna train your model on, if you show it all the model, all the examples, at the end of it, you would expect that it should be having a very high accuracy because you trained it on everything that you know, the universe of possible options. And the problem that might happen is you can run into a situation that's called overfitting, which is where your model is hyper-specific to just the training cases you looked at. Imagine you were looking at sales data from your online retail store. Probably there are some seasonal trends you know about, like it would not be too surprising if Black Friday and Cyber Monday were very high sales days for you. And in fact, I know nothing about your sales data, but I would guess that November and December put together probably constitute maybe half of your annual sales. So you will see that there are trends in that data, we call them seasonality, you know, the month, the specific days, even like weekday versus weekend, have different sales amounts. But of course, that data is what we call noisy. It's not like you can predict to the penny how much money you're gonna make next Saturday, but you know kind of in a range how much it's gonna be, right? So imagine you plotted all the recent sales days, day by day, for the last six months, and you wanted to draw a line through those points that best explains the data. Now, if you just play connect the dots and you precisely connect every dot to every other dot, you will have a model that is 100% perfect at explaining the old data. But that model will be incredibly complicated to describe and it will do this process we call overfitting the data. Meaning, yes, it explains all your historical data, but it's only because the model's more complex than the data itself and it has no predictive power for the future. Whereas if you just drew a simpler line, a smoother line through that data, it wouldn't hit every point, but it would be much more useful in explaining the data. - Yeah, I could see that. - So one of the tricks that people can use to try and avoid overfitting their data is called cross-validation. So let's say you were working with six months of sales data. What if you split that data up into 10 chunks randomly? And you trained on only nine of those chunks and you kept that other one-tenth of it as what's called a holdout set. Or some people say this is the leave out one approach. And then after you've trained your model and you have this curve that describes how you think sales data will go, you say aha, well, let's apply our model to these test cases that it's never seen before and see how well it predicts those. Now in the case where you displayed connect the dots, if you didn't connect through the dots you didn't see, it's probably gonna be very weak at predicting those dots. But if you have a model that actually seems to understand the data and has learned the right parameters through training, then one would expect that on these new test cases that it didn't get to train on, it should also predict them quite well. - So to be clear, every model has a range of inaccuracy. - Essentially yes, you should have a good way of evaluating whether your model is performing well or performing poorly. And in fact, if you can't do that, you're pretty much not doing data science at all. If you can't evaluate how good your model is, you're just kind of doing nonsense. - Is there an acceptable amount of data to review and test before you could gauge your accuracy rate? - Ah, great question. Unfortunately, it's very case by case. If you have something very simple you're trying to measure, you might need very few measurements and your model might have very few parameters. But if you're doing something very complex, or if you need a very high degree of accuracy, then probably you need a lot more data and a lot more training. Another important point I haven't mentioned so far that's absolutely critical though, is when you split up your testing and your training, you must do that in a random sample. If you fail to do that, your entire results are garbage. - Random sample. - That's right. So in other words, let's say we go back to the sales forecast example. If you train on only Monday and Tuesday data, you will get a result that is highly biased towards Monday and Tuesday results, whereas really you need to randomly sample and have basically a good representation of all the possible situations. So let's talk a little bit more generally about a problem we might wanna solve. Let's say we wanna create a classifier that can either identify music as being jazz, or not jazz. And let's assume we don't wanna explain what jazz is. We just wanna have a listener kind of learn by example. If you play them, let's say 30 jazz songs and 30 non-jazz songs, they should on their own kind of come up with a model and say, I think I've figured out what it is that makes something jazz. Now if you go back and you play them one of the, and also in that case you're telling them, you're giving them a labeled dataset. You're saying these are 30 jazz songs, these are 30 non-jazz songs. If you go after that's done, and you play them one of the songs they've already heard, they had better get the answer, right? Because they knew it from the labeled set. But if you play them on the song they've never heard before, that's where things get interesting and you can decide how good of a job the training exercise worked. So if you have, let's say, 100 total songs available, you might wanna split that up randomly into 10 groups of 10 and give the person 90 songs to listen to. And from that they will kind of pick out what features they think make something jazz and not jazz. And then once they're satisfied and they're done with their training, you would give them the 10 holdout songs that they never got to hear. And you would say, please tell me if you think each of these 10 is jazz or not jazz. And if they get 10 out of 10 right, then clearly the training was very good and they got a good prediction, a good classifier out of this. But if they get something much less, maybe one or two right, worse than chance, right? You should get at least five out of 10 right by chance if there's just jazz and not jazz, then your classifier, or the classification process going on that person's mind is not doing a very good job. - For jazz, I thought there was more like a range, like songs weren't so clear cut. - Yeah, that's very fair. So there might be some that are more challenging than others to classify. Depending on how detailed you want to approach this problem, you might want to segment out by the degree to which something is purely jazz or not. But for the sake of our conversation, let's just assume that it's very binary and clear and there's no argument there that something is or is not jazz. - Oh, okay. - Now the other important thing about cross-validation that my analogy doesn't really hold up well for is that one person can only kind of listen to those 90 songs one time, right? And you can't erase their memory and ask them to start again. But with a computer and an algorithm that you're training, you can apply cross-validation in an iterative process. So do those 10 partitions that I described and then run this thing 10 times where you go through, let's say, 10 different exercises of training. And in each of those, you hold out 10 different songs for them to not hear as your test set. So you'd have 10 groups of 10. - And they're random. - Yep. So there's some jazz and some not jazz songs in pretty much all those groups. And you have 10 groups. So you wanna hold out one of those groups and give the other nine groups to the listener to train on. And by listening to those 90 songs, hopefully they'll pull out some model that helps them classify songs as jazz or not jazz. Then you'd like to know if that training was successful and to what degree it was successful and also hopefully prevent some overfitting of the data. So you then take those 10 hold out songs and you say, here are 10 new songs where I'm no longer gonna tell you if they're jazz or not jazz. You listen and you tell me. And from that, you can measure the success of that training session. Now, in the case of a person, you can't unhear a song. You can't erase their memory, right? But in a computer, you can always start fresh. So another feature of cross-validation that you'd wanna do is repeat the training exercise about 10 times. In each case, you hold out a different one of those groupings. So you have kind of 10 different trainings where one section was held out each time. So you can see how that variation affects the accuracy of the training as well. - Okay, so you wanna leave behind 10 so you could judge the success. But isn't that, that's not enough data, just 10 questions. - Well, that's a good observation actually. It may be, it depends, but what exact, how many samples you need and what size of each of these partitions should be and how big or small the holdouts that should be are questions that can kind of vary from situation to situation. I have, in my career, done cross-validation where I partitioned things into 10 and into 20. I've never had cost to look at other cases, but certainly there are other people out there who are playing with these parameters to find just the best way to get their models to train. - So you're saying this is an example of cross-validation? - That's right. - I don't understand what cross-validation is, so we're gonna have to backtrack. - Cross-validation is the process of taking some data you have available for training, splitting it into partitions, subgroups, pulling a holdout out of that, training on the rest of it and then using the holdout set as your test cases to see how successful your training was. - So what's the difference between excluding some just for the time being versus just getting new data? - Ah, so new data is not always easy to find. So in effect, you're kind of coming up with some virtual new data, if you'll excuse that term, meaning that rather than saying okay, let's test on all the data we have and then go outside and get some more data, it's like saying let's just set aside some of the data as if we didn't have it and consider it new after the training is complete. - Okay, so then this goes back to you have limited data of someone or something that's dead, gone, unlikely to repeat, then you would make a model and then that would be extremely useful. But then for predictive, maybe you would withhold data because you want to test at a set time. - That's right. - And not have to wait for the data to come in. - Yes. - Okay, I will buy that. (upbeat music) (upbeat music) (upbeat music) (upbeat music) [BLANK_AUDIO]