Data Skeptic

Data Science at Work in LA County

Duration:: 41m
Broadcast on:: 29 Jul 2015
Audio Format:: other

In this episode, Benjamin Uminsky enlightens us about some of the ways the Los Angeles County Registrar-Recorder/County Clerk leverages data science and analysis to help be more effective and efficient with the services and expectations they provide citizens. Our topics range from forecasting to predicting the likelihood that people will volunteer to be poll workers.

Benjamin recently spoke at Big Data Day LA. Videos have not yet been posted, but you can see the slides from his talk Data Mining Forecasting and BI at the RRCC if this episode has left you hungry to learn more.

During the show, Benjamin encouraged any Los Angeles residents who have some time to serve their community consider becoming a pollworker.

(upbeat music) Data Skeptic is a weekly show about data science and skepticism that alternates between interviews and mini-episodes. - I'm joined today by Benjamin Yaminsky. He holds a degree in history and a master's degree in urban planning and in public policy. Listeners from the LA area may have had the opportunity to see his recent talk at Big Data Day LA 2015. For more than five years, Benjamin has worked for the Los Angeles County Registrar, currently as a data scientist, working on how the department can leverage data analysis and data mining to do revenue forecasting and develop predictive models around voter turnout and the likelihood of people volunteering to be poll workers. Is that project in particular that I thought would be interesting to share today on the show? So Benjamin, welcome to Data Skeptic. - Hello Kyle, very happy to join you. - I thought maybe just to get started, we could outline what it is the Registrar's Office does, what their key initiatives and responsibilities are. - Sure, absolutely. When I think about the Registrar's Recorder, I think of really three core activities. One that I find a lot of my time is devoted to our election activities. So if you want to cast a vote during the election, you could either get a ballot by mail and return it by mail or you can go to your local precinct and drop it off at the precinct box, or you can go in person and actually vote at the polls. The polls, the collection of your ballots, our department is directly responsible for all of that. So we're responsible for the logistics, we're responsible for the tallying of the votes on election night, we're responsible for conveying the results of the election to the Secretary of State, certifying the actual election. We handle all election related anything. Now, there are city jurisdictions like L.A. City, there are a number of other water districts, there's small municipalities that run their own elections. But when it comes to a statewide election, such as a gubernatorial election or a presidential election, our department's responsible for voting in all L.A. County, both in the unincorporated areas and in the incorporated areas as well. So voting is super important, that's a key function for our department. As well as what we call vital records, there's this saying that we have to describe vital records. Hatch 'em, match 'em and dispatch 'em. So that relates to we handle birth records, so if you're born in L.A. County, your birth record comes to the department where it's registered as a birth certificate and it will be on file for perpetuity. If you ever want to get a copy, you'd be able to come to us so we could issue you a certified copy. Same with marriage certificates, so if you want to get married in L.A. County, you do the same and that certificates as well. So if you pass on into the great beyond, while you're here in L.A. County, then your destiny there will be registered with our department. Lastly, and this is a very important one as well, and this is where we've been doing a lot of time series forecasting for our revenues, is real estate records. If you want to buy a home or you want to transact some sort of property and transaction, you will be recording all of that with our department. So a deed, a grant deed, a reconvenience. There's well over 140 different title documents that you could theoretically record with our department. So those are really the three core areas that we handle as a function of L.A. County government. - Well, that's a lot of important stuff. Voting especially, we need a high precision on that so I can definitely appreciate how much data and analysis must go into it. Data science in general, I think, can mean a lot of things in a lot of different places. What does it mean to be a data scientist at the County Registrar's Office? - So I think it means a lot of different things. There are some basic exploratory data analytics that I do in terms of just creating reports, right? So if my boss asked me, hey, how many voters cast a vote by mail ballot but ended up showing up at the polls and exchanging it so they could vote at the polls, we'd have to do a deep dive into our voting history data set but we can know those things, right? So it's a matter of just writing some code in R to extract those answers out, right? So that's basic business intelligence. Likewise, you can look at stuff on productivity and whether or not our employees are away from work too much, you know, depending on what you mean by too much so we can create benchmarks as to how productive employees are or how productive a section is at meeting demands of our customers. Those are all really business intelligency types of things that you could do with a lot of different programs. For me, I like to use R because I know the R coding and I find it really easy to do the exploratory data analysis but in addition to that, I also think about other things such as when we hunt for duplicate records, right? I'll give you a really good example of how we're able to not have to look at 103,000 records manually by our staff to identify duplicate voting records, right? And the way we did that was just through random sampling and identifying how many real true positives we should be looking at in a large haystack of a mixture of true positives and false positives, right? We ran this interesting query on our voter file 'cause we wanted to, we're always in search of potential duplicate voter records and those are where records can come about in a myriad of different ways. Sometimes you have two records and it's the same first name, slightly different last name because maybe there was some sort of clerical error in the input or the voter provider's a different name and it matches on a lot of other things, right? But in this particular case, we were audited by the Otter controller and I don't remember how long, it was not even six to nine months ago, there had been an NBC, actually it was probably a little bit, it was a little bit longer, it was last summer, we had an NBC investigative report done on duplicate records. It really sent us into overdrive in terms of really doing a deep dive into our voter file to find as many of these duplicate voters as we can because obviously we don't want to keep records on the Otter file that should not be there. One of the queries that was run by NBC and also by the Otter controllers department that did the audit audit department was an exact match on first name, last name and home address. What do you think that returned in a, so let me give you a context, right? We had at the time roughly 4.9 million individual voting records that were in our voter file. When we ran that or when they ran that type of query, it returned back to us about 52,000 duplicates which yielded a little over 100,000 actual unique records. How would we go about trying to identify those? We could have spent the staff time and literally manually inspected every single one of those records against the identified duplicate. But what we realized when we started sampling was we saw that by and large, most of those records were false positives. And the way we knew that was we started seeing, looking at some of the duplicates, just visually inspecting and noticing that, wow, these are father and son combinations. - I'm just gonna guess that, yeah. - There's a ton of that stuff. When we did a random sample of the 52,000, as it turns out, in the random sample, close to 15, actually it was close to 16%, turned out to be true positives. So for us, we realized, wow, we probably don't need to be spending weeks, if not months, visually inspecting every one of these records. We just need to find a way to parse out where the true dupes exist versus the false positives. And the way we did that was, we noticed in the date of birth field, we could know when there's a generational gap between two records, right? So father and son, clearly there's a generational gap and they'll be. But what was happening and what was causing the vast majority of the true duplicates was, there was a slight error in the date of birth between two records. So let's say you were born September 13th, 1978, well, your other record that was in the voter file somehow, instead of September 13th, it might be showing September 16th, right? So there might have been some sort of miss key that occurred when we're inputting that information into the voter file that was causing what we'd see a duplicate, right? So how can we actually parse out those generational differences in DOB versus those actual close matches on date of birth and the other core, so first name, last name, home address of date of birth? And we ended up using fuzzy matching implementation in R. It's specifically the A grep function. I ran a number of different tests to figure out what the appropriate cost threshold was and was able to really start separating out those true dupes versus those false positives. And at the end of all this, when I found the right combination of costs in the A grep function, I ended up reducing that number from 52,000 duplicates that we would have manually inspected down to about 8,500, 8,600 that we were able to send the staff to manually inspect. As it turns out, our target number was about 8,300 true dupes. As it turns out, we found all 8,300 from those 8,500. So for us, it was a way to reduce expenditure resources, just using good statistical tools available to us. - I'm gonna put you on the spot a little bit here. How many man hours were saved per line of code in this activity? - That's a really good question. So at the time, I hadn't really used the A grep function. It wasn't very common to me. It was the first time I was really exploring fuzzy matching. When I started looking at it, I realized one of the core limitations to A grep was you could only feed it a single pattern. So for me, I was like, wait a minute, one pattern. I have like 52,000 patterns I need to match against itself. Like, what am I supposed to do? So I was like, okay, maybe I could use an S supply to see if I can start beating it a number of different patterns. Well, as it turns out, there's a lot of great implementation for parallel processing. And that was another area that I hadn't really taken advantage of, but I have a wonderful server-grade computer that I use. And so I had 20 cores that I could theoretically tap into when I start parallel processing. So it took me a little bit of time to learn how to combine A grep with parallel processing, but I found the right coding. And I'll give a major shout out to Stack Overflow. I'm on that daily. And I was able to essentially write a function that was able to do the matching. And in terms of the amount of time that the algorithm runs, I think it runs upwards of five to six minutes to go over close to 52,000. So man hours, I probably spent a good amount of time trying to figure out how to do what I wanted to do. But now that I've done it, I can just run it over and over with new voter files coming in. And it takes seconds at this point. So I feel like I'm saving myself so much time in the long run with the expenditure of that time up front. - Oh, agreed. I actually met in a different way that I imagine you saved a lot of manual deduplication. - Yeah, no, no, I can tell you, at the pace that we go, it probably would have taken upwards of maybe four to six months to go over 103,000 records, right? Whereas going over about 17,000 records manually, that took us, I think, close to three to four weeks. So it was a lot of time that we saved, where we didn't have to go looking through a bunch of false positives that we could really hone in on the true positives that we were trying to find. - There's a lot of discussion in our community about this, at least I've only heard in the last few years, data lakes trying to bring everything together in one place. But I would imagine the diversity of services the city offers makes that quite a challenge and how many the variety of departments and the data they're working with. So in terms of what you touch and it's under your purview, what sort of challenges do you face in the variety of data that comes across the problems you need to tackle? - Yeah, I can tell you, I utilize a number of different data sets, even just within our department, right? So we have some wonderful databases that we use and databases that we've created and enterprise systems that we've actually programmed within the department. It's a real pride of our IT department with our IT division within our department. And from those enterprise systems, I'm able to draw lots and lots of different types of data sets, right? So if I want accounts of how many real estate transactions we did in the game of day or over the year, I can draw down that data. If I wanted to look at our voter file, I'd have to go to a different database and pull down data from there. If I wanted to know things about our absenteeism or who's showing up to work, that's a whole other data set that I'd have to look at. And that's just within our own department. And there are wonderful data sets in other county departments outside of the registered recorder that we're trying to access as well. We're relying upon state data out of what we call the electronic death registration system to pull down death indices of potential voters that may have already passed away that we need to pull off our voter file because we don't necessarily track all that information. And we can get that data from our Department of Public Health. So there's lots of great places to access data. And I can tell you, though, one of the biggest challenges is when you have a data scientist who no one's ever asked these departments for these data sets, there's an initial hesitancy of sharing that data. It's like, wait a minute, why are you asking for this? I don't think I'm allowed to give it to you. So there's a lot of back and forth trying to establish my bona fides and explain to them what our business case is in getting this data. So it could be definitely a challenge. - I know this is probably a struggle you face on a day-to-day basis, but it's maybe comforting to me as a citizen that there's some red tape involved because there's so many concerns of data privacy these days. It's refreshing to hear that city departments are hesitant to share and want to kind of think through the use cases. - Oh, absolutely, and I can tell you, especially when it comes to health data, HIPAA is a major regulation in place that, so if I want to know things, you know, in terms of what data is being kept by the Department of Mental Health or what's been kept by the Department of Health Services, I would never be able to touch that data because I have no need to access that data and that's just data that they cannot share under the current legal environment. Now, can things change in the future? Certainly, and can certain data sets be sanitized and made available either to the public or to other departments? Absolutely, but that's part of the challenge of figuring out what the data needs are and how we can go about sharing those things without violating the law. - So in terms of the duplicate voter records, what's the real concern on the back end? Like, is it that someone could vote twice or there could be fraudulent voting? Like, is it that sort of thing we're trying to avoid or just having an accurate count on who's eligible to vote? - So there's a lot of issues going on. I had referenced a audit that our department went through and I'm not ashamed to share because departments are audited all the time and this is not something that we feared because we've always been engaged in trying to maintain a good voter file. But sometimes things slip into the voter file or you haven't done certain cleaning of the voter file in certain months because you have pulled into doing other things, you get pulled in all sorts of directions. So there can be vulnerabilities, right? But I'm very happy to share with you that when we were audited, they looked specifically for those issues that you're talking about that, "Oh, did these voters vote more than once?" I can tell you, according to the auditor controller, when they did their audit, not a single instance was discovered in which a voter voted more than once. I can tell you, as a department, we felt really good about that. It wasn't a major fear for us that that would be happening. I think there's a lot of mythology that has been created. And I think it comes from a lot of different directions. I think part of it's born out of a little bit ignorance about how the voter file is maintained and what steps are being taken by different jurisdictions to make sure that the voter file is clean. But I think it is natural for certain people to be curious about that or to be concerned. And I am happy to share that we have not seen that type of thing when audited. That aside, I think there's still just a core understanding by our management team and our executive office that we owe it to ourselves and we owe it to the public to maintain the voter file in the way it needs to be maintained. And I think we're trying to be as proactive as possible in doing those types of things. - Yeah, I know the mayor's office has made it not as primary, but I think maybe the number two or three initiative that open data should be important for the city. And I think that's, I've really seen that taking place over the last couple of years. There's been a really expansion of what's available in LA's open data portal and I think a commitment from the city to it. I'm curious as to whether or not you're accessing data through that portal or if you're working mostly with closed sets that are accessible to you that wouldn't or shouldn't be available to the public. - So I think a lot of the data sets I'm working with are of course confidential. So there are data points in the data sets I'm working with that we'll never be able to be shared with the public. Now, could some of the data sets I'm working with be sanitized and some of the confidential data simply be removed and then be shared with the public? Yeah, those are things that we could look at and ultimately put on the open data portal similar to what LA City has been doing. LA County has its own data portal and we have put some data sets on the data portal from our department and we're always looking at other opportunities to find new data sets and share that with the public. Now in terms of my access, I haven't really explored LA City's open data portal that much but of course I've done a bit of a dive into our LA County open data portal and I can tell you there's a lot of interesting things that are on that data portal and I think as a fellow data scientist you might be really interested to see what departments are sharing data and what types of data points exist that the public has access to. - Absolutely, would you mind sharing the URL for that for listeners that wanna check it out? It's data.lecounty.gov and then looking at the stuff there are nice links upfront about the budget. So if you want to know how dollars you've been spent in LA County, if you wanna know about employee salaries, if you wanna know about health data, there's maps and GIS data, property and planning, there's all sorts of really interesting stuff. Oh, and Kyle, just to differentiate, so your listeners aren't confused, there's the city of Los Angeles which has their own data portal and then the county. So the county covers essentially all of LA County where LA City just covers LA City. - Yes, thank you. I've made it mistake a couple times in our conversation but yeah, I know you're distinct entities in both adding a lot of open data value there. So I know not everyone knows a lot about polling. I thought maybe we could just take a moment and describe a little bit about what the process is and what's required to accomplish it. - So our department doesn't do any type of polling whatsoever and for really good reason because for us the only thing that matters is that the ability or the accessibility of voting is available to the voting public. Now, how they vote, who shows up? Obviously there are campaigns out there that are really interested in getting their folks out to vote and they're really interested in the polls to see how their person is doing, right? But for our purposes as an administrator of the voting operation, we can show absolutely no bias towards one candidate towards another. So that's an area that our department does not involve itself in any way, shape or form. - Yeah, that makes sense. Isn't there some work done on voter like the people that volunteer to help with the polls? - Yes, so that's something a little bit different. So we're talking about our poll workers, right? - Yes. - So our poll workers, those are folks that when you go to a precinct and you wanna cast a vote, there's usually someone or a number of people sitting at a table, helping you check in, make sure that you're in the voter file and are gonna give you a ballot. And you will sign off in the roster of voters and you'll be like, "Cast your ballot." Now, those folks are, I think technically they're volunteers, although we do pay a stipend to cover the cost of the day. Those folks are contacted by our department. We have what we call poll recruiters. And right before, I wouldn't say the right before, but before any election, we are calling these folks, making sure that we're able to get everyone we need to get on our precinct so that everyone's staffed and we can open up all of our precincts come election day at seven o'clock in the morning. There's a bit of a trickiness to this because I can tell you, it's not the most popular volunteer job for any county residents. And it could be really difficult. So let me take this moment to encourage all of your listeners to really make that volunteer commitment if you're contacted by our department or contact our department. Let us know if you're really interested because we're always looking for volunteers. But part of trying to find those volunteers, right now, we're not using any kind of predictive analytics. We just leave it up to the intuitions of our poll recruiters themselves because what they usually start off with is lists for the areas that they're trying to recruit from. And they're gonna call the people that in their mind, "Oh yeah, Joe, do you work the last election?" So I'm gonna call him because he's probably gonna work again, right? So you have those basic intuitions that our poll recruiters are using to find those poll workers. But imagine, and this is the cell that I was giving to our managers that were managing this operation, imagine if we could find a way to essentially amalgamate all of your intuition, right? Not just Fred who's really good at recruiting these locations and Kelly who's really good at recruiting these locations, but maybe we could do a deep dive into previous instances in which you recruited poll workers and you had certain outcomes where whether you found out, oh, I called Fred over here and he told me yes, I will work, but then he ended up not showing, right? So the question for us is, not a single one of our poll recruiters has the time to sift through their own data. There's just no way they can do it and look for those patterns. They're only gonna rely upon their intuitions and instincts they develop over time during recruitment. But if we could do a deep dive into the data that we have and we have really rich data sets that tells a lot about every single instance in which we called someone and what they told us. So if we called someone, we have a record that we called them and they told us, no, thank you, I'm not interested at this time, we use a code that tells us, oh, yeah, this person told us they were unavailable or they work for us and everything worked out. So we called them, they agreed to work and they worked out. Oh, we list them as active or maybe they don't show up and that's really terrible for us when we also give us a commitment and they know show because I mean, a really horrible situation would be if you had four people or five people that need to open up a poll and none of them showed up, that means we're in real trouble because there's voters that are waiting to vote at that poll and there's no one there to service them. So we need to avoid those types of negative outcomes. So the question is, how can we find those patterns in previous instances of recruiting? And I could tell you, the patterns are definitely there. The question for us is, as a human behavior, what drives a poll worker to make a commitment to volunteering at the poll and essentially carrying through that commitment? So what data sets are available to us? What can we know, right? Well, we can know did they volunteer in the previous election? We can know what their level of civic engagement is and we could develop that feature by looking at their voting history, right? So that's a question for us is, is previous voting history correlated with volunteering at the polls? Intuitively, to me, I think it does and we're gonna be learning about that as we continue to explore this data. Likewise, what about income? If someone has, you know, is in a low income bracket, are they more interested in working because there's a stipend associated with it, right? So what are the key data points that can help service predictors for in answering those questions as to, hey, if we call this particular person listed in our voter file, are they more likely to say yes or less likely to say yes to voting? What we ultimately wanna do is provide our poll recruiters, essentially stacked lists, right? So instead of working the list that they're currently working, maybe we can reorder those lists based on high probability of saying yes versus high probability of saying no or being in no show. So because there's certain outcomes we wanna obtain and certain outcomes we wanna avoid. - I like that a lot. It's one of those great situations where even if you ordered it randomly, you're probably not hurting their process 'cause all they started was at the top and go down. But to every piece of insight you provide, you've just made the poll process much more efficient. - Absolutely. And I can tell you there was a really good saying. When I was explaining all this to our poll recruiters and their supervisors, what I'm gonna came up with this analogy is as he understood what the predictive algorithm was gonna do. He basically understood it as we were, I was basically gonna be giving him and his staff royal flushes all the time. - I like that. - So yeah, he thought of royal flushes as oh, here's my list or my cards that I'm gonna have to use in this game of bulwark recruitment and hey, give me a bunch of aces and kings and that's gonna be awesome for us. And that's ultimately what I wanna provide to them in terms of making their lives much easier. - Yeah, it's a great way to look at it. So you touched on a few of the data points you have available and I imagine maybe there's some things that you aren't able to reveal but could you talk a bit more about some of the features that you're able to look at to help predict who's likely to respond positively? - As it may surprise you or maybe not surprise you, we do collect some demographic information on our voters, right? So as part of being able to be eligible to vote, you need to provide your date of birth, right? So by providing date of birth at any given point, so on election day, we can know what your age is, right? So that's a demographic data point. Likewise, we do collect gender, so we can know what your gender is. But there's a number of data points that we just simply don't collect. Partly, it's not mandated by law and partly, I think there would be concern by the constituency of oh, should we be collecting this data? Are we even titles just getting so on and so forth? That doesn't mean though that some of the data that we're not collecting isn't available to the department in other forms and it certainly doesn't mean that those data points would be useful to a prediction algorithm. So I'll give you for instance, right? So the income level, right? We don't collect income on any of our voters or any of our poll workers. We don't ask those questions. But I can tell you, the Census Bureau has some really rich data sets that can provide us some approximation of that data. Now, a lot of Census Bureau data is presented at the census tract level. So we can know, so we know for all of our voters, for all of our poll workers, where they live, right? So we can then use GIS to geocode where they live and do a match with what census tract they belong in. So we can look at that census tract and start estimating up for this particular census tract, what is the median income? But there's a level below that that can add precision and that's the group block level. So this is a much smaller grouping that's in the Census Bureau and we're now using that level. Because before I was only familiar with Census tract as I was looking through the Census Bureau data, we can look at it even at a smaller level. And as it turns out, our GIS team can do exactly what they do with Census tract and essentially tell me, for this person's address, what group block does it correspond with? So I can pull that data in from the Census Bureau, likewise, if we wanted to know things about educational achievement, all of that data exists in the Census Bureau data. It's really amazing what Census Bureau data has and makes available to the public. - So tell me a little bit about some of the techniques you apply to do those predictions. - So I think what we're gonna be doing or what we've already started doing is now that we're at the point of modeling, there's a whole host of machine learning algorithms that we'll be using. In fact, we just recently ran our part model and a random forest model, but there's gonna be other algorithms. And in addition to those algorithms, such as like a neural net, any kind of GBM, there's a lot of controls that you'd want to use. So what I'm ultimately talking about is just adjusting the training parameters to find the flexibility in those machine learning algorithms and give you better prediction without overfitting, of course. In addition to that, I think there could be some real use of finding clusters for our coworkers and then ultimately predicting all those individual clusters. We might be able to squeeze out some additional percentages and accuracy by doing that. I think there's a lot of different techniques that we could really be employing to find the as accurate model as possible. - Yeah, definitely. So you had mentioned earlier, just sort of a call to action. Anyone who felt like they want to do their civic duty while we're on the topic, how could one volunteer if they were so inclined? - Absolutely. So we have a whole recruitment section. I don't have the number in front of me, but you could contact our department, we are the Red Star Recorder County Clerk. And you just need to let them know that you're interested in being a poll worker and you will be absolutely directed to one of our poll recruiters. And you might need to fill out a very, very short form just as an application so we can have your information on file and show that you've expressed interest in working at the polls so that you will show up on a list for one of our recruiters to be contacted for the next election. And the other thing too is like there's a lot of interesting data science projects that we're ultimately interested in, just beyond the poll worker, quite predicted model. We're gonna be moving from what we call the precinct model or we're gonna attempt to move from the precinct model to the vote center model. In any given election, there's roughly 4,500 to 5,000 individual precincts. And what we hope to do is find that happy medium of vote centers, probably close to around 800 and be open for more days. So not just that Tuesday for election day, but more days, maybe certain weekends, but there's gonna be fewer centers than there is precincts. Now the question for us is ultimately, what is the optimal placement of these vote centers? Because what we don't wanna do is discourage voting. So if we put a vote center in an area that is too far for a number of voters and there's no other vote center that's close enough, maybe they'll be discouraged and not wanna participate. That's something that we do not wanna do. So we're gonna have to find interesting statistical computational techniques to essentially predict where the optimal vote center placement is gonna be. And that would be a really interesting project that I'd love to tackle. - Yeah, that's a fascinating project. Is that something that's currently on your table or on the near horizon? - I think it's the near horizon because right now, we're working with a wonderful company called IDEO. They're up north. And they're assisting us with essentially developing a new voting system. And part of that voting system is essentially a new voting equipment, making things more accessible to voters and exploring development of vote centers. Now, as we move towards that vote center model, what we don't wanna do is essentially just throw things up on the board and say, oh yeah, we threw the dart and it landed in the map and that's where we're over the vote center, right? What we ultimately wanna do, and I would say probably maybe like four to six months away from really jumping into the data is we wanna start looking at how can we start predicting who's likely to show up to vote, how they wanna vote. So if they're gonna vote by mail and they're not gonna show up at a poll location, that's something that we wanna know about that voter. So if we're gonna predict that that's a current, that's a voter that we won't necessarily have to worry about making it to a vote center, right? So we wanna be able to subset the data in a way so that we're predicting on voters who are actually going to be at these vote centers. So we're gonna have to look at a number of different data sets to help us predict that, right? So this project has lots of small prediction models that are gonna assist in a larger prediction model. And that's ultimately knowing where those vote centers really where we need to place them. And I can tell you the idea came to me when I don't know if you've attended the use R conference. Did you go to that in 2014 that was at UCLA? - At UCLA, yeah, I was there. - Yeah, so I went to one of the talks and the talk was by, I forget the name of the company. I feel terrible about this. And they did a talk on hierarchical Voronoi Tessellations. And at the time, I didn't really see like, oh, how has this happened to me? But when I went, I was like, oh, this looks interesting. They're gonna do some really cool data visualizations and I wanna see what they're doing, right? They were ultimately trying to predict where to place new supermarket chains in a saturated market. You don't wanna place your new supermarket too close to your competitor, but you don't wanna put it too close to one of your own supermarkets. So you gotta find that optimal location, right? So that got me thinking, I was like, wait a minute. That sounds like a real parallel problem to our vote center placement. Because there's lots of parameters as to, well, we wanna place it here, but it can't be too close to this other vote center. And it has to be within an urban environment or it can't be too far away from this type of structure. So there's lots of parameters that we wanna build into that model. And so it really got me thinking, wow, maybe that type of approach might make a lot of sense for our department. So it got me thinking about that. - Yeah, it's fascinating, I really like that. I think there's a great opportunity there. - Yeah, definitely, definitely. So those are types of things that we ultimately wanna be using data analytics and applied predictive model to address business problems and ultimately provide better service to our consistency. - Well, I thought maybe we could wind up our conversation and just touch on sort of one point that's always interesting in the data science world, which is how far our models can take us. If we were physicists, we're probably dealing with very fundamental things and our models are very precisely predicting exactly what they're measuring. But when it comes to people and the dynamic world we live in, there's a certain limit to the precision we face. Are you facing issues like that? And just how far data science can be a service to our community? - Yeah, it's interesting that he brought this up because I was having this conversation with a colleague of mine at our machine learning group meeting today at LA County. And he expressed some real concerns about trying to use machine learning in his department. His department was departmental health. Could there be some real interesting and important use cases that you can use machine learning? Of course, there can't be right for mental health. So imagine, if you will, as a mental health clinician, you want to pour through your database of all your patients and you want to red flag high risk suicide. That's an outcome you do not want for your patient. Could you imagine having a predictive model that could help a clinician, you know, you feed in the data that's been collected on a patient and it could give you some sort of prediction. That could be a really good thing. But what he was explaining to me was, there's all sorts of interesting, and not just interesting, but really complicated data points that they're collecting that overlap with other data points and their data sets get really messy. And not just getting really messy, but in terms of predicting human behavior, those models, of course, can fail. But when we're talking about predicting on something so important, when we start evaluating costs, those failures can be absolutely catastrophic. So those types of concerns about relying on a predictive analytic are really there. The conversation basically was, yeah, Ben, we get what you're doing at the Rage Star Recorder and that's really cool and we're really jealous in some ways of being able to do that. But a lot of our clinicians and a lot of our managers would be a bit wary of relying upon a predictive analytic to predict those types of human behaviors, especially when it's not 100% or it's not 99%. What if we came up with a predictive analytic that predicted on suicide risk and it was 88% accurate in terms of a binary classification? In many situations, 88%, that's really good. I mean, I'd be happy with 88% for a lot of things I'm doing, but for a clinician, that level of accuracy or certainty may simply not be enough. And I think there's a lot of concerns that I wasn't even aware of, but things that I'm starting to understand about some of my colleagues that are really involved in the messiness that is human behavior. So I think there's definitely some limitations there. Like I tell a lot of our managers, these predictive analytics, whatever the accuracy is, it's not intended to essentially replace your judgment as a manager or as a staff person doing the work. It's intended to complement your intuitions. It's intended to complement the decision-making that you're doing now. That goes hand in hand. And I think a lot of folks, they have this kind of minority report type of understanding about predictive analytics where you have these technocrats that basically don't make any judgment calls whatsoever. They just simply rely on these analytics that tell them what to do. And that's just not the reality of the situation. Now for us who we're trying to predict on who's gonna show up to load or not, because certainly you might have certain prediction algorithms that get tricked by parts of the data. Maybe there are data points that are messy and we didn't realize it. There's a lot of noise that maybe we're not accounting for. Throwing the algorithm up. We have to use our own intuitions and compare it to what we're seeing in the predictions. So those are some of the considerations that really need to happen, particularly as a data scientist and also as a decision maker that is trying to figure out how do we provide the best service possible to different constituents in the county of Los Angeles? - Yeah, absolutely. It's comforting to know that the county as people, like the colleague you were mentioning in yourself, we're aware of cognizant of and considering these things and the actual impact they have while also leveraging to the best of your abilities, what data science can do to serve the public. - Sometimes as a data scientist, you guys have to go about it, particularly for people that have never encountered data science, have no idea as to stats, right? Sometimes you need, you know, it's hard to get through. Like some people have misperceptions about what a prediction analytic is supposed to do, right? Some people see it as, oh, this is the magic crystal ball. No, it's not, you know, you have to understand, you know, there's very few things that are 100% accurate. So let's put that aside. But likewise, sometimes you have to convince people that these predictions are pretty solid at this level of confidence, at this level of accuracy, you know, and we could rely upon a statistically speaking on a number of these predictions at up to a certain point. Now, some folks understand that. Other folks, it's still magic to them. And you need to be fine. And I have to tell you, like, I spend a lot of time thinking about intuitive ways to explain how these types of analytics are working so that the late person or someone that has no statistical background whatsoever or zero math background can kind of grasp what it's intending to do. Similar to, you know, that analogy that whole recruiter came up with about stacking the deck and come up with a royal flush. That's the type of intuition I think really reaches folks that really have no kind of understanding whatsoever. - Yeah, absolutely. What do you think we should add before I sign off? - Well, I just wanna thank you very much for this interview and I wanna encourage all of your listeners to get out there and vote. It's an important thing to do, but equally important, we'd love for you to volunteer as a poll worker. So when they help us, you know, make sure we keep those poll locations open. - Absolutely. Encourage people to reach out and do that. If they have the time and energy to do their civic duty, I'll put some details in the show notes that we discussed earlier if you wanna follow up. And thanks again for joining me, Benjamin. This has been really interesting discussion. I'm glad to have you on the show. - Thank you, God. - And until next time, I wanna remind everyone to keep being skeptical of and with data. (upbeat music) (upbeat music) (upbeat music) (upbeat music) [BLANK_AUDIO]