Data Skeptic

The Right (big data) Tool for the Job with Jay Shankar

Duration:: 49m
Broadcast on:: 07 Jul 2014
Audio Format:: other

In this week's episode, we discuss applied solutions to big data problem with big data engineer Jay Shankar. The episode explores approaches and design philosophy to solving real world big data business problems, and the exploration of the wide array of tools available.

You know, I have to start with a funny story. Alright, so one of the things about being an engineer are being very close with data Uh-huh, you're constantly thinking about it even whether you like it or not So I was doing the Santa Monica stairs I think last week and these two girls were walking in front of me and they were talking and like So when's your exam and she goes? Oh, it's next week and this girl goes, so are you done with your models? And she goes no, I have a few I'm still working on a few and then they were talking about yeah, so What's the average and she goes? I think it's about I don't know what she said, but she said eight or ten at the shoulder And I was like wow, they're already talking about some curves and I'm thinking you know shoulder head You know knee of the curve and all that stuff and then she goes what about color I'm like. Oh, here's a new variation Visualization. Yeah, they do visualization man. That's great. Yeah, I don't know I was thinking of maybe doing some blonde highlights, and I was like wait a minute What's happening here? And then I hear them talk and they're they're finishing hairdressing school Oh, that's hilarious and yeah They're talking about their models and their hair and here I am the minute I hear models and average and you know Other stuff and I'm thinking oh, they must be doing some statistics The data skeptic podcast features conversations with researchers and professionals working on problems or projects related to data science Today's episode features experienced big data engineer Jay Shankar I work for a comparison shopping website We have found out that as far as the website goes the traffic is really not We used to have a constant set of loyal people coming to our site What's happening now is Google is cutting into that space And you know they're really doing a great job, and it's right there on the search page. Yeah, so Cutting in front of you. Yeah, they're cutting in front of us And you know they basically cut out the middleman because you know we've traditionally been what my boss calls? Data pimps Yeah, cuz we're like yeah, you know, we know who has a listing for the best listing for like a hard drive Yeah, and you know when you come in there we say hey, these are the guys This is the best guy, and I just click on it. I go buy my hundred So the faster I can get a guy out of my site Into a fulfilling it's better for me. Yeah, and then that kind of a data pin business is not really sustainable Now because Google undercuts you and even before you get to our site if you do a search on Google they get right there Mm-hmm, and then so that's why a lot of these guys are doing the PLA Yahoo's doing the PLA. It's basically like a local search so they have partnership with us and other shopping sites and We can actually place our products with Google and they will just use that so they get all these feeds I see so you contribute to that. Yeah, exactly. And then maybe they massage it. Maybe they don't. Yeah, yeah So that way what they do is they have strategic partnerships with us sure and you know They choose to enrich the data or not based on the vertical and then they give it to the consumer that way You know we cannot win even though it's it's not a lot of money, but still we are in the game Yeah, and as far as going to particular sites Unless you have a good AC and play the organic traffic is just not coming anymore. Yeah, it's not what it used. Yeah, yeah And previously I used to be in a online Yellow Pages company and if you look at their statistics Almost 78% of the traffic they come there first time So they basically they come in they use it and they never come back. Yeah, so it's very hard. It's very hard to be sticky So in that kind of a scenario the best thing for you is to do an SCM play use Google use Bing You know, yeah, and get your products out there and have your custom vanity landing pages and you know make sure that It doesn't look like it's a shady site. Yeah, so you work on that, but as far as SEO plays really It's very tough and especially this holiday season, you know, it was really tough and I know there's I would name it, but there's another comparison shopping site that purely relied on SEO and they got slammed Yeah, because this is the kind of stuff that happened and it's natural, you know, sure if it doesn't happen We're not going to be evolving. It's like nature anything in nature. Nothing is pwn. And I mean you look at 4.7 billion years of earth, you know, there's been never one apex predator that's stuck around, right? Yeah, the cockroaches are stuck around, but it's because they can survive So that's that even if we can eco, you know a couple of cents on the dollar. Yeah, you know Yeah with a high volume it really adds up. Yeah, so the thing the push now is Syndication because we have the biggest catalog and we have the best catalog and we have a brand name That's almost, you know, more than a decade old. Yeah, and I would say at least for me trusted, you know I know that you guys even though you have relationships with a lot of menu, you know Retailers and whomever you're in the business of finding the best rates so For something that's truly commoditized like a hard drive. I'm gonna buy a Seagate hard drive. Right. I've already made my decision I just want the best price exactly and how do I get that how do I exactly, you know aggregate all that data? Or how do I find someone that has done it for me exactly? So our syndication business is really growing and you know, that's where we're gonna keep our effort and the key to all of that is our catalog Sure, that's the library of all the offers that all our merchants are offering. Mm-hmm. So We used to have a fairly monolithic process that was all written in poll and that was about That really would stood the test of time. Really? I mean it was solid. It worked. Yeah, you know, that's a thing when something works It works. Yeah, it is solid. It used to mean it was working. The thing is it would take 48 hours You know anywhere from 24 to 48 hours to really, you know get a publication going and you can't really rerun anything if we missed it. Mm-hmm. So we went to this more of a loosely coupled Architecture where it's all queue-based and you know, we can keep ingesting and the catalog keeps changing But at least it's up to date. Yeah, it really took us some time to get the steady state mainly because the Ecosystem around the catalog had to be trained on this. They really needed to understand things about how this works now You know a lot of times a lot of the salespeople would come and say well used to work and used to, you know, be solid But yeah, it used to work, but then you still had to wait 48 hours And if something messed up, you just lost 48 hours. Whereas here we fail. We fail fast and we move on Yeah, and you know, that's the compromise and it took some time to convince the business and convince the You know all our the other ecosystem all the people around it who are dealing with the merchants to say, you know This is a better way and this is the future But we had to do it fairly quickly because our biggest revenue are the You know the Thanksgiving the holiday season for all comparison shopping sites Thanksgiving and Christmas that is then January and then, you know cyber Monday Those that time is when we are really making money So we really have You know very few months of months to really conceptualize something, you know test it out stage it put it in production. Yeah, it has to work Yeah, it has to work. I mean there's no use to answer but that thing has to work. Yeah, so, you know We always have to have an insurance policy so we made sure that we could always switch to a backup plan and This time when we broke up the catalog to the past the past holiday season we didn't have a backup plan we went with it because I mean we had done our tests we've basically done our due diligence and It was perfect. It went on well some of the merchant Freed's coming in, you know because of our constraint with the hardware. It was taking time, but he was all there There was a the data integrity was pristine It was the problem was with the timings and the SLAs and you know what people outside don't know our inner workings have changed So they expect certain things. You know, so it took us about another month to iron out some of the details But the holiday season went very smooth. Yeah, so now we're talking about okay We can't really you know do this again and just hope and pray that you know things are gonna work So that's why we are having a push towards. Yes, we have a hardware issue, but let's see where how we can leverage the plug And that's one of the things I'm working on right now and plus, you know, once you start getting that Putting things on the cloud like we've already moved a couple of components to the cloud some of the image processing and all that and The tools that you get from Amazon are just amazing. You don't really have to take your tools and put it on EC2 No, you just use their tools. I mean SQS SNS. Yeah, you name it, you know What's the were Kinesis I don't know. Yeah, Kinesis is basically Kafka managed Kafka on the cloud. All right So Kafka is basically another, you know Q management system, but it's got partitioning built-in and It's got scalability built-in. So LinkedIn uses cask Kafka for all their event notification So you have event streams that you can capture and then you can process it But it's fairly plug-and-play with Kinesis Yeah, if you in my experience if you will double down on the AWS platform like everything plays nicely really well Yeah, but if you want to have, you know, your own databases there or you know If you don't want to play with the Amazon ecosystem, you're on your own And not not to say that you won't succeed. No, you will succeed, but it's that much more effort Yeah, and why reinvent the wheel. They've already given you everything. Yeah, you know, yeah so some of the things I'm working on is basically using how to do some of the feed processing the ingestion the data ingestion and then getting it to a point where our Current system can take just the changes and then do the attribution and all the other Steps that follow processing changes and additions and you know deletes are going to be just a batch job. They just remove stuff from Mongo Sure, so let me talk about a little bit about the tools that you're trying to use. Yeah, so when I first started to do this processing of feeds, I was thinking Just plain map reduce might be in Java might be too complicated So I looked into cascading and cascading seem to be like a really elegant solution Because if you're not if you're familiar with cascading So cascading what it kind of uses a plumbing analogy So you have sources are called taps and then you have sinks and you have pipes that join everything but what it does is is a foot obfuscates the map reduce layer and you're just dealing with sources and sinks and What processing needs to happen between all of them and then you chain them all together? But you don't really do the brute force map reduce or the internals, you know, you just use their Abstraction they have hash joins and they have co groups and you know use their tools and it's you can reduce a You know the number of lines can be reduced dramatically when you use cascading But one of the things I ran into was I had to deal with joining two big files So the way the hash joins work is that one of them has to be a smaller subset that you can push To the entire cluster so you can use your Hadoop distributed cache So the idea is that the join then becomes very simple because now you're just extracting the keys the common keys And you don't have to deal with two big datasets on both sides So either you have a smaller subset on the right or a smaller subset on the left And then cascading works, but then with this kind of a huge On slot of data cascading was running out of memory. I mean guaranteed I wasn't really rolling it out to our production cluster. We have a small development cluster and I was doing that and I quickly figured out that's one of the things we do we we quickly do POCs and we fail fast and we move on and then I Whipped up some big scripts. So the thing with pig is that it's it works. Great. UDS are fantastic But moving data in and out of pig you have to wrap it in some real solid shell scripts And you know certain, you know, even simple things you have to really use shell or some kind of a programming language to Feed the data to pig and once you once that's taken care of pig will really do its magic Mm-hmm, and it was really good big work, right? And then I was like, okay, I haven't come this far Let me just look at my produce. So I used a design pattern called reduce side join So the reduce side join is not that efficient, but it works great when you have two big Data items so in a simplest way what you do is you isolate the key From one source and then you flag it as coming from that source and you isolate the key from the other source and you fly it now In your reduce phase, you're basically now. We know where the keys come from and then your value is going to be the entire row So you look at the two values one from this source and one from the second source and then you basically iteratively compare It's just one loop that loops through the entire data you split the values and then you look through the data and look for discrepancies so if there's a discrepancy there's a change and If the source has data, but the the second source does not have that value then it's an ad Mm-hmm. It's basically a leftover join sure and the and the reverse is the right our design. Mm-hmm. So Within like five six lines of code. I was able to whip this thing out and this beat big I mean hands down And it just picked my interest in like doing more map reduce. So then what I did was Now I've expanded it to other things like the pre-processing of the files and I'm moving lookups into distributed cache And it's working really well. Very cool. So one of the problems I'm working on right now is Trying to do some normalization on our feed files So we get a massive amounts of feed files that range anywhere from like A couple of megabytes all the way to like 5060 GB And these are daily files we get And then what we have to do is we have to look at each work item And then we have to say hey, is this an ad? This is a delete or is this a change? So we retain the previous version And we've been doing this through the pipeline system where as each item goes through We will look at Mongo which is our persistence layer and we say hey, is this thing here? Okay, if it's here, what are the attributes what have changed and then go process those attributes that have changed If it's not there if the version number is the same and everything is the same then don't do anything And if it's not there do an ad Otherwise, you know the version numbers expire do a delete So the idea was to basically try and move these bursty jobs to AWS So that we can spawn as many machines as we want on demand and then bring them back down Especially with the spot prices So as I was doing the research, I realized that I basically did everything from like, you know files with 100 lines all the way to files with the you know that are about 40 GB So what I found out that The knee of the curve where the so the curve is kind of erratic in the beginning with the smaller files You know that the slope is kind of varying and then there's a knee of the curve and then it just goes straight and that is about um That's about 10 million rows You mean the time to process the boss? Yes. Yes. So anything below 10 million It's just it's really not worth it. Yeah to parallelize and yeah the overhead Exactly. So 10 million meaning that each file is 5 million and then you know you're doing a comparison All right, right both those files. So total number of rows is 10 million. Yeah, so it was it was very interesting to see that Only about 10 million. It really is linear and it really makes You know a performance impact as to be a number of rows increased number of rules this time And what I also found was that most of the guys are on the longer part of the tail. So, you know, we have about maybe 10 to 15 vendors who are above 10 million In the rest of them are all kind of tapering up. Yeah. Yeah, so, you know, how do you solve that problem? Yeah, you know, yeah That's standard sort of a log normal curve that shows up everywhere. Yeah, exactly So that's when I realized that You know unless it's really big and unless it's really impacting the rest of the feeds It doesn't make any sense to you know take everything to the cloud Makes sense. Yeah, so the idea is now we'll be looking at maybe just talking to the bigger vendors and saying, you know Let's just site the file on s3 and then let's do the processing there Yeah, that's kind of a neat solution then to the fact that we have Amazon web services around Everyone knows s3. So rather than here's some ftp server where this file is. Yes. I'll just already put it in s3 Yeah, I'll share you the permissions. You can go grab it. It's where you need it to be. Yeah. Yeah And it's if it's a really big file and uh, you know, if they're saying that they need a certain sla What we do is we just go for the spot instances or even like, you know, just emr Even the regular emr cluster And we can actually say if you want this kind of an sla, you gotta pay us a little bit more Sure, and we just translate that cost over to the client So uh one of the things that We know is that certain things cannot be done in real time Like especially, you know, when we want to deliver certain metrics, we basically give them an sla and that's usually like eight hours or more So We are constrained by that, but the idea is to go Uh directly to, you know, almost real time But when we do almost real time One of the philosophies that we've been looking at is basically it's not going to be accurate But it is going to be a trending tool Which is what it is. I mean, basically vendors are really interested only in the trending Sure, if they know that their feeds are going to be processed and all their products are going to be online By this time, you know, give or take one or two hours. They're okay with that. Sure. Yeah What they're not what they don't want to do is, you know Just wait and you know not see it for days And now they have a problem. Yep. So yeah, so we uh, we've been looking at some trending, you know One of the things I earlier, I earlier talked about these You know trying to get, you know, read from the firehose and get this accurate measurement Of where exactly each offer is in each item is Really The amount of overhead we have just to maintain the accuracy Is not worth it So we'd rather do so the alternate situation that we've come up with is basically every q is a key in red is For each retailer So what we have is we just log counts So we take those counts and those counts can go up or go down based on what's being cute or what's being Dq and then we have a process every five minutes that goes and captures takes a snapshot of that into my sequel So if you look at um, say a day's worth of data, you will see like say 100 million rows came in Of, uh, you know, the feed file 100 million rows came in And these cues processed this many these who's processed this many and they were this many no-ops So at the end of the day you add it all up. It's all going to match everything that came through And uh, that gives like a 10 trend graph over time And that is good enough Now we are not having we're not reading from the firehose and we are also it's a sampling We're basically doing a sampling over time And that seems to give us good results And to your point one of the things in our design philosophy given our constraints Is we've decided to take the approach where you know approximate is good enough Yeah, yeah, absolutely Yeah, and then you know, there is a push for us to get a bigger data center And you know, maybe move some other stuff. We have two data centers one of one is Sparse but at the same time we don't have the same kind of deal or bandwidth with the other one So we are working some things out, but once we do get our second, you know, we get some breathing space Then I think i'm all for accuracy because one of the things about accuracy is if you can accurately accurately predict These things and provide excellent reports to the customer They'll just come build it and they'll come And you know, that just adds to our reputation and especially comparison shopping market is Extremely tough right now. Yeah. Yeah. Yeah, so you guys are tracking at least some amount of inventory. I think right? Yeah, you'll know who has what which will change with a faster velocity around the cyber Monday type thing. Absolutely. Do you have to Work on a Just you know, like a variable control SLA in there somehow like spin up more and more instances to compensate Certain vendors want us to be, you know, absolutely real time So yeah, and these are big vendors, you know, we have to comply. Sure. So yes, that's kind of the idea why, you know, we were doing this and Before they can knew that we could only do so much, but we were still able to deliver pretty close But now you know, uh, we are primed for it. We just have to, you know, make sure that we can get this thing going But yeah, variable SLAs. Yeah, especially during the holiday season because everything changes. Yeah. Yeah So long lines about picking tools, uh, i'm going to show you picture you've probably seen this And i'll put it in the show notes for anyone who wants to look I forget who put this together, but you're seeing this right? Oh, yeah, i've seen that So it's there must be 200 companies on this thing. It's called the big data startups open source landscape 2.0 Yes, there i don't even know what half these are myself and half i do know i've, you know, played with some and out with others There's um, there's too many solutions frankly Uh, at least in my opinion, um, how do you when you start out to okay, you've You've gotten you've taken things to a certain point you want to take it to the next level It makes sense to look at hiver and pala or hig or how can i bring all these other things in? Mango versus couch. Uh, what's your strategy for deciding? Where to where to go and what to look at? Knowing that you don't have 10 years to look at everything right right right right? um One of the things i look at is you know i look at the specs i look at the features and i say Is this something that is in line with what we are trying to do? Because some of them have certain like for example, uh, of course i do have the luxury to do some research at work So, you know, for example when you're evaluating cues because you're using mongo as cues then, you know Because the mongo table locks and you know because of mongo's uh consistency The cues are slow So we had to come up with a another solution and i looked at a couple of them i i looked at a couple of tools But then what i did was I I read the features and i looked at some other specs And i figured that redis was Really blazing it fast compared to the speed was really incredible And then what i did was i spent about a week benchmarking So i built massive cues dq them nq them You know, I basically made sure that every data structure was tested And it was just using python I just use simple python scripts to prime the pump load up the cues You know bring the cues down hit them simultaneously from three or four different clients And what i found out was the response was near linear Wow So Simple benchmarking. Yeah, you know, I know sometimes it's a brute force method But if you can do that quickly It's always been fail fast, you know, I can't really just not scratch the surface and then say oh this is not the two I have to do some kind of research into it. Yeah, so I looked at that. I actually I looked at a couple of ones, but I was doing simultaneous tests But I figured redis really, you know Came up really well for this. So that was an example of what I did and as far as the big data Tools go i'm quite tuned in with all the user groups So I tried to attend at least the talks on the newer products like spark or impala And you're trying to keep in touch with the community. I mean the community has been very helpful. You can just uh, you know Try to post some questions on the community, you know, they'll answer it I mean stack exchange is great for people, you know telling you how not to do things because they've done it and they're failed. Yeah And but what I usually see is we do have a huge number of products that are coming out I would initially look at where this fits my business and if it doesn't fit my business I'm gonna go with something that fits my business. You know something that I can be up and running fairly quickly There's a lot of ones in beta. There's a lot of ones that are features that are coming And I tried to avoid those because I let it mature a little bit Because I know the future is not going to be these low level tools It's going to be higher level tools where you're not even going to know What runs under them? You know underneath it and it's going to be a lot of visualization. It's going to be a lot of drag and drop We're not there yet. You know, it took us a long time to have like a UI based uh SQL engine. Yeah SQL server oracle. They all had, you know, SQL was always fairly UI driven But it took us a while to get there. Yeah, but once we got there, we don't really, you know Unless you're curious sometimes you just do a right click and you know pick up any item and that's it Yeah, I know what SQL goes underneath, you know So it's going to get there But now since we are quite nascent still, you know, and with my produce 2.0 we have uh newer paradigms map reduce is really on its way out And we are going more towards uh directed acyclical graphs And we have products like tez That are taking it to a next level because ideally that Is what's going to get you a real time in in a this kind of a uh big data processing situation Just very like distributed asynchronous queuing Exactly, exactly. Yeah, because you know every map has a reduce right now And then that data feeds into another map reduce chain and another map reduce chain and it's batch But it is really But if you really see think of it as an acyclical graph and have a couple of maps and then one reduce, you know So when you do which tez is uh, what that's supposed to and uh The Hortonworks products and I'm pretty fond of them. Yeah, I haven't used Hortonworks stuff myself But everything I hear is really good. Uh, the reason I like them is not that cloud era has uh, bad products They have excellent products is cloud era map are they're all fairly custom Whereas uh Hortonworks is 100% Apache open source. Yeah, so vendor lock is something, you know In my experience. I've had a lot of vendor lock Here comes the databases and you know, that's something I want to avoid. Yeah, yeah, I'd like to stay with the open source Yes, we may not have certain features But you have an army of guys coding those features up and testing those features. Yeah, you know And uh, especially the open source community and a lot of these guys have people who are all committers and you know Right, right. Yeah, so they're very actively doing this and if you look at five years ago What was under the Apache project umbrella and what you have now? It's just every day. There's something new. Yeah, you know, yeah, so yeah, it takes time to mature back sure I like to stick with the open source closer to open source Yeah, I tend to agree it's a and if nothing else Even if the project becomes abandoned. Mm-hmm. You yourself could choose to absolutely maintain it. Absolutely. Yeah. I mean, I remember uh business.com we had cloud base. Yep, and uh, it was so interesting because at that time I was fairly new to map reduce and uh, everyone was yeah and this engineer we hired who was working on this product he was he was very gung-ho about it and We used to have long discussions about what kind of design patterns to use and you know, how to optimize the code How to write the compiler the compile sequel to breaks it up in the map reduce And we basically looked at it From mimicking a database What would a database? You know, one of the things that a database would absolutely need and we would need those and at that time, you know High was really kind of languishing. Mm-hmm. And then the minute we put our first version out there There was so much shadow and then they just took hive to the next level. Yeah, so you know, what I like about open source is that Even if it's a languishing project somebody can light some kid can light a fire. Yeah, and next thing You know, oh my god, you know, we got a bit least features into this tool, you know, yeah, yeah, and make it a part of the Whole stack. Yeah competitive pressure. Yeah, I think that's great. You know engineers They would love to do it better than the you know, yeah, I can do it faster than you It's good. It's it's a positive thing. Absolutely. Yeah, so I I think you know open source is the way to go Yeah, and yeah, so To your question about navigating this a lot of this has to still shake down. I mean We have to really see how all this pans out, but for your immediate needs I would go with things that are tested that are that have had some track record But if you want to look at a newer product I think you know, you have to look at the feature set and you have to do that Quick fail-fast test yourself to see is this in line with What I'm trying to do or is there a better product that just does what I'm right what I want to do Sometimes the limited feature set is a good thing Because if it does if it's a perfect match for what you want to do you don't really need to do anything extra Yeah, I'd say that about redis. I mean redis Has very feature rich, but also limited in some way. Yeah, and finite set of commands I don't think they've added much in the last no three four years and then the slave and the distributed Computer yeah, they've been advancements there But but that's all in some sense under the hood, you know the engineer doesn't per se have to care about now a lot of the tools have almost Put in a new chain of command at least in the way I look at it right where what used to be maybe the domain of one software engineer is now Almost a three-step process. Mm-hmm. You know you have your your Software engineers close to the product who knows functionally what things are supposed to do You have sort of like an ops guy in the middle Knowing how to implement that or maybe that's an engineer to and then an ops guy who's just the gearhead knows how to plug these tools together It doesn't even know what the data is doing. Um, where do you see yourself in that spectrum because you could Be full spectrum. Yeah knowing you. Yeah, so One of the things about just you know data scientists in general is that there's gonna be ETL no matter what You're gonna have to have that guy who's good with moving files around knows how data works And then you have the the guy who's like Tom Cruise and minority report like All right. K means clustering So who does that but the statistician, you know and who can Prototype quick and dirty. Yeah and get things going and there's the ops guy as you said who's like just a gearhead so there's There's the you it's hard to be full stack in this environment true because unless you're running a startup And there's a necessity for you to be stretched and be a full stack guy it really I since I've had a you know fairly deep data warehousing background, you know, I know ETL really well and um That is something I would not want to do I would want to move towards more like the you know more in the statistical realm because that You know other than studying that in school and being familiar with it on the fringes It's not something I've delved into and that seems to be my new challenge because I know I can do all this But if I can do that then I can be better at you know Isolating what I really need to get as a data. Yeah, absolutely You know, you're really the end goal if you have a good picture of the end goal You will not waste your time twiddling bits at the beginning of the data ingestion. Yeah So that's kind of where I see myself going and where I see my educational pursuits going as well That's a really interesting point to me because with All the new tools and technology we have It's almost easier to say let's save everything Yes, which is in a way a bit of a naive thing to do Because at the end of the day that data is only as useful as the decision it helps you make Absolutely, and I know companies that really hold tons of data For at least three four years and not a scratch not a dent on the data Yeah And they're paying all this money to you know keep all our their sands You know chock full of data and you have all these data center guys and you're paying for Power and conditioning and all that really does not make sense because it is not answering even one single business question And can it even answer one single business question because that data is outdated Especially when you're talking about trends and comparison shopping what used to be true Early last year is no longer true now. Yeah, so how can you data really help you? Yeah, you know, there's a lot of effort towards you know Profiling your customers sure making sure that you know, you have directed advertisement towards the customer And it's all customer driven now. Right. What used to be a generic More like a business driven now. It's more like towards the customer If I go search for guitars in one site and I go some other site and they are popping ads for guitars Right. So all these customers You know making a rich Having rich data about your customer or enrich data All this stuff changes Constantly so you cannot say, you know, a customer who came to our site two years ago is still viable As an option to pursue to see if the profile and see what he may be interested in that. That's long gone And things are changing so fast these days I think it makes sense to add telemetry to a lot of your processes But to turn them on a lot And you really need to understand You may be reading from a fire host pretty soon to to get very little information. Yeah So that's why having an idea of the end goal or at least have an idea of what questions you're going to be asking Is very important because that really makes it more efficient makes the pipeline efficient. Yeah. Yeah, I mean I know a lot of companies have You know massive petabytes of data in their, you know, like Facebook or Yahoo. They have it But then they are in a slightly different business as well. Yeah. Yeah, but they do know You know, everybody knows that trends keep changing You know what used to be The norm like five years ago, even with design patterns. It's changing a lot. Yeah So Really you have to be nimble And fail fast is something I believe it like, you know Make sure you do the due diligence and if it doesn't work move to the next one. Yeah quick. Yep Yeah, I think that's where we should be going. Yeah, you know, and in all these tools will Will help us to get there? But definitely there's going to be a shutdown. I mean We do know when databases started there were a lot of other, you know, there were guys like Gupta sequel, you know Who are who are like, you know defining rdbms for the desktop and they're not around. Yeah So some people would be left behind some other tools will are going to rise up But as a user community, I would always say, you know We have to be a little skeptical and you know, I think having a Being aware of the entire surroundings or the entire ecosystem is Will make is more successful Then to be kind of drowned in the You know, some people are just tools guys. They just love working on new tools. Sure. They don't really come up with any solutions or anything Yeah, you know, it's very easy to get Sucked into that But after a while you have to realize that there's a bigger picture. Yeah, you know So one of the things I always find just yesterday 4th of July I was at a party and we were discussing data and Somebody was saying that, you know, I hear big data all over the place But I don't really see engineers understanding data Yeah And you know and that's one of the things I mean, I've talked to some analysts Who may not really be great engineers, but they see data day in and day out and do excel spreadsheets and you know They know the pulse of the data and I can have an intelligent conversation with them about how to process this data and use this these newer technologies Yeah As opposed to you know, talk to some engineers who just so caught up with the technology right and that's a trend I see a lot now and I would think I mean I'm not really trying to preach or anything, but I I would you know my little word Of advice to like younger engineers who are in there is that you really need to understand how data works Doesn't matter what tool you use a database or whatever it doesn't have to crunch through massive amounts of data You really need to understand how data works Then you'll have a better idea of what to do for the job the right tool for the job is something I've always That that's been my motto And it's worked great. I mean when I know that something is doesn't have to be that complicated I'm just gonna whip up a quick pulse script and you know Get it going or python these days seems to be the new swissami knife. Yeah And incredible libraries we have there. Yeah, there's a heuristic I picked up somewhere along the lines that I apply quite frequently Boil it down to what question are you trying to answer absolutely? And if you can isolate that which is A lot harder in most cases than it should be but if you can answer that question most of the path has laid out before you Yeah, and you know most data scientists That's what they're supposed to do, you know, really, you know Even before answering the question try to figure out what is the question you're asking Things can get really ugly, especially right now. There's a lot of people with big data on the resume or the talk about big data But I'm not really sure they understand big data. They may know how to use some tools They might have random app reduce program once or twice, but that is not big data. Yeah So how would you define it a simple example? We have a pipeline It's basically a cataloging system. That's what that's our brain And we have syndication partners and everything goes through the catalog and the catalog is it's all cues So we have cues that do certain topics and then we have Consumers of those that consume them process them with them back into certain other cues or you know cite them in a persistence layer so Doing this one of the things that we had to see is eventual consistency So when we first built this we were like, yeah, I think this will work But then you come up with other things that distributed systems have Like eventual consistency or you know, knowing where your little Granular item of data is yeah, and that's very hard to do Because of the transient nature of some of the states that these Items can be work items can be in So what I suggested was how about this? Let's log everything in a reddest database You know every time an entry goes into a queue and anytime it gets out of a queue, you know We have a hash that identifies its unique id and we'll just say, you know, hey Put it in a reddest database and say hey, I wasn't this queue now. I'm in this queue So that'll tell me how long an item stays in a queue and we have a Sun grid engine set of servers The minute I did that all of us and we basically took that data from reddest And we were persisting it in elastic search Before long we realized that we could handle only like one day's worth of data There's massive amounts of data. I mean every second we were getting bombarded with all this stuff And that's when I realized now we have a big data problem, you know, yeah And just to slice and dice this I had to write scripts that would pick items from elastic search And uh, this would basically Put it in a put it in our hadoop cluster And then I had big scripts that would basically do the aggregation And we would miss certain keys So we would have to do it every 15 minutes and then at the end of the hour have an hourly file And then the stuff that wasn't processed, you know, tag it along for the next hour So that you know, sometimes they stay in a queue for a long time and they get popped like maybe after five hours But we still need to correlate those two So Pretty soon it became such a huge uh Overhead just to handle the metrics. We're like, okay. We're just doing summary data Yeah, but that is an example of a big data problem. We have millions of rows coming in and you know, uh, actually what was it? It was about a I think it was close to a billion rows a day. Wow And that's a lot and it's transient data because Once so what I had I used d3j is to kind of visualize it So it would be like a block You know, like a staggered block with different colors with different cues would come in here leave there You know, I see yeah, and I had it aggregated over Certain other items mainly because you know, I can't do it at the granular level so no matter what That What is a big data problem? Yeah, I mean if you're trying to even a million rows if you're trying to you know slice and dice something like that Database is pretty much going to beat your big data system Mainly because the thing is, you know, the way map reduce works and the way data has to be You know, the data has to lend itself to be broken into individual parts that have no correlation with each other And then the reducer has to get you know, it's own keys And and the other thing is we are bringing computing to the data So right that's the other thing. I heard a great analogy once that I often repeat That the you know, what is map reduce? How do we explain this to someone? It would be like let's say you had a library on a college campus and you wanted to ask the question How many of these books refer to one of the us presidents? Not an easy task because looking at a book on its own you don't know if you're lucky Maybe there's an index and maybe the index says okay George Washington this page, but just some novel There could be a character saying oh the guy looks like Teddy Kennedy or wouldn't be a president. It looks like john f Kennedy There's no reference that you kind of have to read every book So do you have one person go one by one and read every book? Well, no, you can have a team of people each reading a book and then and that's the math, right? You get multiple people reading multiple books and then reduces where they come back and and pick their notes or compare notes exactly and I think the analogy holds up an extra step which is nice in that You don't have to check out go drive to the library check out the book bring it back home and read it Mapreduce is also about let's go to the library and read it there where the book lives exactly So I like that the analogy works on two levels that way and uh, I read somewhere that the Romans were using this for census Really? Yeah, like so they would send people out to the cities and uh Get the individual census and then they would bring it back to centers where they would compile all that data And uh, it was very interesting. You know, it's we've used it all along Now we call it Mapreduce, you know But it's a divide and conqueror has always been there. So all the patent should be expired if the Romans had it 2000 year old papers. Yeah But yeah, that was something very interesting So I feel like we could go on for hours and hours, but I've kept you a good run so far Yeah, we are moving more to the philosophical nature things and some controversial stuff and I must stop right here So I like to wind it up and ask my guests for two references that I was mentioning to you earlier One that I like to get a benevolent reference something you're not connected to but feels very useful And we'd love to give a plug for and then something that you're a little closer to and and might get a benefit out of by giving some publicity to I don't really have Well, um I can just tell you that if you're new to uh mapreduce and if you find it very overwhelming The yahoo developers network, uh, website is fantastic So if you go to the yahoo developers start yahoo.com if you look up the mapreduce and I think ivm has one of those two They have a very Very well thought out Uh description of how mapreduce works and how this thing works and I think anybody who's starting off needs to go read that first Yeah, yeah, that's really that's where I got started because it was overwhelming at first Even even though I had gone through a couple of textbooks things were still not clear That website makes it very clear. Nice. So that is one thing I would definitely say you guys can go and take a look Uh, the show notes. Yes and uh, anything that benefits me. I mean Plug your film if nothing else Yeah, I mean if you guys want to see a good uh independent movie about Immigrants coming and how they fare in this country and how They have uh how there's like uh, it's like a family drama You have to go, uh, check out the website one way ticket the movie.com. It's one word one way ticket the movie Uh, we have dvd distribution on netflix But if the dvd takes too much time, you know, you can just Buy the dvd off of the website and it's it's extremely cheap It's not my tunes also. It's it's uh, it's on amazon It's on amazon. Yes. So if you guys are interested Yeah, well, thanks for doing this. This is great. I really enjoy it Thank you for listening to the data skeptic podcast for show notes or other information related to the show Please visit our website at www.data skeptic.com Follow us on twitter at data skeptic If you enjoyed the program leave us an iTunes review and help others find us. Thanks for listening (gentle music)