Data Skeptic

Mining the Social Web with Matthew Russell

Duration:: 50m
Broadcast on:: 07 Nov 2014
Audio Format:: other

This week's episode explores the possibilities of extracting novel insights from the many great social web APIs available. Matthew Russell's Mining the Social Web is a fantastic exploration of the tools and methods, and we explore a few related topics.

One helpful feature of the book is it's use of a Vagrant virtual machine. Using it, readers can easily reproduce the examples from the book, and there's a short video available that will walk you through setting up the Mining the Social Web virtual machine.

The book also has an accompanying github repository which can be found here.

A quote from Matthew that particularly reasonates for me was "The first commandment of Data Science is to 'Know thy data'." Take a listen for a little more context around this sage advice.

In addition to the book, we also discuss some of the work done by Digital Reasoning where Matthew serves as CTO. One of their products we spend some time discussing is Synthesys, a service that processes unstructured data and delivers knowledge and insight extracted from the data.

Some listeners might already be familiar with Digital Reasoning from recent coverage in Fortune Magazine on their cognitive computing efforts.

For his benevolent recommendation, Matthew recommends the Hardcore History Podcast, and for his self-serving recommendation, Matthew mentioned that they are currently hiring for Data Science job opportunities at Digital Reasoning if any listeners are looking for new opportunities.

(upbeat music) - The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. - So welcome back to another episode of the Data Skeptic Podcast. I'm here today with my guest, Matthew Russell, author of a number of books, including Mining the Social Web, which is the one we're gonna discuss today. Additionally, Matthew is presently CTO of Digital Regaining, where some really interesting applied data science work is going on that I look forward to getting a chance to speak about. So, welcome to the show, Matthew. - Hey, thanks for having me, Kyle, I appreciate it. - Before we get too deep into it, maybe you could share some of your background and what inspired you to write the book. - The original inspiration for the book goes back to 2010 or 2011, I was out at a special little gathering in Sebastopol, California, called Food Camp. It's where a number of people get together on the O'Reilly campus, and it's sort of one of the unconferences where you just sort of show up and you have a lot of interesting people in the same place, and you just start to brainstorm ideas and talk through different concepts. And at the time, it was not clear to me that Facebook or Twitter or the social web in general were actually things that were here to stay. It was sort of in that period where I think my space and some other things were starting to even fizzle out a little bit. And around that time, it just did occur to me that, hey, the social web is here to stay. This is a thing, and this is something that really deserves a more holistic treatment. And so I gradually worked through the landscape, put together an outline, socialized it with some colleagues, and ultimately, it manifested into the first edition of the book. - I really enjoyed the book. I found it to be a tour day force for understanding the landscape of a lot of the APIs and tool sets and methodologies that are out there for exploring social web and all the unstructured and interesting data that's available. The book touches on so many, I'd be a lot to cover, but could you maybe give a summary of some of the highlights of what readers can learn from exploring the book? - Yeah, really, the goal of the book in a nutshell is to give the readership a nice cross-section of the social web landscape, ranging from Twitter to Facebook to LinkedIn to Google+, to blogs and RSS feeds. And even in the second edition, I added a chapter on GitHub because it just occurred to me that GitHub is just so incredibly rich. It's such an interesting graph of information. So it's really anything that's social, some of the larger web properties that are social. The thought was, let's just put the right tools and techniques in people's hands so that instead of sitting around and daydreaming about what might be possible. Hey, here's some starter code, here's some templates, here's some tooling, here are some sample problems to get you started. And I tried to really put it together and package it in a way that it is just trivial to get started with a virtual machine powered by vagrant as well as template code that is easy to tweak. So the examples in the book may not be exactly the right templates for everyone, but I think more times than not, they're gonna be a pretty good starting point that'll help someone to get their hands dirty, dig in. And probably by that point, they're going to be able to take it, extend it, do some copy and paste and really make it their own. - Yeah, I really thought it was clever that you included the virtual machine that way. The first book I've read that had that option attached to it. And it saved me a whole heck of a lot of frustration trying to figure out exactly what version and library to use to complement whatever text I was going through. So I think that's a powerful aid to a reader to make sure they can apples to apples, run the examples and then extend them. What inspired that choice? - Well, what inspired the choice was all of the pain involved in supporting the readership for the first edition. So the second edition is quite a bit more consolidated, even considering the virtual machine, the first edition, there were multiple different data stores that were used, multiple different libraries. Versioning was not really explicit or specified. And in general, what inevitably happened, in addition to some of the social web APIs themselves changing, so for example, Twitter had a pretty substantial update, not long after the first edition was published, but in addition to things like that that were just completely beyond my control, different open source libraries would update with breaking changes. It just really became all-consuming to try to support people. And of course, if you buy a textbook, more times than not, you wanna actually run those examples, you wanna dig in, you wanna get your hands dirty. To be successful, I think, as a textbook author, you really have to make sure people can do those things. So it really was a maintenance problem for me. And after a lot of thought and reflection, it occurred to me a virtual machine would really be the way to go. But even then, you still run into some problems. Different VM providers, maybe it works on Windows, maybe it doesn't. By the way, that was another challenge. Mac versus Windows versus Linux environments, believe it or not. So the vagrant-based approach worked great. Vagrant is sort of wondering to rule them all for VMs. It gives you this nice consistent interface, allows you to use whatever back-end virtual machine provider you'd like, and gives you just a nicer abstraction, in my opinion, in order to get into that machine and start using it. That worked really well. And then the other real critical piece was, I found that for a lot of readers, dropping into a terminal was just too formidable, a thing to do more times than not. Lots of people were just getting tripped up before they could even really begin to start using Python. As someone who's a developer at hard and programmed since childhood, it's easy to lose sight of that. But it became clear to me, hey, I really need something more accessible than even just the Python interpreter. And as I would be lucky enough to have it, iPython Notebook had really become a thing in the intervening time period between the first and second edition. And the amazing thing about iPython Notebook is that it's not just an interpreter in the web browser. Now, that is a big part of it. You do get the interpreter in the web browser. And that is a much more natural environment for a lot of people to work with them. But beyond getting the power of Python and the web browser and the more familiar sort of web app looking user interface, the beauty of Python Notebook is that you can easily share your work and collaborate with others. It's like that notebook you had in ChemLab back in high school. You can document your work. You can frame an experimental approach. And of course, this works really well in a data science paradigm, because in a data science paradigm, you really do want to frame a hypothesis. You want to test it. You want to take some notes, document your approach, come up with a conclusion, and iterate on that as quickly as possible, and as you need to. So it's a subtle thing. But iPython Notebook, being the UI for mining the social web, really helped, I think. A lot of readers have given it a lot of compliments. Yeah, I like that as well. I think not only for all the reasons you gave, but also that it gives the power of reproducible analysis and the integration of code and commenting. Of course, all languages have commenting, but it's really more explanatory. They have a lot freer documentation you can integrate. So it helps a user who maybe, as you say, have been intimidated to jump on a command line. Everybody's comfortable in a browser, I have to assume. So it's a great midway point. Yeah, it's definitely less intimidating. And then to your point, I think the thing we're backing into is that it helps you tell stories better, right? You can interlace the pros with the code in a way that is almost like you're just reading a nice blog post where you'll have someone telling you a story and they'll drop in some code and some results and then tell you some more story. Yeah, it just sort of naturally gives you that and almost forces you in a way to think like that as you're using it. Yeah, each of these big social APIs offer a wide, robust data set for people to get access to, whether it's Twitter, Firehose, or Tweets, or what friend connections you have on Facebook. There's so much someone with an interesting idea can jump in and have the opportunity to explore, especially with the guidance of the book. What are some of your favorite aspects of these APIs and maybe where you might see some of the most interesting and innovative, untapped potential that's still gonna come out of the social world? Yeah, one thing I would definitely want to highlight for readers is that, and I elaborate on this in the book quite a bit, but it's that when people think of Tweets, they typically think of those 140 characters. And that's the natural way to think about it. That's what you would see in your client that you're using or on your phone. But from a data science perspective, there's several kilobytes of additional structured data, well-formed fields in a nice accessible JSON object that support those 140 characters. And it ranges from linking together, reply chains, if there's a conversation going on to time and space information, when did the tweet happen? And geolocation services were enabled. Where was it sent from to information about the sender? They're basically their profile information. There's a considerable amount of information that is, if you just ask Twitter's API for a single tweet and you pretty print that JSON structure and you look at it, some people's jaws sort of drop when they realize how much is in there. Twitter also, they go to the length of parsing out what are called the tweet entities, the mentions, the hashtags, the hyperlinks, but not just the hyperlinks. They give you the fully resolved link, the short link, the display link that might be included in the client with an ellipsis for display purposes. Really, anything you see in Twitter's UI on Twitter.com or in any popular client is powered by their public API. And that in turn means there's the ability for you as a developer to tap into that same API and generally do the same kinds of things. So that's one thing that I think is definitely worth highlighting. Another one that I would mention that I've just had a lot of fun with as of late is looking at GitHub and modeling it as an interest graph. So if you think of GitHub as an interest graph, we effectively have projects which are things of interest and we have communities built up around those projects, but we have an aggregation of these projects and people. So we have many different people potentially interested in many different projects. And the projects themselves, of course, have some pretty rich attributes such as the different programming languages that may have been used in them. And it's just a super rich graph. I don't think it's immediately obvious to people that GitHub can be modeled as an interest graph, but once you have that aha moment, I think there's just a whole class of interesting widgets and add-ons that are in the works. - Yeah, absolutely. I'm really excited to see what potentially even readers of the book will be inspired to do. There's a lot there at GitHub that I don't think we've seen the last novel mashup or whatever may come down the road. One interesting anecdote in the book that I enjoyed was the comparison between Coke and Pepsi, at least at the time of publication, Coke strongly dominated Pepsi in the social market. I think you showed how to retrieve the number of likes via Facebook. And then mentioned that we look at their market caps, they're much closer than they are if you look at the like data. So I think it's, I don't know if you meant it this way, but it's an interesting cautionary tale to me about exactly what conclusions we draw from our analysis of data we're pulling from the social web. - Do you have any advice for, let's call them, social explorers and how to frame their conclusions and inferences? - Yes, a couple of thoughts there, absolutely. What I tell everyone here at Digital Reasoning and generally anytime I have an opportunity to speak publicly, you know, I say, you know, the first commandment of data science is to know thy data. It may sound a little cliche, but one of the big problems I see in the data science space is that lots of people want to call themselves data scientists and they feel entitled to that by virtue of being able to just take a data set and start ripping through it with some tools, do some basic data munging and maybe shake out something that's, you know, what we'll just call notionally interesting. And I think that to really be world class data scientist, you have to have a mindset that leads you to really want to very intimately know and understand the data that you're looking at. And a big part of being able to do that effectively is to realize that there are going to be some inherent biases of some kind in the data. After all, the data probably didn't just automatically materialize on its own some group of humans or some group of agents or machines produce that data and they themselves inherently have biases. And so in the case of the Coke and Pepsi, if you, I think I may have footnoted this in the book, but if you take a closer look at the companies, you'll realize that there was a tremendous amount of marketing and social media that Coke put into this. And so they created an image for themselves on wine that projected far beyond what may be the case in the real world when you're comparing them against a foe like Pepsi. And so, you know, as you say, you don't want, I mean, you do want the data to speak for itself, but you also have to take into account the biases, the larger, maybe the backstories that are there and that just inherently involves knowing the data at a deeper level than what would be superficial glossing over the surface. - Yeah, absolutely. I found the book to be really accessible. I suppose you need some background in programming. This can't be your first technical thing, but even people with just enough experience to be comfortable, I think can get a lot of value out of it, whether they're more interested in just hardcore coding 'cause there's really rich programming examples or if they wanna go to the data science route. And it also covers a lot of the fundamentals, like one of the, I think, things you learn day one if you wanna get into data analysis is similarity functions, which are well explained in the book. Maybe for a listener who's right on the edge of the diving board ready to jump in, could you give a quick lay person's definition of a similarity function and how one can use that to do some interesting analysis? - Yeah, similarity functions are really the crux of any type of unsupervised machine learning certainly, but even just more superficial poking around in the data, they're quite useful. In general, when you're looking at data, you're often gonna wanna cluster it in some particular way, just generally bin it into particular partitions that basically divide and conquer approaches. And similarity functions effectively allow you to take any given data point, any given atom in the data and compare it in some objective way to any other data point or atom in the data. So the trivial example that's just very intuitive to understand maybe just a basic lexical distance between two strings. So if you're looking for words that may be typos of one another or simple variations, you would expect maybe there's a short edit distance between them. You would take them and look at the number of characters that differ and the edit distance would be small if lexographically they're similar and it would be larger if they're just totally completely different words. Of course, the real challenge is that it's not tractable to compare every single data point to every single other data point and even a modestly sized data set, that's what computer scientists will call an N squared problem. The idea that if you have N data points, you need to compare them to N other data points that's N times N or N squared comparisons. So that gets intractable really quickly and hence we come back to this notion of using some type of dimensionality reduction or some type of machine learning like maybe a K-means clustering if you have a good sense of what K should be, which is how many clusters you're looking for. There are greedy approximations that you can use to work through it. And in chapter three of the book, this is really where I started to speak about this and to introduce it in the context of LinkedIn data. So the notion was if you're looking at people's job titles from your professional network, there are different ways that you can compare job titles. So I think one of the techniques that makes sense to start with is you may just look at a job title as a collection of words, just a set of words. Principal computer scientist or senior software engineer or chief executive officer, you can perform similarity operations by way of set arithmetic looking at differences and unions and intersections of these sets. And you can take a technique called a jacard similarity that gives you a ratio if you're comparing two sets. And the jacard similarity is really just, if you have two different sets of things, you take the union of all those things, which is the collection of unique symbols across the two, you take the intersection of the two sets, which is just the overlap of the things across the two sets, the members. And you just look at how many items are in that intersection, how many items are in that union and you compute it as a ratio. And it's a fractional number. So if you had two sets that were exactly the same, the intersection would be that they would be completely overlapping, you would get one. And if they were exactly the same, the union would be exactly what one of the original sets is. So you get a one. So your ratio would be one over one equals one. So that would be a perfect similarity. On the other hand, that number would drive closer to zero. The less the two sets are alike. - Yeah, I really like that simple. Well, it's not trivial, but it's easy to understand, implement the example of using LinkedIn titles to compare and understand how close particular people are. And it has that, I guess what a computer scientist would call the bag of words approach. And the book goes on to explore what I find to be the spectrum of other approaches, getting into part of speech tagging and chunking and tech summarization and entity search. And in my mind, there's kind of a trade-off between, as you go up that ladder of sophistication, generally you're getting more, you're getting an improvement in your results out of it, but you're also perhaps spending a lot more time making sure your approach is working or doing some implementation tweaks. So there's kind of a trade-off there. And I'm wondering what your thoughts are when you start a new problem on how to pick the right tool for the job. - I am a big believer that you always start with the low hanging fruit. You always want to think in terms of Occam's razor, you want to try to find the simplest possible explanation for the phenomenon that you're seeing. You want to start with just the easiest, most accessible tool. And from there, you just iteratively peel back the onion and you iteratively work into more and more complexity. Sometimes we will just have the natural propensity to over-engineer or to overthink something. And it may often turn out to be much simpler than we ever imagined. So as a rule of thumb, me, myself, I don't start with sophisticated machine learning. I start with, first of all, just really making sure I understand the data, what's its origin, run it through some basic scripts to count different things, to maybe create some partitions, to look for outliers, to just generally test my assumptions. Because usually with any data set, we are going to make assumptions about it and we are going to be wrong. We really will have some significant blind spots and only putting it under the microscope will remove those blind spots. And I think that the simple techniques, just bag of words, approaches, computing, inverse document frequencies and term frequencies and the simple information retrieval textbook analysis that lots of people would be familiar with and would be available in toolkits, is absolutely the right place to start. Now, all that said, I think that will only get you so far and I'm not sure you will develop a world class type of business that really depends on analytics. If you stop there, you're going to run into something more times than not, I think. You're probably going to need a deeper view of the data. You know, there's the old saying if it were easy, everybody would do it. Well, you know, if the analysis is easy, then probably everybody is doing it. Now you do have to tie it into a business problem. It's not just about the technique or the analysis. So, you know, there's that side of the story as well. So for example, digital reasoning, you know, we have what I would consider some of the world's most sophisticated natural language processing technology that you're going to find anywhere. And we do that because we want to take human language data and we really believe over many years of experience that you have to create the right building blocks in order to be able to do the right, more effective, upstream, advanced analytics. And so using the bag of words, approaches, or just counting things, that helps you to size up data. That helps you to get a working understanding, to build a prototype, to test some intuition, but when precision and accuracy and recall and F measures and these sorts of quality control metrics really come into the conversation, you'll find that for the messy problems that are the ones you will encounter with human language data and unstructured data, it really does take a level of sophistication beyond what you're going to get in a typical open source toolkit. - Yeah, definitely. I don't think anyone would argue with me if I claimed natural language processing is far from a solved problem. So we've definitely got to explore a lot of these sophisticated techniques, depending on what problem we're trying to solve. I think there's a challenge there and that the more sophisticated it gets, the less accessible the approach becomes to a non-computer scientist. Do you have any advice for how someone using the fundamentals and the tools that they can learn from your book can best frame their results when trying to convince someone who's maybe not familiar with the technologies? - I'm a big believer that the people who understand things thoroughly and deeply are able to actually articulate them in very simple explanations and I think Einstein has a quote, something to that effect, right? That if you really understand something, you'd be able to explain it to a child and they would understand it. So yeah, in general, there's a lot of thought in working with data and data science that frames it as a storytelling approach. And I do think that there's a lot of truth in that in that if you take a big messy, complicated data set, at the business level, it really should usually boil down to a few sentences or a paragraph. There's the insight and then there's okay, why does this matter? And then on a certain level, everything else is just the details and how the sausage is made. So I would encourage people to try to provide the simplest possible explanation, but no simpler than the simplest possible explanation. - Yeah, I know my next question's maybe a bit out of scope for the book specifically, but given your vast domain knowledge, I'm curious to hear your perspective on essentially the responsibility that users, developers, and API providers have in the social world. The user should be careful about what they're sharing and the API provider should be very thoughtful of how they control the flow of that information and the developers, certainly not above inspection as well in terms of how they manage the data that they're given permission to have access to. So ultimately, where does most of the accountability lie in those three roles in making sure that we're honoring everyone's privacy and protection and anonymization, these sorts of important things? - Yeah, there's certainly a fair bit of shared responsibility there. I'm a personal believer that you should never put anything on the internet that you would not want to be public. Maybe that sounds a bit old fashioned, but in general, I think that once any piece of information is digital and online, regardless of anyone's best intentions, or regardless of anyone's best attempts at security, or regardless of anyone's particular take on their company's terms of service on any given day, we as humans are pretty fickle and opinions and sentiment changes over time as to how some of these different policies should work, not to mention just the occasional security breach where all bets are off, regardless of good intentions. Just like any other relationship, I would frame it in terms of one of mutual respect and trust. So if you were a consumer and you really want to give a social network or any other provider, your deepest, darkest, most sensitive data, you really should have some basis for doing that and it really should be some basis of deep trust. And if you don't have that and you provide that data, personally, I think that's just foolish and I think the consumer sort of done it to themselves in this case. Of course, you have other situations where over time, someone maybe is providing information, providing information, so like with Google, for example, maybe the consumer's not immediately aware that Google is warehousing or any search engine for that matter is warehousing search results over a prolonged period of time. Besides looking at your bank account and what you're spending your money on, looking at what you're searching for on the internet, maybe the next best thing or maybe even the best thing to help me figure out who you really are and what makes you tick if I don't know anything else about you. So in a case like that, I can't really say, well, it's the consumer's fault for using the search engine over a prolonged period of time. That's one where I think Google or the search engine as a company would bear a significant amount of responsibility for being responsible. There's a case where maybe, even though I'm sort of a fan of minimal government in life, maybe that's a case where there should be some regulations and policies to help frame how large multi-multibillion dollar companies use consumer data. And then, of course, at the developer level, there's always some accountability there as well as you pointed out. Generally, I think if the developers are following the terms of service and providing the right cues to the user, I see that as a little less important than the consumer being thoughtful and providing responsibility. And ultimately, the entity warehousing, making the data available, providing it. And you see systems such as OAuth, which is exposed in the book. It's a system that's largely evolved to provide the developers with the ability to get permission to access sensitive data in a way that it can be revoked by the consumer, in a way that the consumer can not have to give away sensitive credentials to allow that application to use it. So I think we'll continue to see maturity there as these systems for authorization continue to grow up with the social web. - There's so much we could cover from the book. We couldn't possibly do it in one sitting. I finished my read through a couple of weeks ago, but I'm even still working through doing a deep dive on all the examples, which are really informative and helpful. I hope we've enticed listeners enough to go check it out. What are some of the best online outlets by which they could do so? - So really the heart and soul of the online identity for the book would be the GitHub repository. That's thoroughly linked to in the book. It's also linked off of the book's blog, which is miningthesocialweb.com. Well, I've tried to include some posts and some additional material that didn't make the timeframe cut off for the book. So you can just search for mining the social web, GitHub, and that'll come right up. Just make sure it's the second edition repository. The first edition repository is still out there as well. Just a couple of weeks ago, in fact, the second edition repository received more stars and bookmarks by users than the first edition. So that was an interesting and exciting point of inflection for me. - Yeah, with our remaining time, I was hoping we could talk about some specific real world use cases where I presume the tools and techniques of the book are being used, namely at digital reasoning. For anyone unfamiliar with the company, could you tell us a little bit about what digital reasoning does and what your role is there? - Yeah, I am chief technology officer here at Digital Reasoning. I've been here for seven years and really helped to build back the company nearly from scratch seven years ago when I moved to Nashville. Digital Reasoning is a company that, the way I would describe it is we structure your unstructured data and then we provide advanced analytics to solve real world problems by using those building blocks. So to a machine, human language data is just an opaque stream of symbols. It has no context whatsoever. We have a series of predictive analytics models that performs the really the full scale spectrum of natural language processing. So we take that human language data and we effectively turn it into legos. So you put this big sheet of plastic in one side of the machine and the other side of the machine spits out legos. And with those legos, there are these building blocks of human language. You can start to solve interesting problems. You can build your starship or your truck or whatever it is that you're interested in building. So in that regard, our flagship product, which is called synthesis, by the way, is very much a platform, software platform. And in general, we have been working in the defense intelligence space in areas of national security for some number of years. We're now really helping some banks on Wall Street to meet their obligations to new legislation. We call that our proactive compliance part of the business. These banks really are facing some significant liability for things that are happening on their networks, in their email systems, in their chat systems. And they are trying their best to get on top of that. And then beyond that, there are some fairly substantial opportunities in healthcare and just the public sector that we're excited to be a part of. - Yeah, the financial industry efforts, I was really excited to be reading about, one of the things the book covers in some detail is the open Enron email analysis that I think most users are familiar with. But it's a fascinating way to take a look at how and when things changed at that company during the turmoil that went on there. Can you talk a little bit about what synthesis does and maybe how it's comparable or how the banks are using it to find the things that they're looking for? - Yeah, so in general, this sort of gets into this notion of what we're calling and what some others are now starting to call cognitive computing. It's this idea that there's a class of problems that are very human-like problems. They're very messy problems to solve. They're vague, complex, ambiguous. And there's generally not what you consider a right answer. There's sort of an acceptable answer or a best answer. And so in the case of what a bank may be looking at or really what anyone looking at human language data may be looking at, it's gonna be the case that there's gonna be some language that's used. And the language is going to be indicative of something that's going on. But it may not say it quite that way. It may be roundabout. It may be intentionally masked. And so an obvious example to go to would be insider trading. So if there's insider trading going on in a bank, the bank would certainly wanna know about it. But of course, two brokers are not going to just email one another on the internal system and say, "Hey Joe, let's do the insider trade." Here's the ticker. And it's gonna be far more roundabout than that. But at the same time, there will be signals. There will be anomalies in the language that we will be able to try to tease out. Another example that I will mention that we are just so grateful to have worked on and to be a part of. There's an organization called Thorn. You can look them up. Wearethorn.org. This is an organization that Demi Moore and Ashton Kutcher founded to combat human sex trafficking here in the United States. And generally, what we have going on is that there will be in the dark parts of the internet, escort forums and escort bulletin boards and such things. And it turns out there's quite a few minors who are being trafficked in these escort scenarios. So how do you look at an escort ad and determine with some degree of efficacy, okay, is this just a legitimate escort ad? Or is this likely to be an escort that might be a minor, that might be a person in distress, that might be being controlled by a pimp in a way that they're sort of operating independently, but under strict supervision. Very, very difficult problem to work on. And a problem you'll never get right 100% of the time, by the way, but there is leverage that you can put on that problem by way of cognitive computing. And that's exactly the kind of problem cognitive computing should be used for. These are the worthy problems, the problems that are worth working on. And I just wanna encourage anyone listening to find a problem that is really gonna make the world the place you want it to be, a better place and spend some of your time and energy working on that type of problem. - Yeah, that's a really great example. I can't imagine anyone would be disappointed to hear what work was going into a great social service as well. And something that you might need a fleet of 100 people looking all at the various underbelly of the internet, the dark websites, as you say, that one machine can all automate and make possible to find those sorts of connections that might not have been uncovered before. - Yeah, and that's a good point. There's sort of a subtlety there that you sort of backed into, which is that cognitive computing is this area of technology where I think we have humanity and technology augmenting one another so that we as humans can reach our fullest potential. I think that's just a key point to keep in mind, I think, as we continue moving forward in the data space and technology evolves and new types of machine learning and AI continues to get more sophisticated. Where I think as humans, we're gonna continue to use technology to either enable things that were not previously possible at all. These technology breakthroughs, as we might call them, but they're also just substantial numbers of problems. Whereas you just said, well, maybe you need 100 people to study the dark belly of the internet. Well, with the right technology, maybe we can get that 100 down to 20. Maybe those 80 can then go on and work on some other problem. Then maybe we can get that 80 down to 16. And then we can take the remainder and work on another problem. So I think there's a virtuous cycle there to be had. Absolutely. I'm gonna point in the show notes a lot of readers and listeners to check out the recent coverage and fortune magazine that I think goes in a great detail about what digital reasoning is doing. And I was, if my research is correct, I think you guys started out offering services to the government, is that correct? And helping detect terrorism and credible threats and that sort of thing? Yeah, that's right. The founder and CEO of the company, Tim Estes. He was at UVA in Charlottesville at the time. He began some work with an army organization there in Charlottesville called Injic. And the company really started out, just Tim and maybe a couple of others working on a project that grew into more of a consultancy, which eventually grew into a services business. And then over the past five, six years, we've really made the transition into more of an enterprise software business. So there's definitely a life cycle there. And it's been a long, difficult road to get to the point that we're at. And we still have some significant challenges ahead. But anything worth having is worth working hard for. And we're just grateful to have an opportunity to make a difference at the moment in time that we're all here together. Yeah, and these are really hard problems to tackle. I think it's an exciting thing that you can even approach stuff like this. There is one case, though, that's maybe more pragmatic I'd like to get your take on. I mean, it's the case of a New York comedian. His name was Joe Lipari, if I recall correctly. And he was a guy just having a bad day and went to the Apple store and things didn't go as he had planned. He went home and he started watching Fight Club. And he posted, I think it was either on Facebook or Twitter, something, you know, he quotes Edward Norton's character saying something about coming to the office with a semi-automatic weapon and just unloading on everybody. And I certainly wouldn't argue with anybody who would say that post is in really poor taste. But within a short period of time, that information was detected. SWAT team showed up and kicked his door in. And I think it's safe to say this was a false positive. Maybe one worth investigating, but, you know, not an actual threat. You know, there's a debate here about what's appropriate police response that, well, that's interesting. I think it's outside the scope of our discussion. I'm curious to hear your take on how is the right way to control for false positives like that. You certainly don't want to overlook a case like that and filter it out because you might filter out some actual threats you detect. But it's maybe a sarcasm, the ultimate insurmountable challenge to cognitive computing? - Yeah, so there's a lot wrapped up in that. Let me try to dissect some of it. The general response is not a technology problem at all. It's simply, you know, more of a policy problem. You know, someone could have made a comment in poor taste and had their door kicked down the same way someone could have tweeted about it or spray painted it on the side of a building. You know, in terms of the actual issue of sarcasm, it is admittedly, you know, a difficult thing to even determine if you're sitting in a room across from another human being that there are times when that person, even if you know them really well, they may make a sarcastic comment and you're wondering to yourself, well, what did he or she really mean? Was that it being serious? Are they having a bad day or are they sarcastic? And so I think in general, you either ask them for clarification or you make an assumption. Well, they are, they aren't being sarcastic. If you make the assumption you as a human may effectively encounter, you know, what you consider a false positive there as well. You know, if you consider that, you know, they were being sarcastic and they weren't or vice versa, you know, there's an accuracy issue. And if we as humans have that much trouble determining the real world context, then I think it stands to reason that, you know, expecting a machine to do that with a higher degree of fidelity is certainly a challenge. Now, the benefit machines have, especially in the context of cognitive computing where they're just tremendous amounts of data that can be used as context. Arguably, a machine may be able to use some of that additional context for its benefit, but, you know, in the particular case you mentioned, you know, it could just have well been the case that this person could have just flipped a switch and really gone back and done some terrible things. It just so happened that in this case, they didn't, they just sort of blew up and made a comment.