The Data Stack Show

198: Building AI Search and Customer-Enabled Fine-Tuning with Jesse Clark of Marqo.ai

This week on The Data Stack Show, Eric and John chat with Jesse Clark, the Co-Founder & CTO of Marqo.ai. During the episode, Jesse discusses the evolution of AI and machine learning in enhancing search capabilities, particularly in e-commerce. The group explores the concept of vector search and its advantages over traditional keyword-based methods. The conversation also touches on the challenges of searching for specific items, like car parts for Land Cruisers in Australia, due to the complexity of part numbers and interchangeability. They delve into the difficulties of dealing with unstructured data, such as information locked in PDFs and manuals, and how Marqo is developing AI to search and incorporate this data into relevant results. The episode covers the technical aspects of customizing embedding and language models for better search outcomes and the potential of language models to connect different data modalities for advanced search experiences, the future of interfaces, the role of new technology in search experiences, and more.

Duration:: 52m
Broadcast on:: 17 Jul 2024
Audio Format:: mp3

Highlights from this week’s conversation include:

Jesse’s background and work in data (0:35)
E-commerce Application for Search (1:23)
Ph.D. in Physics Experience Then Working in Data (2:27)
Early Machine Learning Journey (4:35)
Machine Learning at Stitch Fix (7:28)
Machine Learning at Amazon (10:39)
Myths and Realities of AI (13:49)
Bolt-On AI vs. Native AI (17:26)
Overview of Marqo (19:46)
Product launch and fine-tuning models (23:02)
Importance of data quality (25:38)
The power of machine learning in search (32:02)
Future of domain-specific knowledge and product data (34:08)
Unstructured data and AI (37:19)
Technical aspects of Marqo's system (39:42)
Challenges of vector search (43:27)
Evolution of search technology (48:15)
Future of search interfaces (50:43)
Final thoughts and takeaways (51:53)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

(upbeat music) - Hi, I'm Eric Dots. - And I'm John Wessel. - Welcome to "The Data Stack Show." - "The Data Stack Show" is a podcast where we talk about the technical, business, and human challenges involved in data work. - Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. (upbeat music) - All right, welcome back to "The Data Stack Show." We are here with Jesse Clark from Marco. Jesse, welcome to "The Data Stack Show." We are super thrilled to talk with you today. - Yeah, great to be here. Thanks so much for having me on. - All right, you have a really long, interesting history which we're gonna dig into in a minute, but give us the abbreviated version. - A very good version. Yeah, it started out. She's a PhD looking at very small things. It's been about six years in academia and decided that this wasn't for me, moved into women's fashion doing data science at Stitch Fix, segued into Alexa at Amazon, robotics, and search, and then founding Marco, which brings me here today. - Awesome, time to discuss. All right, yeah, welcome on, Jesse. So we talked a little bit before the show about the e-commerce application for search, some on the history of search, and why it's so complicated and messy right now. So what are some topics you're interested in, Copper? - Yeah, really started to talk about machine learning in search, vector search, some of the new capabilities that really unlocks, and really forward looking as well, because I think that we're gonna see a large evolution in the way that we search, and we've already seen some of that, with things like chat GPT and these kind of question and answering methods. So forward looking, I think it's really exciting. - Yeah, awesome. All right, shall we do it? Let's do it. - Jesse, so glad to have you on the show, and I have to entertain my curiosity just a little bit before we dive into data stuff, to ask you about physics. So you have a PhD in physics, you studied very microscopic things, with my very limited understanding is correct. You sort of almost replicated like microscopes that are too small to have a lens for or something, which is insane. So I just have to know, like having a PhD in physics, what was the most surprising thing that you discovered is sort of the most unexpected thing, and learning so much about physics. - Yeah, that's a great question. And yeah, really well understood in terms of what I was doing from that very brief chat. In terms of, I think what was the most surprising thing which physics, I did a lot of experimental physics, was just how good you had to become in these adjacent areas. So things like electronics, plumbing, we used to do a lot of these, we lived and died by experimental data, and we'd go to great lengths to collect this data. And so we're living in the lead hut for six days, collecting all this data. And so we'd have to do program robots. To collect the data, we'd have to hook up vacuum pumps, do all the electrical ourselves, but none of this was taught to you. So you had to just work it out. And this was a while ago now, there was much less resources on the internet. And so it was really just like having to, there was no choice but to work it out on the spot. - Yeah, that's not theoretical physics. (laughing) - That's not theoretical, no. - Yeah, I mean, so my conclusion from this is like, if I'm going out on an adventure and I need a real renaissance man who can, figure out how to hook up a vacuum pump, I just need to find a doctorate in physics, you know, to be this thing. - Where's somebody say? - If you want for your team on survivor pressure. - That's true. - Awesome, well, we have a ton to cover today talking about AI, talking about search, all the challenges of the search, talking about how those two worlds are colliding. But what I'd love to do, you have been in the depths of machine learning for years now. And so you're absolutely one of those people who I consider you were doing machine learning and, you know, quote unquote AI, we can talk about the, well, that term if we want, then you were doing that way before it was cool. So early on at Stitch Fix, you were doing machine learning across a number of disciplines around, of course, recommendations, but also, you know, demand forecasting, et cetera. Can you just give me, well, actually, why don't we just start there? Can you just give your machine learning journey? So after you did your PhD, what types of machine learning stuff did you work on and where? - Yeah, I think it was quite organic. You know, when you did sort of the PhD, you know, a lot of the, you know, a lot of the things that you have to do is you need to, you know, solve a lot of problems, you know, analyze a lot of data and, you know, that comes out. You've got to write the programs. And so it starts to become this sort of very natural evolution. You got to write these programs for analysis. You got to write these algorithms for analysis. And so you sort of automatically start, you know, basically developing your own kind of machine learning libraries as part of your PhD. And so, you know, in physics, there was a huge amount of talk back in the 2000s about big data. You know, we had huge amounts of data at the time. It was petabytes of data. We used to carry suitcases of hard drives back from these experiments. We had so much data, we just didn't know how to analyze it, really a lot of it. And there was, you know, everyone talked about this future state where we could get all this information from the data, but at the time, like, I have no idea really how we're going to achieve this. Of course, now, you know, looking back, we have so many tools now and big data really is something that could be leveraged. And so that's amazing to see that, you know, it took a long time longer than I think everyone expected, but, you know, sort of almost 20 years ago, it was such a hot topic. And now 20 years later, we do have a lot of the tools. So yeah, the machine learning kind of happened organically. I didn't realize that the time was even probably called machine learning. You know, I just thought these were algorithms that we had to do. And then it wasn't until I started to look outwards, you know, from my own discipline that I realized, actually these are very similar things, you know, they're applied in many different ways. And they're really, you know, really valuable to a lot of other functions, which are, you know, out of sort of core, you know, science and physics. - At what point do you remember the moment when you realized, okay, I've built these algorithms to, you know, operate on these petabytes of, you know, experiment data? What was the point when you realized like, okay, well, I actually have a machine learning skill here that I could take to industry, you know, outside of, do you remember that moment? - Yeah, I think it was a, I think it was slightly different. I didn't think that I had the skill. I was more like, I suddenly realized I lacked a lot of skill. And so I was, you know, I was in my, you know, physics domain and then started to look, you know, outside of it and really think about, you know, yeah, exactly, I was like, oh, I could learn all this stuff. I must be able to apply it. I'm, you know, gonna be pretty, you know, pretty good at this straight away. And then started to look at it and realized, oh, hang on a minute, I actually know nothing here. I need to learn a lot more. - Oh, yeah. - This was probably, you know, this was quite a long time ago. So it was very early on in the machine learning journey. But I think it was that realization that there was this huge amount that I still need to learn, which motivated me, you know, a lot to actually sort of, you know, really cover those gaps. Of course, you know, I, there are a lot of the stuff that you do learn in, in fields like physics actually does have a lot of counterparts in other fields like machine learning. They're called the terminology is very different. So again, once you sort of realized that, then you actually, you know, again, actually, I know more than I thought I did, you realize what the mappings are between these subjects. - Yeah, makes total sense. So with Stitch Fix, your first sort of in industry job and machine learning after you left academia? - Yeah, exactly. So yeah, moving into, yeah, Stitch Fix was the first sort of industry job, you know, full-time data science and machine learning. And so that was really exciting. I mean, a lot of the, the one thing that was so amazing as well was just how, you know, in physics and it's particularly in experimental physics, you know, the quality of the data and then the quality of the analysis really dictated, you know, the outcomes. And that was exactly the same, straight away, you know, could recognize that that was the same in industry, that it was, you know, the same kind of primitives, really thinking about, you know, the data and being really careful with that and thinking about how to actually, you know, drive, you know, sort of, you know, outcomes. - Yeah, makes total sense. So remember has it, and I don't remember what year it was, but Stitch Fix got so good at some of their recommendation algorithms that was kind of that rivaled like Netflix level. I don't remember the way white year it was, but at least on the data community, it was like people weren't super impressed with the interesting signal that you guys were able to extract. Like at what point were you, you know, you vary? Like how did that come about? Like what do you think some of the key sets of data science because most companies go to the data science throughout higher data scientists and like it just doesn't matter how they want. And it seems like Stitch Fix was very much the exception to that. - Yeah, it was very interesting. I think they worked very hard to sort of build an environment that allowed people to kind of do a quite a bit of freedom in terms of exploration, but then once something, you know, hit a retrain, then it's this sort of exploitation and putting it into business. But I think one of the secrets was honestly like, you know, a bunch of reform physicists working on these data problems. You know, there was a lot of, I mean, it was an incredible kind of mix of people. There was, you know, a lot of PhD businesses, you know, computer scientists, neuroscientists, you know, social scientists. So a lot of people who had deep expertise in a lot of experimentation at data. And so I think that was really the key was that people, you know, both fanatical about this stuff, you know, you just didn't leave anything to chance. It was, you know, wake up in the middle of the night and think that, oh, I haven't, you know, I've missed this piece of being in my data code in my ETL. I need to fix that. Otherwise my downstream is going to be cooked. And so I think it was really that, you know, combination of just getting people who were, you know, really loved data and then giving them the freedom to kind of execute. - Yeah. And it sounds like a diversity of background that's was helpful there too. I mean, even from your very practical hands-on experience with data and physics, it's very different than hiring kind of the team of PhD data scientists that all kind of have a uniform background, right? - That I find sometimes as a good fully translated into an actionable intelligence. So yeah, that's cool. - Yeah. Yeah. I think it's actually one thing that was really noticeable as well as, yeah, the diversity of background. She's a guy in these sort of problems that crop up. People have seen them, it looks slightly different in their field and then they can bring in different lanes, they've got different tools. And so you get this sort of, you know, much, you get this really good sort of better together story where people are able to bring in one of these other ideas and solve these problems. - So you moved from Stitch Fix to Amazon and you did a couple things in Amazon. So give us the overview of, you know, websites and machine learning problems as you solve it in Amazon. - Yeah, it was literally exciting. I joined Amazon. I didn't even know what I was gonna be doing there. And it was a little bit when I probably tried a little more color. I was living in California at the time, my wife. We just had given birth to twins as well. And so then decided to, you know, take this job with Amazon, move to Seattle and not actually, I didn't even know what I was gonna be doing there. So it was a huge leap of faith. (laughing) Really, you know, think my wife about the support there. And so, yeah, went and worked at Amazon on a sort of top secret project basically was the sort of number three or four higher in the team. And so this was really ambitious zero to one kind of project. So this was really exciting. And I think, you know, Amazon has a reputation. You know, for these projects and it's certainly true, they're really taking, you know, a big bet and trying to make, you know, really remarkable things happen. So that was, you know, really just great to see how that sort of evolved and, you know, being connected below deeply technical and working on the machine. We're in there for this initial project. It was still very connected to the end. You know, what customer problem we're solving and how do we take, you know, these technology, which is quite complex, that nascent, still has a number of rough edges. But how do we make that into something that customers are gonna love and buy? And so that was really interesting, not just from the technical perspective, but this sort of holistic product, development perspective and just seeing that iterative cycle. And then, yeah, moving on, you know, we're, lived on to robotics after that. Again, just saw, you know, a huge opportunity in terms of, you know, greenfield projects taking on something ambitious. Again, you know, the Finland Center is obviously huge part of Amazon and the efficiency there, you know, that they've been able to, you know, really drive down, you know, it's quite remarkable. And so again, to be able to, you know, have a potential impact there was really, really interesting. So yeah, developing a lot of the intelligence for robots, you know, these, all the sort of machine learning models to help them see, basically understand the world. And then, you know, after that, it's been about two and a half years doing that. Also spent some time in sort of retail and shopping. So again, you know, starting to think about, you know, how do we improve, you know, these experiences online, you know, there's a whole bunch of different sort of aspects that this touched, you know, sort of, how do you help sellers? How do you help people discover things? So yeah, this was super interesting. Well, I want to zoom out a little bit because that is such a wealth of experience across a variety of machine learning disciplines, you know, sort of from making a recommendation, you know, making a clothing recommendation, statistics and fashion to robots, which is crazy. And a non-trivial, you know, sort of scale of, you know, I think so, obviously, so can we talk about the AI landscape for a little bit? And maybe a good question to start with would be, what are sort of the big myths or like, if you could sort of dispel, you know, it's like, there's so much hype out there. If you sort of had a couple of top things where you're like, you know, these are the things that really bother me when I see these headlines. What are those things for you? Because you are actually, you know, truly experiencing this and literally building sort of like AI-focused technology, which we'll talk about in a minute. Yeah, great question. There's a lot I think that probably gets me a bit riled up at the moment. I think some of the things in AI, I mean, a lot of it really is that it's, you know, it's not magic and, you know, there's really no sort of, you know, silver bullet here. It's these problems, you know, to solve them with AI still requires a lot of the same, you know, the same things have always been required, right? You need really good data. You need really sort of disciplined approaches. You need really good evaluations. You need to understand, you know, sort of what's happening. So I think just being grounded that, you know, this is still a tool, you know, AI is a tool. It's a technology that you can use. And so, you know, all of the rules, usual rules apply. You know, I think in terms of the things that are, you know, sort of a bit misrepresented, you know, I think we sort of like particularly, the hype started down a little bit, but particularly when a lot of the, you know, large language models started to come out, these incredible skills, you know, people started to talk about all these emerging capabilities and whatnot, you know, but I think now it's actually become, you know, evident that actually these emerging capabilities aren't so emerging, it's really that, you know, people just didn't stare at the data long enough. And actually a lot of this stuff was already in the data and that this is entirely expected. And so, you know, training these models is not, again, it's not magical. You don't just get, you know, suddenly get this AGI, you know, without current, sort of, you know, let's just crop about L.A.M.'s. You know, it's really, you know, what's in the data. It's very much like training, like training or something. You train a muscle, it gets bigger, you know, you train a L.A.M. on this data, it gets better at it. And that's really it. And I think that, you know, we need to sort of just be really, you should have grounded in, you know, what, you know, what we can do and what we want them to do. >> Yeah, yeah, that makes subtle sense. Well, help us understand. So if you have to break sort of like, let's say modern AI down into sort of its art components from a technical perspective, could you break those down for us? So like, you know, you talk about large language models, there's vector databases, giving you, you know, rag applications, you know, sort of. But what are sort of the main components when you think about a modern AI application? What are the core pieces of that? >> Yeah, another good question. I think the core piece is, yeah, I mean, AI is, you know, I mean, as the night suggests, artificial intelligence, and it's sort of much more encompassing than something like machine learning, which is much more sort of specific. And so I think AI, you know, definitely is much more than, you know, just an individual component in terms of like a model. It's really the, you know, there's worth a lot more than the sum of its parts. And that's what these systems are actually able to sort of do. And so I think from AI, you know, sort of system perspective in the different parts, you know, there's obviously this, you know, at least in the modern AI, we're sort of centralized around deep learning models or some other machine learning model, which is kind of driving it. But then you have all of this, you know, peripheral, you know, components around that. And so exactly like you said, you know, being able to store and retrieve data, you know, vector databases, vector searches, we've seen, you know, organizing large language models with the ability to retrieve information. This has evolved now into say, cool use. And so now you can actually, you know, not just have, you know, a single, you know, database, for example, but you can actually have other functions that can get pulled out. So maybe it needs to request something, you know, what not. So you've got these sort of systems now, which are, you know, integrating a whole bunch of other sort of, you know, data management systems and, you know, again, like serving inference, you know, but a lot of it actually looks very similar to, I think, what's happened in software engineering before, again, you know, this isn't, you know, you know, a lot, you know, super impressive and really powerful. You know, a lot of the same engineering practices still apply. And so you've still got all these other components. You've got the serving, you've got the interface. You've got the, you know, obviously a lot of the, you know, the sanity checking and the sort of safety around it, particularly the language models of state, you know, with, you know, about the prompting engineering, the prompting injection. Yeah. So I think that's probably it. - Well, how would you describe, is that I've got two mental models here, especially with SAS apps. I've got one, which I'll call bolt-on AI, which is I have a product. And then I call the chat GPT endpoint and like, you know, return something, right? I think it'd be things where they just kind of add on AI. - Yeah. I think I saw the problem. - Yeah. - Yeah. And it's kind of a little problem. - Yeah, whatever. - They think you're problematic. (laughing) - Yeah, where it writes, you know, a little bit. And it's an essence, kind of white language, chat GPT can do and then focusing it toward their product, which is fine, versus what I think you guys are doing where it's truly a native AI. And I guess one, how do you communicate that with people? And two, maybe what are some of the challenges with so much noise in the space and so much? Yeah, it was part of the best way to say it, just the very noisy, loud space. So how do you communicate that with people? - Yeah, I think you were absolutely right. And there's sort of this, you know, consuming of AI, you know, to build something and then sort of producing AI as well. And so there's, you know, big distinction here. Previously, it was much more about, you know, producing, you know, the AI and there's much less capabilities in terms of being able to consume it and use it. And we've seen, you know, now people are able to integrate it just to a single API and now we're sort of people AI enabled. But I think, you know, and certainly now the term has become so overloaded, it's very hard to actually distinguish, you know, what's the fact for fiction. But I think we'll sort of move, you know, at some point we'll move very quickly, I think, away from people sort of, you know, talking about AI as powering their application, you know, I think probably when electricity came about, people were like, you know, electrical powered, you know, sort of cooking or something like that. And it was a big, you know, selling point. But, you know, now if you sort of said that for a lot of stuff, people would probably look at me quite strangely. Of course, you know, to be like, why aren't you using, you know, electricity to do this stuff? Like it would be odd if you weren't. And so, you know, I think we'll get to the point where it becomes so pervasive that AI is just expected to be part of it. But I think in terms of, you know, being able to sort of, I guess, talk about it. I think we focus a lot on, I think, you know, outcomes, I guess, as well, and just sort of business value. And so, somewhat, you know, sometimes they're sort of, you know, in the middle, it's actually not, you know, less important as to what the actual business value is and what the sort of outcomes you can derive are. And so, I think just trying to focus on, you know, what is it with about, like, you know, the technology and the sort of solution, but really about, like, the problem in actually solving that? - Maybe it'd be good for you to just give us like a, yeah, an overview of what Marco does before he dig into search, because you mentioned a couple of things that you're giving us, the overview of the landscape, you know, there's vector search, there's vector databases, you know, when you look at Marco, it has, you know, I think several components there, but just level set us on Marco. And then I'd love because you've tried everything in search when I get my first job. And so, I'd love for you all to talk about, you know, why it's our problem in here in the history. But yeah, start out with just telling us what Marco does and sort of, and I guess it could be on the vector side. - That'll be most interesting, because when you hear vector search, you kind of instantly thinks that your database, there's a bunch of vector databases out there, like not all of them are created equally, obviously. So, help us place Marco correctly in the way that we're thinking about it. - Yeah, absolutely. So, Marco is, one way to think of it is a vector search sort of platform. And so, the reason that we're calling it that and working towards this is that vector search itself requires much more than just a vector database. And so, you know, vector database is, you know, built around, you know, similarity search. So, you put in a vector and then you can find, the nearest vectors and that's effectively your search, that returned with, you know, these things, the relevant things in terms of a vector sort of search process. However, this is still very primitive operation and actually building any kind of search system requires, you know, much more components from, you know, than that. And in fact, vector search itself requires, you know, a whole bunch of additional machine learning components. And so, you know, moving what, with Marco, we're moving beyond to just, you know, focusing on the similarity piece and actually think more holistically, how do we actually bring, you know, this vector search technology, you know, to developers so that they can actually integrate it. And so, with the current, you know, sort of wave of solutions, the vector database, and then, you know, everything is left of the exercise, you know, to the engineer, where they've got to now implement all of the orchestration, extraction, layout, and handles the machine learning. And then once they've got that, they've got, you know, now you've got your sort of hello world example. But then if you actually have any sort of suitably valuable, you know, search, you know, search bar or application, you actually need to really think about how you actually tune the search, how do you develop the models. And so, this is sort of the third piece. And so, what we've done with Marco is really think, okay, how do we actually, you know, if someone's going to actually put this into production, have a long lived service that drives a lot of business value, they're going to have to, you know, cover off on all of these components. And these are, you know, quite different technical works, you know, technical domains require a lot of expertise. And so, what we're doing is we have the vector database, we have the abstraction layer, the orchestration, machine learning, the inference, so people can get started straight away, the documents in, documents out, so you can search with text, you can search with images. And then we're now building into this place where you can actually start to plan, tune models and actually, you know, integrate behavioral data, feedback from, you know, from users and have this continual system that continues to learn. And so, you've got this search system, which is really performant, covers off all of the components and just continues, you know, gets better over time. And is that last part, 'cause you could send a big product launch. So can you tell us specifically, 'cause it sounds like that last part is sort of, 'cause I know out of the box, you could plug Marco in and get, you know, you could get, you know, much more relevant search out of the box. But that last part, it sounds like now you're enabling your users to integrate their own sort of first party data, is that sort of like bringing around embeddings, but tell us about the launch. I'd love to hear a little bit more about that. - Yeah, really exciting. So like I mentioned, it's, you know, to get vector search, you know, be really valuable long term, you need a few different components. And so the initial Marco focused on the vector database and the extraction layer. And like you said, now we've just launched the ability for customers to, you know, fine-tune their own models on their own data, get really specific, domain-specific models, which really understand the nuances. And so I think everyone is, you know, familiar, if you search for even sort of basic things, like genes on different websites, the notion of what genes would be is, you know, very different. Customers that, you know, have a particular, you know, flavour, particular style, you know, and so being able to actually capture one of these nuances now is what this new product launches enables customers to do. So they can actually fine-tune the data, fine-tune the model story on their data to get it to really understand, you know, what their customers and, you know, language they're using, it can allow, you know, covers off on, you know, maybe there's new terms, there's slang terms, maybe it's multilingual, maybe it incorporates multiple languages. All of this now can actually be large and then serve the integrated into Marco to provide much, much better results. So I'm curious, especially with your background at Stitch now, and it's like user input, yeah, like, which is a funny, like, whole certain thing. We went, like, way off of it. We went on after doing anything, let's just try it. Let's see where they click, do this. Yeah, it's just, like, digitizing the retail interaction where it has in store. Right, so I'm interested, like, maybe some applications there for, like, your technology, where it's like, oh, keep burning the ass, like, well, we just run a database, I'm like, well, let's store it in a way that I can interact with it. And then you have that higher quality, more precise data that do you think that's important for the future of search, especially in, you know, AI? Yeah, I think the data, again, is incredibly important. It's not more important now because, you know, the AI really, you know, it's trained on a particular set of data, it'll be trained on, you know, particular sort of style of data. And so what can happen, you know, particularly the long-known problem in machine learning is you can have these sort of distributionary smashes. So if the machine learning models were trained on a particular type of data and now it starts to see different data, you know, maybe behave, you know, slightly differently or a bit worse. And so, you know, data quality, you know, it's never going to go away. I think it's, you know, just, you know, you got to be fanatical about the data quality and that will always pay dividends. And so, you know, that really. So you talked about the history and you looked at multiple solutions. Can you just give us a brief overview? What types of solutions have you tried to purchase? But then I also love for you to bring up, like, what were some of the dreams you had about what you could do with search that were just impossible? But like, and then Jesse, I'd love to hear, like, you know, what are those things that Marko addresses? - Yeah. So, I mean, my history with search actually goes way back to you. There's an open source solution called Apache Solar that had a previous job we used in-app that powered the search for the app we were building at the time. And I was part of the admin team. So we were doing like the mappings and trying to run that and. - They're like building all the indexes. - Yeah, like in re-building indexes in the middle of the night because things broke. I mean, it was so much work. And then we moved to a elastic search with kind of an acceleration, like, oh, this is nice, we're like, this is better. - Yeah. - And then, you know, Amazon started hosting that for you and like, there's kind of this progression. And so we moved out, and then we moved into the e-com. And then I come into the e-com and we moved to Shopify. I'm like, okay, like, let's look at Shopify's built in search. We asked around, we talked to people and nobody uses it. And I tried out and was like, okay, all right. I understand why nobody uses it. But then that discovery process was just so surprising. There's some kind of entrenched e-comers solutions that have been around for a long time that just do e-comers search. And then they're like, oh, we're gonna add Shopify, like, 'cause we had, you know, this, the new thing. But they still have that, like, older model. There's the schemas for a lot of them are very rigid. I'm like, well, we need you to put color here and size here, you know, whatever your parameters are. And I mean, we spent thousands of hours cleaning up, getting data into, you know, fixing schemas for a search store, basically. Wow. Did you look at like Algolia? We moved, we actually, we started with one of those like more bespoke vendors. We moved Algolia, which was better. It was a little bit more flexible. You can feed Algolia, like, actually, behavioral tech data and do so many things with them. And then a couple of years into it, like, oh, we've got all these new AI features. It's like, oh, this is me, it's, we turn them on. And it's like, what if I'm holding an AI prediction feature, like, get, you know, get insights into what people are piping in, we built for turning the results. And I mean, this was, we had, I think we had over a million visitors a month. It's not a small traffic site. And like, the insights were used. There wasn't much there, like, they had like synonym, like, recommendations, like, somebody would type, I don't know, one thing. And it was just kind of a disappointing, you know, experience. And I'm sure it's probably developed past, you know, in the last couple of years, it's developed more. But I think the biggest disappointment was around discoverability. But by new, the, if I knew a keyword that was in the name of the product, it worked. If I knew this new part, it never felt ever interesting, it worked. So what was, so like, describe a user problem to me. And then this is Jesse, where I'd love for you to, like, weigh in your, describe your user problem. So keyword in the name of a product, but a lot of times, especially 'cause you were dealing with like, water filtration parts and stuff. A lot of times people are describing their problem. - Right. - Right, I filtered this out of my water. - Yeah. - Right? Was that like a... - Yeah, describing the problem was one of the biggest one in that space was, does this work with X? - Oh, interesting. - It says it's like relationships, and two, you're trying to, like, connect this pop to this tube, and you're like... - Back in the water. - Yeah, back in the day, right, you're back in the show. - But, I mean, that was hard. So we don't tell these data models, like, fits in or like compatible with, like, and it just gets, and it's just the being this web of relationships. And I think it's really complicated. I think that was probably the most difficult problems is people like, "Will this work for me?" Or, like, you get into materials or like, a question like, "Will this work at this temperature?" And now it wasn't specifically listed on a property. - Will the long tail of, like, indices and, like, contingencies is the same? - Yeah, but it might have been buried in a description somewhere, you know? - Right. - Yeah, yeah, yeah. - Yeah, any of the long, at least for us, any of the long descriptions, most of them have pretty valuable information, but they were basically inaccessible from search. - Wow, all right, Jesse, solve our problem here. Solve our problem. - Yeah, that's right. There's a bit to our impact. I think one, like you described, the first problem was just, like, difficulty to use even, you know, sort of, keyword systems, you know, before actually having any machine learning involved. And so, there's only one thing that was, you know, it's focused a lot on, and Marco is, you know, being able to make a lot of the Victor search, technology really accessible to developers. And so, you know, being, taking away a lot of that, you know, just maintenance and that sort of, you know, back-end stuff that's really hard to manage. And so, that's part of the, you know, part of the value proposition is that, you know, we take care of a lot of that. And so, the sort of developer experience so you can get up and going is, is being really a core focus. And then, like you say, in terms of, you know, being able to use different problems, you know, with keyword search, you know, if you know exactly what you want, it's fantastic because it's literally just, you know, it's finding the exact same phrase. But, you know, a lot of times you don't know what the correct language is. You don't know how to articulate it. Maybe it's a question. Maybe it's buried in the, you know, in the description. And so, what we've seen with these machine learning based techniques, particularly around, you know, Victor search, obviously, is where you can basically define, you know, your own relationships, you know, in terms of what's similar. And so, you know, people start asking, you know, these questions or start, you know, even querying in some, you know, very, you know, different way, you can actually start to, you know, can learn these mappings. And so, it's very flexible about, you know, what you actually define as being similar. And so, with search, you know, someone puts in a query and then they've got back products and these, you know, these sort of have this natural relationship of similarity. And then you can actually learn all this similarity as well from these kind of past interactions. And so, it's so powerful in that you can define what is similar. It's not, there's no canonical, you know, this is similar to that. But you define that through these relationships. And so, that enables you to now ask questions. You know, you can do really anything you want. So, it's incredibly powerful. - Yeah, well, one of the unlocks for me in doing all this search research is a recommendation is a search execute on your behalf without, and from you. Which seems obvious you think about. But we're not interested in the search problem. - Yeah, yeah, like it. - It's just not something that-- - Oh, and yeah, I've never framed it that way, but it makes a typical sense. - Yeah, yeah, exactly. I think that's, yeah, and not many people I think have quite realized that, you know, search and recommendations are really, you know, two sides of the same coin. And especially when in e-commerce, when you've got, you know, vague head query, you know, maybe it is just like an item of clothing, like t-shirt. And there are, actually, it's not one result. It's not like information retrieval where you're asking a question, you know, what is the atomic wave of gold, for example? And it's sort of very specific answer. And you might only have one, you know, thing that matches that. You've got this sort of degeneracy where there's a lot of potential matches. And so, this is like a recommendations problem. But then you segue into this, you know, more verbose queries, for example, which may only have one match. And so, it's this fluid, you know, transition between recommendations and search. And so, I think, be able to think about it, you know, in these different queries and what they actually, you know, require, you know, is definitely the right approach. - Yeah, I'm really curious, especially for, let's call it kind of specialty e-commerce applications, where like some domain knowledge is required to purchase the right thing. Let's talk car parts. - Maybe it would be a good example. Like, how far away do you think companies are from basically combining a knowledgeable, you know, model that knows about cars and car parts with their product, they already have to help people navigate a site? What does that currently in this paper look like? And what do you think maybe it looks like in a couple of years from now? - Yeah, it's very interesting. I think, you know, I think at the moment we've got still quite, you know, sort of, you know, early methods in terms of what we can do here in terms of understanding. And so at the moment, it's still very much, you know, sort of systems with different pieces and you'd like might have an embedding model that knows particular things. Maybe a couple of that with a language model that knows certainly different things. And so that's kind of the sort of current state of play. And, you know, depending on what you want that, you know, depending on how you want the results to be sort of displayed or depend on, you know, what language model you might have on the outside, maybe just, you know, have results. And so if you're asking it to, you know, understand deeply about, you know, if you've got 10 results then you're asking them to distill those results into an answer for a customer. You know, the language model itself at some point has to actually understand, you know, a lot about the domain, you know, depending on what it is. It can't just sort of summarize it. You'll actually have to understand, you know, the differences. And so I think what we'll see, you know, probably much more, you know, into end, you know, train for example, I think we're already starting to see this where, you know, the embedding models, you know, everything is really informing each other. These aren't necessarily done in complete isolation. So that you can actually get this sort of system, which is, you know, domain specific as well, not just individual components. You know, because these things aren't, you know, and they also feed into each other. So, you know, the results, you know, from one thing feeds into the other. And if you do have, you know, particularly, you know, say issues in one component, it just slides on through the system. And so being able to really optimize, you know, into end, I think is where we're going to be going. And have these systems, you know, language models and embedding models, you know, that can be optimized together. And then potentially, you know, things like the storage component actually living inside, you know, the machine learning models, the large language models. And so going forward, you know, the vector database and we're much tightly integrated with the large language model, for example. - Because, I mean, back to what you were saying earlier, Eric, you basically just want to replicate that highly knowledgeable, like, in person, you know, customer sales rep or whatever, like that person you go and talk to it, help you build it, like, you see a plumber for 20 years and you know everything about plumbering, you can describe a solution, like, you just, what you need, like, that's what you want. - Yes, Eric. - But we're really far away from that. - Yeah, it is interesting. You mentioned car parks, I was thinking, I have a hobby of working actually on land cruisers, which are very popular in Australia. And searching for stuff is phenomenally difficult because even if you have a base like model number, the partner changeability varies pretty significantly in terms of like sets of years, right? And so searching for parts is so difficult. And so you end up going through these forums and like coming through like message threads to be like, is this the right part number for like my specific thing, you know? Which is loud, I mean, it really shouldn't be, it shouldn't be that difficult because I guess the thing that's crazy to me is all of that information exists and actually is like pretty available. Like, interestingly enough, it's not like it's a mystery. It's just that as a human you have to go through and like create these explicit relationships in your own mind, you know, that just have them then combined from Reddit forums. - Yeah. - Yeah, yeah. - Yeah, and I think that's what's most exciting about a lot of the current sort of wave in AI and machine learning is now, I think we have much better methods to use a lot of these kind of data that exists, which is really relevant, it's unstructured data. And so I think previously before we had to really act, you know, had to really define these relationships. But I think now we've got this ability to actually, if we know it's relevant, you can sort of start to incorporate that and actually learn from a lot of this other, yeah, relation that exists and actually now incorporate that into, you know, your server search system. And so that's really powerful, you know, I think being able to use a lot of this other data which was previously really hard to use. - One other thing on this topic is we also had a ton of really useful data locked up in PDFs that were probably made back for, that were manuals that were out to you guides. And he like, how would AI, you know, AI and AI search potentially unlocks some of that data? - Yeah, I think there's a few different ways. You know, I think one of the things that we sort of focused on with Marco was, you know, just that problem that you've got so much data which is unstructured and just basically inaccessible, you know, I think something, you know, your numbers, something like 80% of the world's data is unstructured data and it's, you know, growing at an exponential rate. And so, you know, one of the ways we founded Marco was to think about, you know, the invariance, there's so much been changing in AI. You know, so many new models, everything is changing. So how do we build a business and how do we solve problems which are based on, you know, the invariance? And so we know in unstructured data is a huge amount. We know it's going to, you know, keep growing. We know that people need to search it and we need, you know, relevant results. And so that's sort of one way that we've been, you know, thinking about, you know, the problem building Marco. And so, you know, being out, you know, vector search particularly allows people to search across these unstructured data in ways that, you know, previously were, you know, impossible. And so now I think as well, not only can you search across it, but like, you know, we sort of just discussed in terms of always latent data which exists in forums, you know, we now have, you know, not just to search that, but you can actually, they incorporate that into a domain through to the, you know, model as well and actually really understand that. - Okay, Jesse, I want to dig into the technical side a little bit because, you know, John mentioned earlier again, you can have like a package, AI, you know, application and it, you know, you send a data, it sends it back. Yeah, as it sounds sort of a betting model, it has its own, you know, deep learning model, LLM, whatever, right? You talked about Marco as sort of a system and now with this latest product launch, which is very exciting is you can bring your own, you know, first party data to help like inform, to inform the system. But you mentioned something earlier and I think it's really interesting and I think it's gonna be really helpful on me, especially for me, but I'd love for listeners to like walk away with the better understanding of the embedding is model and then the language model and then the things that you would want to customize on each of those or like what are the separate concerns across the embedding's model and the language model that you need to think about and then what does that relationship work like in Marco? Yeah, so I think one thing that we've done, particularly with the new product launches, you know, take quite a holistic approach to the way we build the systems and optimize these models. And so the first one really is making sure that the consideration is, you know, it's probably on the embedding side is that we can actually optimize it in a way that's actually got a mirror of what's being used in production. So it's not done in isolation in terms of what data is being used, in terms of how the data is structured. You know, you start to see that people have, you know, particulate and might have reviews are titles that have descriptions. We know that this is often used. Sometimes it's missing as well. And so actually you've got to be robust now to some missing data. I think that's one of the key things as well. You know, this sort of current paradigm, in vector searches, you have one piece of information that gets turned into one vector and you just search over that. But of course, that's pretty naive in terms of what actually, you know, customers and users have, they have multiple bits of data, they might have some, they might not have others. And so the first piece is really trying to optimize, you know, I think the models around, you know, what the actual customers need and have and sort of those use cases, the data structures, how it'll be used in the system. And then also, you know, customizing the models as well in terms of, you know, not just the data structures and how it'll be used, but actually one of the outcomes that people want to actually do with this. Like what are they actually trying to, you know, are they trying to improve at particular aspect of the business? And so being able to use a lot of that sort of domain specific, you know, data to actually optimize, you know, directly for business outcomes. And so that's sort of, you know, I think how we're thinking about it quite holistically in terms of from this sort of optimization perspective. So, and this sort of then plays into the different, like you mentioned, the LLMs and the embedding models, each of these sort of different things. So on the embedding side, it really at least to understand, you know, what are the, you know, how does that map, you know, something that comes in as a query to the information that's in the database and how do you, you know, create that relationship in a way that, you know, it's going to retrieve the relevant results. And then on the outside, you know, if you've got a large language model, and these are used in many different ways, particularly in search, you know, both on the input and the output. And so on the input side, you know, you can use them for attribute extraction or data enrichment, if you've got this, you know, data cleaning. And so in the game and on the output, you can use it to synthesize, you know, a set of results and actually try and read it over it. And so again, depending on what you want to do, I'll make sure that these things will depend a little bit on what you need from the language model, you know, how much domain knowledge does it need or simply need to be able to summarise to the need to be reason about that. And so if you start to go beyond, you know, pretty simple sort of summarisation and extraction, then, you know, these language models themselves will have to become sort of somewhat domain specific and actually start to understand, you know, a lot of the nuance of the field. Makes total sense in, okay, so I have, this may sound like a funny question. Is there a, you know, we talked about us, you know, basic search index, where I know what I want, I know the keyword. So I search for the keyword and I get the result. It's great, right? And then we talked about some of these challenges that are much more nuanced that are very difficult. You have these relationships that the user's not gonna make explicit, you know, we need to infer a lot or learn from the inferences that we're uncovering. You know, what are the use cases where maybe vector search is not a good fit? Are there cases where it's like, well, you wouldn't necessarily use it for that? - Yeah, I mean, I think it's a very explicit, you know, sort of, I mean, I think part numbers is a good case there where you just more exact match on a part number and nothing else. I mean, you know, I think that's, you know, that's a great example. - Don't infer a relationship to the thing. - The auto parts to ask that. - Yeah, exactly. Don't make a reference. - That's right. Where it's, you know, like a controller, I need to find this thing. But, you know, I think that any of the sort of short coming of the vector search at the moment, you know, I think it really come down to one is, you know, the model just not being so, you know, appropriate for the particular use case, which is also exactly why we developed, you know, this new product where we can actually, you know, fine-tuning beddings on the custom domain because this really, you know, aligns the model with what the, you know, sort of user intent is. But, you know, in the future, you know, we're seeing an evolution as well. We're still in the early days of vector search use. We've got these sort of mostly single vector representations of data. But of course, this is pretty naive. And so moving beyond just a single representation into sort of multiple representations and then being having much more intelligent kind of query models as well, we'll be able to, I think, you know, not only have, you know, the benefits of the keyword search use type of part number comes in, this is the nose and this is the part number that should just return the exact matches and this is a question. Okay, now we can default to different behaviors. So, you know, in the future, I think we'll be able to absorb, you know, effectively, you know, all of these things, you know, with vector search and the system will know and understand exactly what, you know, what needs to be done. - Yeah, I mean, that was kind of a trick question because I want a world where like the only search is, you know, is like well-tuned to search because if you've experienced it done really well, it's so much better. It's almost just like, and I'm trying to think of how to describe it. It's one of those experiences where you're just like, yes, like this is what it actually like, I guess maybe it decreases the mental load so much and it feels so intuitive, then it's just like, like it feels like natural, I guess. I'm not anti-climactic, you know, but I don't know. I mean, what do you think? You can do so much search, but like, that's the world that I want. - The thing that got me recently on this topic was, let's see, GPTs, or Chad GPT, more than a year, a year, a year, a year, a year. And just using, like, boy search for Siri or for Google, like, it feels so bad. - Really? - Like, 'cause if you type in GPT or use the, like, little boy's thing, just for, just somebody might normally use, like, say, you know, okay, Google or Alexa or whatever. Like, it's a markedly different experience and accuracy. - So that was something that really struck me recently, like, oh, like, this was just, you know, a year ago. - Yeah, yeah. - And surprisingly, they haven't really, you know, you know, those two particular things. Like, I guess that's coming out of it. - Yeah, well, I mean, what do you think, Jesse? Like, they've got a busy, like, they eat, right? - Yeah, I mean, it's pretty interesting. I think, you know, a lot of, and that's why I'm so search is quite interesting, 'cause, you know, there's obviously some incumbents there, but, you know, it's a wave of AI, you know, they obviously have to, they've got, you know, existing business model and the remote, not. And so, you know, they've got to sort of, you know, they can't really hard pivot on, you know, a short notice. And so I think we're sort of seeing some inertia there. And then obviously having to work out and what's the, you know, I'd take a big bet on terms of where the future is going. But I mean, I think there's also a lot, you know, that we don't see, which is, you know, they've got a particular business model they're optimizing for particular things. And so that's also what you don't understand from a search perspective is, you know, what's the incentive for their results that they're providing, right? Like, you know, someone who's got a, you know, web-scale search, right? They're going to be, you know, they live off ads and they're selling ads and search, you know, why they serve results is going to be dominated by these kind of business objectives. That's what the whole system is being optimized for. And so that's really one thing to consider as well. You know, what the incentives are of the search provider and that will dictate a lot of, you know, how these things are done. You know, do they deliberately, you know, sort of, yeah, almost, you know, have different results? - Yeah. And I guess that's why I was like, slightly optimistic about voice as it merely is monetized, right? - Yeah. - Some of them, you know, it's like maybe they'll renovate there faster, but. - Yeah. - Yeah. - Yeah. - Okay. - Yeah, that's such a great point. I mean, yeah, just the erosion of like quality search due to revenue, just like web search is like, that's great. Well, we're really close to the investor here, Jesse, but I want to ask you, you know, we talked about the hype. We talked about there's this over bullets. We got a great breakdown of, you know, sort of vector search and like all the exciting things there. What other things in sort of the AI space are you personally excited about? I mean, you're building a company in your space, but what excites you as you look at the landscape? - Yeah, I think it's really exciting to see, I think the evolution of large language models, particularly into different modalities. So, you know, large language models that are able to, you know, particularly just like you said, like even here, you know, see, and also obviously use, you know, a natural language, but then using them in a way that they can act as an agent or a controller or a critic. And so, you can now actually put these, you know, language models inside systems that they can make evaluations, they can route logic. And so actually, you know, this is something that's being, you know, quite powerful and really hard to achieve. Otherwise, you know, one of the great things about these models is that they've got, you know, this natural language interface, it's kind of lost you and sometimes you do lose a lot of, you know, nuance with it, but it's also incredibly good, you know, gluing together all these different interfaces. So it's sort of, you know, this layer, this interface layer that allows you to connect audio, language, video into one thing and then it can go into a database, for example. So this, you know, being able to sort of unlock, you know, of this, the language models with these different, you know, I guess, sources of data and then being able to take actions or outputs is really exciting, you know, from microseconds. So you can think about it from a search perspective is, you know, you can have a system that could be optimising itself, right? Like you can have a language model, you know, it knows what good search results would be, safety or domain, you know, it can literally, you know, send it off and starts to just, you know, collect data through search results, optimise the system. So, you know, moving, you know, into that direction, I think is incredibly exciting. - Yeah, yeah, I feel like we're going to enter this phase where when you talk about the history of, you know, using, you know, going all the way back to like, you know, of a source that was our school all the last day, ugly and all the sort of stuff. What's interesting is I was like a decade, you know, I was like an entire decade. My sense is that the advance in search is going to do like a decade and a very drastically short amount of time. And I think it's because of what you're saying, Jesse. - Yeah, it's going to be very interesting. I mean, hopefully it does, you know, get better rapidly, you know, and everyone's using Marco, you know, so that we can, you know, the perennial frustration of searching. - Yeah, it's going to be, you know, fascinating to see how it goes and I'm incredibly excited as well, just sort of to add on to that last question about the future interfaces as well. We're seeing these evolve a lot. And like, you know, Vector Search is really powerful. And, you know, just this kind of idea of doing, you know, punching in keywords, a couple of keywords and then sort of pressing into, you know, if we can just sort of move away from those ideas and actually think about how do we interface with these search systems, you know, the search engine I think will be much better, the experiences will be much, much better. And so I think that's very exciting to think, how can we actually leverage these new experiences with this new technology as well? - Love it. And we're, I should have asked this earlier, where can people go to see Marco try it out? - Where is it they're going to figure it out? - Yeah, it's a Marco.ai. You can go on, you know, we've got live demos on the site. You can check it out there. GitHub, you know, we've open source Apache 2 live since, you know, so you can spin it up on your laptop, you can get going, we've got to really start up, you know, sort of start guide, you can, you know, build your first Vector Search, you know, experiencing, you know, literally a couple of lines of code, you know, image search and, you know, really experienced those sort of hard moments. I think it's, you know, like instead, when you first sort of experience it, it can be quite magical. And I think we've had some customers before like going at first off the end-to-end system and they started searching in emojis, then all of a sudden, you know, it's returning our cat emoji and they're getting picked at the cat. They're like, oh my God, this is, you know, amazing. You know, this, that understands, you know, this stuff. You know, it's like, yeah, hit over to, yeah, Marco.ai or hit over to the, you know, GitHub, which you can get to from the site. - Cool. All right. Well, Jesse, thank you so much for giving us your time. I learned your time. And we'd love to have you back sometime at the future. - Yeah, thank you very much. That'd be my pleasure. - The Data Sack Show is brought to you by Rutter Sack, the warehouse native customer data platform. Rutter Sack has purpose built to help data teams turn customer data into competitive advantage. Learn more at Rutter Sack.com. (upbeat music) (upbeat music) (upbeat music)