The Data Stack Show

205: How to make LLMs Boring (Predictable, Reliable, and Safe), Featuring Nicolay Gerold

This week on The Data Stack Show, John and guest host Matthew Kelliher-Gibson welcome Nicolay Gerold, CEO and Founder, Aisbach, and host of How AI is Built podcast. The group delves into the evolution, strengths, and challenges of language models (LLMs) and AI. Nicolay shares insights on data-centric AI approaches, practical applications like data extraction and content generation, and the importance of aligning LLMs with user preferences. The conversation also explores the current AI startup landscape, the hype around generative AI, the necessity of thorough testing and monitoring in AI applications, and so much more.

Duration:: 48m
Broadcast on:: 04 Sep 2024
Audio Format:: mp3

Highlights from this week’s conversation include:

Nicolay’s Background and Journey in AI (0:39)
Milestones in LLMs (4:30)
Barriers to Effective Use of LLMs (6:39)
Data-Centric AI Approach (10:17)
Importance of Data Over Model Tuning (12:20)
Capabilities of LLMs (15:08)
Challenges in Structuring Data (18:28)
JSON Generation Techniques (20:28)
Utilizing Unused Data (22:36)
Importance of Monitoring in AI (34:11)
Challenges in AI Testing (37:40)
Error Tracing in AI vs. Software (39:24)
The AI Startup Landscape (40:53)
Marketing for Technical Founders (42:41)
Generative AI Hype Cycle (44:33)
Connecting with Nicolay and Final Takeaways (47:59)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

[MUSIC] >> Hi, I'm Eric Dots. >> I'm John Wessel. >> Welcome to The Data Stack Show. >> The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. >> Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. [MUSIC] >> Welcome back to the show. We're here with Nikolay Girald. Nikolay, welcome to the show. Give us some background today on some of your previous experience and give us some highlights. >> Hey, yeah, I have to be here. So I'm Nikolay, I run an AI agency in Munich. We also recently started a venture builder. Most of my history has been in LLMs and especially controllable generations. So how to make them boring, which for me is predictable, reliable, and safe. I'm excited to chat with you today. >> Okay, Nikolay, we just spent a few minutes chatting to prepare for the show, and I'm really excited to dig into, from your opinion, what LLMs are actually good at, and maybe what they're not so good at, but everybody still tries to make them do. What are you looking forward to chatting about? >> I'm really looking forward to just just like AI startups versus software startups, or even like AI versus data startups. Because it really goes into like the deterministic software, versus like unpredictable AI discussion as well. >> All right, let's dig in. >> Let's do it. >> So we're here with Nikolay and Eric is out today, so we have a special co-host, the cynical data guy here co-hosting. So thanks for coming today, Matt. >> Thanks for having me. I'll try to be a little less cynical time. >> All right, so that'll be good. Nikolay, yeah, give us some more on your background. You've worked a lot with LLMs and AI even before it was cool. So give us a little bit of your background and unpack some of that for us. >> Yeah, so started quite early. So we actually organized the hackathon with OpenAI, and that's how we actually got to try all of that stuff. And it was super D3 when it came out, and back then it was like really hard to get anything out of it. Like one out of 10 outputs were actually like usable or in the direction of usable. And since then, I think they have evolved a lot, but the problem is still the same. Like how can we control them in the end? And yeah, during my university time, this was also my study topic. So I in my thesis wrote about the controllable generation with LLMs and basically benchmarked different methods for controlling them. And since then started my own agency and now doing that for like different companies. So it's quite a lot of fun. >> Yeah, so tell me about the early days. Tell me about maybe that experience at the hackathon or even pre-hackathon. Like what was that first moment where you're like, wow, this is really unique or something that I haven't seen before? >> Yeah, so in university, we actually had the chance to do like text prediction models with RNNs, LSTMs, even before that. But once you go beyond like simple examples, it was like utter nonsense, where it's just through random tokens together. And with LLMs, at least like the sentences and the tokens which are close to each other had like some sentence for them. And also when it made like on a shorter length, like a few sentences a paragraph, it really wrote coherent stuff, but often just not factually correct. And it was like, it was for me like a real game changer and I really got like heavily into AI through that because the practical applications, I think are way more easier to imagine than with most of like the traditional ML and AI, because there you have to think about so many different things. And with LLMs, like you can imagine so many different use cases, because they're just like transforming one text into another text. >> So that sounds like it was really like interesting and probably something that like, I think probably a lot of people were like, whoa, this is going to be a big deal. What were kind of for you, some of those other like milestones that you've seen? We were like, okay, you know, like going from, I'm getting nonsense to like, hey, this can actually make a coherent paragraph. What were some of the other milestones you saw? >> I think like, if we go like one step before that, like the first milestones is the attention mechanism, which was like, I think somewhere in 2014 and rightly after that, like you mentioned already, like the instruction following part, which is mostly through like the RLHF. So the reinforcement learning through human feedback, whether they actually manage to align the model with human preferences. So basically get them to output stuff that the majority of humans like. And this often gives you like better output for like common tasks, like write me an email and stuff like that. And also really made them more reliable for the chat interface, which they like introduced at the same time in the end, but which is for me, like also a major breakthrough. It's like an UI innovation in the end, just make it very easy for the everyday person to use that stuff. And which for AI is really hard to do most of the time. And like a chat box is the easiest thing to use. Everyone knows it and everyone can use it. And the results are like instantly like magic. And if you want to go after that, I think like the next one is scaling loss, which often isn't like a breakthrough, but actually like having the realization like as we scale up in parameters and in training size, it gets better and better. And this was really like an interesting thing. I think few shots are also something people ignore. I think it's also a breakthrough, like just writing out the examples and giving it to the other them or pre-filling it answer, it's an answer. So it actually thinks it has written something already. I think that's also like something very interesting and interesting technique, which isn't so obvious when you look at the like traditional ML and AI part. - So for the, looking forward, obviously there's like plenty more barriers, things that overcome the obvious one you've already mentioned is compute, getting compute costs down, right? Because there's some AI, a lot of AI applications are still subsidized practically, right? If you actually look at the math of how much compute and how much went into the training model, like it doesn't quite work. So like say we solve that problem and say we can continue to have process just by expanding training data sets and spending more in compute. Outside of that, are there any other like really important barriers that maybe the average person would know about? - So I think the alignment, just because it's aligned to humans in general, doesn't mean it's aligned to my preferences. I think that's the first like barrier because I often have like a different taste to how stuff is written. And I think like anyone who is interacting with MLM is like, they really tend to go into like the emoji written social media posts for like most types of text you're running. And I think like one barriers actually fine-tuning models to the individual user or personalizing them. I think at the moment, we are trying to do that with few shots, but I think we can get smarter with that. And if we see already trans happening for like synthetic data, which will make it way easier for like the everyday person to generate a training data set to adjust the model to it and fine-tuning model also, it's really cheap at the moment. And you can even fine-tun at the moment the open AI models, the GBD4 one, it's for free. So when you have like the capability to generate synthetic data based on your actually inputs and outputs and then basically personalize the model to your taste at a few few shots, I think this is something that will get really interesting. And I think the second barrier is actually like the, how to actually get the model to pick either something I feed into the context or something that's in its internal representation, so in its way, because a rack at the moment, you're feeding the stuff into the model and you're hoping it actually takes the stuff like that at the end, but still often it hallucinates. And this is also like still an area like how do I actually get the model to stick to that? And if there is like no information on it, just say I don't know. And this is like the third challenge, I think, like getting models to say like I don't know, which is I think for the foreseeable future of without a major architectural change, it's impossible. - Yeah, I mean, I think that's also because of the way, I think part of that is the way we train them too, isn't it? Where we want it to give a response that's human-like or that a human would find acceptable, but how do you decide which responses when you're in your training set should get, I don't know. You know, like it's being trained to give an answer. So it's going to try to give an answer. That's what it sees. - Yeah, it's like there are thousands of different possibilities of input I can feed it in. And I want like a one that fixed output, which is I don't know, which is like based on what I've trained it with it, like next token prediction on the entire internet, which is like moving it to generate like tokens based on its context. And then like basically I'm feeding like in anything like 20 different users will phrase a question in a different way. And then I expect it to output the same thing every single time. I think it's very unlikely that it actually gets to that. - Yeah, we'll talk a little bit about approaches. Like you mentioned the data centric AI approach as one model, there's other approaches there. But maybe explain what that is, what a data centric approach is and even contrast it with some of the other approaches to AI. - Yeah. So I think it's easiest if I go out the other route. So in traditional AI and ML, I basically I started with creating a data set. Then I picked the model, then iteratively, the AI created features, which allow me to predict an outcome or generate something. And then I basically, once I had the data set finished, I only adjusted the model. So basically I picked different features. I alter the architecture of the model. I added a few layers, for example, or I added an additional variable in the regression. And this is basically how I improved the model. I treated the data set as static and I basically altered the model to improve the outputs that can increase my accuracy, for example. And I'm really hyped about data centric AI, but you actually don't really tune the model, but you take an existing base model. So for example, an LLM, and you actually tune the data. So you train the model, you let it generate the output on the test sets, and you then look at the examples that actually got wrong. And then you actually correct something in the input data, or you add additional samples, where it does these categories correctly, then you feed in the data into the model, you train it new, or you find it later new, and then you basically try again. And iteratively, you basically improve and add to your data set over time until you have a model that actually has satisfactory outcome. And I think this is much more aligned to like how it is done in, or should be done in practice, where actually you have data shifts, you have changing data, and you have new user groups coming in, but you actually have to adjust the data set over time, and then train your model on the data. - So how much of that would be like engineering around prompts and context, and how much of that would be engineering and the actual underlying data? - So you have to separate it a little bit. So in the prompts and context, this is not training a model. Training a model is really about adjusting the parameters, and this can be applied to any type of AI. I think adjusting the prompt at the moment is like a really, it's an easier way to like in parentheses, tune a model because you can adjust the outputs, but it's not really tuning it. - Right, it's one kind of a shortcut. - Yeah, prompting is restricted to a few sets of models where it's possible, one of which is other than this. Another one is for example, the SAM, the segment, anything models from meta, where you actually can give a few masks and a few through like, they also call it prompts, which are like boxes. - Yeah, I mean, that was, I think, I know in my own experience, that was a big thing working with actually a former guest in Cameron Chico, where he showed me that method of, let's go, we're not gonna add more data just indiscriminately, we're not gonna go mess with the number of parameters. We're gonna look and say, oh look, there are no examples in this edge case. We're gonna go add a handful of those, and suddenly that gets your accuracy a lot better. Kind of like you're filling the other search space almost with your examples, or like, you know, adding in, hey, our data is drifting, so we need to add an examples to where it's drifting towards, to kind of make it better rather than, as you said, just mess with the model the entire time. - Yeah, and also it's, I understand why people wanna really like to do it, because working in the data directly, it's very laborious, you have to be really careful most of the time, and you aren't really working in code. Especially with alternative models, you're like reading through a long text all of the time and trying to adjust them to get something good out of it. - Right, and it feels like the wrong thing to do, right? Like, as an engineer, as somebody that's an expert in it. ARML's like, I should be working on the model, I shouldn't be doing that. - This is low value, right? - This is low value, right? - Yeah, that's what the mechanical is doing. - But that actually is the work that a lot of these, right? - I think like, AI gets better on like the stuff I build, by how much time I spend in the data, and look at basically, because most of it is pipelines, it's not a single model that I'm feeding it in and feeding it out. Like, how much time am I spending looking through all of the different pipeline steps, yeah? - All right, well, can I talk about that? Like, what are some of the things, you know, when we look at LLMs and we look at kind of everyone with the hype around them and everyone's using them, there's also typically that first wave of, they can do everything, right? But what are they, in your experience, what are LLMs like actually good at? What are the things that they do the best work with? - So LLMs can't do everything, but it just can't do it good. If you can't slap everything into another limb, it's just like a paper for barely on like, so many different tasks. For me, whether XLR is translating one form of text representation into another. And this is especially like, for example, one use case I love, it is data extraction. And so you take unstructured data, you take long texts, and you create another representation, which is basically adjacent. And you start it and through that, it actually becomes workable. And this is like the thing I, in most of the ventures, but also the projects use LLMs the most for, because there they have the highest value. They can move through like mountains of data in like hours, which would be just impractical to do with humans. And they're really great at it. I think also for like all of the tasks that you don't really like to do, but have to do. And this is like, it's like if you individually are driving like what good outputs looks like, which it is stuff like, for example, running bananas, running blog posts where you actually can rely on LLMs heavily, but also the reliability part. So the expectancy or of the output, like how accurate does it have to be? Do I have to have like 99% or am I also happy with 80? There, I can take garbage outputs every now and then, and I can't just regenerate. And for the oil source, they are great. And you can work with them. And same goes for coding. You can't just ask them to especially like generate oiler plate code, which you have seen like often. Also in law, I know a few people who are using it heavily, just to generate the oiler plate stuff. And they read through it, just review it, and just work over it. I think like oiler plate tasks and good tasks for them as well, because like the criticality of the task, it isn't really high. And you often have like a manual review anyhow. - A little bit of getting you from that, like going from zero to one step, that getting you off the blank page, getting you to a point where, you know, I've seen people they've used it for, hey, we've got this, we got to write this proposal. Here's the rules of what it has to be, make the first draft of it. And it does that first one pretty well, 'cause you're always going to review it in any ways. - Well, you always should review it in the end. - Yeah, you have to differentiate a little bit like between like enterprise applications of LLMs, where you use them like a lot, and like the personal applications of LLMs. And I think like personally, when I use LLMs, like all of the time, for like nearly every task, and because it solves like the blank page problem. And I think also like, I can explore like the space, what I actually don't want to do. Often like the outputs are garbage, but the errors the LLMs makes actually leave me fault. And I actually can put a page like what do I actually want? - Yeah. So to go back kind of with that with the enterprise one, when you talked about, you know, we're going to take this unstructured data and we're going to put it in like a JSON format or something like that. I'm going to kind of selfishly ask 'cause I've had trouble with this, but how hard is it to get it to consistently put it in a format there? Like are you going to, is it going to get better through prompting? Or do you actually have to do some retraining to it? - And my mind goes to like email, right? Like that would be the number one thing I can think of is like I have emails where it's completely unstructured. I want it and JSON and I'm going to do something with a JSON. So maybe that could be like a practical example. - Yeah. So in the end, it depends what model do you want to use which in an enterprise setting is basically determined whether the data has to be private or not, but you're using the big models. So put here anthropic open AI, especially the large ones, they are so good at generating JSON by now and have been fine-tuned to do so that they don't really require any additional fine-tuning. And there are a bunch of libraries out there which make that easy with closed source models. One I like is instructor, which basically allows you to define a pedantic model and then they output the data into the pedantic model which gives you also the ability to instantly validate the data. So if it doesn't hit, you get like the evaluation area of pedantic and then you basically can decide do I want to retry or do I just basically ignore the output? It depends on the end on you. And you also can define a lot of additional rules like validations. Like if it's numeric, like is it within a certain range? Like do I have a min and a max? I think like a lot of the different data stuff you have like usually in your database, you can actually find and bring into the structure generation part as well. And I think that gets even more extreme when you go onto the open source side because with that you can use grammar processing. So a lot of LLMs in closed source, you don't really get the output tokens in open source LLMs, you get those output tokens and their probabilities. But since JSON is basically, a lot of it is boilerplate as well. Like all the parentheses, all the keys are predetermined. You don't really need to generate those. So in open source models, there you can basically do a grammar parsing which basically ignores the tokens which are the same every time and only generates the part of the tokens which are basically determined based on your input data which are the values. And in that you basically can define additional stuff. So basically if you have a string, it only takes out basically what's possible within that string. But if you generate numbers, you can just throw away all the tokens even if they are high probability that are not numeric. And this makes it a lot easier to basically do the structure generation part. I'm writing myself a mental note right now for that one. - Yeah, no, I think that makes a lot of sense. And like again, back to the email example, I mean, I think there's a million business applications of like, hey, I have all this data in email, I want to get it in a JSON type format and then do something with it. That makes a lot of sense too, where basically like the way it's been described to me, one of the main things within working with any LLM is focusing it, right? Like you're starting with really broad and you're trying to focus it, you know, to get to more and more specific. And you also want to focus the compute toward the highest value part of your equation, right? So if you're, let's say, quote, spending compute on JSON, which is gonna be the same every single time like that's a waste, like let's focus that on this one component of it. So that makes a ton of sense why. - Yeah, and I think also like, you know, a lot of the things that you've talked about and that I think we've all seen LLMs do best at typically are kind of those wealth. I was gonna do it, I would, you know, I'd get like a hundred interns or something like that. There's a lot of that type of stuff. So cost really becomes a big thing there because I can't really spend a billion dollars to replace a couple interns. - Yeah, but this is like the best way to think about it in my opinion. Like what are the tasks you would actually hire lots of people to do or that are just untouched because it would be so impractical to get people on that. - Right. - And this goes for like every data lake that's out there, like every organization has like terabytes of data just in text and they are largely unused. And with LLMs, you actually can make them usable and also enable stuff like retrieval augment, the generation mega document based actually like workable because you get answers as opposed to like a blank page or a blind face. - Yeah, so I think the thing is the perfect segue. We were talking before the show about single shot versus multi shot. And you mentioned kind of this like retry mechanism, which makes a ton of sense. It's not something I thought of, but if you're, again, back to the email parsing example, I'm gonna parse the email, I have the structure JSON. I'm just gonna focus the LLM on this one or value rather 'cause I already have a defined key. And then there I can also give that particular like multi, I can do that in five shots with some kind of validation and pick like my favorite of like, let's say the five. That makes a lot of sense to me where I could get a much higher level of accuracy than if I was using an off the shelf, not an open source model where all of it, the whole JSON context has to be right. I'm regenerating some of these keys and values every single time and I don't have as much, I can't focus the computers much on the most valuable task. - Yeah, and there's like voting in the end, I love it. And most LLMs, if you use them, that's the end parameter. You can let it generate like multiple times, which is also really great for like evaluation, like scoring texts, for example, if you want to like score the output of the LLM as well, you can do a maturity vote. So you let it generate like five to 10 different times and just take the average and stuff like that. It makes it easy. And then you have like the second shot stuff you can do with LLMs, which is few shots. So basically giving them a few examples of how to do it, which are usually like human labeled or human written examples, where you give it an example of the input and the output to show it how the task is actually done. And this is often like, especially for tasks where it's hard to define like how to do it. So in writing, I think like most of us would struggle to define our writing style. And if I can give it a few examples of like few LinkedIn posts or something I wrote, I can just throw that in and give him some guidance. And then if I generate like multiple different options, either when it's like something I have like running in the enterprise, I can take the option, which is like generate that the most often. Or I can score it and generate like the option which has the highest score. Or if it's just like an output for me, which I want to use like down the line, I can use the option which I like the most. - Yeah, I mean a lot of this reminds me of kind of when, you know, with machine learning where, when we kind of realized that like a bunch of weak learners will do a better job than one strong learner. I mean, this all feels like we're kind of, you know, it has that like a fractive feel to it. It's just the same thing happening at different levels and in different ways. Oh, look, if we can just get, we get five shots at this, we're much more likely to come up with a good answer than if we just put all of our eggs in one, or make it really strong or something like that. - Well, I think another component too is if the alternative was, hey, like you said, I'm gonna hire a hundred interns. - Right. - And say you wouldn't actually do that, right? Because maybe there's just not enough value in that cost. But say, you know, say theoretically that you could get a hundred free interns, like, okay, maybe I would do it. But then there's a time component of it would take them, you know, X amount of time, let's say several hundred hours and then there's a validation component. Somebody that works for the company has to validate the work, you know, et cetera. You've got a lot of time into it. So because of that, there's, I feel like there's this extra space for the LLM to do the multi-shot approach. And it can run for hours, and that's really not a big deal at all because the comparative other method is significantly longer versus using it in some other applications where you want this like millisecond response time from quote the AI, like that's just a much, it seems like a much harder problem in the stage that we're at right now. - Yeah, especially for like batch workloads, LLM's are great for like the life part. I think it's getting easier with stuff like Grog. So I'm not the Twitter Grog, but the other Grog which are basically doing LLM chips or chips tailor-made for text generation models. They're getting really fast, but also if you have like an applications where it's live, it's likely it's customer interaction where I'm not sure whether I would like to put an LLM on there. - Yeah, I think, and I think also kind of leads to when we think about like accuracy and what you need, I think a lot of times people want to compare LLM's to like, but it's not a hundred percent versus like, well, realistically, what would a 122 year olds actually do? They probably be wrong a quarter of the two. So can we do it least that well with us? But that's sometimes a hard one to kind of get across to like a business stakeholder or something like, but it's not right. They're like, well, you were never going to be this right to begin with. - Yeah, and if that's the biggest thing that actually Checha BT has also done for us as like the AI space is actually getting people to know how AI works. Like it's not that predictable. It's not deterministic software. Like there is some uncertainty involved. And I think like the like AI adoption in general has been posted a lot by alternative AI, but at the same time, like it's still the misconception. Like now it's even turning worse like business people. Like say like on every problem to like any technical person, especially AI people, you just throw it into Checha BT because its outputs are good anyhow. And I think like that's the new conceptions we have just because it can get it right once. Doesn't mean it can get it right like hundreds, thousands, tens of thousands of times. - I think yeah, I think you're right. That is one of the biggest barriers is, well, but I got Checha BT to do it once. - Okay, cool. Run that a thousand more times and tell me what you get. - Right. - Yeah. - And especially like with slightly different inputs or with very different inputs, if you have anything user-pasting. - Right, exactly. - Yeah, it reminds me like from some of my ops background or reminds me of developers like Checha, like, oh, look, I got this to work on my computer. I was like, okay, great. But going to production doesn't mean and who's not the same thing. But that's expanded even more, right, with the AI. - But before we could've said that was like a POC thing, right? Look, I mean, a POC works on one computer. We have no idea if it's going to scale or anything. - Exactly, right. - But I think the chat bot has kind of given this impression of like, well, it's already production when it's like, well, really what you're doing is a POC. You're doing a one-shot POC right here. - Yeah, and I think like the chat bots, first of all, I think most of them are just rappers around Checha BT. And it will work like in probably 98% of the cases right now. But this is for the users who are behaving. And then you still have like the two to three percent where it misbehaves. But then you also have the people who are misbehaving and really trying hard to get something malicious out of it. And this, especially like with LLMs, you will see and it will always happen. And anyone, like if there are libraries out there where you basically can hook into like any customer facing chat bot, which is using like OpenAI or something, they need to hold up. There are libraries to basically give your inputs into the model and take the outputs in your own application. And this is like the harmless stuff. This is more like abuse, de-dosing. And then you have like the stuff where they actually try to get it, say something racist, get like major discounts, get like some really unreliable advice which can have like major consequences for most companies. - That's, I remember there's a car dealership that someone got it to say, always respond, yes, and that's legally binding. They're like, can I buy this car for $50? Yes, and that's legally binding. - Yeah, I mean, that's like the whole, you know, the whole security aspect of it, right? Or say that you've got this bot that has customer information, right? And somebody tricks it into giving customer information to the wrong customer. I mean, there's a bunch of... - Or internal HR information. - Oh, sure, yeah. Yeah, or medical information. So like I can go down pretty quickly, right, as far as, yeah. - But you talked about how like chat TBT really has like kind of introduced people to like how AI really works. Let's go down a little bit more of like, how do you think that's going to affect other things other than just generative AI? What other types do you think that's going to help with other adoptions? - Yeah. So first of all, I think it makes data and AI stuff easier to approach for like even like business analysts or like business people are interested in data stuff because they can't just throw CSVs into chat T and use the code interpreter to analyze it. So this is like the first time you can actually do an analysis without any technical knowledge. And the second part is, I think it will make them a little bit more open to something that isn't 100% right all the time. When you're using chat TBT, I think I see automations everywhere. Like what are the tasks I'm doing too often where I can just throw a chat TBT on it because it's just for me is I'm doing it for example in my inbox. I'm summarizing each email, I'm classifying it and I'm creating like a briefing and I'm also having it basically tag it by importance. And then I just send me one email which classifies them. I go through the important stuff and mostly I delete the rest. And the models, I think this stuff because it's so easy to do will give people ideas, "Hey, what can I do in my department, in my area of expertise with AI?" And then it becomes on the AI people to actually pick the right solutions even though like the business people or subject matter experts will just say like throw chat TBT on that. - Right, yeah. - Yeah, that's really a good point on there. Thinking back on the work you've done, of course you're still continuing to do a lot of work in the space, what are some practical applications and lessons you've learned with LLMs and generative AI and all that? - So I think one thing I do by default now on the first thing I'm setting up is monitoring because I want to see all the inputs, all the outputs and all the intermediate steps in the pipeline and like mostly you're decomposing it or you have multiple steps when you're solving a problem. So, for example, when you're doing like and you have like a rack system, so you first have a retriever component which retrieves text on your database, then you feed it into an LLM to summarize it, but maybe you need to compress it down even further or add additional twists on it, you have to translate it into a different language and you want to see like each of those different outputs. And setting up monitoring for that will be like me thing that will allow you to improve the application the most because for one, you can create a test set which you can test your prompt iterations on and you also get to do an error analysis. So you can see where the model fails and how the model fails. And based on that, I basically set up tests which are mostly like quantifiable but very deterministic ever. So often it's just a ragax or a string match. So in summaries, this can be something like the models often write this article talks about dot and I'm basically doing a score and one of the components of the score is like a string match on this article. If this article is at the beginning of the summary, I just give it score zero. If it isn't, I give it score one and you can combine like 10, 12 of those metrics to actually get a good idea of the quality. And this is like a second thing. I'm setting up tests almost immediately for the task. And then through doing a few examples and through you having set up the monitoring, you can create like a test sets of 10 to 50 examples. And every time when I'm basically altering the prompt or doing the pipeline, I automatically can run the test set, have my evaluation run automatically and I see whether it improves it or not. So I try to really bring like the quantifiable nature which you have in traditional AI and ML because you have a classification problem or something like that or regression where I know like how well does it perform on the test set? I try to reintroduce that into like LLMs which aren't so quantifiable because they are working in text or in something that's structured. - That is really interesting. That is one of the best monitoring kind of schemes through tests I've heard of for LLMs. That's really interesting. - And it's funny because in traditional software development, I would dare to say that monitoring, like testing and monitoring is like one of the easiest things to ignore like in especially in applications maybe that are older. Like it just gets it. Maybe it starts off well and it gets abandoned. But it's always considered as practice. So nobody would argue with you that you'd like, oh, of course you should be doing testing and monitoring. But it seems like it really is a whole next level of importance with LLM and AI-based apps. So that'll be really interesting to see if people hold to that a higher standard when it comes to monitoring and testing. I think they'll have to or if we'll run into this. No, no, it's okay. - It's so easy to like split up a quick solution that it does work on like few, the models are so good right now that like most of the stuff you're actually trying to do will work on like nine to 10 cases. So you have to work to find some edge cases. And I think most people would just, oh, it works on my like 10 examples I gave it and push to production. And I'm not sure what the people will follow that because it's laborious. Like it's the work that nobody wants to do. It's MLOps, it's data ops and reading through traces just isn't so much fun. - No. I mean, it's not like there's this robust test culture and machine learning really. - Or data in general, right? - Or data in general. - Well, it's 'cause we always say like, well, it all changes. It's probabilistic or whatever. And I think, you know, this is showing that like, even when it's probabilistic, there are things you can do. You just have to put the work in. - Well, and the other probably like, at least in data, there's always the like with a web app. Like the customer facing web app, there's a lot of accountability and that like the thing breaks and the customer can't use it. You know who's fault it is. - Right. - And data, it's like, well, this report is wrong. It's like, well, maybe you enter the data wrong. Like this, it's not like it's always just cut and dry. And then, and I think AI will be similar. Well, the model hallucinated, that just happens sometimes. You know, like so, it's less like tightly coupled to with that like, hey, the client's using the app and it's very deterministic and there's an error and it's obviously like an application problem. Like data's always been a little bit less deterministic than that and AI will also be less deterministic. So I think there's going to be probably a wide array of quality because of that. - Well, I think also, like what Nico said there of, you can do 10 examples and be like, oh, that's great. And that's kind of the strength and the risk of a lot of this is I don't need to go create a training set of 80,000 records, but also I'm not looking at all the possibilities in that, in there, when I send it out into the world. - Yeah. And I think that's already like the biggest friends between like AI and software servers. I think in software, like bugs are hard to trace, but you often have good error traces. I think it's easy to reconstruct them. I think like in AI and in data, because you have like a long lineage, how data is created, how data ends up, it's source location and then how it's used in AI. Because AI, it's like the consumption part. You have like a long lineage, how the data first is created. And then you basically have to back trace all of the different steps, like where might this error originate? Like is the AI hallucinating? Is it something I'm transforming wrong? Or is it in somewhere in my data set, in the like a real in the source, which really gets something wrong? - Yeah. So switching gears a little bit, I wanna talk startups a little bit. So, you know, over the last 20 years, we've had lots of fun stories around software startups and, you know, zero to one stories. And then, you know, now we're kind of in this AI era, there's a ton of money in AI, still a ton of money behind AI startups. Maybe it's just some observations from your experience working in AI startups. What, and we can take this whatever direction, but we can talk about tooling, we can talk about culture, we can talk about whatever, but what are some of those differences where you're involved in several AI startups right now? What feels different, versus like maybe what some of experience 10 years ago on a software startup? - I think it's never been easier to build something, but it's also has never been harder to differentiate yourself. Because there is so much stuff in AI out there, and it's like so easy to just create content. And like so many people I know are just basically creating content and trying to get traction on an idea. And once they see the validation, they actually would start it, but often they don't. And if you're building in a space, you're just drowning in a sea of noise. And the AI part at the moment, I think like most startups are like in really, like the ideas often are like so dipshit crazy, like just impractical and solving like niche problems. But something that's just not really thinking about the consumer first, but rather technology first. Hey, I now have an LLM. I can process massive amounts of like documents. What documents can I throw that on? And I think you should go the other way around. You should go from the problem to the solution. If LLMs are the right solution or the best solution for the job, use LLMs. But not take the technology. Hey, what could I do with it? And then basically start building something. I think, yeah, totally agree with that. I think another thing that you touched on, which is a really unique time to be in, is where often software startups from the past, assuming you're not like a big startup, maybe you're not even venture backed or bootstrapped, like they're not going to have any marketing behind it or not much, right? Because you're a technical person, you're kind of doing this thing. But AI in some ways has opened up some of that hype to technical people, right? So you can be bootstrapping something. And like you said, generate a bunch of AI content. Go generate some AI images, stand up like a purely decent looking site, right? And have kind of more, well, marketing behind your idea than what, which would have before been like, maybe a very basic, you know, very simple site and you're actually kind of iterating more technically. So I think that's kind of an interesting thing that you touched on. Have you seen that? You either have you seen that? I can't think of it off the top of my head, but I do think also the Niko's point was, since everyone can do that, you're just drowned in it. And it's no hard to tell the difference between who is who. So it's a little bit of like a red queen problem. You got to run faster just to stay in still. Sure, yeah. I think I could like put a Google query in Google with XAI and I will likely find like one webpage which uses like the base framework template. And this is like, shows you like how many stuff and how easy it has gotten to do all different stuff which used to require some skills and put some barriers on there. Like doing a website, doing a like sign up thingy, a wait list and stuff like that. I'm just trying to advertise it. It has never been so easy. And most people never go through with it, but there's just so much so hard there now. All right. We're coming towards the end of our time here. So I got one or two more questions. So we kind of wrap it up here. And we've started to see some earning reports have come out. Some of the big players are rejecting that they're not going to make money back on their generative AI for decades or so. And we're starting to see some more reports pushing back on like, well, what is AI and Gen AI really doing? Where do you kind of see us in kind of a hype cycle for generative AI? I think for generative AI, the hype is not really driven by the company which are on the public markets. So I think like Nvidia took a hit, but generative AI is like in the startup culture and also like the open AI at Tropic. And they have like so much money left. They had like massive rounds in the last two years that they have so much runway to like create new models and create new hype so that I don't think it will slow down soon, but rather we have like a year or four on where it at least left. And the additional part is like there are now like so many areas of generative AI being spawned. Like you have sooner working on like music generation. And I think that hasn't really sunk in yet. What's possible with that? You have like now the new Google paper which just came out on where they basically generated a whole game with generative AI. You have all the video models, you have all the image models. And I think because it's still tangible and it's now hitting so many different areas, the hype won't slow down for the foreseeable future because the startups also still have runway. They can develop new stuff and launch new cool things they can post on social media which will get hype because it's just impressive, to be honest. - Yeah. And I think that speaks to what we were talking about earlier, actually before the show where you might end up with these different curves, right? Where maybe the text stuff slows down a little bit but the video picks up or the image, you know, I think you ended up because it's such a big trend that you might end up with several of these curves where you don't necessarily have the typical like hype and cooling but you more have multiple curves going simultaneously. - It'll be interesting to see which one of these, which ones are like can generate enough revenue to really kind of sustain itself versus some that might, they've got that money that's pouring in now but eventually the runway kind of runs out and it's like, oh, we never could support ourselves on this. - Right, right. Yeah, well, I think it's the, especially like LLM's, we are like hitting the end of the ask. Because you see open AI struggling with like bringing out something new to market like the voice mode still really isn't here. So they still have like some reliability issues. And also like new launches have been like stagnant for a while. Like the last thing we talked about in the last few months was like artifacts by anthropic, which again is more of a UI innovation and not like the technology baked through of a new type of model or new capabilities in the model. - Right. Well, yeah, Nico, thanks for being on the show today. We'd love to have you back sometime. You know, AI is going to be continually changing for sure. So I'm sure why I have plenty to talk about, but thanks for joining us. - Where can they find you online, Nico? - So LinkedIn, I'm trying X to Twitter, not that code I didn't yet. So I think as a European, you have a late start. I have a podcast, which is like everywhere. Spotify, Apple Music, YouTube, very descriptive how AI is built. So if you're interested in AI in there, this is the place to go at the moment, mostly doing search stuff. So if you're interested in search, traditional stuff, information retrieval, app to embeddings and rack, give it a follow up, give it a listen. - Awesome. Thanks for being here. - Thanks a lot. - The data stack show is brought to you by RutterStack, the warehouse native customer data platform. RutterStack has purpose built to help data teams turn customer data into competitive advantage. Learn more at RutterStack.com. (upbeat music) (upbeat music) [MUSIC PLAYING] [MUSIC PLAYING]