Category Visionaries

Joe Witt, CEO & Co-Founder of Datavolo: $21 Million Raised to Unlock the Potential of Unstructured Data for AI

Welcome to another episode of Category Visionaries — the show that explores GTM stories from tech’s most innovative B2B founders. In today’s episode, we’re speaking with Joe Witt, CEO & Co-Founder of Datavolo, a pioneering company in the AI data ingestion space that has raised $21 Million in funding.

Here are the most interesting points from our conversation:

NSA to Commercial World: Joe transitioned from working at the NSA, where he focused on data collection and security, to founding Datavolo, leveraging his expertise in large-scale data ingestion.
From Open Source to Enterprise: The journey of Apache NiFi, an open-source project initiated at the NSA, and how it became a cornerstone for Datavolo’s technology in managing unstructured data.
Generative AI and Data Ingestion: Joe discusses the crucial role of data ingestion for generative AI, emphasizing the need to capture and process unstructured data for effective AI training and application.
Target Audience: Datavolo primarily targets data engineers who interact with AI engineers, providing tools to automate and optimize data pipelines.
Marketing Strategy: The importance of educating the market about the steps involved in integrating company-specific data with large language models (LLMs), and the benefits of automated data ingestion.
Funding and Growth: Insights into Datavolo’s successful fundraising journey, emphasizing the importance of a strong founding team and a clear vision in securing investment.

Sponsors: Front Lines — We help B2B tech companies launch, manage, and grow podcasts that drive demand, awareness, and thought leadership. www.FrontLines.io

The Global Talent Co. — We help tech startups find, vet, hire, pay, and retain amazing marketing talent that costs 50-70% less than the US & Europe. www.GlobalTalent.co

Duration:: 23m
Broadcast on:: 30 Jul 2024
Audio Format:: mp3

Welcome to another episode of Category Visionaries — the show that explores GTM stories from tech's most innovative B2B founders. In today's episode, we're speaking with Joe Witt, CEO & Co-Founder of Datavolo, a pioneering company in the AI data ingestion space that has raised $21 Million in funding.

Here are the most interesting points from our conversation:

NSA to Commercial World: Joe transitioned from working at the NSA, where he focused on data collection and security, to founding Datavolo, leveraging his expertise in large-scale data ingestion.
From Open Source to Enterprise: The journey of Apache NiFi, an open-source project initiated at the NSA, and how it became a cornerstone for Datavolo's technology in managing unstructured data.
Generative AI and Data Ingestion: Joe discusses the crucial role of data ingestion for generative AI, emphasizing the need to capture and process unstructured data for effective AI training and application.
Target Audience: Datavolo primarily targets data engineers who interact with AI engineers, providing tools to automate and optimize data pipelines.
Marketing Strategy: The importance of educating the market about the steps involved in integrating company-specific data with large language models (LLMs), and the benefits of automated data ingestion.
Funding and Growth: Insights into Datavolo’s successful fundraising journey, emphasizing the importance of a strong founding team and a clear vision in securing investment.

Sponsors:

Front Lines — We help B2B tech companies launch, manage, and grow podcasts that drive demand, awareness, and thought leadership.

www.FrontLines.io

The Global Talent Co. — We help tech startups find, vet, hire, pay, and retain amazing marketing talent that costs 50-70% less than the US & Europe.

www.GlobalTalent.co

[MUSIC] >> Welcome to Category Visionaries, the show dedicated to exploring exciting visions for the future from the founders or in the front lines building it. In each episode, we'll speak with a visionary founder who's building a new category or reimagining an existing one. We'll learn about the problem they solve, how their technology works, and unpack their vision for the future. I'm your host, Brett Stapper, CEO of Frontlines Media. Now, let's dive right into today's episode. [MUSIC] >> Hey everyone and welcome back to Category Visionaries. Today we're speaking with Joe Witt, CEO and co-founder of DataVolo. Joe, welcome to the show. >> Thanks, Brett. Great to be here. >> The big question, is it data or is it data? >> Who knows, we'll take whichever way they say it as long as they say it. >> It sounds good, sounds good. Let's talk about your background and explore a little bit about what you were doing before you founded the company. >> Yeah, so straight out of college, I went to the National Security Agency and really got deep into the problem of, how do you take collection and deliver it to all the systems that need it as fast as possible. There's a whole range of awesome problems to work there, but I just got really focused on that at a super enterprise scale, thinking about how you do that efficiently, securely with strong governance and lineage, and candidly, I didn't know anything about the commercial world, so I had no idea really how that carried out there. But eventually, the work we did, we were able to get open-sourced, into the Apache Software Foundation, so they're after I started a company and I spent the next eight years, basically, where we got acquired by Hortonworks, spent the next eight years at Hortonworks and Saldera, helping enterprises on the global scale adopt the technology and really understand how to ingest data at big data scale. For whatever reason, I've just really bought into that problem and enjoyed it ever since. >> I know I'm asking you all of the hard questions here to start things off, but I'm currently watching Homeland with my wife and loving that show, not sure if you've watched it yet. When it comes to intelligence shows, which one do you think is the most accurate? >> Oh my gosh, that is a great question. So, it's not Homeland, I can tell you that. >> Oh, don't tell me that, Joe. >> The more on the CIA front, if I remember, I was on the NSA side. Yeah, I don't know what the most accurate one out there is, but what I can tell you is, it's an awesome mission, and we did a lot of great work. I'm sure they still do really great work there, but it's an amazing opportunity to see data at a scale that just frankly most companies on global sense will never see. You also get to deal with interesting complexities about how to do it compliantly and securely, and it was a lot of fun. >> I've had a number of founders on who also spent time in intelligence, and one of the things they've talked about is the fact that they can't really talk about a lot of the work that they do. One time I had a founder on, and I looked at him, and he looked like he was about 45, and I looked on his LinkedIn, and there was no history at all besides the company he had just founded. And right when we got on the call, he was talking to you through, and you're kind of explaining why. Was that a challenge for you? You're not being able to talk about the work that you did when you did move into the commercial world? >> Yeah, it's definitely, it was tough at first, but part of the thing is the software that we built got open sourced, along with that were a couple of press releases that were done by the NSA themselves, and so as far as what I did on a broad scale, I can talk about that in pretty generic terms. But look, I've been out now for eight years. It's really easy for me at this point to talk about what the enterprises are dealing with. >> Well, candidly, it's not that similar than what the Intel community deals with. So it was hard at first, but now it's a non-issue. >> Makes a lot of sense. Well, let's dive into everything that you're building today. So just start off at a very high level. What does the platform do? What value does it bring to users and customers? >> Yeah, so I think the best way to answer that question is to put it in a little bit of a historical context. So core product that we're building around, or the core engine we're building around, is this tool called Apache Nifi. It's open source software that's in Apache Software Foundation. And if you think about its first eight years of life, we'll call it the NSA years from like 2006 to 2014 timeframe. Nifi got to grow up on what we now think of and talk about as unstructured and multimodal data. Documents, images, audio, video, all those sorts of things. Now, candidly at the time, we didn't use those terms and it wasn't about AI. But importantly for what we do now, we had to learn how to handle large data, learn how to have really fine-grained lineage, learn how to support users that maybe don't necessarily write great code, but still understand how to express requirements and build important pipelines. Then we open source the technology, so left the Intel community effectively, or at least branched beyond the Intel community, and into the enterprise commercial world. And there we did that for the big data community. But I think the real truth is the vast, vast majority of data being ingested into the big data or Hadoop ecosystem and early data bricks, early snowflake kinds of ingest cases, is all structured data. Things that look like records and rows and columns and cells, things like that. And so we had our first eight years on unstructured data, the next eight years sort of in broad terms on really structured ingest feeds, and what has emerged now with generative AI, right? So chat GPT shows up in 22 or 23, depending on how you look at it, and really kind of blows all of our minds with how you can now interact with this thing and ask questions. But importantly, it can only answer questions with any real context on data that's been trained to develop an understanding for. And so now what are companies trying to do? Every company on the planet now wants to be able to ingest all the data that they have. The vast majority of the data companies have under management today is actually unstructured data. It's PDFs, Excel documents, Word documents, images and audio and so on. They want to capture all of that data or collect all of that data, parse it, and make it available to ask questions over it. But that data, you know, if you're a specific company, your data wasn't trained into the LLM. So chat GPT's not going to answer your questions. Does that make sense, bro? Yeah, that does make sense. So for us at DataVolo, the opportunity became kind of obvious to say, look, companies really need help capturing unstructured data. And instead of us kind of wading through a vast army of ETL vendors and having kind of an undifferentiated play, we now have an opportunity to be really differentiated, focus on our core strengths, which happens to be on unstructured data ingest, and really help them get that information into the core systems that will drive AI processes and outcomes. So for example, you have a series of PDF documents on a sitting somewhere in a drive or a S3 or some location like that. How do you sort of wood chipper through them and do the computer vision to extract that information? What text is in there? What structure is it in? What's in the tables? What's in the charts and images? And create sort of structured output that can go then into a vector store, graph databases, document stores, you name it. And now you can combine what something as awesome as OpenAI's LLMs know, along with your own company's enterprise data. And so that ingests side of it, that automated and continuous piece, that's what DataVolo's making really easy for companies to do. - This show is brought to you by Frontline's Media, Podcast Production Studio that helps B2B founders launch, manage, and grow their own podcast. Now, if you're a founder, you may be thinking, I don't have time to host a podcast. I've got a company to build. Well, that's exactly what we built our service to do. You show up and host and we handle literally everything else. To set up a call to discuss launching your own podcast, visit frontlines.io/podcast. Now back to today's episode. - And we've had probably 30 or 40 companies on now or founders on that are in the data space. Can you just paint a picture of where you sit in the landscape and what the market category is of this product? - Yeah, so in some ways it's talked about ETL for generative AI or ingest for AI or data engineering for AI, the vast majority of people that you talk to and look at in the AI space today, the first set of it would be people that are building things like LLMs. And that's actually really heavy focus right now. That's why you see people buying all kinds of GPUs and so on is so that they can train these models on bigger, better, better, faster data. The next wave of people that you tend to talk to then are people who can deal with prompt engineering and automating interaction with those LLMs to ask them questions. So part of that's gonna be model training. How do you make better smarter models? The other part of it then is okay, so I've built a model. Now how do I ask it questions? You could sort of generically think of that as like the front end of a generative AI system. We are much more on the back end behind the scenes. How did your company's data get to where it could be interacting with an LLM in the first place? So we're on the inference side, if that makes sense, bro. - And who's the main buyer that you're trying to target? Is it the chief data officer? Is it the data engineer? What does that look like? - Yeah, so the core user that we interact with is gonna be a data engineer predominantly, and specifically we're talking about data engineers that interact with AI engineers. In the same way that if you think back to the big data world where the data scientists were kind of the star of the show in big data, behind every one of those people is nine to 10 data engineers doing all of the difficult work of collecting that data, cleaning it, preparing it, and getting it ready for someone like a data scientist to do analysis of it. You see a very, very similar pattern on the AI side now where the AI engineer needs the same kind of assistance. Although candidly, the data engineers of the AI evolution are gonna benefit tremendously from generative AI itself improving their experience. We could talk about that more kind of as we go. But the core user, make no mistake is a data engineer. Our goal is to make every data engineer basically a 10x data engineer, give them the tools and capabilities they need to ingest more of this data, more correctly, faster, more securely. But you're right, the CDO is definitely where a lot of the conversations start, right? Every enterprise on the planet is having to answer to its board for what is its generative AI or AI strategy broadly today. And so a lot of this starts with the top down, but the people we end up talking to from a user point of view are the data engineers. >> What's it like marketing to those types of people? What resonates and what doesn't resonate? >> Yeah, so for data engineers, I mean, let's be honest, doing ETL and ETL systems and the tooling for ETL for a long time has been painful. And so they can be a grumpy bunch. What I really mean by that though is I think at this point, they don't want to hear about new tools. What they want to understand is how can the kinds of tools and techniques they already use be done faster or more automated? What's particularly exciting is in the work that we're doing, it's not just, hey, let's give you a pipeline to ingest documents so you can then go do AI. It is how can we also leverage generative AI to automatically create the pipelines for you? So as a data engineer, can you express what you want to do in just natural language? Like literally just write out where does data come from? What kind of transforms would you like to see done on it and where do you want the data to go? And then we'll automatically create pipelines that are then also automatically monitored, elastically scaled, all the things that people end up spending a lot of time toiling with. Or if you need to do a transform on a piece of data, you know, it comes in in a format that's different than what needs to go out. A lot of times that means they got a right code. Well, again, nowadays with all these code-based co-pilot solutions, it's actually quite easy, certainly easier than it's ever been to help a data engineer, even essentially handing them code to do transforms automatically as well. So for the data engineers, they really want to hear, how can you help them use existing tooling in a lot of cases, but also automate as much of that process as possible? The 10X data engineer is basically looking to do more pipelines that are more stable more often. And so that's a lot of what we focus on. With the CDOs, the story is quite different, obviously. With the CDOs or the executive leadership teams broadly, what they want to understand right now is, well, how does this LLM even work? What are my risk profiles and governance programs that I have to establish? What are my privacy concerns? What happens if I'm running this in SAS versus on-prem? And so the conversations there are much broader, much more focused on governance and sort of a longer-term view. And so we spend, you know, I'd say about half our time with each of these, but they're very different conversations. From a marketing perspective, what tactics and strategies are you really seeing work right now? >> So I would say the most important thing right now is that you understand and can, we're in sort of the early phase where we have to kind of educate the market on what are the steps you really have to do in the first place? I mean, everyone has tried chat GPT at this point and sort of understands that there's a big aha moment here where you can really start to interact with this thing and it can even be creative. I mean, that's sort of the point of the generative nature, right? But what they don't necessarily know is, well, how do I take my unique company data and marry that up with this LLM that was never trained on it in the first place? So you really do have to walk them through the journey. Like marketing right now is a lot of education, frankly. So how do you capture unstructured data? Well, it's actually changed data capture just like it was in the relational structured data days. It's just now on different data. You also have to do change data capture, not just the documents, updates to the documents, metadata changes who has access to those documents. You then have to parse that information, convert it from its sort of visual representation to an actual structured narrative text representation. You have to do things like PII detection and redaction. You have to be able to chunk it, which is itself a rapidly evolving space might be that you're chunking based on structural elements. You might do things like run cosine similarity between sentences and decide that the ideas between two sentences have changed enough and now you want to do semantic structuring or chunking. And then of course you have to generate embeddings, drive this data into vector stores and then help them understand all of the really cool systems out there in the ecosystem like Langchain and others that then let you build rag applications on that data combining the power of OpenAI plus the data we've ingested into those vector databases. So I think right now the key way to do marketing is through education. Does that make sense? Yeah, that makes perfect sense. What's the marketing team look like today? How big's the marketing team? Yeah, so I'm very blessed to have a co-founder who's been on the go-to-market side for a long time. So I come up through basically tech ranks as a software developer and then leader and a VP of engineering type. And so while obviously I've done a lot on the go-to-market side, it's not been my primary forte. My co-founder, Luke Rokay, has done a lot of this. He led marketing in his most recent role at Clodera. He had been a sales leader for large portions of the US and so has a great background in that regard. I'm still involved. Obviously, I should certainly be helping share what the vision looks like. Certainly involved when it comes to engaging from a kind of founder led sales mentality and so on. But Luke's definitely driving our marketing approach today. This show is brought to you by the Global Talent Co, a marketing leader's best friend in these times of budget cuts and efficient growth. We help marketing leaders find, hire, vet, and manage amazing marketing talent for 50 to 70% less than their US and European counterparts. To book a free consultation, visit globaltalent.co. - From a go-to-market perspective, what's been the most important decision that you've made so far? - So I would say getting really specific about a use case that we can add value on, I think has been probably the most valuable. The trick with the technology that we're building on, you know, it's been in the open source community for coming up on 10 years now. I think it's nine years as of now. There's somewhere on the order of eight to 10,000 companies using NIFI already today. And it's supporting super diverse set of use cases. That is a blessing and a curse, right? Right now, what we really need people to understand is how do you use it to capture unstructured data? And what are the steps that you have to be able to do and that you need to have optionality on? And so instead of telling people how great and flexible and powerful this specific tool is, our go-to-market motion is really focused on the outcome that somebody's trying to get when it comes to, say, document intelligence or parsing unstructured documents. So our key focus when it comes to go-to-market right now is making that really crisp. So if you want to be able to ask questions over a series of PDF documents and you want to be able to enrich them with other data that you might have in a warehouse already, exactly how do you do that? That's been our big focus rather than just sort of teaching them that there's a Swiss army knife. - And I know the company was founded less than a year ago. How are things going so far? Any numbers that you can share that highlight some of that growth that you're seeing? - Yeah, so we're doing very well when it comes to existing users of NIFI. So I had expected that we would either need to figure out how to commercialize and monetize the existing user base that was looking for a strong vendor to support them, or we'd have to be really crisply focused on generative AI. And we've actually, I think, found a really nice blend where we have both investors and customers that we're working with who are really diving in deep on generative AI. So we have an important portion of our team focused on helping them kind of advance the state of the art, get to the point where they're doing continuous and automated ingest unstructured data, which is super exciting. But we are also extremely lucky at this stage to have customers that just reach out to us, frankly daily, who already use NIFI, they already love it, and they just wanna go to the vendor that knows NIFI the best and can support them the best. We're definitely very blessed to have that scenario where you've got inbound, especially 'cause it's not like we can afford to do a tremendous amount of marketing outbound and boots-on-the-ground kind of sales motions. - And I see on the crunch-based 21 million today, is that correct and a follow-up? What have you learned about fundraising throughout this journey? - Yeah, so it's correct. We had a seed round in the middle of the year or early fall last year. We kicked off the series A-rays, not long after that, off of some really initial customer attraction that was awesome. Again, just existing users of NIFI. And I should add, and I probably meant to add this on the previous question, it's not just existing users that need a good vendor. What I was worried about is it would be existing users that are just doing structured data where it's like pure ETL and differentiation at that point's tough 'cause, I mean, there's a tremendous number of ETL vendors out there on structured data. But what's incredible is they're already using NIFI to deal with unstructured data feeds. They do document parsing. They're invoking machine learning models in these data streams as well. And so it's like right in the wheelhouse or pocket of where we're trying to help take the community now. So we've been very lucky in that regard. So on the funding topic, you know, like I said, we did the seed late early fall after we put together the right co-founding team and a lot of the core early hires and had some early customer traction. We did the series A-rays. And we were very fortunate towards the end. City Ventures joined with a really strong sort of conviction on both where they were trying to take the organization and the vision that we had. And so we've spent a lot of time working with them both in the context though. The venture arm, but also developing how we can be helpful to them commercially as well, which is awesome. As far as what I've learned about funding, wow. Yeah, so it's critical to have a strong founding team. And when I say strong founding team, there's a bunch of different ways to be strong. In our case, we have some advantages in that we have started companies before. I've co-founded a company previously, gone through that funding process, what it's like to hire and build teams, a mixture also beyond having founded a company, but also having led really large engineering organizations where you get into the hundreds of people kind of role. It really helps bring a lot of credibility to it. Not only that, but we also represent a really core brain trust or leadership team behind NIFI. I mean, I originally created the project in 2006, so that helps a lot. One of the core people I've worked with for a long time are key members of the open source community as well. And so that's also seen as really valuable both for customers as well as investors. But I was also, like I said, really fortunate to have Luke join as a co-founder who brings a tremendous wealth of experience on the go-to-market side, but he also has a great network of extremely talented people that are really excited to follow him. And so we're really fortunate to have that. Final question for you here before we wrap. Let's zoom out three to five years into the future. What's the big picture vision look like? Yeah, so I think we got to think about what is kind of the journey that's going to be taken for generative AI broadly. What I would like to see in that sort of three to five year timeframe is that we have a really strong range of continuous and automated ingest of data sets, feeding into vector document database as well as traditional structured storage. I think we're going to over time see that this distinction between unstructured and structured data is going to kind of blur away. What's going to matter is that you can use natural language or mechanisms like SQL to query this data at large scale, understand what's happening in real time. I think also, you know, throughout this conversation, I've spent most of the time talking about data ingest, feeding into storage points like vector stores and relational databases and so on. You know, before somebody were to ask a question to the LLM, I do think we're also going to play a key role when it comes to agent access patterns as well, where the LLMs are going to basically use agents and tools to go and search for the data when questions are asked, kind of like going and conducting research. I think we'll be on that side as well, acting as sort of asynchronous tool that gets invoked to go find data, do these processing steps and then help sort of educate the LLMs and how best to respond to their original tasking? And I want to see in that three to five year range that there's hundreds of thousands, if not millions of pipelines running, capturing unstructured multimodal data streams, audio, video, you name it, into generative AI and frankly AI systems broadly, obviously generative AI is captured all our imaginations today, but it's quite exciting to think about what comes next. And on top of that, we're saying that we want to help our customers capture this unstructured data. That also means that we can't just be resting somewhere comfortable as a SaaS service saying, well, once you get your data to us, then we can help you. I think in that three to five year timeframe, we've got to dominate answering the bell, so to speak, of being hybrid and multi-cloud and being wherever the customer's data is, whether that's on any of the cloud service providers, frankly, even all at the same time, as well as on-prem or even at the edge. We have a lot of experience in each of these different areas, but we can all imagine where this really goes. People are going to want to understand what their pipelines look like globally. If they're going to be feeding AI systems, they want the best data, the highest quality data, the freshest data. They're going to need a global command and control layer. And I think that's a big part of what we need to unlock. >> Amazing, love the vision. >> All right, Joe, we're up on time, so we'll have to wrap here. Before we do, if there's any founders that are listening in that want to follow along with your journey, where should they go? >> Yeah, so we've got channel on LinkedIn, or page on LinkedIn for DataVolo. We've got the datavolo.io website. It's also rather active with a Contact Us link. We've been getting in touch with a lot of other companies, partners, investors, as well as customers through that channel. We also do have a private beta that we're working on with a lot of users today, where getting in touch with us and we can get you on to the app so you can easily start building generative AI in just flows today. >> Amazing, Joe, thanks so much for taking the time. >> Likewise, Brett, really appreciate it. >> This episode of Category Visionaries is brought to you by Frontlines Media, Silicon Valley's leading podcast production studio. If you're a B2B founder looking for help launching and growing your own podcast, visit frontlines.io/podcast. And for the latest episode, search for category visionaries on your podcast platform of choice. Thanks for listening and we'll catch you on the next episode. (upbeat music) (upbeat music) (upbeat music) (upbeat music)