The Data Stack Show

195: Supply Chain Data Stacks and Snowflake Optimization Pro Tips with Jeff Skoldberg of Green Mountain Data Solutions

This week on The Data Stack Show, Eric and John chat with Jeff Skoldberg, Principal Consultant, Data Architecture and Analytics at Green Mountain Data Solutions. Jeff has been a data consultant specializing in supply chain analytics and cost optimization and shares his journey from starting as a business analyst at Keurig in 2008 to becoming an independent consultant. They discuss the evolution of the data landscape, including shifts from Microsoft SQL Server to SAP HANA and later to Snowflake. Jeff emphasizes the importance of cost optimization, detailing strategies for managing data costs effectively. The group also discusses two frameworks for using data to control business processes and create actionable dashboards, and more.

Duration:: 48m
Broadcast on:: 26 Jun 2024
Audio Format:: mp3

Highlights from this week’s conversation include:

Jeff's Background and Transition to Independent Consulting (0:03)
Working at Keurig and Business Model Changes (2:16)
Tech Stack Evolution and SAP HANA Implementation (7:33)
Adoption of Tableau and Data Pipelines (11:21)
Supply Chain Analytics and Timeless Data Modeling (15:49)
Impact of Cloud Computing on Cost Optimization (18:35)
Challenges of Managing Variable Costs (20:59)
Democratization of Data and Cost Impact (23:52)
Quality of Fivetran Connectors (27:29)
Data Ingestion and Cost Awareness (29:44)
Virtual Warehouse Cost Management (31:22)
Auto-Scaling and Performance Optimization (33:09)
Cost-Saving Frameworks for Business Problems (38:19)
Dashboard Frameworks (40:53)
Increasing Dashboards (43:29)
Final thoughts and takeaways (46:28)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

(upbeat music) - Hi, I'm Eric Dots. - And I'm John Wessel. - Welcome to "The Data Stack Show." - "The Data Stack Show" is a podcast where we talk about the technical, business, and human challenges involved in data work. - Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. (upbeat music) (upbeat music) - Welcome back to the show. We are here with Jeff Skollberg from Green Mountain Data. Jeff, welcome to "The Data Stack Show." We're super excited to have you. - Thanks so much for having me, excited to be here. - All right, well, you are calling in from the Green Mountains. So, you come by the name of your consulting practice, honestly, but give us a little background. How'd you get into data? - Sure, so I first started in data in 2008 when I was working at Kerig, that's the call center business analyst. I was with Kerig for about 12 years doing data, a couple of years on, actually on the BI team. So kind of transitioning from a business analyst who is just knee deep in data all the time to, it's actually now it's my job title, even though I've been doing it the whole time. - Yeah. - And then about five years ago, I said, "Hey, let me try and do this on my own, "see what happens." So about five years ago, I left Kerig to become an independent consultant. And since then I've been helping companies kind of on their analytics journey, doing end-to-end stuff. So, business analyst, the data pipeline, architect solution architecture, as well as the last mile, which is analytics to dashboard. So, that's what I've been doing for the last five years. - So, Jeff, one of the topics I'm really excited to dive into is cost optimization. That's one that's near and dear to my heart previously being in a CTO role. So, I'm really excited to dive into that one. Any topics you're excited to cover? - Cost optimization sounds awesome. We can talk about some frameworks that I use with my clients as I'm walking them through their business problems and we're thinking about how we're gonna solve the problems that they come to me with. That's another one. And we can also talk about maybe how the data landscape has evolved over time. - Yeah, okay, that sounds great. - All right, let's dig in. Jeff, one thing I love is that we get to talk about sort of your tenure at Keurig a little bit. It seems less and less common, especially in the world of data for someone to spend over a decade at a single company. And not only did you do that, but what a decade. I mean, you sort of chose the years to span that time at Keurig from sort of before some of the most high impact technologies we use today were even invented. So sort of start at the beginning, like how did you get into, was that like an early job out of school? How did you even get into data at Keurig? - Absolutely. So I went to school at University of Vermont and this was my first job out of college. And Keurig being the largest employer in Vermont at the time, remote work not being as popular as it is now, it was basically my only choice for like, working at a big company. My first data role in 2008 there was as a call center business analyst, where we're looking at average wait time, average talk time and average handle time, which is like the sum of the two. And using that for forecasting models, like when should people be taking their lunch breaks or their 15 minute break, et cetera. So we're really building a staffing model based off of our call history and those types of metrics. So to kind of keep a very long story short, eventually I moved into a role as supply chain analyst where I had very smart managers and mentors who were really great at coming up with KPIs to manage the business. And then I was responsible for bringing them the data so they can like really fulfill those KPIs. So we could go into some examples maybe later in the show if we want to dive deeper there, but it was very much an exercise of like, hey, we're going to try and reduce XYZ cost and now we need the data to understand XYZ cost. And let's automate that. So, you know, we're not paying our analysts to pull data, we're paying our analysts to really present it and show where the pain points are in the weak spots. The good bulk of my career, accurate, I would say like eight of those 12 years was as supply chain analyst. And then I moved into an actual BI developer on the BI team when Eric decided SAP HANA was or just SAP in general, was going to be their ERP system of choice. My manager made sure that me as a supply chain analyst had unfettered access to the SAP HANA database. So when he had a business question about how the business is doing, is something in stock, what's our in-stock percentage, our things shipping on time, et cetera, that I would be able to write those queries and give them the data. And then shortly after that, I joined the BI team as an official member of the BI team, even though I was unofficial BI that entire time. - Yeah, very cool. Okay, I have a couple questions. So one is about the Keurig business and sort of the digital nature of it over that time period. So in terms of like distribution channels and you said supply chain, was there a big shift to more online sales over that time period? And did that have a big impact on sort of your work with the data? - So, it's interesting. It actually kind of went the opposite where it went more B2B in the long run. So when I started there in 2007, they were the largest shipper in Vermont, or I'm sorry, they were the largest shipper in New England. So this is before people bought everything on Amazon. - Yeah. So of course, Amazon is the largest shipper in New England now. Like I don't have to look it up or check that. You just kind of know, right? - Yeah, yeah. - But in 2007, it was Keurig Green Mountain was the largest shipper in New England. They were doing about 20,000 orders a day out of the Waterbury distribution center. - Holy crap. - Yeah, it pretty amazing. And that didn't grow a lot. They're consumer direct. It basically stayed about static over the years, plus or minus. What really exploded was their grocery distribution, their club distribution, club being like Sam's Costco BJ's retail. So like Bed Bath Beyond, we would call that the retail channel. So that's obviously like separate than grocery. And as they became a nationwide brand, looking at metrics, like what percent of grocery stores are we in? They very quickly got to a hundred percent. - Wow. - Of the, you know, the national brands. So that it was just absolute explosive growth. - Wow. Okay, so now what walk us through the changes in the tech stack, right? Because, I mean, you have sort of business model change and then explosive growth. And then the other variable is the changing tech landscape, right? - Absolutely. - The explosive growth generally means we have all sorts of new data needs, all sorts of new processes to build to way more data than we were ever processing before. So give us that story. - So I was very fortunate that I landed in a company that knew what business intelligence was. And they had a BI team since before I worked there. So I started there in 2007, my first data role in 2008. I don't know how long they had a BI team for, but they had one back then, a small one, but, you know, they had a dedicated team called the BI team, which is really cool because a lot of places that I've done consulting at, they're not there yet. So I was really fortunate to learn in an environment where they were thinking about sustainable data assets, where they were thinking about, "Oh, well, Jeff, you might do it this one way in Excel, 'cause that's all you know how to do, but let's show you like how you would do it in a database and how you structure your data to support this KPI to automate it and stuff. So I actually had really great mentorship on that text guide as well. So that's really cool that they were light years ahead of their time, but on the other hand, the actual tech that they were using was, of course, Microsoft SQL Server as their business warehouse. Their ERP system was an Oracle-based product called PeopleSoft, which is like, I mean, that's kind of ancient history now, but it was very much a Microsoft shop, and it was interesting to see it evolve to SAP. And honestly, it almost felt like we were downgrading. When we were coming into SAP business objects, I forget what is it called? BW SAP Business Warehouse was like a huge downgrade from your SQL Server analysis services, like the cubes that you have in SQL Server. Going from analysis services to business objects and business warehouse was like, it was a tough pill to swallow, but then coming into SAP HANA, which acts like a relational database, it acts like a modern data warehouse. Some people might not know about HANA, so just explain real quick. It's an in-memory database. So when you turn it on, it takes all of your tables and it loads them in RAM. So this runs on massive servers that are just specially designed for SAP HANA, usually manufactured by Hewlett Packard. So they have a partnership where Hewlett Packard is designing machines specifically for this database. >> I had no idea about that. >> It's totally wild. We're talking like terabytes of RAM. >> Wow. >> I had no idea about that. >> That's very cool HANA trivia. >> Yeah. And so we were able to do forecast analytics on a six billion point data set using a live connection in Tableau. And you were able to just drag and drop and it would just come up on the screen, maybe like a one second delay or something like that. >> Wow. >> Computing on live on a six billion point data set. >> Wow, that's wild. >> Totally. So that was really cool to, that was my first like, I will say big data, six billion, you're kind of getting there in the big data realm. You need specialized equipment to process six billion rows. And that's how I define big data. Some people will say number of terabytes is big data. I define it as do you need specialized equipment to actually process this data? And certainly with that six billion rows you do. That was my first kind of experience there with big data. And it was really fun to be able to optimize HANA, like learn about the partitioning, learning how we're going to be clustering our data and organizing it and using a union instead of a join to do those forecast accuracy comparisons was like a huge speed boost. All of like little performance tuning tips that you learn along the way was really enjoyable. >> Very cool. And when did Tableau come into the picture? Along with SAP or? >> Yeah, I think they implemented Tableau in like 2016. So medium early adopters there, like Tableau was certainly around in 2010. And then kind of had a huge growth. >> Were there any major changes in the pipeline over time, especially as you acquired additional data sources, right? I mean, I'm sure you had, you know, all sorts of pipelines from all these different like distributors, vendors, all that sort of stuff. >> Sure. So a lot of my analysis came from two places. SAP directly, our ERP system, our system of record and our demand planning system, which was called Dimantra, an Oracle-based product. And we have Dimantra re-forecasting every combination of product and customer every single day. And not only is it doing that once every single day, it's doing it for multiple versions of the forecast. So you can have your pure system generated forecast, you could have your sales outlook forecast. So like what the sales reps, they want to put their little juzu on the forecast. That's your sales outlook. You could have your annual operating plan, which is what you were saying at the beginning and that doesn't really get adjusted. And then maybe you get all together, you have like nine versions of the forecast. So it was forecasting every combination of customer product and then nine different sets of those every single day. And we're saving all of these snapshots for a certain amount of time before we drop them. And there's certain snapshots that we save forever. So like one week leading up to the sale, we really care about that. 13 weeks before the sale, we really care about that. And the reason why we cared about those two things is because one week leading up to the sale, you can make the cake up. 13 weeks leading up to the sale, you can buy the Keurig Brewer and get it there from China. Six months leading up to the sale and one year leading up to the sale. Those are the lags that we would keep forever. And this was years to go. So I'm improvising here a little bit, but more or less it's directionally accurate. And so we had to come up with a process to ingest all of this data, snapshot it, and then delete certain slices of the snapshots as the time lapse and then actually do the comparison. So that was like my main play in SAP HANA. You were asking me about multiple systems and let me circle back to that. They also had IRI data. IRI is like cash register data. So St. Anne's Costco, BJ's, Walmart. What did you, Eric or you, John, actually buy at the cash register? And it's collected at the row level like that. Where then they can actually, it's actually pretty creepy what they can do with data. They can be like, what's the average credit score of the person that buys our cakeups? Like they can actually like, you know, they can do that because when they get IRI data, they know, unless you pay in cash, they know who bought it. And luckily I wasn't responsible for that pipeline. One of my friends and colleagues was responsible for a pipeline. So I just got to consume that data as a source within HANA. That got piped into HANA as well. And I was able to act as a consumer. - You say luckily I assume that one of that was a huge pain. - I think it was a big lift because they didn't have the integrations that they might have today. So you're dealing with like automating exports and stuff like that. - Sure. - Yeah, with like TP servers. - Yeah, large TP servers, very large volumes. Exactly 'cause it's every sale. So fascinating. And this is so far, I feel like we're learning like things that we haven't, like the details of some of this stuff that we haven't really talked about on the show. - Oh, totally, I love supply chain analytics because it's its own niche thing. So like consumer package goods, it's certainly its own niche thing, but just general supply chain, any like, if you make heaters, I can help you. If you make airplanes, I can help you, you know, because supply chain of like, knowing how much product we have in stock today, when is more product coming in? Or when are we making more product? How's that comparing up against our demand for product? That's all stuff that now is, I almost have like a template for it. It's like easy problems for me to solve. So I really enjoy talking about this stuff. And those are the types of clients where I really excel, even though I've had clients in all types of industries at this point. - I have, John, I know you probably have a ton of questions, but one last question on this subject, from me sort of looking back over your time and all the change that happened across a number of different vectors, the technology landscape has changed drastically, but what hasn't changed from your perspective, right? I mean, and what kind of made me think about this a little bit was you saying they had a BI team, you know, before you even joined? What types of things haven't changed, even though the world you described in 2008 is just so drastically different in many ways from a technological standpoint for a company that would be setting up the same sort of infrastructure. - Star schema. It hasn't, like seriously, like I learned about Star schema in 2008 when I was at Curie, because that's how they designed their cubes. And nowadays we could take liberties and every year at coalesce, they're gonna have a speech called is Kimball dead or is Star schema dead. And they kind of say, yeah, like you don't really need to do that anymore. But realistically, what we're doing is, we're just not following all of the Kimball best practices of like surrogate keys and like foreign keys and all of this stuff, but we're still keeping our facts and dimensions separate. And we're still joining them right before we send the data to Tableau, we take our fact table and we join it to our customer table, our product table and our fiscal calendar. And then we send one big table to Tableau. So it's not really a Star schema because we didn't do it by the book of how all the extra steps and added time and complexity that Kimball said you should do. You don't really need to do that anymore. But this concept of facts and dimensions is timeless. Absolutely. And actually I hear Joe Reese talking a lot about how like the people don't care about data modeling anymore and how just the average data model is very sloppy and it's not following a rigid technique. And to a certain extent, I think that's kind of okay. Like on one hand, it's really cute to be like a perfect Star schema, but the six weeks extra that it would take versus just like getting the answer. Like here's my query and this gets the answer. Sometimes you have to look at time to value. So you always have to look at time to value, right, Eric? Yes. Such a tired phrase, but so true. So true. So true. Yeah. So true. So true. Yeah. John, I've been monopolizing the mic. Yeah. So yeah, let's dig into some of that cost optimization thing. You know, we talked before the show that, you know, starting in, let's say 2008, you probably had servers that were, maybe you had them in RACS space, right? Like that's a nice 2008 hosting company, right? Maybe our servers are in RACS space. And then like at some point, let's say in 2010, 2012-ish, like you get into AWS. And then you still have servers that are in AWS. And then eventually you've got things that come out like Lambda and then further and further down the road to basically pay per the hour and then pay per the second. So there's this evolution here. And I guess maybe walk us through from your data experience, like how that impacted your job on a regular basis. Like I was thinking about this in 2008 or 2010 or 2012. Now I'm thinking about this in 2024. Like what's the evolution there specifically around cost optimization? Sure. I'll kind of think about this in three chunks. There's my free SAP HANA where you had the world where you described, it's like fixed cost. Yeah. More or less fixed cost. So if your data gets bigger, you need to spend more, but you know what you're gonna pay for a year 'cause you pay for a year upfront. Right. But then there's like the HANA days and then there's my post 2018 days where that's where I've seen more change. So like from the 2008 to 2018, the big change that I went through personally in my data endeavor was this adopting of SAP HANA. Sure. I'd like to talk more about from 2018 onward as now I'm getting into Snowflake and pay for compute models and stuff like that. It is such a drastic mentality. One little pre-note is that not all companies are there yet. I have one of my main clients today is using Amazon Redshift and they're not using the server list, they're using the paper month. So it's like they know what they're getting upfront. It's a couple thousand dollars a month and it doesn't change. It doesn't matter if you use it or don't use it. It's fixed cost. There's something to be said for that because they know upfront what they're gonna spend in. There's not gonna be like whoops. I ran a BigQuery query without a limit clause and I accidentally spent $5,000. You know, like that is actually a problem that people have in BigQuery. Yeah, no, it is. I've done that before, not quite to that extent, but yeah, I mean, we had an analyst last company like exploring some Google shopping data. And I think it was like 500 bucks. Like just for like 20, 30 minutes because of the data set size. Totally. But to your point, company is budget annually based on a fixed cost model. Like that's what your budget is. It's like whatever it is, you know, $30,000 for a warehouse for, you know, Vermont the large company, maybe per year at a smaller company. But like, but then you have these variable costs. So then it becomes like as a, if you're in any sort of leadership position, you're actually really challenging to manage the cost up or down, you know, obviously like cheaper is always better in general, but then like you manage it too well and you lose your budget. And then, you know, then the next year you're fighting to get the budget back. It's a tough problem. Whereas before it was like, like, let's buy this. Let's get a depreciated over three years. You know, it's an asset. Like we get, you know, there's, it's not operate. It's not just an operating cost budget. Like bucket, like it is now remote for a lot of companies. Which I know you can buy reserves, you know, reserved instances and, you know, sign a snowflake contract for multiple years and accomplish some of the same things. But I think the mindset though, is tough. I've always had a tough, you know, problem with kind of finance and IT and figuring out that, you know, like how do we explain? Like how much is it going to be like, well, we don't know. Like it's not a good answer, right? Yeah, totally. So I think it comes down to your organization needs to grow expertise in cost optimization for the warehouse that you're using. Yeah. And I'll talk about snowflake 'cause that's what I know the most about. I think that most companies grossly overestimate their usage in the first year and then go way over budget of their usage in the second year. Yeah. Snowflake, I think it's still $25,000 minimum investment if you don't want to pay on a credit card. So you could sign up credit card, your account costs 20 bucks a month if you just use it just a little bit. Like whatever it is. But if you want to get, okay, now we want you to set us an invoice instead of, and you want a little bit of a discount, you have like $25,000 minimum investment to deal with that sales rep. Most clients aren't hitting that their first year because they're implementing. They don't have a lot of usage. Yeah. Year two, all of a sudden it's like, oops, we spent a lot of money. And so you really kind of have to figure out how to rein it in. And there's certain things, certain areas that we can look at how to rein it in. Would you like me to go into a few of those? Yeah. I think one interesting, well, I think the why behind like why did it blow up, right? And I think the positive answer would be, we're trying to do democratized data. We've got all these analyst writing queries that didn't have access before. We have, you know, BI tools that people are building their own dashboards and, you know, maybe even like kind of composing their own queries in the tool. So I mean, all that is theoretically a positive thing, but you did just democratize data, which means from a cost standpoint, like you just gave a bunch of people access to this thing that the meter runs every time, you know, a new person uses it. Well, and hopefully that's the case, but oftentimes the transformation step is spinning the warehouse more than people actually use the data. Sure. And so you want to know what's your ratio of having your DBT jobs running in your warehouse versus users actually using the data. And the other thing is a lot of companies are paying more to load their data into Snowflake than their Snowflake bill is actually costing them, meaning that their five-tran bill is higher than their Snowflake bill. Yeah. Yeah. You want to know that. Most companies I think. If that's, you want to know that and you want to fix that right now, basically, because it shouldn't cost a lot of money to load data into Snowflake. So that's kind of looking at the entire set. By using something other than five-tran, is that what you're saying? It is actually. Okay. I didn't know if you had some magic five-tran optimization. You wanted to share with us. No. I mean, sorry, it's like, you know, I have my opinions, but five-tran has a very well-known pricing problem in the industry. And it's like, I'm not here to talk down about any company by any means, but if you're paying by the row, you want to get less by the row, but paying by the row is really tough. So let me just give an example. If this forecast accuracy thing that we talked about before, where they're re-forecasting every product every day, that means the entire data set changed every day. You're not just, it's not like event data where you're just getting new events. And it's like, oh, I want five-tran to track into events. That means that I have like a couple billion new rows every day. So right off the bat, you like, you can't use five-tran because you would never be able to afford it. But a lot of times, in supply chain data, when you're dealing with MRP, which is materials requirements planning, or you're dealing with inventory snapshots, or you're dealing with forecasting again, you have more data that changes every day than doesn't. Right? Byproduct, how much stock did I have at this warehouse today? And how much did I have yesterday? Almost all of it changed, if you're a busy warehouse. So you don't want to pay by row for those types of things. You want to either come up with your custom loading, or come up with, I love tools. I actually prefer tools over custom programs. I think that when teams adapt tools, they can move faster. They can be a leaner and smaller team by adapting a few smart tools. So you just want to look at how the tool is priced and make sure that it fits your use case. Well, I think in five-tran's defense, they have done a nice job of developing a lot of quality connectors. 'Cause if you compare them to open-source connectors on a lot of the top connections, they're clearly better. So in the marketing space, they've got to, if you compare HubSpot or Salesforce connectors versus some of the open-source ones, and you're not just a massive operation, it makes a ton of sense. So it does make a lot of sense for a lot of data sources. So the data sources that aren't in the list of what I just mentioned, where you just have massive amounts of changing records every day. For systems like HubSpot, where even Salesforce, you could do okay. Depending on, I've seen Salesforce running out of control on the five-tran side as well, but a lot of these connectors, I agree with you. They are high quality, maybe the highest quality, of reliability and accuracy and ease of use. When I evaluate another extract load tool, I'd basically say, is it five-tran simple? And if it's not five-tran simple, it's kind of just off the list. - So the answer is usually no, right? - Yeah. - Like there are some good ones out there, like that are, you know, in the next now, but there are a lot of them that like, this is just gonna be too hard where you need a more technical person to manage. It's a good tool, but you need a more technical person to manage it. - Yeah. - But circling back to where we started is, you wanna be hyper aware of, am I spending more on loading data to Snowflake than my entire Snowflake bill? Because that means you're a little bit upside down. - Right. - And then you also wanna know, like, am I spending more on my transformation compute than my end users are consuming? And really see if you could get it to be that your end users are the cause of your high snowflake bill. Cause that's what you really want. Cause that means you're paying the price for democratizing data, but that's money well spent, whereas the other money might not be money well spent. - Well, and the other thing too, is I like to think of it as a poll model versus a push model. Like push model is like open five, try and check all the boxes. Like connect everything you can find an API key for, check all the boxes, get all the data in versus a pool model of like, hey, there's a business requirement. Somebody actually needs something, pull it through, like check the box and five transform it and DBT, you know, deliver it in your BI tool of choice. Like that's an efficient model, but it's kind of easier to do it the other way, right? Of like, hey, business, like look at all this data we have for you. We have everything possible from. And then you just list out like every system the business uses, it could be 20 systems. And then A, they're overwhelmed. B, you're gonna waste a ton of money, you know, constantly ingesting and storing and that data. Do you see people struggle with that too, where they just like kind of check all the boxes, suck it all in and aren't using most of it? - I do and every single five trend customer that I've helped has gone back and unchecked boxes months later. - Yeah. - You know, it's like, how many boxes can we uncheck? What's being used? If it's not being used, let's uncheck it. - Yep. - So. - It is that hoarder mentality though, where you're like, we might need that, you know? - Let's say it again later. - It'll be free, right? - Yeah, it'll be there. - It'll be there. - It'll be there. - Yeah, exactly, exactly, no. - One thing I was thinking about in related to cost optimization, you said you sort of have these frameworks that you use to help your clients think through data projects. Did they incorporate like, are they sort of sensitive to the cost optimization? Could you explain your framework? - So, I wish that they were. So, maybe it's my lack of being able to communicate this to the actual IT teams instead of business teams, but a lot of times, at least with the clients that I have, and I'll say that I stick with clients for a long time, so it's usually a few clients. They have not been using my framework to optimize their costs, unfortunately, even though I wish they were. We're using my framework to solve business problems, but not always to optimize the costs. But, I would say we could shift into the framework or we can either one or two more things that I could say about. - Okay, yeah, totally, sorry. I didn't mean to change gears, but yeah. Let's run the cost optimization down to the ground. - So, I would, just to give people a few tips, the number one mistake that I see out there, which is a very expensive mistake, is to use your virtual warehouses in Snowflake, like cost centers. So, create like a supply chain extra small, finance extra small, HR extra small. And so, now you're trying to apply like a cost center or the department who's using the warehouse to the warehouse, because a lot of times, I'll log into the client Snowflake account and all three of those extra small warehouses that I just said are on at the same time, when one of them could just be on. So, if you just had like reporting extra small, now, instead of paying to have three warehouses on for three different teams, you just have one warehouse on. - That's like kind of like my number one tip is, have as few virtual warehouses as you can, even to the extent of like just one of each size. And then just use those. And then the way that you actually apply your cost centers and figuring out who's using the data is through either query tags or putting comments in your DBT models or there's a grid. So, I'm a huge fan of select.dev. They're a Snowflake cost optimization company. They have kind of turnkey ways that you can see who's using what data and what they do is they apply a comment to every query that gets executed in DBT or Tableau, however you specify it. So then I could say, okay, my supply chain team spent $30 consuming data and my HR team spent $100 consuming data. So, it's a much better way of allocating the cost centers than by splitting out your virtual data warehouses. - Yeah, let me comment on that. I made that mistake early on as well. - I did too, but that's how everyone did it. You have to learn it, yeah, but go ahead, John, sorry. - Yeah, yeah, I made that mistake early on. Not too badly, but I had an ingestion warehouse, transformation warehouse and reporting warehouse. So I split it by like workload. And then you know, look at the bell and then the logic of like, oh, wow, all three of those are running. Guess what? That's three times as much as one of them running. So then we just tried like, hey, what if we just literally combined everything into one and everything was fine, didn't have any performance problems at all. So, yeah. - And then turn on the auto scaling then, so. - Yeah, sure. - For your extra small, let that scale up to maybe three nodes or yeah, three clusters actually, yeah. - Well, let's actually talk about auto scaling just for a second. - Yeah. - But I'm curious, like, what's the trade off? So like you have a smaller warehouse that runs for longer, right? To do something versus a larger warehouse that can run for shorter, but it's more expensive. Like, what's the, how do you think about that? - So like, we talked about it as. - When we, you're talking about scaling up to the next size warehouse. - Yeah, right. - The query runs for more than two minutes, then you could try and get it under one minute. So like, basically as long, so the one minute is the minimum billing increment. So if you're running a query for exactly a minute, you've had a hundred percent utilization. So you don't want queries running for less than a minute on a larger warehouse size, but you want them running the shortest amount of time possible for one minute. So that's one kind of target that you can go for. - Nice. - And the way that you could tell if scaling up is actually going to help you is, if you look in the query profile, there's something called like disk spillage. And then there's like spillage to remote storage, which means they actually spilled it to S3. - Okay. - So virtual warehouse is actually, it's a computer, right? So it's trying to process as much as it can in RAM. And when the RAM runs out, it's spilling to disk. But when you fill up the hard drive on that virtual computer, it's now dumping out to S3. And if it's dumping to S3, you know for a fact that going to the next warehouse size up, it's gonna have a bigger hard drive. And it won't dump to S3 'cause now it had a bigger hard drive. - And more RAM, right? - So more RAM. - It had more everything, that's right. So, but looking at disk spillage and specifically the remote spillage is how you could tell if scaling up will. And then you don't want to scale up until your query is running in two seconds. You want it to run for a minute. Like let's say the query is taking an hour before on an extra small. - Yep. - You know, you could go order of magnitude up until that thing is running a minute and then it's basically costing the same. - Hmm, yeah, nice, that's super helpful. - Yeah, that is very cool. - And then there's this concept of scaling horizontally. So you can have an extra small warehouse and then you could say min clusters and max clusters. So I could call it Jeff's extra small, okay? And I'll say max clusters is three and min clusters is one. That means, and that's a concurrency issue. So if only I'm using it, it'll just be one cluster running. If now all of a sudden there's like 40 or 50 Tableau users using that same warehouse at the same time, it'll just spin up automatically an extra cluster of Jeff's extra small. So now it's now scaling horizontally to handle concurrency versus scaling up to handle a harder query to process. So I think three is a good number just to start with. So let all your use the smallest warehouse you can, let it scale up to three if you think you need to. And for Tableau, it's a lot of times it'll start Tableau with a medium just because I want Tableau to be a little bit faster, but again, letting it scale up to three if it needs to. And it almost never does. - Yeah. - It has to be at least eight queries for it to scale up one more. - When if you think about your average workload, right? At a midsize company or even a larger company, I would guess that like there's certain peak hours and then peak times a month where you may have like, like, you know, this 100 people or 80 people or whatever, but the majority of the time, even during the work day is gonna be a fraction of that. So that makes a lot of sense. - Totally, totally. Yeah, and that's what's really nice is that stuff like will then automatically handle it by, hey, what's the number of queries I have in my queue right now? Okay, let's turn on another one just to kill that queue. - Yeah. - It's nice, so. - Yeah, so that's my best cost saving tip. That's like almost like the free lunch one that anyone could do. You do it right now without making a huge impact to your organization. You do have to be a little bit careful 'cause you might break some jobs, which we're using the warehouses that you're deleting. So yeah, like do them one at a time, see what breaks. Well, first understand ahead of time what you think is gonna break, fix it, turn off one, see what breaks, you know, that type of thing, so. - Love it. - Super helpful. Why don't we talk about your frameworks to round out the show? - I'd love to. So when a client comes to me with a particular business problem, and it's normally the business team's reaching out to me more so than the IT teams. I'll just kind of put that on the table that it's very much like there's a business problem that someone's trying to solve. There's two different frameworks that I walk them through. And the overarching idea that sits on top of these two frameworks is this concept that the only reason to use data is to control a business process. So you wanna use data to control something. No, it's not an FYI thing. So example, how much did we sell last week? You wanted, let's even generalize it further, a sales dashboard. Why have a sales dashboard unless you're trying to control your sales to meet that target? Like a sales dashboard isn't an FYI thing. It's a are we marching towards our goal? And if we're behind our target for this point in the year, what are the reasons why I'm behind my target? So that's kind of the overarching thing that sits on top of my two frameworks. But framework number one is just comes from the Six Sigma manufacturing methodology. And within Six Sigma, they have this process called define, measure, analyze, improve, control. And you can see this framework is set up that it ends with control. So first we start thinking about controlling a business process. And then we say, here are the steps to actually control the process. We're gonna define it what you're trying to measure and like what your problem is. We're gonna define it. Then we just figure out how to measure it, analyze the result. What are you gonna do to improve it? And then you just get to this point where all you're doing is you're using the dashboard to control the process and make sure that the process is in control. And you can apply this to anything. You can apply it to sales, to forecast accuracy, to your snowflake spend. So the one thing that I love about this tool called select that I mentioned earlier is that it has dashboards that show you where you're spending, what are your most expensive queries are. But it really comes down to this fact that we're going to use it to control the process of getting our spend under control. So that's kind of framework one. And then framework number two is what should a dashboard do? What should be on a dashboard? Well, every dashboard should do three things. It should say where are we today? What is the trend? And what is the call to action? And so if we unpack each one of those three, so where are we today? That's like you're at the top, they call it a ban, a big ass number at the top. That's like the kid, the bad, like... This tableau dashboard came to my email. I see a picture of the dashboard in the body of the email and there's a green check at the top next to a number. I know I'm good. Or there's a red X at the top next to the number. I know I'm bad. That's the where are we today. And if we think about, we'll just use the sales dashboard example 'cause it's a very simple concept for people to think about. If you're behind your target, it's an X, right? So number two of all dashboards must do is what is the trend? So this might be your sales and weekly buckets on a line chart. So the simplest tools and tableau, the simplest chart types are the best chart types. So we have a line chart that says here's our sales by week, here's our target by week, which weeks were above and below the target. And same week last year, maybe. So maybe you have three lines on a chart. So you could see where this year is comparing to your target and last year. So now you have a really good picture of the trend. And then the third thing is what is the call to action? And that said it another way is, what are the items within my business topic that are driving my business results? And if we think about a sales dashboard, it's like, well, these three sales reps are really behind on their sales. They're the problem. Or these three product IDs are really behind on the sales. It's these products aren't doing well, or these brands or these package categories, or whatever it happens to be. But at the bottom of the dashboard below the trend chart is going to be a bunch of different bar charts, usually, that can then act as filters on the stuff above it. So I could click on this sales reps name and then the whole dashboard filters to adjust that sales rep. And I could see how far behind is he? What are his products? How are his products doing? And so that brings it all together of, we now have where are we today? What is the trend? What's the call to action? And I can use that whole thing to more or less bring the process under control. So it's the manager's job who's consuming the dashboard to then use it to then go figure out why the problems are the problems. That's my framework. - Yeah, I love it. Okay, one question right off the bat, and I think I know the answer, but do you find that this increases or decreases the number of dashboards in a company? - I think it does increase 'cause they want to control more points, right? But I do like this idea of, so, I mean, Ethan Aaron always talks about 10 dashboards, which I don't think is a reasonable number of dashboards for an organization that has more than 10 departments like, I mean, do you think just think of any company that makes something that goes on the grocery shelf, right? They have manufacturing, they have distribution, they have purchasing, they have hundreds of human resources, they have finance. They have more than 10 departments, you can't just have 10 dashboards, right? But each team should not have more than 10, and each individual team maybe should only have like five KPIs that they're really looking at. So if you're the supply chain team, you should care about things that are in stock, things that are late, things that aren't shipping on time, and like how your inventory is, and is your forecast accurate. But maybe that's hopefully five things that I just said. And then anything else that you want to measure, you say which one of those five is going. And then you end up with maybe 50 dashboards for the whole entire organization. But to get back to your question is, it does beg a lot more questions, and they say, well, that was so effective, what else can we control? - Yeah. - And I just did this at one of my clients where the first dashboard went live, they put it on the monitors throughout the building, and everyone was looking at it as they were walking around throughout the building. And instantly this particular KPI within like one week got so much, but like hundreds of percent better, basically. Like it was just like a market improvement because people's themes are on the dashboard. If you put someone's name on a dashboard with a problem next to their name, they're gonna go clean up that problem. And the problems started getting cleaned up right away. And you want to be careful because you don't want a punitive culture, but you also want an effective culture as well. - Right, tell me about this. This is something that I guess hadn't thought about in this context, but I've done before, where it's almost like you, from a dashboard perspective, like dashboards, in anything you put up, so you mentioned like that first, we get got a lot better, right? But that same dashboard a year later, like people just walk by and ignore it. So one of the things that I've done in the past is say we've got five frameworks. It's keeping it visually fresh, but also just almost like rotating because it's like, all right, team, we're gonna focus on improving this metric and almost visually like, all right, that's the biggest number, everything else is still there. And almost like keeping people's interest as part of the strategy and keeping people's focus, where you're focusing on five things versus one thing. So let's pick out what do we think's most impactful this month or this quarter. Focus on that, make that really big, make that the focus. Put everything else out there 'cause we don't want it to completely drop off on the other four, but we wanna focus on the one. Have you seen people implement that strategy or thought about that? - So one thing that I've done is when it's time to like, hey, this dashboard is no longer the thing that we need. It's not like the hot topic anymore because we've achieved control, basically. - Yeah. - Is you just put a watch? You could sunset the dashboard and just do like a slack alert. That's like, hey, if this number goes above 300, I wanna know because then maybe I can pull that dashboard out of archive and see what's happening. - Yeah. - So yeah, that's, I think just like putting a watch on things so you could use top low alerts, you could use slack alerts if you have a pipeline tool. - Cool, yeah, nice. - Awesome, we're at the buzzer and this has been an amazing conversation, Jeff. I mean, I don't actually think we've discussed SAP HANA on the show, maybe ever, but we learned some really interesting things about it. We learned how to optimize Snowflake and love your framework on the dashboarding. I mean, that's, that really is just, I was thinking back through, you know, unfortunately in all the sort of analytics and DI stuff I've done, we never used a framework that clear, but I was, as you were describing that, I was thinking back on which ones worked really well, you know, like which ones can I sort of look back on and be like, that was actually extremely effective and they basically all aligned with the, almost the exact, you know, it was like, oh yeah, the big number at the top where it's like, this is good or bad and there's, that's really the only thing that matters like, yeah, that was really effective. - Awesome, I'm so glad to hear that conversation was enjoyable and yeah, thanks a lot guys for having me on and asking great questions. - Absolutely, absolutely, well. - Have a great one up in the green mountains and we'll see you out in the data sphere on LinkedIn. I'm LinkedIn, peace. - The data stack show is brought to you by RutterStack, the warehouse native customer data platform. RutterStack has purpose built to help data teams turn customer data into competitive advantage. Learn more at RutterStack.com. (upbeat music) (upbeat music) (upbeat music)