Archive.fm

Cloud Commute

Timeplus: Streaming Analytics for Realtime Data | Jove Zhong

Duration:
29m
Broadcast on:
13 Sep 2024
Audio Format:
mp3

- You described it as a data platform, I think for a lot of people, they would probably describe it as a stream engine or a processing engine, is that correct? - This category is still in the early stage, right, so people all know what is database, what is data warehouse, then people talk about the delay. The streaming processor power has been there for a long time, like if you're familiar with a touch link, for example, that's one of the best open source. And it works well in many cases, and I think there's a huge, developer community in Germany and in many of the rest of the world. However, Flink is designing a very kind of elegant and complicated way, right? And it's not easy to guess that it was Flink, and if you set up your own Flink cluster, it's not cheap, require a lot of CPU memory, tuning, and overall experience if Flink is not great enough, that's part of the reason why maybe a patch spark is a little bit more popular. - You're listening to Simple Blocks Cloud Commute podcast, your weekly 20 minute podcast about cloud technologies, Kubernetes, security, sustainability, and more. - Hello everyone, welcome back to this week's episode of Simple Blocks Cloud Commute podcast, this week with another incredible guest. And yes, I know, I say that every time and you know it's every time true. So hello, Joe, thank you for being here. I think we've never met before, so maybe just give me a quick introduction about you and your background. - Hi Chris, I'm so glad to join the show. And yeah, my name is Joe, Joe, Joe, and I co-founder at Templus. So Templus is a streaming database or streaming CQ or streaming analytics, depends on what kind of category are coming from. So we are essentially providing a very unique capability for you to understand what's going on right now and also what's in the past, even you can do some machine learning, real-time training for the current data and the project for the future points. So we provide both the open source core engine as well as some commercial software on the cloud or bring our own cloud or self-voting. And as a developer, you can connect to Canvas with your real-time data feed, for example, data in Apache Kafka, in some, for example, Post Data database, you can apply some CDC, and all those real-time data and the host code data can be put into Canvas and you can just leverage CQ to understand the pattern to some real-time and low-latency integrations. And this can be quite useful for any kind of use cases in the really low-latency data points, for example, whether you are being attacked on the separate secretes or if you do any kind of a trading in the traditional financial sector or doing any web-free blockchain, you might leverage our system to understand, should I buy or sell more my portfolio giving the past few seconds the price, the momentum? So we are more like a general purpose data platform focusing on the real-time part, but can be applied to many use cases. And I'm happy to be part of the show and share my story and about some of the technical details. - Right, so you described it as a data platform. I think for a lot of people, they would probably describe it as a streaming engine or a processing engine. Is that correct? - Yeah, this category is still in the earlier stage, right? So people all know what is database, what is data warehouse. Then people will talk about data lake and the streaming processor power has been there for a long time. Like if you're familiar with a patch link, for example, that's one of the best open source stream processor, but it works well in many cases. And I think there's a huge developer community in Germany and in many other world. However, Flink is designing a very kind of elegant and complicated way, right? And it's not easy to guess that it was Flink. And if you set up your own Flink cluster, it's not cheap, require a lot of CPU memory and tuning and overall experience of Flink is not great enough. That's part of the reason why maybe a patch Spark is a little bit more popular in the data where maybe in terms of the real-time processing or streaming processing Spark is not as strong as Flink, but overall Spark have much easier developer experience and nice integration with Python, for example, and also it's backed by Databricks. So if you are the Databricks customer, you have a better version of Spark and you can do a bunch of other magical ML data engineering stuff in Databricks platform. However, we can do something better than Flink. At least this is our motivation as a subcommittee. And we also, not just focusing on the processing part, but also we have our own storage engine. So it's okay, you can just send data to us and no matter it's fresh data or historical data and we can leverage a single engine to allow you to ask questions about what's happening right now and also what's the data trend or data pattern in the past two years. You don't have to send a query to different systems. Otherwise, you might have to send a some data to, for example, Flink and some other data to Snowflake and you ask questions and some joy will be very difficult. And we are a single platform to handle both real time and so that's something we really want to make the data engineering a little bit easier, especially if you care about low-latency and also to join both the real-time data. - Right, and as far as I know, time plus is all about SQL. So you're extended basically the standard SQL engine or the standard SQL syntax and added streaming capabilities to build all the queries from source to sync, right? - Yeah, SQL is, yeah, at the very beginning, it's our primary or even at the very beginning, it was the only interface. And later on, we add a few more like user-defined function, UDF and then we can support no matter calling a remoter service and all you can leverage JavaScript or Python to define your own logic. But all those customized logic is still defined as a function as a SQL function and then you still put this back to your overall SQL statement. So for example, unlike compared to Apache Flink, which is based on the low-level API and ZQ is, they are kind of an abstraction layer. So you write a ZQ, but it will eventually compile or translate it to some low-level API, but it's not a case for a time pass. We, SQL is our own API. There's no underlying data frame, whatever API, but we do realize that this SQL can be very powerful. You can write a bunch of, say, CTE or sub-query, but sometimes you do need to write more compressed logic or you can leverage your own JavaScript or Python libraries. So in that case, we enhance our UDF framework. So we can talk about that later. But that's really kind of something in the middle or more flexible way for advanced users to apply your own logic in your own SQL pipeline. So you can do many things. And it's very efficient. In many cases, the JavaScript engine, which is Google V8, is almost as fast as the native machine code. So you don't really sacrifice our performance, basically it's remained the main interface. - Right, so you already mentioned a few use cases in the beginning, but what are the big things? I guess real-time dashboarding, financing, I think you mentioned stock market. What are the big ones? - Yeah, sure. So in terms of the use cases, there's many things we can do. And to be honest, some of them, or at least half of them, many other streaming professor or vendor can do. So it's bringing to the kind of overall picture that is, there's a lot of common use cases. For example, we call it stream ETL. That is you keep a moving data from A to B. For example, you have your original transactional data in Postgres or MySQL. And all those data you want to create a real-time dashboard, or you want to create a real-time alert, say if my customer rate, average customer rate today is lower than three, then you notify the manager and do something. Or if you, for example, if certain area, you are offering some service and a certain area, they have not enough goods or inventory you might want to stand somewhere. So it is very natural to leverage the other system, not the original Postgres to do all the analysis, because you don't want to slow down your Postgres. So those are real-time CDC. It is very common. And also, you want to do some real-time alerts. You want to view your future store in terms of the machine learning. You want to grab all those original data source as many as possible, as fresh as possible, and then convert them into a bunch of numbers and is active as a feature store so that you can apply more machine learning, webinar centers, webinar engineering. So there's many things this industrial is focusing on, and there's also open-source solutions, commercial solutions. But what really helps 10+ stands out is we are very focusing on the performance, and also it's low footprint. So I'll give you an example that is many systems can do per minute or maybe a few seconds data movement or transformation. But in terms of 10+ because we implement in a special way, and we can easily achieve single-digit milliseconds, end-to-end, meaning that is, while there is a data pushed to 10+ and with all those stream processing or those are cyclologic, we can show the results to the downstream or the trigger alert. And the entire end-to-end latency is maybe five or six milliseconds. So this is quite useful in some of the scenario that is we need a really low-latency, for example, trading or risk analysis, right? I mean, if you are just looking at the dashboard or some other kind of data consolidation, such a single-digit millisecond is nice, but it may not be that really required, but for the case of trading, right? So it's a very competitive space, right? Everyone wants to be faster, and you don't have been super fast, but as a lot of you are faster than others, you have a better chance to win or lose less money, right? So everyone push very hard to have the best network cables and the low-level package they want to do, everything they can do to achieve better performance and lower latency. And it is true that is we work with some of the best financial companies or even some of the people in the blockchain or the web series space. They leverage our technology to get to the low latency, but also because we have this, we call streaming CQ or materialist view, which is kind of a new concept to if you are not familiar with that, but essentially it's a long-running CQ. It keep us scanning all the new data and send you the results. And the CQ itself is a background job. It never stops and it keeps working on that. So using that way, you don't really have to set intervals. I want to query my system every one minute or every two-minute second. You don't need to do that. You just define your logic in your CQ and whenever data comes as low as five-minute second, you will get a result, whether it's to put the signal or the best signal. - Yeah, it's basically a table, which is always up to date with the latest data that matched the query. Very, very useful stuff. I think in terms of trading with the high performance, it's probably most the HFT for people that don't know. It's like high-frequency trading. Those guys are insane, like notly insane. Like they, I mean, I think they sell in like, they buy and sell in second orders. It's completely out of this world. - Yeah, they don't really care too much about like what's the quarterly report or what's going on Twitter. They just look at the number itself. (laughs) - Right, exactly. - I can figure out how that's human. - I have a friend who's working a lot in HFT and just listening to him is like, yeah, whatever, make your money, I'm glad. In terms of sources and things, you already mentioned Postgres and MySQL, I guess Flink, Spark, the common ones, it's very anything like very special, that you say, yeah, that does work or that doesn't work? - Yeah, for us, the most preferred data source is Apache Kafka, right? Again, it's, right now there's many other Kafka API compatible services. They talk to the similar or even the same protocol as Apache Kafka, but you don't have to literally use Kafka, but it's still very common for you to leverage Kafka to consolidate your data source. So, I mean, the point that is we don't have to talk to individual database, talk to individual API, and if there's already some integration to move those data to Apache Kafka, then we only talk to Kafka, that would be easier. And Kafka also supports a very nice kind of a schema registry or different data compression or retention policies. And today, for example, if you operate your own Apache Kafka, it's not so easy. So, some people choose a manager service such as Confluent Cloud. And also that's bring the room that is maybe, some people complain Confluent Cloud is not cheap. So, that's the reason why people like Rapenda or Wopstream, they come up with their own solution that you're using to the past or using GOLA, or whether to use object storage or not. So, they can bring down the cost a low and that may introduce a little more other things you want to worry about. For example, whether the latency, video requirement, for the Wopstream may not provide as good low latency as others, but they do a very good job on the object storage leverage. So, it's much cheaper. - Right, but that means if Kafka is your preferred data source, you either have your own connectors or you can basically connect anything that Dibisium supports as well. - Yes, yeah, Dibisium is very popular for so-called CDC or moving data from the original TP database and the translate to the JSON format or similar format to capture what is changed. And Dibisium can write it out to, for example, to a Kafka topic and we can read it from there and it's actually, it's a very lens-y and the compact structure, say what is my schema, which column is changed, what's the before value, what's the current value, then we translate it to our own, no matter it's a CQ or it's an insert so that we can capture what almost the mirror in the real time was happening in the original database. However, Dibisium also can be leveraged, I guess, without Kafka, I think, in some cases, you can even leverage Dibisium as a library to fit into our system, but today, we still focus a lot on the Kafka data sources. Meanwhile, we also have our own REST API for you to push data to us without something in the middle and this can be predicted useful if you have a few, for example, IoT devices, right? So the IoT devices, maybe they can send data to Kafka but still, it's maybe some Kafka library but sending pure pan HTTP requests is easier and the template itself have own buffering or some of the hidden so that you can send in a minute that has possible to us and we can make sure those data can be analyzed in the real time and you don't really have to set up a Kafka but if you already have Kafka then prefect, then it's really a nice way to consolidate your data and don't worry, as a data engine, we don't have to worry too much about the concept of the new product. - Right, right. From a developer's perspective or a operations perspective, I know there's time plus, sorry, time plus cloud (laughs) and as far as I know, you can also deploy it. I think there is even a Homebrew, like a macOS Homebrew version but there's Docker images, there's everything. I mean, the Homebrew version is great for developers who want to try it out, who develop on their own machine. There's literally like no easier way to get something on macOS. Maybe just talk a little bit about the different options and what are the pros and cons between all of those. - Yeah, sure. So, but eventually it's simple that is it, we really want to make developer happy. At least let them last fast radio, right? I mean, being developer is not easy and you have to read a lot of documents and you have to set up dependencies. And for example, on my machine, I may have 10 different GDK versions and depends on which software I would use, I may have to keep a switch, even sometimes I have to switch a different node version or the Maven versions. There's a lot of a capacitance in the developer world and we want to make things as easy as possible. So we want to provide different options. So we have this core engine open source on GitHub. We call it Proton or 10 pass Proton. So this is our core engine and it's as simple as a single binary, I think when you get it depends on your limits on that, you might have this 200 or 300 megabytes, a single binary, a single file and you just change mode, pass X makes it excutable and you can run that there's no other way that is, it's super easy, but we also provide things like home brew so that you don't have to download by ourselves, just a little home brew. I personally run, I guess, brew, update every other day on my Mac to make sure all my software is up to date. It's more like clean up. It's cleaner than my room, I guess. I want to get to the latest version. It's very easy for you to get to the latest version. - Yeah. - And also we have example Docker image for sure and you can also install our software using C URL, it's a script, but more importantly, that is if you want to run this in production and we offer A is that is we have this fully managed cloud. So you just need to log in with your social account no matter it's Google or Microsoft, so you can just log in and you may get, again, you can see some of our live demo systems that that's for sure, but if you want to try by yourself, you can jump from your demo system to our power system and then you can get a free trial for 14 days, then that's fully managed by us and up with us and tuning by us and it's easier for sure. But if you want to do the self hosting by yourself for different reasons, for example, like you have your Kafka in local, you don't want to expose your Kafka endpoint to the internet or you have some other databases, you won't have some, they have, who are not ready to be put on the cloud. So you can offer it to use our self hosting version with, we do provide Kubernetes ham charts for you to install easily on your Kubernetes or if you want to just set up a three or five different virtual VM and you can install the binary and configure them to create a class by yourself. So all those options is available and it really depends on the use cases. For example, we have some, we have normal customers who use the Kubernetes ham chart that's very easy, but we also found some users, they just need a, run the software on a small server or even kind of a single node and they have a lot of that. So it's more like similar to the edge computing. Each server, they have a assess to certain files and they want to react on it quite quickly. Even we do have some, even more plastic edge scenario which is that is we have some clients doing POT with us that is they have a, a lot of a train, I guess, on the train and the, you not always get to put the signals or for example, if you are in the countryside, if you go through some panels, so you don't always get to put the signals. So they prefer doing the computing on the, on the train. So every single train, they have a bunch of small devices and the devices only have limited results for example, eight core and maybe 16 gigabytes memory. And in the path, they want to put everything together. Like they want to put Kafka, they want to put the Apache flame, they want to put the warehouse, want to put the graduates together in such devices. And it's, it's, it's pretty much about issues. And now they can just put the tempos and the tempos is a very efficient engine and they can do own device, more in the train, training, alerting. - Yeah, so that's a lot of deployment options for sure. But yeah, I'd prefer to choose which one you think works best for a scenario. - All right, you said time plus cloud. Are you using Kubernetes internally? Is it using how much? - Yeah, sure. (laughing) - Not really how much. So I guess, again, this is my, people might have different opinions, but in the practice of tempos, we use our own Kubernetes operator in the cloud. Again, it's Kubernetes, but we use literally it's EKS, right? AWS managed elastic Kubernetes service, I guess. So the reason that is we also want to minimize our operation efforts. So for example, the, the master node of Kubernetes is managed by AWS and they also have something like Fage, you can less worry about the infra, the VM stuff. And the AWS also don't really provide the most recent or latest conversion, but they do a lot of tests. And they are about a bunch of prongings. So we happy to be using EKS. But the reason why we not directly using hub chart but using operator because we also want to handle multi-talent and some other things better. And using operator essentially is customized code, right? So you can write your goal on code in an operator and then when someone set up or when someone upgrade, we can do extra things. And hub chart is really designed for people who don't care too much about multi-talent, they just want to set up a multi-node server and hub chart just to have a bunch of templates, parameters, so for you to wire things together. - Yeah. - Yeah, that makes sense. We're almost at the end of our time. One question I always ask people like, what do you think is like the next big thing? What do you see on the horizon? B, B the visionary could be in time plus, could be in stream processing, could be in databases, AI, whatever you think is cool. - Yeah, I will say certain AI is a big thing, right? Some people think there's some bubble. I had my own opinion, but I guess there's something we need to do together that is making sure AI can get the latest fresh data, crack data, right? I mean, they do, usually they don't do this very well. They have a lot of historical patterns but they don't really have good access to the fresh data. That's how data AI and the data can get together. Even not talking about the AI itself, it is, there's a lot of movements today to have cloud native data warehouse or as per thing like that. People can put more data in the cloud with like every low cost. But step by step, people will be realizing that is, I not just need a huge amount of data. I need to understand what's going on right now. So the real time data, the streaming data, I think will have a bigger room to grow. And with the AI, we can even somehow partner with the AI engine or AI model to come up with a better context and provide a better recognition for the team. - All right. I think that makes a lot of sense. I also feel like machine learning plus real time analytics data is probably the next step. Integrating those two things. And it happens a little bit already, I think. Like a lot of the, what's it called, fraud detection algorithms are basically using some kind of pre-trained AI model or machine learning model and fit the current data for it. All right. Cool. I think that is nice. Anything else you want to put out? Anything you want to share with the world? - Yeah. So I'm wearing the T-shirt at 10+ if you are watching the video, but if not, just go to 10+ dot com. I think there's some button to allow you to try either our open source version or the free shop version on cloud or our health app post. If you have a user, if you can give us a star on our portal project, that'd be great. So give us some feedback to raise your requirements or report some issues or help us on the documentation or just chat with us in our community Slack. You can learn more people, different use quizzes and help each other to read a better world together with fresh and the crack data. - That's awesome. I'm happy to put all of those links in the show notes. If you want to find anything, you'll find it. - Yeah, Joe, thank you very much for being here. It was a pleasure having you. I hope we see you at a conference somewhere soon. - Yeah, close, likewise. - I think for the first time, I think we never met in person, but for the audience, you know the drill next time, next week, same place. And I hope you come back and listen in again. Thank you very much for being here as well. - The cloud commute podcast is sponsored by simply block. Your own elastic block storage engine for the cloud. Get higher IOPS and low predictable latency while bringing down your total cost of ownership. WWW, Simply Block I/O. (upbeat music) (gentle music)