Cloud Commute

Introduction to Side Channel Attacks using CacheWarp

Duration:: 31m
Broadcast on:: 02 Aug 2024
Audio Format:: mp3

But then we started looking at this interaction. Like, okay, we built this super really cool stuff. We have that in the CPUs now that's pretty new. And on the other side, we have instructions in the CPU that were introduced in the beginning of x86, and nobody uses them anymore. CPUs become more and more complex. At that point in time, it was the predictor making guesses where to jump next. Look outside your window and you see if the lights are on. And if the lights are on, you have a certain probability that the neighbor is at home. The lovely noisy neighbor problem that everyone knows from the cloud. I love the example with the neighbor. That's just incredibly visual for everyone to understand. So if this noise that you're seeing depends on some secrets, then it gets interesting. And this is what we do. You're listening to Simpleblocks Cloud Commute Podcast, your weekly 20 minute podcast about cloud technologies, Kubernetes, security, sustainability and more. Hello, everyone, welcome back to this week's episode of Simpleblocks Cloud Commute Podcast. And as you know, I have another incredible guest. I know I say this every single time, and it's every single time true. So with me this week is Misha Alishfarts, fellow German, actually a security researcher with CISPA, but he's gonna say a few words about that. So welcome, welcome, Misha Al. And maybe just introduce yourself very quick. Thanks for this nice intro here. Ah, yes, so as you already said, I'm a security researcher. I'm working here at CISPA, which is a big research center in Germany. If you look at the academic rankings, one of the worldwide leading institutes for cybersecurity. In my position here, I'm leading a research group. I'm a faculty here. So currently I have six PhD students that I advise and additionally, five student helpers that support me in the tasks we are a bit bigger, not the biggest group, but still a considerable large group. And we are working on very specific topics. I have to say that also one of the things I'm talking about today. We are working on kind of all things related to side channel attacks. So also a term that we probably have to introduce in the podcast because it's, I guess, not common knowledge. Right, right. We'll get to back in a second. Maybe you can extend a little bit on CISPA. CISPA is interesting because simply block itself. Well, we're not part of CISPA, but CISPA supports us as well due to the encryption efforts we do with the data storage. But maybe say a few more words about CISPA itself. - Yes. So CISPA is this research center here that has grown a lot. So it's still relatively new. It was founded six years ago. And now we already have more than 600 people working here. Among them, a lot of scientists. So we have now roughly 40 group leaders here that have their own research groups, covering all topics that are information security related, but also AI machine learning. So we are also in this area. We are mainly, as scientists, we are mainly working on academic problems. We try to solve problems. We write them down. We publish papers at the top academic conferences worldwide. And with that, we are also leading worldwide. So there's no other university or research center internationally that has more of these top publications worldwide on this at these conferences. So we really try to be the best in all the things we do. Sounds like marketing now, but this is really what we want to do. We want to be the best. And of course also train the next generation. So that means getting new students and awesome. As you've seen yourself, we want to support companies here. We're helping startups with a security background. And well, we want to make sure that more people work in this area. And ideally also in Germany, that we are creating more shops here about these topics. Because we also believe that these are the topics that will stay relevant in the future. - I agree. You're oriented at side channel attacks. I'm a bit of a geek. So I love to look into all of that. I'm probably not, well, I don't want to say technically skilled enough, but I'm probably not deep enough into side channel text to specifically explain what, how it works. But I think you're doing a much better job. Maybe just give us an explanation, especially the audience. What is specifically a side channel attack and how can it does that work? - Yes. Maybe let's start with an intuition from the real world. So sometimes when you live somewhere, you have a neighbor, and you don't know if this neighbor is at home or not at home or on vacation or not. And you're also not talking to this neighbor. You might not even know the neighbor, but you can still learn something by observing side effects of the behavior of your neighbor. For example, you look outside your window, and you see if the lights are on. And if the lights are on, you have a certain probability that the neighbor is at home. I mean, you're not 100% sure, right? Sometimes you forget to turn off the lights, then your guess is wrong. Sometimes you are at home and don't have the lights turned on because maybe you're watching TV or something. But still you get a good chance if you see lights, you can assume that probably this person is at home if the lights are off, probably not. If the lights haven't been on for a week, probably the neighbor is on vacation or died, hopefully not. So you learn certain things, but just observing things not directly about the neighbor, but what is influenced by the neighbor's behavior. And in real world, we have many such scenarios where we just see something and then try to infer what is really happening. In computer science, we try to do the same on the software level and on the hardware level. So here, what we do in our research, we're not observing neighbors directly, but it's also not so far away if we are talking about the key out. We also have neighbors on the key out. We want to see what is this neighbor on the key out, this other user running on the same server actually doing. And we can't directly talk to them. We can't see what they're doing, but we see certain side effects of that. So also intuitively, think about your run an application that uses a lot of resources, that uses all your CPU, all your memory, then other application, maybe from a different customer, see some bottlenecks there, seeing a slowdown in their own application. And from that, you can already infer not much. So this is like really simple one, contention based side channel is what is called, someone uses resources, resources are not endless, so you cannot use that. And then you already see that something is happening. - The lovely noisy neighbor problem that everyone knows from the cloud. - Yes, and that often has a performance problem, but it's actually a security issue. If you can then start, try to infer, not only that there is a neighbor, but what the neighbor is doing. So if this noise that you're seeing depends on some secrets, then it gets interesting. And this is what we do. We try to find such noise patterns that are unique to secrets. For example, if you're doing cryptography, you have a secret key involved that consists of zero and one bits. And depending on if the bit in the key is currently a zero or a one, your CPU has to do different things. And that involves different resources and a different computation. And we can see that in certain patterns. We can see that in when memory is accessed or memory is not accessed, when certain parts of the CPU are active or not active, then we can't use them, for example. And we see that anemic can infer like, okay, now there's a zero bit, now there's a one bit, a zero bit, just by observing some other effects in our own applications. And from that inferring an entire key, for example. Breaking the crypto, even though it's mathematically secure, the implementation is correct, there are no software bugs, but the side effects are observable. And from that we can infer the key. Often not 100% correct. But even if we get, let's say we have an AS key 128 bits, we can get 120 bits correct. Guessing the remaining ones, that's doable. - Right, right. You basically just brute force the remaining bits by trying the potential. - Exactly. And then maybe we have to try 1000 different keys. But at some point we'll get it correct. - Huh. Okay, that's slightly different from what I thought it is. I know, well, I don't even know if the side channel attack would be correct name for that. But basically, when you try to bring CPU's and stuff into a hiccup by giving them the wrong signal at the wrong time at very specific timings. And then you just jump over certain instructions or stuff. - Yes, what you mean that also exists, these are the hardware based side channel attacks. - Okay, so, but it is the side channel attack. Okay, now I wasn't 100% sure if I'm actually correct about that. - Yes, that they are also considered side channel attacks. Even though nowadays we mostly call them fault attacks, because you can think of it for something. And then you induce a fault in the CPU. It skips an instruction, it does a wrong calculation, stuff like that. - Right, fault injection, that was the term I was looking for, right, right. Cool, yeah, I love the example with the neighbor because that's just incredibly visual for everyone to understand. But your team, you worked on something very specific, which was the cash war. And I think I was like released two years ago, a year and a half ago, something. - I think it was last year, November. - Or maybe last year, November. Maybe just say a few words about that. I think I write about that on high side. I was actually not aware you guys are behind that. So it was really interesting when I get the chance to talk to you. (laughs) - Yes, this is a really nice attack and it shows something very interesting. We have CPUs for a long time and we adding features on top, on top, on top. And sometimes we are forgetting what we already added back then, let's say in the 80s. And it's still there is some legacy stuff nobody dares to touch. And we're adding new features forgetting about the legacy features and also forgetting to think about how they could interact. And a cash war is a really nice example of that. So this targets the newest AMT CPUs with the trusted execution environment used in the cloud, this SEV. And this SEV has the security guarantee saying like, everything you run in there is secure. Even if your cloud provider is malicious or doesn't have to be malicious but could be hacked, even with the permission of the cloud providers, you have no way of seeing what is running inside the virtual machine. - The way it works is that it encrypts every virtual machine with a specific hardware, well, with the ASKI thing, right? - Yes, exactly, yeah. So it encrypts, it also attests that it's running on actual read hardware and it's not emulated in some way and it gives you in theory pretty good guarantees that everything you run there cannot be modified, cannot be seen. And this looked fine, fine. But then we started looking at this interaction, like, okay, we built this super, really cool stuff. We have that in the SEV use now that's pretty new. And on the other side, we have instructions in the SEV use that were introduced in the beginning of x86 and nobody uses them anymore. We don't have use cases for them anymore in modern operating systems. And then you e-manures about this instruction scene to see what do they do with them, spoofed them, spoofed it, exist. That's the nice thing about x86, being back in bed all the way back, way back. You're like, you're like, wondering like, what happens if you use these new instructions that are useless nowadays? And the manual is like, don't use them, use them. I'm like, that's interesting. So it doesn't say anything about it would not prevent us from using them, just like, if you use modern features, like a multi-core, do not use them because they don't expect it. Let's see. I mean, if somebody tells me to not do something, my first thing is like, now I'm doing it even harder. So let's see what is happening. And that's exactly what we tried. And my student was the main driving force behind that array. I told him, like, I told him, like, look, this could be very interesting, see what happens. And at first, first tries, okay, we just run this instruction, everything crashed. It's like, oh, that is not a real problem yet, but definitely interesting. I don't know if cloud providers would agree with that. Something like that. Well, I mean, if you're the cloud provider, making sure something is not working anymore, that's easy. You could also take a hammer. Okay, fair. But it's like, that's a starting point. That's pretty interesting. And then we're starting in investigating that, making theories what could happen. So talking, technically detailed, what does this instruction do? You have the DRA, where you store all your memory, and then you have the cache inside the CTAU. The stores recently used copies of the data you have in there you have to make things fun. And also, if you modify data in the applications, they first are modified just in the cache. And let's see if you have time, they are written back to the real big main memory. And what is instruction, what does it? Here's the cache. And it's like, yeah, it's a large use case for some reasons when pointing a server, for example, that you set everything in an only state. But what if you're really running now machines, they modify data, they only data inside the cache first, and then we get rid of how the conflict is the modification also lost. And the short answer, yes. So for any virtual machine, if this virtual machine modifies some data, somewhere updating something, we can run this instruction, and then this modification is reverted. So we go back to an old state of the data. - And you can run that from any threat, even from a different core, how does that work? - You can do that for basically everything, yes. - Wow. - So this instruction was designed at the time, when there was only one core. - Right, yeah. - And it also says, if you have more than one core, it's undefined what happened with this. - My behavior, behavior. - Yes, it's like, don't do it if you have more than one core. - Okay. - And, but even if limited to one core, as a count provider, you can easily schedule the VM you want to attack, or an end with an end score. - And this is what cache warp essentially can do, this reverting modification. So you modify some files, and as an attacker, you go back to the old state, old state. And this is something that does not sound powerful at first. Because like, well, the data was also there, it was war, no harm done. The interesting thing is, you do that selectively. So you're not reverting everything, so you're not like, restoring a snapshot of your machine. But only of parts of the memory that you can directly directly target. And then you get really nice effects based on also how we write programs. So if you're thinking about programs, for example, the pseudo binary, which elevates your privileges to root for an operation. How does it do that? Well, it asks the operating system, am I already root? If so, it continues with root. Otherwise it asks for a password, for example. How is that implemented? The root user has this idea of zero. When we write programs, stuff is typically initialized at zero first. So the permission is zero, because that's how we start with variables in memory. Then the pseudo program asks the operating system, like who am I? The operating system gives back the number, zero for root, another number for any other user, updates this value in memory, use cache warp, revert it to zero, and then we are root, because it was also initialized like that. And this is not only like a specific case of pseudo, this is in many cases how we write programs. We also, for error checking, we started like we had no error, and then we check certain things. And if we have an error, then we update that like we had an error, but we can revert that. And even if there was an error, there is no trace of that anymore, and we continue. - Right, right. The second you said, it asks the operating system, what user am I? I was like, oh, yeah, okay. ID zero, I see. That is good, because to be honest, I saw the cache warp exploit. I did not really see how this was used, and now it makes total sense, yeah. Because I was also like, how does it... My understanding was that you can actually reset it to a different previous state, and I never understood how that worked, but you basically just clear it out, and it's all zero. And if you want something to be zero, that's the way to go. - Yes, yes. - Right, right. Now you got me. (laughs) Wow, that is just brilliant. Oh, I have still been using this for that now. - Thanks. Of course, a lot of the brilliant systems still needed for finding targets was like, okay, I can reset that back to this value. Where exactly do I do that in a program that it gives me exactly what I want? But we showed quite a few cases where that works. - You only have to time it correctly, right? So you have to figure out when is the correct point and time when pseudo would actually ask, like, "Hey, give me this operation, or give me this ID?" - Yes, yes. - So that means you basically create, well, not a remote code execution, per se, but you could probably make that happen as well. But you gain, well, privilege escalation, basically. You gain root permission. So the next step would be to find something else to inject it into a memory and say, "Please execute that." And I guess with cache, you could probably do the same thing by just making sure you're redirecting to the right memory location now. - Yes, so this is one thing you can work on a very low level. And also the CPU remembers certain things, like when you go to some different part of the code, how to return, you can reset that, and you return to some other place, makes it quite nice. But you also showed like this full chain end-to-end exploit or in two steps. So you have some way to log into a server, as is H, typically. There's a password check. And with cache warp, we trick that password check into believing that it does not matter what we enter, any password is correct. Then we were logged in as a normal user. Then you use it again on sudo. And then we're logged into a server into any virtual machine with root privileges. Then we can just execute whatever we want there as root. So full control of the VM. This entire thing takes just a few seconds. - Wow. - Yeah, it's over 90% reliability. - And you basically start with getting a virtual machine on some shared resource, and that's just how you get into it. It's hard to be precise, I guess, to target a specific company, but you never know who's alongside you. - Exactly, exactly. - Right, right. - I mean, we also don't want to do that. This is, when the academic part also ends, we kind of already overstepped it a bit with showing step by step how to get from this logic problem in the CPU to full, to taking over an entire VM. - Right. - We are not going into details like how to attack a specific company. Even though a lot of my students also ask these things, like, I don't want to break into Microsoft. What do I do? But this is not what we should do. And also not what we do. - That makes sense. It was more like a rhetorical question, obviously. Wow, that is quite something. Especially, it comes about two years after, or two and a half years after a spectrum and meltdown, which were the other big things. And as you said, CPUs become more and more complex. At that point in time, it was the predictor making guesses where to jump next. And it was the same thing. You do those things to speed up CPUs and to make them better. But now you have all those features and people all just figure out how to use them. I think people are probably more familiar with meltdown and specter because it was like, if you want to say it, it was a big th-up. (laughs) Because also it involved AMD Intel and even ARM CPUs as far as I remember, right? It was a little bit late to the game, but people figured it out how to do it as well. And the meltdown thing was like three different CVs or four even. - In the beginning, one and two for specter. - Or two for, yeah, it was a couple of iterations. From your perspective, which one is worse? - It's a very difficult question, but also a really interesting one to think about. So when we, so meltdown, specter, we published that in 2018, beginning of 2018, we discovered in 2017. Back then, we did not fully understand the consequences of that. Now in hindsight, I would say it really depends. So meltdown is something that has a huge impact or had a huge impact, but luckily, like the year two came back, nobody saw the impact. Because that was disclosed in June 2017 and made public in January 2018. And so all the vendors had time to work on fixes workarounds. And when it was public, we had already systems protected against exploitation. So the issue were not fixed, but at least exploitation was made difficult to impossible, depending on the system. So we did not see the impact. If that hadn't happened, that would have been a huge impact because the code mounting that is so extremely easy. Back then, our group was part of the discovery, we printed T-shirts, and we could fit the entire exploit code on a T-shirt. - I remember that T-shirt. - But that was at least easy to mitigate. Specter on the other side is really difficult, and we still have that, and we still have to deal with that. We still don't have real solutions, but it's also way harder to exploit for an attacker. That's similar to stuff we have on software security. Let's say, easier things like buffer overflows. We know them since the 80s. We know how to exploit them. It gets harder and harder to exploit. We still have the bugs. We can't fundamentally fix them. We could, but we don't. But also, we live with that. And as similar of Specter, we don't know how to fix that, but as it's so difficult to exploit, we kind of live with that. Meltdown would be easy, so we had to do something immediately. And luckily, also found ways for fixing that. Cash war goes in a similar direction as Meltdown. It's very easy to exploit. It's extremely powerful, but luckily also AMD was able to fix that very quickly. Let's say within half a year. By, it's not the greatest fix, but by removing other functionality. Nowadays, we are used to that, that our CPUs are losing functions with security updates. At least we have a workaround. It's not exploitable anymore. I hope cloud providers also deploy that. We cannot check. But at least there is something where we could ensure nobody can exploit it anymore. Right, so I guess it's a microcode update, and you basically have to load that through the Linux kernel or whatever. Yeah, that makes sense. So you think Specter is the worst one because it's hard to fix? Because this is a design issue. It's inherent to the design, versus the others are implementation issues where somebody made a mistake implementing something. Right, okay, that makes sense. Yeah, we're almost, well, we are out of time. But one last question, the one that I've always asked, like, "What do you think is the next big thing? Are you working on something cash-warped too?" Always, and even worse. Of course, I can't give any details on that. But yes, we just started to scratch the surface of all these problems. And this is also not surprising if you think about that. We had software bugs in for years, decades. Nobody's surprised if there's a bug in a software. We need a patch. Everybody suddenly surprised that we have that in the CPUs, but also hardware is now just written in software. We have hardware description languages. These are just programming languages for hardware. We combine our CPU. So, program takes care of taking our description, making that into hardware, simplified. Of course, it's also written by humans like software. Humans make mistakes. That's something we will always see. And so, it's not a surprise that we see a lot of problems in CPUs as well, and we'll find more and more. Over the years, so far, we were relatively lucky that we could always add some quick fix mitigation that disabled some functionality maybe, but prevented exploitation. I hope it stays that way. But I fear not. And at some point, we will see these big problems that we cannot fix. And this is really bad, because we cannot simply do an update like with software. And then we have to think about what to do. I guess the thing goes for the ever more complex graphics cards, especially when you go into the AI accelerators and stuff like that. Those things grow and get more complex day by day. Yes, yes. We are currently mostly adding complexity and not trying to simplify things. Publicity introduces problems. That is very true. And every software developer, I guess, knows that. I was kind of shocked how way I was off with the meltdown and inspector timing. Seems like COVID completely messed with my feeling for time. Also, it's like yesterday for me. I can't imagine that. Yeah, thank you very much. It was a pleasure having you. I hope you have the chance to have you back somewhere in the future after the next big exploit you guys were working on. Because that is just incredibly interesting. And I think for our audience, which is often cloud users, it's also very relevant. Thanks for having me. It was a pleasure talking to you about that. And yes, I'd be happy to be back. All right. Thank you very much for the audience as well. Next week, same time, same place. And I hope you're listening in again. And thank you very much for being here as well. The Cloud Commute podcast is sponsored by Simply Block. Your own elastic block storage engine for the cloud. Get higher IOPS and low predictable latency while bringing down your total cost of ownership. WWW Simply Block I/O. [MUSIC]