Okay. So, we're, we're through three of our seven optimizations for today. Let's look at the next optimization. And this, this comes back to this fundamental problem that you can build something both big and fast. They fed off against each other. You can build really big cache, you could build big memory. You can have, you know, have boxes of memory that fill this entire room. You go in some data center, people actually build big boxes of RAM that would fill this entire room, but it's slow, it takes a long time to access it. It's a bunch of multiplexers, they need to sort of choose between all the different racks of RAM, we'll say, or if you have a 32 megabyte cache, that's a lot of multiplexing and stuff like that. So, you can have something both big and fast. Now, Our solution to this, our multi-level caches is that you put progressively larger caches as you go farther out. So, you have a small cache which you access very frequently, and has a very fast cycle time. You put a L2 which is a little farther away but can hold more data. And then, you have, let's say, main memory, or maybe an L3 out here which holds more data but takes longer to access. So, that's really the insight here is that, you can have larger cache sizes as you get farther away and it helps to have larger caches cuz you will be able to hold more capacity. And you'll, sort of, gracefully decay versus falling off a cliff, Either you're in the cache or you're not in the cache. So that means multilevels of caches can, can fix this problem. The other thing you can do is, because this cache is farther away, it takes more cycles to access it. You can afford for it to be bigger and take multiple cycles to access it. So, kind of it by virtue of it being farther away, if it takes two cycles to access versus one cycle to access. But it already took you, lets say, a few cycles to get out to that cache, You know, you've got, you have the extra time to go do that. So, this is our multi, multilevel caches. So you'll see, lots of processors people will quote, the L1 cache size and the L2 cache size, and the L3 cache size. One, one thing I wanted to say about this was that, there's actually some, some rules of thumb that apply to this. And they're kind of, kind of random and empirically found. But typically, in a processor design, you're going to want to make your next level of cache about eight times bigger than the previous level of cache to have it have some useful, effect. And this is just an empirical sort of rule of thumb that people typically apply, and if you go look at almost all the processors in the world, this almost always holds true. And the reasoning behind this is, you know, it takes extra time to go access this next level of cache. And, if you're just going to go up by a factor of two, it's not really worth it. And if you, sort of, run the numbers of back in the beginning, of the missed time multiplied by our missed rates. If a miss time goes up a bunch, the miss rate needs to go down a lot for this to actually make any, any sense to use. And if you only go up by a factor of two in cache size, Let's say, you have an eight kilobyte cache backed by a sixteen kilobyte cache, It just sort of works out but it doesn't actually help. Somebody said that it is rolled into that rule of thumb is how much time it takes to sort of, go to next level of cache. Because if it only takes, let's, say, one more tick or one more cycle to go to the next level of cache, the rule of thumb might break down. It might actually make sense to have a two, a multiplicative cache, or a cache that's only two sizes larger than the next INAUDIBLE. But in reality, people typically don't really build these things unless they're eight levels up. And you can, sort of, sit down and plot looking in the Hennessy and Patterson book, The Missed Rates. And you'll, you'll see that this rule of thumb kind of works out pretty well. Okay. So, now that we think we want to build multilevel caches, how does this affect the actual cache design? Believe it or not, having an L2 cache effects the L1.. Hm, that's an interesting insight. So, a couple of things can happen. First of all, it might actually influence your design to even have a smaller low-level cache, or, or smaller level one cache. So, you can actually think about having a bigger cache so will give you the aggregate readout the L1, the L2 and the performance out of it might be a good trade and you can actually reduce your cycle time of your processor and that could be a good trade-off. And this is something like they what they did in the payment before where the cycle time was very important to them. They put a much smaller L1 cache and they had a backing L2 behind that. Cough And, another advantage of this is you can actually reduce you energy used to access the cache if you have L1 and L2 and you make your L1 smaller, cuz a lot of the work you're doing is in L1. So, if you make your L1 smaller, you're going to be firing up effectively less transistors. Another way that having a L2 and influence the L1 is you can potentially have the L1 not be write-back. So, you can actually have it be write through. And this is a common thing. For instance, actually, the, the Tilera Processor, which I worked on, we do this. We have a right through from the L1 to the L2. It makes the L1 a lot simpler. It makes it so you don't have to have somewhere on this slide here, You don't have to have dirty data in your L1. And this is a big benefit. So, not having dirty, dirty data in L1 means that you could potentially not have to have an error checking, or error correcting code on your L1 data because none of the data in there's important. It's all going out to the L2 anyway. So, if it's all in the L2 anyway, you could potentially just invalidate your entire L1 and the program is still correct. So, to really simplify error recovery you can sort of replace error checking with just pari, or, EEC with just parity. So, that's a big win of having, write through from the L1 to the L2. And, this is actually the reason that people typically do this. And we haven't talked about this yet, we'll actually talk about it at the end of the term cuz it actually will make multiple processors easier to build. Now, what do I, what do I mean by that? Well, if you have a cache coherent processor. So you have multiple processors, and they all have caches, but they're trying to use one piece of memory, or use one big shared memory, you have to go check everybody else's cache if you're looking for data. But if you have a write through from the L1 to the L2, Well, this is a benefit here because what this means is you have no dirty data in your L1, so you never have to, to check the L1. You only have ever have to go check the L2. And this is, this is a, this is a big win for coherence perspective because it just makes it a lot simpler. This is, this sort of goes back to our write buffer that we were talking about. If you have the L1 write through to the L2 and you go to do, let's say, something that will invalidate or create a victim in your L1, you don't ever have to take that victim and write it out to the L2. If all your data's clean in your L1 at all times. Anyway, what I'm trying to get out at here is, You would think that just putting like a multilevel cache doesn't effect the, closer to the core caches, but it does cuz you typically design your whole cache system together. Another thing that, you know, affects is the inclusion policy. So, if we go look at sort of a Venn diagram of memory, And we have data which is stored in our L2 cache, and we have a smaller L1, The data is stored in both, in a inclusive cache. In an exclusive cache, We data in our L2, And data is in our L1, and they are not the same. This is inclusive. This is exclusive., Okay. So, If you go look at inclusive, you're going to say, I'm wasting space because I have to store, if any piece of data is in L1 has to be stored in L1 and L2, and over here we only need to store the data in one location. So, you would think everyone would have always wanted to build inclusive caches. In reality, most people build inclusive caches and don't build exclusive caches. And, and, the, the reasoning behind this is that it gets hard to build multilevel exclusive cache. So, let's, let's look at this a little bit more detail here. So, The exclusive cache stores more data, so that's a positive. The negative here is that you need to check both locations when you have to do, let's say, a remote invalidate. So give our cache rehearse protocol, our multiple processor cache rehearse protocol, and we need to go check where data is. Now, instead of checking let's say, the L2 and knowing the data from a time perspective that's in the L2 and is also in the L1. Now, we're going to check the tags INAUDIBLE. Exclusive caches, as far as I know, we could actually implement that these are sort of AUD. and some of the complexity involved in this, you really have to swap the lines between the inner type of caches where you take caches because data, which can't be used in the, the other cache. So, when you take cache from this from the one, let's say, from the L1 then pick on these to go to the L2 and when the data is coming in, that's a transfer from the L2 to the L1. This gets hard, hard to be correct under all circumstances. You'll see once you get the hang of it there. And one little, one last little thing of why exclusive caches are hard to do is that if cache y size or the cache y size is different, You basically can't build a exclusive cache because you're swapping sort of, lines between these two things, and it just doesn't make any sense, doesn't anybody benefit. And, lots of times people like to have smaller block size in the L1 cache because you get more locations, more blocks to work than L2. And if, let's say, have those block sizes different, you can't necessarily represent let's say, In the L2 cache, which has, let's say, a 64-byte block size, and the L1 which has a 32-byte block size, Just go. You won't have all the data cuz it takes two of the smaller cache lines to put together one of the bigger cache line, so you can't really sort of, swap them cuz it's not compatible. So, in each, exclusive cache, this forces you to have the same cache size between two. But the, of course, is that you get more data. This, this is also the other reason why sort of, this for inclusive caches, if you guys are qithout the rule eight multiplication or making this cache a transfigure multiple cache prospective, that, that, that step up then kind of INAUDIBLE turn into a noise because, oh, yes, you were duplicating something which all you INAUDIBLE L2 cache size INAUDIBLE. Okay. So, a few set of examples are ubiquitous, multiple caches, some real, real INAUDIBLE. Cough Here, we have, You have the Itanium-2 this is a DYW Architectural advance dealing on your picture that we talked about from Intel and HP. And you sort of look at, they have a sixteen kilobytes, L1, and then L2 that should change both these things. So, it change the size and, and the base it's, it's bigger than any x multiplier. And they also changed the line size. And then, They even have a third level here, That is when you have two, might as well go to three. So they have a third level cache because they have a space. This is a expensive processor, they had a lot of space, so they put a level three cache here. And, in the third level cache, they actually increased the associativity quite a bit, a 12-way set associative. And, they made the cache a lot bigger, three megabytes. And the line size is the same as they're, they're almost the right size. Let's look at the, the latency here. They have single cycle latency in the L1,. That's usually kind of what you want, very fast latency in the L1, five cycles L2, twelve cycles L3. It seems like it steps up as it gets further away. If we look at the Power seven this was the ten ultimate power architecture from IBM, the Power eight that is, you know, just came out. Cough Here, you can see they, they have lots of cache in the middle, And this is a muti-core. So, eight cores here. Each of those cores has a 5-bit L2. They have a little bit bigger L1s that have 30 kilobyte I and 30 kilobyte D L1s.. A little bit by INAUDIBLE L1, three cycles, INAUDIBLE L2. Then, the Intel processor we've seen before, and INAUDIBLE L3. But the L3 is huge, 32 megabytes. It's a lot and a lot of data out here. But it's also being shared between the processors. So, this will be about private L2s, private L1s, and a shared L3. So, interesting design. Its multilevel caches show often in basically all of modern processors, Npw we have multilevel caches. Okay. So, let's take a look at the efficacy of the multilevel cache. Cough This doesn't you feel like it's sort of just the L1 by itself. This is, I don't know, look at the, the miss penalty of that is going to go down. So, just for the, just for the L1, the miss penalty of this goes down because if you're missing your L1, before you had to go out to main memory, and that was expensive, but now you can go to your L2 or L3. So, you're reducing this out. If you draw up a box around the L1, and L2 and L3, everything together, this is both going to reduce your missed penalty and you're going to have more cache space to reduce your risk rate. So you're going to have this, this is reduced and broken this penalty, and in the risk rate together, dropping box for all your cache in the structure in a the multilevel cache system.