1 00:00:03,340 --> 00:00:07,799 Okay. So, we're, we're through three of our 2 00:00:07,799 --> 00:00:14,910 seven optimizations for today. Let's look at the next optimization. 3 00:00:15,151 --> 00:00:20,931 And this, this comes back to this fundamental problem that you can build 4 00:00:20,931 --> 00:00:25,669 something both big and fast. They fed off against each other. You can 5 00:00:25,669 --> 00:00:28,623 build really big cache, you could build big memory. 6 00:00:28,623 --> 00:00:32,301 You can have, you know, have boxes of memory that fill this entire room. 7 00:00:32,301 --> 00:00:36,593 You go in some data center, people actually build big boxes of RAM that would 8 00:00:36,593 --> 00:00:40,494 fill this entire room, but it's slow, it takes a long time to access it. 9 00:00:40,494 --> 00:00:45,120 It's a bunch of multiplexers, they need to sort of choose between all the different 10 00:00:45,120 --> 00:00:49,300 racks of RAM, we'll say, or if you have a 32 megabyte cache, that's a lot of 11 00:00:49,300 --> 00:00:55,347 multiplexing and stuff like that. So, you can have something both big and 12 00:00:55,347 --> 00:00:56,120 fast. Now, 13 00:00:57,260 --> 00:01:04,249 Our solution to this, our multi-level caches is that you put progressively 14 00:01:04,249 --> 00:01:10,483 larger caches as you go farther out. So, you have a small cache which you 15 00:01:10,483 --> 00:01:15,206 access very frequently, and has a very fast cycle time. 16 00:01:15,206 --> 00:01:21,509 You put a L2 which is a little farther away but can hold more data. 17 00:01:21,509 --> 00:01:26,038 And then, you have, let's say, main memory, or maybe an L3 out here which 18 00:01:26,038 --> 00:01:28,820 holds more data but takes longer to access. 19 00:01:29,580 --> 00:01:34,864 So, that's really the insight here is that, you can have larger cache sizes as 20 00:01:34,864 --> 00:01:41,042 you get farther away and it helps to have larger caches cuz you will be able to hold 21 00:01:41,042 --> 00:01:44,473 more capacity. And you'll, sort of, gracefully decay 22 00:01:44,473 --> 00:01:49,141 versus falling off a cliff, Either you're in the cache or you're not 23 00:01:49,141 --> 00:01:53,053 in the cache. So that means multilevels of caches can, 24 00:01:53,053 --> 00:01:56,914 can fix this problem. The other thing you can do is, because 25 00:01:56,914 --> 00:02:00,436 this cache is farther away, it takes more cycles to access it. 26 00:02:00,436 --> 00:02:04,651 You can afford for it to be bigger and take multiple cycles to access it. 27 00:02:04,651 --> 00:02:08,981 So, kind of it by virtue of it being farther away, if it takes two cycles to 28 00:02:08,981 --> 00:02:13,196 access versus one cycle to access. But it already took you, lets say, a few 29 00:02:13,196 --> 00:02:17,122 cycles to get out to that cache, You know, you've got, you have the extra 30 00:02:17,122 --> 00:02:22,462 time to go do that. So, this is our multi, multilevel caches. 31 00:02:22,462 --> 00:02:29,361 So you'll see, lots of processors people will quote, the L1 cache size and the L2 32 00:02:29,361 --> 00:02:35,728 cache size, and the L3 cache size. One, one thing I wanted to say about this 33 00:02:35,728 --> 00:02:40,800 was that, there's actually some, some rules of thumb that apply to this. 34 00:02:40,800 --> 00:02:45,946 And they're kind of, kind of random and empirically found. But typically, in a 35 00:02:45,946 --> 00:02:50,745 processor design, you're going to want to make your next level of cache about eight 36 00:02:50,745 --> 00:02:55,797 times bigger than the previous level of cache to have it have some useful, effect. 37 00:02:55,797 --> 00:03:00,849 And this is just an empirical sort of rule of thumb that people typically apply, and 38 00:03:00,849 --> 00:03:04,999 if you go look at almost all the processors in the world, this almost 39 00:03:04,999 --> 00:03:08,487 always holds true. And the reasoning behind this is, you 40 00:03:08,487 --> 00:03:13,118 know, it takes extra time to go access this next level of cache. And, if you're 41 00:03:13,118 --> 00:03:16,968 just going to go up by a factor of two, it's not really worth it. 42 00:03:16,968 --> 00:03:21,900 And if you, sort of, run the numbers of back in the beginning, of the missed time 43 00:03:22,340 --> 00:03:28,716 multiplied by our missed rates. If a miss time goes up a bunch, the miss 44 00:03:28,716 --> 00:03:33,715 rate needs to go down a lot for this to actually make any, any sense to use. 45 00:03:33,715 --> 00:03:37,588 And if you only go up by a factor of two in cache size, 46 00:03:37,588 --> 00:03:43,080 Let's say, you have an eight kilobyte cache backed by a sixteen kilobyte cache, 47 00:03:43,400 --> 00:03:46,084 It just sort of works out but it doesn't actually help. 48 00:03:46,084 --> 00:03:50,160 Somebody said that it is rolled into that rule of thumb is how much time it takes to 49 00:03:50,160 --> 00:03:53,180 sort of, go to next level of cache. Because if it only takes, let's, say, one 50 00:03:53,180 --> 00:03:56,968 more tick or one more cycle to go to the next level of cache, the rule of thumb 51 00:03:56,968 --> 00:03:59,653 might break down. It might actually make sense to have a 52 00:03:59,653 --> 00:04:04,398 two, a multiplicative cache, or a cache that's only two sizes larger than the next 53 00:04:04,398 --> 00:04:08,137 INAUDIBLE. But in reality, people typically don't really build these things 54 00:04:08,137 --> 00:04:12,986 unless they're eight levels up. And you can, sort of, sit down and plot looking in 55 00:04:12,986 --> 00:04:16,200 the Hennessy and Patterson book, The Missed Rates. 56 00:04:16,200 --> 00:04:20,640 And you'll, you'll see that this rule of thumb kind of works out pretty well. 57 00:04:21,920 --> 00:04:24,758 Okay. So, now that we think we want to build 58 00:04:24,758 --> 00:04:29,120 multilevel caches, how does this affect the actual cache design? 59 00:04:31,480 --> 00:04:36,512 Believe it or not, having an L2 cache effects the L1.. 60 00:04:38,420 --> 00:04:44,500 Hm, that's an interesting insight. So, a couple of things can happen. 61 00:04:44,860 --> 00:04:51,103 First of all, it might actually influence your design to even have a smaller 62 00:04:51,103 --> 00:04:54,800 low-level cache, or, or smaller level one cache. 63 00:04:55,820 --> 00:05:00,019 So, you can actually think about having a bigger cache so will give you the 64 00:05:00,019 --> 00:05:03,828 aggregate readout the L1, the L2 and the performance out of it might be a good 65 00:05:03,828 --> 00:05:07,588 trade and you can actually reduce your cycle time of your processor and that 66 00:05:07,588 --> 00:05:10,908 could be a good trade-off. And this is something like they what they 67 00:05:10,908 --> 00:05:14,473 did in the payment before where the cycle time was very important to them. 68 00:05:14,473 --> 00:05:17,940 They put a much smaller L1 cache and they had a backing L2 behind that. 69 00:05:20,000 --> 00:05:26,837 Cough And, another advantage of this is you can actually reduce you energy used to 70 00:05:26,837 --> 00:05:33,757 access the cache if you have L1 and L2 and you make your L1 smaller, cuz a lot of the 71 00:05:33,757 --> 00:05:39,048 work you're doing is in L1. So, if you make your L1 smaller, you're 72 00:05:39,048 --> 00:05:43,200 going to be firing up effectively less transistors. 73 00:05:44,080 --> 00:05:51,163 Another way that having a L2 and influence the L1 is you can potentially have the L1 74 00:05:51,163 --> 00:05:55,995 not be write-back. So, you can actually have it be write 75 00:05:55,995 --> 00:05:58,703 through. And this is a common thing. 76 00:05:58,703 --> 00:06:04,544 For instance, actually, the, the Tilera Processor, which I worked on, we do this. 77 00:06:04,544 --> 00:06:08,035 We have a right through from the L1 to the L2. 78 00:06:08,035 --> 00:06:12,880 It makes the L1 a lot simpler. It makes it so you don't have to have 79 00:06:13,093 --> 00:06:18,758 somewhere on this slide here, You don't have to have dirty data in your 80 00:06:18,758 --> 00:06:21,134 L1. And this is a big benefit. 81 00:06:21,134 --> 00:06:26,284 So, not having dirty, dirty data in L1 means that you could potentially not have 82 00:06:26,284 --> 00:06:31,575 to have an error checking, or error correcting code on your L1 data because 83 00:06:31,575 --> 00:06:36,795 none of the data in there's important. It's all going out to the L2 anyway. 84 00:06:36,795 --> 00:06:42,086 So, if it's all in the L2 anyway, you could potentially just invalidate your 85 00:06:42,086 --> 00:06:45,120 entire L1 and the program is still correct. 86 00:06:45,660 --> 00:06:51,884 So, to really simplify error recovery you can sort of replace error checking with 87 00:06:51,884 --> 00:06:58,109 just pari, or, EEC with just parity. So, that's a big win of having, write through 88 00:06:58,109 --> 00:07:03,336 from the L1 to the L2. And, this is actually the reason that 89 00:07:03,336 --> 00:07:08,566 people typically do this. And we haven't talked about this yet, 90 00:07:08,566 --> 00:07:15,083 we'll actually talk about it at the end of the term cuz it actually will make 91 00:07:15,083 --> 00:07:20,689 multiple processors easier to build. Now, what do I, what do I mean by that? 92 00:07:20,689 --> 00:07:23,482 Well, if you have a cache coherent processor. 93 00:07:23,482 --> 00:07:28,752 So you have multiple processors, and they all have caches, but they're trying to use 94 00:07:28,752 --> 00:07:33,831 one piece of memory, or use one big shared memory, you have to go check everybody 95 00:07:33,831 --> 00:07:38,974 else's cache if you're looking for data. But if you have a write through from the 96 00:07:38,974 --> 00:07:43,446 L1 to the L2, Well, this is a benefit here because what 97 00:07:43,446 --> 00:07:48,609 this means is you have no dirty data in your L1, so you never have to, to check 98 00:07:48,609 --> 00:07:51,521 the L1. You only have ever have to go check the 99 00:07:51,521 --> 00:07:55,830 L2. And this is, this is a, this is a big win 100 00:07:55,830 --> 00:08:05,195 for coherence perspective because it just makes it a lot simpler. 101 00:08:06,620 --> 00:08:12,177 This is, this sort of goes back to our write buffer that we were talking about. 102 00:08:12,177 --> 00:08:17,878 If you have the L1 write through to the L2 and you go to do, let's say, something 103 00:08:17,878 --> 00:08:23,863 that will invalidate or create a victim in your L1, you don't ever have to take that 104 00:08:23,863 --> 00:08:29,563 victim and write it out to the L2. If all your data's clean in your L1 at all times. 105 00:08:29,563 --> 00:08:32,770 Anyway, what I'm trying to get out at here is, 106 00:08:32,770 --> 00:08:37,702 You would think that just putting like a multilevel cache doesn't effect the, 107 00:08:37,894 --> 00:08:43,275 closer to the core caches, but it does cuz you typically design your whole cache 108 00:08:43,275 --> 00:08:50,460 system together. Another thing that, you know, affects is 109 00:08:51,780 --> 00:08:58,816 the inclusion policy. So, if we go look at sort of a Venn 110 00:08:58,816 --> 00:09:08,014 diagram of memory, And we have data which is stored in our L2 111 00:09:08,014 --> 00:09:16,575 cache, and we have a smaller L1, The data is stored in both, in a inclusive 112 00:09:16,575 --> 00:09:20,520 cache. In an exclusive cache, 113 00:09:21,240 --> 00:09:28,419 We data in our L2, And data is in our L1, and they are not 114 00:09:28,419 --> 00:09:31,820 the same. This is inclusive. 115 00:09:33,442 --> 00:09:40,888 This is exclusive., Okay. 116 00:09:40,888 --> 00:09:46,170 So, If you go look at inclusive, you're going 117 00:09:46,170 --> 00:09:52,408 to say, I'm wasting space because I have to store, if any piece of data is in L1 118 00:09:52,408 --> 00:09:58,312 has to be stored in L1 and L2, and over here we only need to store the data in one 119 00:09:58,312 --> 00:10:03,214 location. So, you would think everyone would have 120 00:10:03,214 --> 00:10:09,634 always wanted to build inclusive caches. In reality, most people build inclusive 121 00:10:09,634 --> 00:10:15,681 caches and don't build exclusive caches. And, and, the, the reasoning behind this 122 00:10:15,681 --> 00:10:19,987 is that it gets hard to build multilevel exclusive cache. 123 00:10:19,987 --> 00:10:24,596 So, let's, let's look at this a little bit more detail here. 124 00:10:24,596 --> 00:10:28,072 So, The exclusive cache stores more data, so 125 00:10:28,072 --> 00:10:32,303 that's a positive. The negative here is that you need to 126 00:10:32,303 --> 00:10:37,667 check both locations when you have to do, let's say, a remote invalidate. 127 00:10:37,667 --> 00:10:43,258 So give our cache rehearse protocol, our multiple processor cache rehearse 128 00:10:43,258 --> 00:10:46,810 protocol, and we need to go check where data is. 129 00:10:46,810 --> 00:10:52,572 Now, instead of checking let's say, the L2 and knowing the data from a time 130 00:10:53,148 --> 00:10:57,018 perspective that's in the L2 and is also in the L1. 131 00:10:57,018 --> 00:11:00,805 Now, we're going to check the tags INAUDIBLE. 132 00:11:01,052 --> 00:11:07,391 Exclusive caches, as far as I know, we could actually implement that these are 133 00:11:07,391 --> 00:11:14,224 sort of AUD. and some of the complexity involved in this, you really have to swap 134 00:11:14,224 --> 00:11:21,597 the lines between the inner type of caches where you take caches because data, which 135 00:11:21,597 --> 00:11:27,466 can't be used in the, the other cache. So, when you take cache from this from the 136 00:11:27,466 --> 00:11:33,336 one, let's say, from the L1 then pick on these to go to the L2 and when the data is 137 00:11:33,336 --> 00:11:36,915 coming in, that's a transfer from the L2 to the L1. 138 00:11:36,915 --> 00:11:43,357 This gets hard, hard to be correct under all circumstances. You'll see once you get 139 00:11:43,357 --> 00:11:47,938 the hang of it there. And one little, one last little thing of 140 00:11:47,938 --> 00:11:53,879 why exclusive caches are hard to do is that if cache y size or the cache y size 141 00:11:53,879 --> 00:11:58,148 is different, You basically can't build a exclusive 142 00:11:58,148 --> 00:12:04,825 cache because you're swapping sort of, lines between these two things, and it 143 00:12:04,825 --> 00:12:09,393 just doesn't make any sense, doesn't anybody benefit. 144 00:12:09,393 --> 00:12:16,510 And, lots of times people like to have smaller block size in the L1 cache because 145 00:12:16,510 --> 00:12:22,747 you get more locations, more blocks to work than L2. And if, let's say, have 146 00:12:22,747 --> 00:12:28,810 those block sizes different, you can't necessarily represent let's say, 147 00:12:28,810 --> 00:12:34,164 In the L2 cache, which has, let's say, a 64-byte block size, and the L1 which has a 148 00:12:34,164 --> 00:12:38,113 32-byte block size, Just go. You won't have all the data cuz 149 00:12:38,113 --> 00:12:43,802 it takes two of the smaller cache lines to put together one of the bigger cache line, 150 00:12:43,802 --> 00:12:47,951 so you can't really sort of, swap them cuz it's not compatible. 151 00:12:47,951 --> 00:12:53,439 So, in each, exclusive cache, this forces you to have the same cache size between 152 00:12:53,439 --> 00:12:56,317 two. But the, of course, is that you get more 153 00:12:56,317 --> 00:12:59,262 data. This, this is also the other reason why 154 00:12:59,262 --> 00:13:05,132 sort of, this for inclusive caches, if you guys are qithout the rule eight 155 00:13:05,132 --> 00:13:12,285 multiplication or making this cache a transfigure multiple cache prospective, 156 00:13:12,285 --> 00:13:19,252 that, that, that step up then kind of INAUDIBLE turn into a noise because, oh, 157 00:13:19,252 --> 00:13:27,055 yes, you were duplicating something which all you INAUDIBLE L2 cache size INAUDIBLE. 158 00:13:27,055 --> 00:13:31,329 Okay. So, a few set of examples are ubiquitous, 159 00:13:31,329 --> 00:13:34,859 multiple caches, some real, real INAUDIBLE. 160 00:13:34,859 --> 00:13:40,200 Cough Here, we have, You have the Itanium-2 this is a DYW 161 00:13:40,200 --> 00:13:48,143 Architectural advance dealing on your picture that we talked about from Intel 162 00:13:48,143 --> 00:13:52,211 and HP. And you sort of look at, they have a 163 00:13:52,211 --> 00:13:58,992 sixteen kilobytes, L1, and then L2 that should change both these things. So, it 164 00:13:59,282 --> 00:14:06,354 change the size and, and the base it's, it's bigger than any x multiplier. 165 00:14:06,354 --> 00:14:10,810 And they also changed the line size. And then, 166 00:14:10,810 --> 00:14:15,486 They even have a third level here, That is when you have two, might as well 167 00:14:15,486 --> 00:14:18,772 go to three. So they have a third level cache because 168 00:14:18,772 --> 00:14:22,501 they have a space. This is a expensive processor, they had a 169 00:14:22,501 --> 00:14:25,661 lot of space, so they put a level three cache here. 170 00:14:25,661 --> 00:14:30,527 And, in the third level cache, they actually increased the associativity quite 171 00:14:30,527 --> 00:14:34,698 a bit, a 12-way set associative. And, they made the cache a lot bigger, 172 00:14:34,698 --> 00:14:38,174 three megabytes. And the line size is the same as they're, 173 00:14:38,174 --> 00:14:42,155 they're almost the right size. Let's look at the, the latency here. They 174 00:14:42,155 --> 00:14:50,497 have single cycle latency in the L1,. That's usually kind of what you want, very 175 00:14:50,497 --> 00:14:56,926 fast latency in the L1, five cycles L2, twelve cycles L3. 176 00:14:56,926 --> 00:15:01,377 It seems like it steps up as it gets further away. 177 00:15:01,377 --> 00:15:12,628 If we look at the Power seven this was the ten ultimate power architecture from IBM, 178 00:15:12,628 --> 00:15:18,810 the Power eight that is, you know, just came out. 179 00:15:18,877 --> 00:15:27,243 Cough Here, you can see they, they have lots of cache in the middle, 180 00:15:27,243 --> 00:15:35,810 And this is a muti-core. So, eight cores here. Each of those cores 181 00:15:35,810 --> 00:15:43,573 has a 5-bit L2. They have a little bit bigger L1s that 182 00:15:43,573 --> 00:15:52,810 have 30 kilobyte I and 30 kilobyte D L1s.. A little bit by INAUDIBLE L1, three 183 00:15:52,810 --> 00:15:56,284 cycles, INAUDIBLE L2. Then, the Intel processor we've seen 184 00:15:56,284 --> 00:16:01,082 before, and INAUDIBLE L3. But the L3 is huge, 32 megabytes. 185 00:16:01,082 --> 00:16:07,452 It's a lot and a lot of data out here. But it's also being shared between the 186 00:16:07,452 --> 00:16:11,174 processors. So, this will be about private L2s, 187 00:16:11,174 --> 00:16:15,311 private L1s, and a shared L3. So, interesting design. 188 00:16:15,311 --> 00:16:21,515 Its multilevel caches show often in basically all of modern processors, 189 00:16:21,515 --> 00:16:24,493 Npw we have multilevel caches. Okay. 190 00:16:24,493 --> 00:16:29,885 So, let's take a look at the efficacy of the multilevel cache. 191 00:16:29,885 --> 00:16:36,121 Cough This doesn't you feel like it's sort of just the L1 by itself. 192 00:16:36,121 --> 00:16:42,528 This is, I don't know, look at the, the miss penalty of that is going to go down. 193 00:16:42,528 --> 00:16:49,191 So, just for the, just for the L1, the miss penalty of this goes down because if 194 00:16:49,191 --> 00:16:55,684 you're missing your L1, before you had to go out to main memory, and that was 195 00:16:55,684 --> 00:02:07,451 expensive, but now you can go to your L2 or L3. So, you're reducing this out. 196 00:17:02,234 --> 00:17:07,795 If you draw up a box around the L1, and L2 and L3, everything together, this is both 197 00:17:07,795 --> 00:17:13,288 going to reduce your missed penalty and you're going to have more cache space to 198 00:17:13,288 --> 00:17:17,289 reduce your risk rate. So you're going to have this, this is 199 00:17:17,289 --> 00:17:22,647 reduced and broken this penalty, and in the risk rate together, dropping box for 200 00:17:22,647 --> 00:17:26,920 all your cache in the structure in a the multilevel cache system.