1 00:00:03,240 --> 00:00:09,708 Let's, think about the next optimization. And the next optimization is, trying to 2 00:00:09,708 --> 00:00:14,915 get data in the cache early. So this kind of the one technique where 3 00:00:14,915 --> 00:00:20,673 we're going to think about trying to combat compulsory misses, or first time 4 00:00:20,673 --> 00:00:24,486 misses. And what we do is we actually go and fetch 5 00:00:24,486 --> 00:00:29,666 data before we need, need it. And we can do this in a couple different 6 00:00:29,666 --> 00:00:33,216 ways. We can do this with just hardware doing 7 00:00:33,216 --> 00:00:36,293 it. We can do it with software doing it. 8 00:00:36,293 --> 00:00:42,614 We do it with mixed cues, or a hardware, and software does it. 9 00:00:42,614 --> 00:00:47,659 And this really affects our compulsory misses. 10 00:00:48,536 --> 00:00:56,543 The, the new care actually wanted to say something about instruction fetches, are 11 00:00:56,543 --> 00:01:04,769 easier to predict than, data cache apps, It seems. And you'll actually see this 12 00:01:04,769 --> 00:01:11,240 sometimes show up in processors, where processors will have. 13 00:01:11,240 --> 00:01:17,458 A instruction prefetcher but will not have data protection, and it's because 14 00:01:17,458 --> 00:01:23,677 instructions a lot of times doesn't execute instructions in a row; there sort 15 00:01:23,677 --> 00:01:28,845 of marking valid instruction and they just keep falling through. 16 00:01:28,845 --> 00:01:35,387 So it's very easy to predict with high probability if you're executing let's say 17 00:01:35,629 --> 00:01:40,555 code at one block the actual right next block will be access. 18 00:01:40,555 --> 00:01:44,997 So it's just sort of the extreme of spatial caliber. 19 00:01:44,997 --> 00:01:50,989 So you fetch, let's say block one, you will go fetch to block two of your 20 00:01:50,989 --> 00:01:56,639 instructions. So you both read effective prefetcher for 21 00:01:56,639 --> 00:02:00,604 instructions. So, nothing is free in life. 22 00:02:00,604 --> 00:02:05,758 Prefetching can hurt you. So star of garbage is here. 23 00:02:05,758 --> 00:02:13,490 If you really want your prefetching, of data let's say from your L2, your L1, to 24 00:02:13,490 --> 00:02:19,240 increase your probability that you are going to find data. 25 00:02:19,240 --> 00:02:23,634 But, if you go and you pull in data, it just wasn't useful. 26 00:02:23,634 --> 00:02:30,033 You're actually going to pollute the cache or effectively decrease the capacity of 27 00:02:30,033 --> 00:02:34,582 the cache with just extra data in there that wasn't needed. 28 00:02:34,582 --> 00:02:40,518 So, you have to be very careful with prefetching, because it actually has the 29 00:02:40,518 --> 00:02:46,763 ability to get rid of compulsory misses, but it also has the ability to hurt your 30 00:02:46,763 --> 00:02:52,854 performance in particular. Timeliness is, is sort of an important issue here is 31 00:02:52,854 --> 00:02:55,967 that, If you go and prefetch too late, the data, 32 00:02:55,967 --> 00:02:59,494 that's not really useful. What do we mean by too late? 33 00:02:59,494 --> 00:03:04,950 Well, if you were to do a prefetch and the exact, next instruction is the load that 34 00:03:04,950 --> 00:03:10,140 was going to use the data that you just prefetched, it's not going to save you 35 00:03:10,140 --> 00:03:13,999 almost any time. It might save you one cycle, so that's not 36 00:03:13,999 --> 00:03:18,257 really useful. But if you prefetch too early and then it 37 00:03:18,257 --> 00:03:23,380 gets kicked out by some other piece of data, it didn't do you any good either. 38 00:03:23,380 --> 00:03:27,240 It just wasted bandwidth up to your next level of capture, 39 00:03:27,240 --> 00:03:33,575 To the, the nice little cache RP. So, it's a really tricky thing to get right here, 40 00:03:33,575 --> 00:03:39,136 it's prefetching, especially within this time. And, of course, this, this, yeah, 41 00:03:39,136 --> 00:03:44,275 this, this, this the loose, both the cache data and it just increases the bandwidth 42 00:03:44,275 --> 00:03:48,498 out to, out to here. So, you need to make sure that you're not 43 00:03:48,498 --> 00:03:53,074 using extra bandwidth that you could be using for something else. 44 00:03:53,074 --> 00:03:58,353 So, if you have a demand list from 01 or 02, you want to make sure that, that 45 00:03:58,353 --> 00:04:03,486 always takes precedence over in prefetches. If we go look instruction 46 00:04:03,486 --> 00:04:08,132 prefetching in particular. As I said this is, this is actually 47 00:04:08,365 --> 00:04:13,012 usually a benefit. Because there's a high correlation as you 48 00:04:13,012 --> 00:04:17,349 go request block i. You're going to go request block I plus 49 00:04:17,349 --> 00:04:23,467 one if it's not already in your cache. When you go to do this though, though you 50 00:04:23,467 --> 00:04:29,198 just don't want to go blindly request I plus one for the next level cache. 51 00:04:29,198 --> 00:04:35,240 You probably want to check the instruction cache to make sure you already don't. 52 00:04:35,240 --> 00:04:42,003 You have five plus one, because otherwise you're just sort of wasting extra 53 00:04:42,003 --> 00:04:46,513 bandwidth you shouldn't be using against altitude. 54 00:04:46,513 --> 00:04:54,629 Okay, so this is interesting point here, this is interesting optimization, on the 55 00:04:54,629 --> 00:05:00,672 optimization of prefetching. Instead of just always filling it, let's 56 00:05:00,672 --> 00:05:07,887 say a subsequent block into the L1 you could do a little buffer here. This is 57 00:05:07,887 --> 00:05:12,227 kind of like a. It's not kind of victim cache, but it's 58 00:05:12,227 --> 00:05:17,846 kind of an extra little cache on the side or a film buffer or stream buffer. 59 00:05:17,846 --> 00:05:23,892 And, what's nice about this is, you could basically fetch, let's say, I, I plus one, 60 00:05:23,892 --> 00:05:28,658 put both of them in here. They are trying to maybe even fetch plus 61 00:05:28,658 --> 00:05:33,708 two, you are wasting extra bandwidth potentially here, or use some extra 62 00:05:33,708 --> 00:05:37,478 bandwidth. What if you, let's say you do a cache 63 00:05:37,478 --> 00:05:43,240 induce for the next is already here. Just try to execute right after that, or. 64 00:05:43,240 --> 00:05:47,360 Right out of the stream buffer or the prefetch buffer. 65 00:05:47,360 --> 00:05:52,922 And this is actually a common theme which is both using the instruction side and 66 00:05:52,922 --> 00:05:58,484 sometimes the DS side, where people will actually prefetch into a prefetch buffer, 67 00:05:58,484 --> 00:06:03,497 and that will be the cache, and they're just trying to do this to prevent 68 00:06:03,497 --> 00:06:07,068 pollution in the cache of daily dope action liberty. 69 00:06:07,068 --> 00:06:11,807 But the data that's in this prefetch buffer, stream buffer, has a high 70 00:06:11,807 --> 00:06:14,760 probability of being used. So, it's probably, 71 00:06:15,302 --> 00:06:18,760 A good, good idea if you have one of those. 72 00:06:20,420 --> 00:06:24,303 Okay. So what are good heuristics on the data 73 00:06:24,303 --> 00:06:27,928 side? A couple, couple different heuristics 74 00:06:27,928 --> 00:06:34,402 here, The most basic one is predeterminance. So if you miss for block 75 00:06:34,402 --> 00:06:38,545 b, go fetch block e1. Plus one. This is not same on access. 76 00:06:38,545 --> 00:06:43,119 So if you go touch block e, somewhere in the cache. 77 00:06:43,119 --> 00:06:50,714 This is not saying, go get the next block. Instead, this is saying, if you miss block 78 00:06:50,714 --> 00:06:55,233 b, go get the next line. And that actually people built a lot of 79 00:06:55,233 --> 00:06:58,180 those. And that actually it's, D is interesting. 80 00:06:59,741 --> 00:07:06,064 another. A little bit, what we wanted to be a 81 00:07:06,064 --> 00:07:11,014 little more careful with is the one when you're saying, "If I access block B, in 82 00:07:11,014 --> 00:07:13,680 the cache, it'll take hits in our O1 cache. 83 00:07:13,680 --> 00:07:20,358 Should we go and preemptively try to get B plus one?" You gotta be careful with this 84 00:07:20,358 --> 00:07:25,153 one. Cuz you, you effectively can very easily 85 00:07:25,153 --> 00:07:30,939 be pulling in lots of data that you don't really need into your cache at this point 86 00:07:30,939 --> 00:07:36,174 because you, you basically, at every single load we'll say, it sits in your L1 87 00:07:36,174 --> 00:07:41,547 cache, you're very easily be pulling in lots of extra data, so you have to be a 88 00:07:41,547 --> 00:07:47,063 little bit, a little bit apprehensive about doing this one if you want to be 89 00:07:47,063 --> 00:07:51,360 more aggressive, and you have big caches though, you can be more aggressive. 90 00:07:51,360 --> 00:07:58,290 Cuz just as we sort of, pursuing asset to look for with more data. 91 00:07:58,290 --> 00:08:02,000 So, if you go look at something like the paid for and beyond, so that's in your 92 00:08:02,000 --> 00:08:08,740 desktop, possibly stay the Core i7, they actually have strive protect, strident 93 00:08:08,740 --> 00:08:16,660 prefetch. So they will detect a program that is, let's say, accessing sort of, 94 00:08:16,660 --> 00:08:23,700 data will give you offsets. Right, Forward. So, let's say, you access 95 00:08:23,700 --> 00:08:29,595 location one, location of 128 plus one. Location 256 plus one. 96 00:08:29,595 --> 00:08:33,065 And you're basically just sort of moving through data. 97 00:08:33,065 --> 00:08:38,335 But not just looking for the next line, after the next line, after the next line. 98 00:08:38,335 --> 00:08:41,872 But instead you're sort of looking with a given stride. 99 00:08:41,872 --> 00:08:47,342 This is actually pretty common when you think about structures laid out in memory. 100 00:08:47,342 --> 00:08:50,678 If you think of a C style structure in an array. 101 00:08:50,678 --> 00:08:55,215 And then you're all going to be packed. And they're going to be packed by the size 102 00:08:55,215 --> 00:08:58,912 of the structure. As you go touch, let's say, the first 103 00:08:58,912 --> 00:09:03,111 field of the structure. So you have the structure, inside of 104 00:09:03,111 --> 00:09:07,240 there, there's, an int, a pointer a floating point number. 105 00:09:07,240 --> 00:09:12,062 And you have a loop, which says, increment the first int, we'll say, in the 106 00:09:12,448 --> 00:09:15,727 structure. You're going to be striding through that 107 00:09:15,727 --> 00:09:18,556 memory. You're going to be touching the same 108 00:09:18,556 --> 00:09:21,900 location in memory, as you sort of execute that loop. 109 00:09:22,960 --> 00:09:28,821 So what you do is you have something that's called a stride detector. 110 00:09:28,821 --> 00:09:36,041 And it looks at the memory references you do and tries to see if they are correlated 111 00:09:36,041 --> 00:09:40,288 somehow. And given the correlation it will predict, 112 00:09:40,288 --> 00:09:46,175 oh, if you just go and access let's say, address zero, address eight. 113 00:09:46,175 --> 00:09:52,602 Hold on a second, exactly. You just, you, you access address zero, 114 00:09:52,602 --> 00:09:56,351 you access address 128, access address 256. 115 00:09:56,351 --> 00:10:01,930 It'll predict and say I bet you they are going to go back to they are going to 116 00:10:01,930 --> 00:10:05,099 access 128 plus 256. And just go pull that interim. 117 00:10:05,099 --> 00:10:10,030 Pull that address. So they set the Pentium IV, and the Intel 118 00:10:10,030 --> 00:10:14,806 clusters having stride detectors. The goal is actually the power of five. 119 00:10:14,806 --> 00:10:18,308 It has eight independent stride pre-fetch detectors. 120 00:10:18,308 --> 00:10:21,110 So this they seem to get pretty sophisticated. 121 00:10:21,110 --> 00:10:24,293 So you can actually, go pull in lots of extra data. 122 00:10:24,293 --> 00:10:29,642 And this is important in sort of, the care and feeding of the processor, so you don't 123 00:10:29,642 --> 00:10:37,415 have to wait for the cache miss. You can also do this in software, just as 124 00:10:37,415 --> 00:10:41,840 well as in hardware. So if you go and look at a software 125 00:10:41,840 --> 00:10:47,990 pre-fetch, you can add extra instructions to your I save, which say, go get the data 126 00:10:47,990 --> 00:10:51,995 early. And this is kind of the, akin, you know, 127 00:10:51,995 --> 00:10:56,105 we have looked at doing these and Arbor, Arbor and super Scalars. 128 00:10:56,105 --> 00:10:59,106 And VRW's did a lot of that stuff in software. 129 00:10:59,106 --> 00:11:01,912 Well, you could do prefetching in software. 130 00:11:01,912 --> 00:11:07,262 But I should point out, to effectively do this pre-fetching in software, we changed 131 00:11:07,262 --> 00:11:09,871 the ISA. We added some sort of pre-fetch 132 00:11:09,871 --> 00:11:13,329 instruction. The other little trick that we had to do 133 00:11:13,329 --> 00:11:16,200 here. Oh so doing this loop, I just wanted to 134 00:11:16,200 --> 00:11:21,093 say what's going on here is, we're adding A of I. First we multiply A of I times B 135 00:11:21,093 --> 00:11:24,845 of I. Well we're actually prefetching the 136 00:11:24,845 --> 00:11:31,954 subsequent nuclear agents regular pretecthing the subsequent nuclear agents 137 00:11:31,954 --> 00:11:36,853 data probe. So you can do the speed dutch after the 138 00:11:36,853 --> 00:11:41,661 subsequent loop iteration. But, you need to do, have a few little twittles here. 139 00:11:41,661 --> 00:11:44,994 You probably need to have a non-blocking memory system. 140 00:11:44,994 --> 00:11:49,962 Because otherwise, you need to basically some way that this prefetch can go out to 141 00:11:49,962 --> 00:11:54,627 the memory system and not just stall. Because otherwise, it didn't help you at 142 00:11:54,627 --> 00:11:57,354 all. So what we talked about non-blocking and 143 00:11:57,354 --> 00:12:00,322 out of the ordinary systems in this next lecture. 144 00:12:00,322 --> 00:12:04,140 But this Sauper prefetch, you probably need something like that. 145 00:12:06,263 --> 00:12:13,722 We're out of time today so I'll pick this up, next time with software pre-fetching. 146 00:12:13,986 --> 00:12:18,637 And then some other software optimization techniques, 147 00:12:18,637 --> 00:12:23,640 And we'll start talking about out of order limit systems.