Let's, think about the next optimization. And the next optimization is, trying to get data in the cache early. So this kind of the one technique where we're going to think about trying to combat compulsory misses, or first time misses. And what we do is we actually go and fetch data before we need, need it. And we can do this in a couple different ways. We can do this with just hardware doing it. We can do it with software doing it. We do it with mixed cues, or a hardware, and software does it. And this really affects our compulsory misses. The, the new care actually wanted to say something about instruction fetches, are easier to predict than, data cache apps, It seems. And you'll actually see this sometimes show up in processors, where processors will have. A instruction prefetcher but will not have data protection, and it's because instructions a lot of times doesn't execute instructions in a row; there sort of marking valid instruction and they just keep falling through. So it's very easy to predict with high probability if you're executing let's say code at one block the actual right next block will be access. So it's just sort of the extreme of spatial caliber. So you fetch, let's say block one, you will go fetch to block two of your instructions. So you both read effective prefetcher for instructions. So, nothing is free in life. Prefetching can hurt you. So star of garbage is here. If you really want your prefetching, of data let's say from your L2, your L1, to increase your probability that you are going to find data. But, if you go and you pull in data, it just wasn't useful. You're actually going to pollute the cache or effectively decrease the capacity of the cache with just extra data in there that wasn't needed. So, you have to be very careful with prefetching, because it actually has the ability to get rid of compulsory misses, but it also has the ability to hurt your performance in particular. Timeliness is, is sort of an important issue here is that, If you go and prefetch too late, the data, that's not really useful. What do we mean by too late? Well, if you were to do a prefetch and the exact, next instruction is the load that was going to use the data that you just prefetched, it's not going to save you almost any time. It might save you one cycle, so that's not really useful. But if you prefetch too early and then it gets kicked out by some other piece of data, it didn't do you any good either. It just wasted bandwidth up to your next level of capture, To the, the nice little cache RP. So, it's a really tricky thing to get right here, it's prefetching, especially within this time. And, of course, this, this, yeah, this, this, this the loose, both the cache data and it just increases the bandwidth out to, out to here. So, you need to make sure that you're not using extra bandwidth that you could be using for something else. So, if you have a demand list from 01 or 02, you want to make sure that, that always takes precedence over in prefetches. If we go look instruction prefetching in particular. As I said this is, this is actually usually a benefit. Because there's a high correlation as you go request block i. You're going to go request block I plus one if it's not already in your cache. When you go to do this though, though you just don't want to go blindly request I plus one for the next level cache. You probably want to check the instruction cache to make sure you already don't. You have five plus one, because otherwise you're just sort of wasting extra bandwidth you shouldn't be using against altitude. Okay, so this is interesting point here, this is interesting optimization, on the optimization of prefetching. Instead of just always filling it, let's say a subsequent block into the L1 you could do a little buffer here. This is kind of like a. It's not kind of victim cache, but it's kind of an extra little cache on the side or a film buffer or stream buffer. And, what's nice about this is, you could basically fetch, let's say, I, I plus one, put both of them in here. They are trying to maybe even fetch plus two, you are wasting extra bandwidth potentially here, or use some extra bandwidth. What if you, let's say you do a cache induce for the next is already here. Just try to execute right after that, or. Right out of the stream buffer or the prefetch buffer. And this is actually a common theme which is both using the instruction side and sometimes the DS side, where people will actually prefetch into a prefetch buffer, and that will be the cache, and they're just trying to do this to prevent pollution in the cache of daily dope action liberty. But the data that's in this prefetch buffer, stream buffer, has a high probability of being used. So, it's probably, A good, good idea if you have one of those. Okay. So what are good heuristics on the data side? A couple, couple different heuristics here, The most basic one is predeterminance. So if you miss for block b, go fetch block e1. Plus one. This is not same on access. So if you go touch block e, somewhere in the cache. This is not saying, go get the next block. Instead, this is saying, if you miss block b, go get the next line. And that actually people built a lot of those. And that actually it's, D is interesting. another. A little bit, what we wanted to be a little more careful with is the one when you're saying, "If I access block B, in the cache, it'll take hits in our O1 cache. Should we go and preemptively try to get B plus one?" You gotta be careful with this one. Cuz you, you effectively can very easily be pulling in lots of data that you don't really need into your cache at this point because you, you basically, at every single load we'll say, it sits in your L1 cache, you're very easily be pulling in lots of extra data, so you have to be a little bit, a little bit apprehensive about doing this one if you want to be more aggressive, and you have big caches though, you can be more aggressive. Cuz just as we sort of, pursuing asset to look for with more data. So, if you go look at something like the paid for and beyond, so that's in your desktop, possibly stay the Core i7, they actually have strive protect, strident prefetch. So they will detect a program that is, let's say, accessing sort of, data will give you offsets. Right, Forward. So, let's say, you access location one, location of 128 plus one. Location 256 plus one. And you're basically just sort of moving through data. But not just looking for the next line, after the next line, after the next line. But instead you're sort of looking with a given stride. This is actually pretty common when you think about structures laid out in memory. If you think of a C style structure in an array. And then you're all going to be packed. And they're going to be packed by the size of the structure. As you go touch, let's say, the first field of the structure. So you have the structure, inside of there, there's, an int, a pointer a floating point number. And you have a loop, which says, increment the first int, we'll say, in the structure. You're going to be striding through that memory. You're going to be touching the same location in memory, as you sort of execute that loop. So what you do is you have something that's called a stride detector. And it looks at the memory references you do and tries to see if they are correlated somehow. And given the correlation it will predict, oh, if you just go and access let's say, address zero, address eight. Hold on a second, exactly. You just, you, you access address zero, you access address 128, access address 256. It'll predict and say I bet you they are going to go back to they are going to access 128 plus 256. And just go pull that interim. Pull that address. So they set the Pentium IV, and the Intel clusters having stride detectors. The goal is actually the power of five. It has eight independent stride pre-fetch detectors. So this they seem to get pretty sophisticated. So you can actually, go pull in lots of extra data. And this is important in sort of, the care and feeding of the processor, so you don't have to wait for the cache miss. You can also do this in software, just as well as in hardware. So if you go and look at a software pre-fetch, you can add extra instructions to your I save, which say, go get the data early. And this is kind of the, akin, you know, we have looked at doing these and Arbor, Arbor and super Scalars. And VRW's did a lot of that stuff in software. Well, you could do prefetching in software. But I should point out, to effectively do this pre-fetching in software, we changed the ISA. We added some sort of pre-fetch instruction. The other little trick that we had to do here. Oh so doing this loop, I just wanted to say what's going on here is, we're adding A of I. First we multiply A of I times B of I. Well we're actually prefetching the subsequent nuclear agents regular pretecthing the subsequent nuclear agents data probe. So you can do the speed dutch after the subsequent loop iteration. But, you need to do, have a few little twittles here. You probably need to have a non-blocking memory system. Because otherwise, you need to basically some way that this prefetch can go out to the memory system and not just stall. Because otherwise, it didn't help you at all. So what we talked about non-blocking and out of the ordinary systems in this next lecture. But this Sauper prefetch, you probably need something like that. We're out of time today so I'll pick this up, next time with software pre-fetching. And then some other software optimization techniques, And we'll start talking about out of order limit systems.