1
00:00:03,240 --> 00:00:09,708
Let's, think about the next optimization.
And the next optimization is, trying to

2
00:00:09,708 --> 00:00:14,915
get data in the cache early.
So this kind of the one technique where

3
00:00:14,915 --> 00:00:20,673
we're going to think about trying to
combat compulsory misses, or first time

4
00:00:20,673 --> 00:00:24,486
misses.
And what we do is we actually go and fetch

5
00:00:24,486 --> 00:00:29,666
data before we need, need it.
And we can do this in a couple different

6
00:00:29,666 --> 00:00:33,216
ways.
We can do this with just hardware doing

7
00:00:33,216 --> 00:00:36,293
it.
We can do it with software doing it.

8
00:00:36,293 --> 00:00:42,614
We do it with mixed cues, or a hardware,
and software does it.

9
00:00:42,614 --> 00:00:47,659
And this really affects our compulsory
misses.

10
00:00:48,536 --> 00:00:56,543
The, the new care actually wanted to say
something about instruction fetches, are

11
00:00:56,543 --> 00:01:04,769
easier to predict than, data cache apps,
It seems. And you'll actually see this

12
00:01:04,769 --> 00:01:11,240
sometimes show up in processors, where
processors will have.

13
00:01:11,240 --> 00:01:17,458
A instruction prefetcher but will not have
data protection, and it's because

14
00:01:17,458 --> 00:01:23,677
instructions a lot of times doesn't
execute instructions in a row; there sort

15
00:01:23,677 --> 00:01:28,845
of marking valid instruction and they just
keep falling through.

16
00:01:28,845 --> 00:01:35,387
So it's very easy to predict with high
probability if you're executing let's say

17
00:01:35,629 --> 00:01:40,555
code at one block the actual right next
block will be access.

18
00:01:40,555 --> 00:01:44,997
So it's just sort of the extreme of
spatial caliber.

19
00:01:44,997 --> 00:01:50,989
So you fetch, let's say block one, you
will go fetch to block two of your

20
00:01:50,989 --> 00:01:56,639
instructions.
So you both read effective prefetcher for

21
00:01:56,639 --> 00:02:00,604
instructions.
So, nothing is free in life.

22
00:02:00,604 --> 00:02:05,758
Prefetching can hurt you.
So star of garbage is here.

23
00:02:05,758 --> 00:02:13,490
If you really want your prefetching, of
data let's say from your L2, your L1, to

24
00:02:13,490 --> 00:02:19,240
increase your probability that you are
going to find data.

25
00:02:19,240 --> 00:02:23,634
But, if you go and you pull in data, it
just wasn't useful.

26
00:02:23,634 --> 00:02:30,033
You're actually going to pollute the cache
or effectively decrease the capacity of

27
00:02:30,033 --> 00:02:34,582
the cache with just extra data in there
that wasn't needed.

28
00:02:34,582 --> 00:02:40,518
So, you have to be very careful with
prefetching, because it actually has the

29
00:02:40,518 --> 00:02:46,763
ability to get rid of compulsory misses,
but it also has the ability to hurt your

30
00:02:46,763 --> 00:02:52,854
performance in particular. Timeliness is,
is sort of an important issue here is

31
00:02:52,854 --> 00:02:55,967
that,
If you go and prefetch too late, the data,

32
00:02:55,967 --> 00:02:59,494
that's not really useful.
What do we mean by too late?

33
00:02:59,494 --> 00:03:04,950
Well, if you were to do a prefetch and the
exact, next instruction is the load that

34
00:03:04,950 --> 00:03:10,140
was going to use the data that you just
prefetched, it's not going to save you

35
00:03:10,140 --> 00:03:13,999
almost any time.
It might save you one cycle, so that's not

36
00:03:13,999 --> 00:03:18,257
really useful.
But if you prefetch too early and then it

37
00:03:18,257 --> 00:03:23,380
gets kicked out by some other piece of
data, it didn't do you any good either.

38
00:03:23,380 --> 00:03:27,240
It just wasted bandwidth up to your next
level of capture,

39
00:03:27,240 --> 00:03:33,575
To the, the nice little cache RP. So, it's
a really tricky thing to get right here,

40
00:03:33,575 --> 00:03:39,136
it's prefetching, especially within this
time. And, of course, this, this, yeah,

41
00:03:39,136 --> 00:03:44,275
this, this, this the loose, both the cache
data and it just increases the bandwidth

42
00:03:44,275 --> 00:03:48,498
out to, out to here.
So, you need to make sure that you're not

43
00:03:48,498 --> 00:03:53,074
using extra bandwidth that you could be
using for something else.

44
00:03:53,074 --> 00:03:58,353
So, if you have a demand list from 01 or
02, you want to make sure that, that

45
00:03:58,353 --> 00:04:03,486
always takes precedence over in
prefetches. If we go look instruction

46
00:04:03,486 --> 00:04:08,132
prefetching in particular.
As I said this is, this is actually

47
00:04:08,365 --> 00:04:13,012
usually a benefit.
Because there's a high correlation as you

48
00:04:13,012 --> 00:04:17,349
go request block i.
You're going to go request block I plus

49
00:04:17,349 --> 00:04:23,467
one if it's not already in your cache.
When you go to do this though, though you

50
00:04:23,467 --> 00:04:29,198
just don't want to go blindly request I
plus one for the next level cache.

51
00:04:29,198 --> 00:04:35,240
You probably want to check the instruction
cache to make sure you already don't.

52
00:04:35,240 --> 00:04:42,003
You have five plus one, because otherwise
you're just sort of wasting extra

53
00:04:42,003 --> 00:04:46,513
bandwidth you shouldn't be using against
altitude.

54
00:04:46,513 --> 00:04:54,629
Okay, so this is interesting point here,
this is interesting optimization, on the

55
00:04:54,629 --> 00:05:00,672
optimization of prefetching.
Instead of just always filling it, let's

56
00:05:00,672 --> 00:05:07,887
say a subsequent block into the L1 you
could do a little buffer here. This is

57
00:05:07,887 --> 00:05:12,227
kind of like a.
It's not kind of victim cache, but it's

58
00:05:12,227 --> 00:05:17,846
kind of an extra little cache on the side
or a film buffer or stream buffer.

59
00:05:17,846 --> 00:05:23,892
And, what's nice about this is, you could
basically fetch, let's say, I, I plus one,

60
00:05:23,892 --> 00:05:28,658
put both of them in here.
They are trying to maybe even fetch plus

61
00:05:28,658 --> 00:05:33,708
two, you are wasting extra bandwidth
potentially here, or use some extra

62
00:05:33,708 --> 00:05:37,478
bandwidth.
What if you, let's say you do a cache

63
00:05:37,478 --> 00:05:43,240
induce for the next is already here.
Just try to execute right after that, or.

64
00:05:43,240 --> 00:05:47,360
Right out of the stream buffer or the
prefetch buffer.

65
00:05:47,360 --> 00:05:52,922
And this is actually a common theme which
is both using the instruction side and

66
00:05:52,922 --> 00:05:58,484
sometimes the DS side, where people will
actually prefetch into a prefetch buffer,

67
00:05:58,484 --> 00:06:03,497
and that will be the cache, and they're
just trying to do this to prevent

68
00:06:03,497 --> 00:06:07,068
pollution in the cache of daily dope
action liberty.

69
00:06:07,068 --> 00:06:11,807
But the data that's in this prefetch
buffer, stream buffer, has a high

70
00:06:11,807 --> 00:06:14,760
probability of being used.
So, it's probably,

71
00:06:15,302 --> 00:06:18,760
A good, good idea if you have one of
those.

72
00:06:20,420 --> 00:06:24,303
Okay.
So what are good heuristics on the data

73
00:06:24,303 --> 00:06:27,928
side?
A couple, couple different heuristics

74
00:06:27,928 --> 00:06:34,402
here, The most basic one is
predeterminance. So if you miss for block

75
00:06:34,402 --> 00:06:38,545
b, go fetch block e1.
Plus one. This is not same on access.

76
00:06:38,545 --> 00:06:43,119
So if you go touch block e, somewhere in
the cache.

77
00:06:43,119 --> 00:06:50,714
This is not saying, go get the next block.
Instead, this is saying, if you miss block

78
00:06:50,714 --> 00:06:55,233
b, go get the next line.
And that actually people built a lot of

79
00:06:55,233 --> 00:06:58,180
those.
And that actually it's, D is interesting.

80
00:06:59,741 --> 00:07:06,064
another.
A little bit, what we wanted to be a

81
00:07:06,064 --> 00:07:11,014
little more careful with is the one when
you're saying, "If I access block B, in

82
00:07:11,014 --> 00:07:13,680
the cache, it'll take hits in our O1
cache.

83
00:07:13,680 --> 00:07:20,358
Should we go and preemptively try to get B
plus one?" You gotta be careful with this

84
00:07:20,358 --> 00:07:25,153
one.
Cuz you, you effectively can very easily

85
00:07:25,153 --> 00:07:30,939
be pulling in lots of data that you don't
really need into your cache at this point

86
00:07:30,939 --> 00:07:36,174
because you, you basically, at every
single load we'll say, it sits in your L1

87
00:07:36,174 --> 00:07:41,547
cache, you're very easily be pulling in
lots of extra data, so you have to be a

88
00:07:41,547 --> 00:07:47,063
little bit, a little bit apprehensive
about doing this one if you want to be

89
00:07:47,063 --> 00:07:51,360
more aggressive, and you have big caches
though, you can be more aggressive.

90
00:07:51,360 --> 00:07:58,290
Cuz just as we sort of, pursuing asset to
look for with more data.

91
00:07:58,290 --> 00:08:02,000
So, if you go look at something like the
paid for and beyond, so that's in your

92
00:08:02,000 --> 00:08:08,740
desktop, possibly stay the Core i7, they
actually have strive protect, strident

93
00:08:08,740 --> 00:08:16,660
prefetch. So they will detect a program
that is, let's say, accessing sort of,

94
00:08:16,660 --> 00:08:23,700
data will give you offsets. Right,
Forward. So, let's say, you access

95
00:08:23,700 --> 00:08:29,595
location one, location of 128 plus one.
Location 256 plus one.

96
00:08:29,595 --> 00:08:33,065
And you're basically just sort of moving
through data.

97
00:08:33,065 --> 00:08:38,335
But not just looking for the next line,
after the next line, after the next line.

98
00:08:38,335 --> 00:08:41,872
But instead you're sort of looking with a
given stride.

99
00:08:41,872 --> 00:08:47,342
This is actually pretty common when you
think about structures laid out in memory.

100
00:08:47,342 --> 00:08:50,678
If you think of a C style structure in an
array.

101
00:08:50,678 --> 00:08:55,215
And then you're all going to be packed.
And they're going to be packed by the size

102
00:08:55,215 --> 00:08:58,912
of the structure.
As you go touch, let's say, the first

103
00:08:58,912 --> 00:09:03,111
field of the structure.
So you have the structure, inside of

104
00:09:03,111 --> 00:09:07,240
there, there's, an int, a pointer a
floating point number.

105
00:09:07,240 --> 00:09:12,062
And you have a loop, which says, increment
the first int, we'll say, in the

106
00:09:12,448 --> 00:09:15,727
structure.
You're going to be striding through that

107
00:09:15,727 --> 00:09:18,556
memory.
You're going to be touching the same

108
00:09:18,556 --> 00:09:21,900
location in memory, as you sort of execute
that loop.

109
00:09:22,960 --> 00:09:28,821
So what you do is you have something
that's called a stride detector.

110
00:09:28,821 --> 00:09:36,041
And it looks at the memory references you
do and tries to see if they are correlated

111
00:09:36,041 --> 00:09:40,288
somehow.
And given the correlation it will predict,

112
00:09:40,288 --> 00:09:46,175
oh, if you just go and access let's say,
address zero, address eight.

113
00:09:46,175 --> 00:09:52,602
Hold on a second, exactly.
You just, you, you access address zero,

114
00:09:52,602 --> 00:09:56,351
you access address 128, access address
256.

115
00:09:56,351 --> 00:10:01,930
It'll predict and say I bet you they are
going to go back to they are going to

116
00:10:01,930 --> 00:10:05,099
access 128 plus 256.
And just go pull that interim.

117
00:10:05,099 --> 00:10:10,030
Pull that address.
So they set the Pentium IV, and the Intel

118
00:10:10,030 --> 00:10:14,806
clusters having stride detectors.
The goal is actually the power of five.

119
00:10:14,806 --> 00:10:18,308
It has eight independent stride pre-fetch
detectors.

120
00:10:18,308 --> 00:10:21,110
So this they seem to get pretty
sophisticated.

121
00:10:21,110 --> 00:10:24,293
So you can actually, go pull in lots of
extra data.

122
00:10:24,293 --> 00:10:29,642
And this is important in sort of, the care
and feeding of the processor, so you don't

123
00:10:29,642 --> 00:10:37,415
have to wait for the cache miss.
You can also do this in software, just as

124
00:10:37,415 --> 00:10:41,840
well as in hardware.
So if you go and look at a software

125
00:10:41,840 --> 00:10:47,990
pre-fetch, you can add extra instructions
to your I save, which say, go get the data

126
00:10:47,990 --> 00:10:51,995
early.
And this is kind of the, akin, you know,

127
00:10:51,995 --> 00:10:56,105
we have looked at doing these and Arbor,
Arbor and super Scalars.

128
00:10:56,105 --> 00:10:59,106
And VRW's did a lot of that stuff in
software.

129
00:10:59,106 --> 00:11:01,912
Well, you could do prefetching in
software.

130
00:11:01,912 --> 00:11:07,262
But I should point out, to effectively do
this pre-fetching in software, we changed

131
00:11:07,262 --> 00:11:09,871
the ISA.
We added some sort of pre-fetch

132
00:11:09,871 --> 00:11:13,329
instruction.
The other little trick that we had to do

133
00:11:13,329 --> 00:11:16,200
here.
Oh so doing this loop, I just wanted to

134
00:11:16,200 --> 00:11:21,093
say what's going on here is, we're adding
A of I. First we multiply A of I times B

135
00:11:21,093 --> 00:11:24,845
of I.
Well we're actually prefetching the

136
00:11:24,845 --> 00:11:31,954
subsequent nuclear agents regular
pretecthing the subsequent nuclear agents

137
00:11:31,954 --> 00:11:36,853
data probe.
So you can do the speed dutch after the

138
00:11:36,853 --> 00:11:41,661
subsequent loop iteration. But, you need
to do, have a few little twittles here.

139
00:11:41,661 --> 00:11:44,994
You probably need to have a non-blocking
memory system.

140
00:11:44,994 --> 00:11:49,962
Because otherwise, you need to basically
some way that this prefetch can go out to

141
00:11:49,962 --> 00:11:54,627
the memory system and not just stall.
Because otherwise, it didn't help you at

142
00:11:54,627 --> 00:11:57,354
all.
So what we talked about non-blocking and

143
00:11:57,354 --> 00:12:00,322
out of the ordinary systems in this next
lecture.

144
00:12:00,322 --> 00:12:04,140
But this Sauper prefetch, you probably
need something like that.

145
00:12:06,263 --> 00:12:13,722
We're out of time today so I'll pick this
up, next time with software pre-fetching.

146
00:12:13,986 --> 00:12:18,637
And then some other software optimization
techniques,

147
00:12:18,637 --> 00:12:23,640
And we'll start talking about out of order
limit systems.