1
00:00:03,340 --> 00:00:07,799
Okay.
So, we're, we're through three of our

2
00:00:07,799 --> 00:00:14,910
seven optimizations for today.
Let's look at the next optimization.

3
00:00:15,151 --> 00:00:20,931
And this, this comes back to this
fundamental problem that you can build

4
00:00:20,931 --> 00:00:25,669
something both big and fast.
They fed off against each other. You can

5
00:00:25,669 --> 00:00:28,623
build really big cache, you could build
big memory.

6
00:00:28,623 --> 00:00:32,301
You can have, you know, have boxes of
memory that fill this entire room.

7
00:00:32,301 --> 00:00:36,593
You go in some data center, people
actually build big boxes of RAM that would

8
00:00:36,593 --> 00:00:40,494
fill this entire room, but it's slow, it
takes a long time to access it.

9
00:00:40,494 --> 00:00:45,120
It's a bunch of multiplexers, they need to
sort of choose between all the different

10
00:00:45,120 --> 00:00:49,300
racks of RAM, we'll say, or if you have a
32 megabyte cache, that's a lot of

11
00:00:49,300 --> 00:00:55,347
multiplexing and stuff like that.
So, you can have something both big and

12
00:00:55,347 --> 00:00:56,120
fast.
Now,

13
00:00:57,260 --> 00:01:04,249
Our solution to this, our multi-level
caches is that you put progressively

14
00:01:04,249 --> 00:01:10,483
larger caches as you go farther out.
So, you have a small cache which you

15
00:01:10,483 --> 00:01:15,206
access very frequently, and has a very
fast cycle time.

16
00:01:15,206 --> 00:01:21,509
You put a L2 which is a little farther
away but can hold more data.

17
00:01:21,509 --> 00:01:26,038
And then, you have, let's say, main
memory, or maybe an L3 out here which

18
00:01:26,038 --> 00:01:28,820
holds more data but takes longer to
access.

19
00:01:29,580 --> 00:01:34,864
So, that's really the insight here is
that, you can have larger cache sizes as

20
00:01:34,864 --> 00:01:41,042
you get farther away and it helps to have
larger caches cuz you will be able to hold

21
00:01:41,042 --> 00:01:44,473
more capacity.
And you'll, sort of, gracefully decay

22
00:01:44,473 --> 00:01:49,141
versus falling off a cliff,
Either you're in the cache or you're not

23
00:01:49,141 --> 00:01:53,053
in the cache.
So that means multilevels of caches can,

24
00:01:53,053 --> 00:01:56,914
can fix this problem.
The other thing you can do is, because

25
00:01:56,914 --> 00:02:00,436
this cache is farther away, it takes more
cycles to access it.

26
00:02:00,436 --> 00:02:04,651
You can afford for it to be bigger and
take multiple cycles to access it.

27
00:02:04,651 --> 00:02:08,981
So, kind of it by virtue of it being
farther away, if it takes two cycles to

28
00:02:08,981 --> 00:02:13,196
access versus one cycle to access.
But it already took you, lets say, a few

29
00:02:13,196 --> 00:02:17,122
cycles to get out to that cache,
You know, you've got, you have the extra

30
00:02:17,122 --> 00:02:22,462
time to go do that.
So, this is our multi, multilevel caches.

31
00:02:22,462 --> 00:02:29,361
So you'll see, lots of processors people
will quote, the L1 cache size and the L2

32
00:02:29,361 --> 00:02:35,728
cache size, and the L3 cache size.
One, one thing I wanted to say about this

33
00:02:35,728 --> 00:02:40,800
was that, there's actually some, some
rules of thumb that apply to this.

34
00:02:40,800 --> 00:02:45,946
And they're kind of, kind of random and
empirically found. But typically, in a

35
00:02:45,946 --> 00:02:50,745
processor design, you're going to want to
make your next level of cache about eight

36
00:02:50,745 --> 00:02:55,797
times bigger than the previous level of
cache to have it have some useful, effect.

37
00:02:55,797 --> 00:03:00,849
And this is just an empirical sort of rule
of thumb that people typically apply, and

38
00:03:00,849 --> 00:03:04,999
if you go look at almost all the
processors in the world, this almost

39
00:03:04,999 --> 00:03:08,487
always holds true.
And the reasoning behind this is, you

40
00:03:08,487 --> 00:03:13,118
know, it takes extra time to go access
this next level of cache. And, if you're

41
00:03:13,118 --> 00:03:16,968
just going to go up by a factor of two,
it's not really worth it.

42
00:03:16,968 --> 00:03:21,900
And if you, sort of, run the numbers of
back in the beginning, of the missed time

43
00:03:22,340 --> 00:03:28,716
multiplied by our missed rates.
If a miss time goes up a bunch, the miss

44
00:03:28,716 --> 00:03:33,715
rate needs to go down a lot for this to
actually make any, any sense to use.

45
00:03:33,715 --> 00:03:37,588
And if you only go up by a factor of two
in cache size,

46
00:03:37,588 --> 00:03:43,080
Let's say, you have an eight kilobyte
cache backed by a sixteen kilobyte cache,

47
00:03:43,400 --> 00:03:46,084
It just sort of works out but it doesn't
actually help.

48
00:03:46,084 --> 00:03:50,160
Somebody said that it is rolled into that
rule of thumb is how much time it takes to

49
00:03:50,160 --> 00:03:53,180
sort of, go to next level of cache.
Because if it only takes, let's, say, one

50
00:03:53,180 --> 00:03:56,968
more tick or one more cycle to go to the
next level of cache, the rule of thumb

51
00:03:56,968 --> 00:03:59,653
might break down.
It might actually make sense to have a

52
00:03:59,653 --> 00:04:04,398
two, a multiplicative cache, or a cache
that's only two sizes larger than the next

53
00:04:04,398 --> 00:04:08,137
INAUDIBLE. But in reality, people
typically don't really build these things

54
00:04:08,137 --> 00:04:12,986
unless they're eight levels up. And you
can, sort of, sit down and plot looking in

55
00:04:12,986 --> 00:04:16,200
the Hennessy and Patterson book, The
Missed Rates.

56
00:04:16,200 --> 00:04:20,640
And you'll, you'll see that this rule of
thumb kind of works out pretty well.

57
00:04:21,920 --> 00:04:24,758
Okay.
So, now that we think we want to build

58
00:04:24,758 --> 00:04:29,120
multilevel caches, how does this affect
the actual cache design?

59
00:04:31,480 --> 00:04:36,512
Believe it or not, having an L2 cache
effects the L1..

60
00:04:38,420 --> 00:04:44,500
Hm, that's an interesting insight.
So, a couple of things can happen.

61
00:04:44,860 --> 00:04:51,103
First of all, it might actually influence
your design to even have a smaller

62
00:04:51,103 --> 00:04:54,800
low-level cache, or, or smaller level one
cache.

63
00:04:55,820 --> 00:05:00,019
So, you can actually think about having a
bigger cache so will give you the

64
00:05:00,019 --> 00:05:03,828
aggregate readout the L1, the L2 and the
performance out of it might be a good

65
00:05:03,828 --> 00:05:07,588
trade and you can actually reduce your
cycle time of your processor and that

66
00:05:07,588 --> 00:05:10,908
could be a good trade-off.
And this is something like they what they

67
00:05:10,908 --> 00:05:14,473
did in the payment before where the cycle
time was very important to them.

68
00:05:14,473 --> 00:05:17,940
They put a much smaller L1 cache and they
had a backing L2 behind that.

69
00:05:20,000 --> 00:05:26,837
Cough And, another advantage of this is
you can actually reduce you energy used to

70
00:05:26,837 --> 00:05:33,757
access the cache if you have L1 and L2 and
you make your L1 smaller, cuz a lot of the

71
00:05:33,757 --> 00:05:39,048
work you're doing is in L1.
So, if you make your L1 smaller, you're

72
00:05:39,048 --> 00:05:43,200
going to be firing up effectively less
transistors.

73
00:05:44,080 --> 00:05:51,163
Another way that having a L2 and influence
the L1 is you can potentially have the L1

74
00:05:51,163 --> 00:05:55,995
not be write-back.
So, you can actually have it be write

75
00:05:55,995 --> 00:05:58,703
through.
And this is a common thing.

76
00:05:58,703 --> 00:06:04,544
For instance, actually, the, the Tilera
Processor, which I worked on, we do this.

77
00:06:04,544 --> 00:06:08,035
We have a right through from the L1 to the
L2.

78
00:06:08,035 --> 00:06:12,880
It makes the L1 a lot simpler.
It makes it so you don't have to have

79
00:06:13,093 --> 00:06:18,758
somewhere on this slide here,
You don't have to have dirty data in your

80
00:06:18,758 --> 00:06:21,134
L1.
And this is a big benefit.

81
00:06:21,134 --> 00:06:26,284
So, not having dirty, dirty data in L1
means that you could potentially not have

82
00:06:26,284 --> 00:06:31,575
to have an error checking, or error
correcting code on your L1 data because

83
00:06:31,575 --> 00:06:36,795
none of the data in there's important.
It's all going out to the L2 anyway.

84
00:06:36,795 --> 00:06:42,086
So, if it's all in the L2 anyway, you
could potentially just invalidate your

85
00:06:42,086 --> 00:06:45,120
entire L1 and the program is still
correct.

86
00:06:45,660 --> 00:06:51,884
So, to really simplify error recovery you
can sort of replace error checking with

87
00:06:51,884 --> 00:06:58,109
just pari, or, EEC with just parity. So,
that's a big win of having, write through

88
00:06:58,109 --> 00:07:03,336
from the L1 to the L2.
And, this is actually the reason that

89
00:07:03,336 --> 00:07:08,566
people typically do this.
And we haven't talked about this yet,

90
00:07:08,566 --> 00:07:15,083
we'll actually talk about it at the end of
the term cuz it actually will make

91
00:07:15,083 --> 00:07:20,689
multiple processors easier to build.
Now, what do I, what do I mean by that?

92
00:07:20,689 --> 00:07:23,482
Well, if you have a cache coherent
processor.

93
00:07:23,482 --> 00:07:28,752
So you have multiple processors, and they
all have caches, but they're trying to use

94
00:07:28,752 --> 00:07:33,831
one piece of memory, or use one big shared
memory, you have to go check everybody

95
00:07:33,831 --> 00:07:38,974
else's cache if you're looking for data.
But if you have a write through from the

96
00:07:38,974 --> 00:07:43,446
L1 to the L2,
Well, this is a benefit here because what

97
00:07:43,446 --> 00:07:48,609
this means is you have no dirty data in
your L1, so you never have to, to check

98
00:07:48,609 --> 00:07:51,521
the L1.
You only have ever have to go check the

99
00:07:51,521 --> 00:07:55,830
L2.
And this is, this is a, this is a big win

100
00:07:55,830 --> 00:08:05,195
for coherence perspective because it just
makes it a lot simpler.

101
00:08:06,620 --> 00:08:12,177
This is, this sort of goes back to our
write buffer that we were talking about.

102
00:08:12,177 --> 00:08:17,878
If you have the L1 write through to the L2
and you go to do, let's say, something

103
00:08:17,878 --> 00:08:23,863
that will invalidate or create a victim in
your L1, you don't ever have to take that

104
00:08:23,863 --> 00:08:29,563
victim and write it out to the L2. If all
your data's clean in your L1 at all times.

105
00:08:29,563 --> 00:08:32,770
Anyway, what I'm trying to get out at here
is,

106
00:08:32,770 --> 00:08:37,702
You would think that just putting like a
multilevel cache doesn't effect the,

107
00:08:37,894 --> 00:08:43,275
closer to the core caches, but it does cuz
you typically design your whole cache

108
00:08:43,275 --> 00:08:50,460
system together.
Another thing that, you know, affects is

109
00:08:51,780 --> 00:08:58,816
the inclusion policy.
So, if we go look at sort of a Venn

110
00:08:58,816 --> 00:09:08,014
diagram of memory,
And we have data which is stored in our L2

111
00:09:08,014 --> 00:09:16,575
cache, and we have a smaller L1,
The data is stored in both, in a inclusive

112
00:09:16,575 --> 00:09:20,520
cache.
In an exclusive cache,

113
00:09:21,240 --> 00:09:28,419
We data in our L2,
And data is in our L1, and they are not

114
00:09:28,419 --> 00:09:31,820
the same.
This is inclusive.

115
00:09:33,442 --> 00:09:40,888
This is exclusive.,
Okay.

116
00:09:40,888 --> 00:09:46,170
So,
If you go look at inclusive, you're going

117
00:09:46,170 --> 00:09:52,408
to say, I'm wasting space because I have
to store, if any piece of data is in L1

118
00:09:52,408 --> 00:09:58,312
has to be stored in L1 and L2, and over
here we only need to store the data in one

119
00:09:58,312 --> 00:10:03,214
location.
So, you would think everyone would have

120
00:10:03,214 --> 00:10:09,634
always wanted to build inclusive caches.
In reality, most people build inclusive

121
00:10:09,634 --> 00:10:15,681
caches and don't build exclusive caches.
And, and, the, the reasoning behind this

122
00:10:15,681 --> 00:10:19,987
is that it gets hard to build multilevel
exclusive cache.

123
00:10:19,987 --> 00:10:24,596
So, let's, let's look at this a little bit
more detail here.

124
00:10:24,596 --> 00:10:28,072
So,
The exclusive cache stores more data, so

125
00:10:28,072 --> 00:10:32,303
that's a positive.
The negative here is that you need to

126
00:10:32,303 --> 00:10:37,667
check both locations when you have to do,
let's say, a remote invalidate.

127
00:10:37,667 --> 00:10:43,258
So give our cache rehearse protocol, our
multiple processor cache rehearse

128
00:10:43,258 --> 00:10:46,810
protocol, and we need to go check where
data is.

129
00:10:46,810 --> 00:10:52,572
Now, instead of checking let's say, the L2
and knowing the data from a time

130
00:10:53,148 --> 00:10:57,018
perspective that's in the L2 and is also
in the L1.

131
00:10:57,018 --> 00:11:00,805
Now, we're going to check the tags
INAUDIBLE.

132
00:11:01,052 --> 00:11:07,391
Exclusive caches, as far as I know, we
could actually implement that these are

133
00:11:07,391 --> 00:11:14,224
sort of AUD. and some of the complexity
involved in this, you really have to swap

134
00:11:14,224 --> 00:11:21,597
the lines between the inner type of caches
where you take caches because data, which

135
00:11:21,597 --> 00:11:27,466
can't be used in the, the other cache.
So, when you take cache from this from the

136
00:11:27,466 --> 00:11:33,336
one, let's say, from the L1 then pick on
these to go to the L2 and when the data is

137
00:11:33,336 --> 00:11:36,915
coming in, that's a transfer from the L2
to the L1.

138
00:11:36,915 --> 00:11:43,357
This gets hard, hard to be correct under
all circumstances. You'll see once you get

139
00:11:43,357 --> 00:11:47,938
the hang of it there.
And one little, one last little thing of

140
00:11:47,938 --> 00:11:53,879
why exclusive caches are hard to do is
that if cache y size or the cache y size

141
00:11:53,879 --> 00:11:58,148
is different,
You basically can't build a exclusive

142
00:11:58,148 --> 00:12:04,825
cache because you're swapping sort of,
lines between these two things, and it

143
00:12:04,825 --> 00:12:09,393
just doesn't make any sense, doesn't
anybody benefit.

144
00:12:09,393 --> 00:12:16,510
And, lots of times people like to have
smaller block size in the L1 cache because

145
00:12:16,510 --> 00:12:22,747
you get more locations, more blocks to
work than L2. And if, let's say, have

146
00:12:22,747 --> 00:12:28,810
those block sizes different, you can't
necessarily represent let's say,

147
00:12:28,810 --> 00:12:34,164
In the L2 cache, which has, let's say, a
64-byte block size, and the L1 which has a

148
00:12:34,164 --> 00:12:38,113
32-byte block size,
Just go. You won't have all the data cuz

149
00:12:38,113 --> 00:12:43,802
it takes two of the smaller cache lines to
put together one of the bigger cache line,

150
00:12:43,802 --> 00:12:47,951
so you can't really sort of, swap them cuz
it's not compatible.

151
00:12:47,951 --> 00:12:53,439
So, in each, exclusive cache, this forces
you to have the same cache size between

152
00:12:53,439 --> 00:12:56,317
two.
But the, of course, is that you get more

153
00:12:56,317 --> 00:12:59,262
data.
This, this is also the other reason why

154
00:12:59,262 --> 00:13:05,132
sort of, this for inclusive caches, if you
guys are qithout the rule eight

155
00:13:05,132 --> 00:13:12,285
multiplication or making this cache a
transfigure multiple cache prospective,

156
00:13:12,285 --> 00:13:19,252
that, that, that step up then kind of
INAUDIBLE turn into a noise because, oh,

157
00:13:19,252 --> 00:13:27,055
yes, you were duplicating something which
all you INAUDIBLE L2 cache size INAUDIBLE.

158
00:13:27,055 --> 00:13:31,329
Okay.
So, a few set of examples are ubiquitous,

159
00:13:31,329 --> 00:13:34,859
multiple caches, some real, real
INAUDIBLE.

160
00:13:34,859 --> 00:13:40,200
Cough Here, we have,
You have the Itanium-2 this is a DYW

161
00:13:40,200 --> 00:13:48,143
Architectural advance dealing on your
picture that we talked about from Intel

162
00:13:48,143 --> 00:13:52,211
and HP.
And you sort of look at, they have a

163
00:13:52,211 --> 00:13:58,992
sixteen kilobytes, L1, and then L2 that
should change both these things. So, it

164
00:13:59,282 --> 00:14:06,354
change the size and, and the base it's,
it's bigger than any x multiplier.

165
00:14:06,354 --> 00:14:10,810
And they also changed the line size. And
then,

166
00:14:10,810 --> 00:14:15,486
They even have a third level here,
That is when you have two, might as well

167
00:14:15,486 --> 00:14:18,772
go to three.
So they have a third level cache because

168
00:14:18,772 --> 00:14:22,501
they have a space.
This is a expensive processor, they had a

169
00:14:22,501 --> 00:14:25,661
lot of space, so they put a level three
cache here.

170
00:14:25,661 --> 00:14:30,527
And, in the third level cache, they
actually increased the associativity quite

171
00:14:30,527 --> 00:14:34,698
a bit, a 12-way set associative.
And, they made the cache a lot bigger,

172
00:14:34,698 --> 00:14:38,174
three megabytes.
And the line size is the same as they're,

173
00:14:38,174 --> 00:14:42,155
they're almost the right size.
Let's look at the, the latency here. They

174
00:14:42,155 --> 00:14:50,497
have single cycle latency in the L1,.
That's usually kind of what you want, very

175
00:14:50,497 --> 00:14:56,926
fast latency in the L1, five cycles L2,
twelve cycles L3.

176
00:14:56,926 --> 00:15:01,377
It seems like it steps up as it gets
further away.

177
00:15:01,377 --> 00:15:12,628
If we look at the Power seven this was the
ten ultimate power architecture from IBM,

178
00:15:12,628 --> 00:15:18,810
the Power eight that is, you know, just
came out.

179
00:15:18,877 --> 00:15:27,243
Cough Here, you can see they, they have
lots of cache in the middle,

180
00:15:27,243 --> 00:15:35,810
And this is a muti-core.
So, eight cores here. Each of those cores

181
00:15:35,810 --> 00:15:43,573
has a 5-bit L2.
They have a little bit bigger L1s that

182
00:15:43,573 --> 00:15:52,810
have 30 kilobyte I and 30 kilobyte D L1s..
A little bit by INAUDIBLE L1, three

183
00:15:52,810 --> 00:15:56,284
cycles, INAUDIBLE L2.
Then, the Intel processor we've seen

184
00:15:56,284 --> 00:16:01,082
before, and INAUDIBLE L3.
But the L3 is huge, 32 megabytes.

185
00:16:01,082 --> 00:16:07,452
It's a lot and a lot of data out here.
But it's also being shared between the

186
00:16:07,452 --> 00:16:11,174
processors.
So, this will be about private L2s,

187
00:16:11,174 --> 00:16:15,311
private L1s, and a shared L3.
So, interesting design.

188
00:16:15,311 --> 00:16:21,515
Its multilevel caches show often in
basically all of modern processors,

189
00:16:21,515 --> 00:16:24,493
Npw we have multilevel caches.
Okay.

190
00:16:24,493 --> 00:16:29,885
So, let's take a look at the efficacy of
the multilevel cache.

191
00:16:29,885 --> 00:16:36,121
Cough This doesn't you feel like it's sort
of just the L1 by itself.

192
00:16:36,121 --> 00:16:42,528
This is, I don't know, look at the, the
miss penalty of that is going to go down.

193
00:16:42,528 --> 00:16:49,191
So, just for the, just for the L1, the
miss penalty of this goes down because if

194
00:16:49,191 --> 00:16:55,684
you're missing your L1, before you had to
go out to main memory, and that was

195
00:16:55,684 --> 00:02:07,451
expensive, but now you can go to your L2
or L3. So, you're reducing this out.

196
00:17:02,234 --> 00:17:07,795
If you draw up a box around the L1, and L2
and L3, everything together, this is both

197
00:17:07,795 --> 00:17:13,288
going to reduce your missed penalty and
you're going to have more cache space to

198
00:17:13,288 --> 00:17:17,289
reduce your risk rate.
So you're going to have this, this is

199
00:17:17,289 --> 00:17:22,647
reduced and broken this penalty, and in
the risk rate together, dropping box for

200
00:17:22,647 --> 00:17:26,920
all your cache in the structure in a the
multilevel cache system.