1
00:00:03,480 --> 00:00:08,918
So, today we're going to continue our
adventure in computer architecture and

2
00:00:08,918 --> 00:00:14,356
talk more about parallel computer
architecture. last time we talked about

3
00:00:14,356 --> 00:00:19,794
coherence, memory coherence, and cache
coherence, systems and to differentiate

4
00:00:19,794 --> 00:00:26,812
that from memory consistency models which
is a model of how memory is supposed to

5
00:00:26,812 --> 00:00:33,814
work, versus the underlying algorithms
that try to keep memory consistent, and

6
00:00:33,814 --> 00:00:40,904
try to implement the consistency models.
We left off last time, we, we were talking

7
00:00:40,904 --> 00:00:48,740
about MOESI, or also known as the Illinois
protocol, and we walked through all of the

8
00:00:48,740 --> 00:00:55,703
different arcs through here. And if you
recall what we were talking about, was, we

9
00:00:55,703 --> 00:01:01,969
split the shared state from the MSI
protocol into two states, shared and

10
00:01:01,969 --> 00:01:06,993
exclusive. And the insight here is, it's
very common for programs to read a memory

11
00:01:06,993 --> 00:01:11,321
address, which will pull it into your
cache. And then go modify that memory

12
00:01:11,321 --> 00:01:15,942
address. So for instance, if you want to
increment a number. You're going to do a

13
00:01:15,942 --> 00:01:20,564
load. It's gonna bring it into your, your,
ca-, or into your register set. But also

14
00:01:20,564 --> 00:01:24,956
into your cache. You're going to going to
increment the number and then you do a

15
00:01:24,956 --> 00:01:29,330
write back to the exact same location.
Pretty common in imperative programming

16
00:01:29,330 --> 00:01:33,381
languages. Declarative programming
languages like Scheme and such, they may

17
00:01:33,111 --> 00:01:37,377
at times copy everything. But for declara-
excuse me, imperative programming

18
00:01:37,377 --> 00:01:42,118
languages it's pretty common to actually
change state in place. So, because of

19
00:01:42,118 --> 00:01:47,596
that, you can bring it right into, this
exclusive stage, and then when you have to

20
00:01:47,596 --> 00:01:53,143
go to modify it, you would have to go and
broadcast in the bus. You know, you would

21
00:01:53,143 --> 00:01:58,347
have to talk to anybody and would loose,
hm, effectively this intent to write

22
00:01:58,347 --> 00:02:03,619
message. Then you would have to send
otherwise across the bus and waiting for

23
00:02:03,619 --> 00:02:09,166
that address to be snooped on the bus, or
be seen by all the other entities on the

24
00:02:09,166 --> 00:02:16,434
bus. note I, I say entities in the bus.
We've been talking primarily about,

25
00:02:16,434 --> 00:02:23,120
processors, the last day, but there can be
other entities on the bus that want to

26
00:02:23,120 --> 00:02:29,242
snoop the bus. So examples sometimes
include cohe rent, IO devices. So, this

27
00:02:29,242 --> 00:02:33,655
isn't very popular right now, but I think
this will become much more popular as soon

28
00:02:33,655 --> 00:02:37,910
as we start to have, GPUs or Graphics
Processing Units or general purpose GPUs,

29
00:02:37,910 --> 00:02:42,007
which will be sitting, effectively, very
close to our processor on the same bus,

30
00:02:42,007 --> 00:02:46,157
and will want to take part in the
coherence traffic of the processor. So

31
00:02:46,157 --> 00:02:50,412
it's going to want to basically read and
write to the same memory addresses that

32
00:02:50,412 --> 00:02:54,720
the processor is reading and writing, and
take part in the cash coherence protocol.

33
00:02:54,908 --> 00:03:00,260
At a minimum, usually your IO devices need
to effectively tell the processor when its

34
00:03:00,260 --> 00:03:05,360
doing a memory transaction that the
processor should know about. So typically

35
00:03:05,360 --> 00:03:10,271
when you are moving data from a IO device
to main memory, that's going to have to

36
00:03:10,271 --> 00:03:15,120
effectively go across the person. Everyone
is going to have to validate their

37
00:03:15,120 --> 00:03:20,220
cache's, you have to snoop the traffic, or
they will all have to snoop that memory

38
00:03:20,220 --> 00:03:27,032
traffic from the IO device. So we had
talked about MOESI as an enhancement to

39
00:03:27,032 --> 00:03:32,668
MSI. Well, we left off last time, and we
were going to talk about two more

40
00:03:32,668 --> 00:03:39,517
enhancements that are pretty common. one
is been used widely in AMD Opterons. I

41
00:03:39,517 --> 00:03:46,367
think they still use this in AMD. I think
they use something similar to this still

42
00:03:46,367 --> 00:03:53,010
in AMD, is my understanding. and the idea
is you add an extra state here, which is

43
00:03:53,010 --> 00:03:59,063
called ownership, or the owned state. And
effectively, what this is, is it looks

44
00:03:59,063 --> 00:04:05,267
just like our MOESI protocol from before.
But now, instead of having data in the

45
00:04:05,267 --> 00:04:10,941
modified stage, when you, let's say
another processor needs to go access that

46
00:04:10,941 --> 00:04:16,842
data, instead of having to send all that
data back to main memory, and validate

47
00:04:16,842 --> 00:04:22,725
that line out to main memory, and go fetch
it back from main memory. Instead, you can

48
00:04:22,725 --> 00:04:27,127
do direct cache to cache transfer. This is
of a, basically an optimization here. So,

49
00:04:27,127 --> 00:04:31,832
you don't have to right back to data to
main memory, and in fact you can allow

50
00:04:31,832 --> 00:04:37,095
main memory to be stale. And you can just
transfer the data across the bus from the

51
00:04:37,095 --> 00:04:43,331
one cache to the cache which needs it. So
in this example here, we're going to look

52
00:04:43,331 --> 00:04:48,833
at this edge here. So another processor
wants to read the data. So we see an

53
00:04:48,833 --> 00:04:55,069
intent to write to a particular cache line
and our processor currently has it in the

54
00:04:55,069 --> 00:05:02,480
modified state. We see this other
processors intent to write, and. or excuse

55
00:05:02,480 --> 00:05:07,948
me. Intent to read. And we're actually
going to provide the data out of our

56
00:05:07,948 --> 00:05:13,638
cache, and not write it back to main
memory, and transition the line in our

57
00:05:13,638 --> 00:05:21,846
cache to this owned state. The other
processors can now take it in, and take it

58
00:05:21,846 --> 00:05:31,005
in a shared state. So they will have it a
read, read only copy. Now, note this is

59
00:05:31,005 --> 00:05:36,146
only for, for read-only, we'll talk about
if another processor wants to write to the

60
00:05:36,146 --> 00:05:40,977
state in a second. So we have it in its
own state, and what we're trying to do

61
00:05:40,977 --> 00:05:46,118
here is this processor is tracking that,
that data needs to be written back to main

62
00:05:46,118 --> 00:05:50,701
memory at some point. That's the whole
purpose of this state here, is we've

63
00:05:50,701 --> 00:05:55,718
basically designated a processor which
owns the data and owns the modified state.

64
00:05:55,718 --> 00:06:01,675
So the processors which take at read only
get it into the shared state, and if they

65
00:06:01,675 --> 00:06:06,200
need to invalidate the line, they don't
need to contact anybody. Because they are

66
00:06:06,200 --> 00:06:09,933
having a share state, they have a
read-only copy. They don't need to make

67
00:06:09,933 --> 00:06:14,063
any bus transactions. So if you think
about it, if you actually want to

68
00:06:14,063 --> 00:06:18,757
effectively have one core, read the state
from other, read this dirty state from the

69
00:06:18,757 --> 00:06:25,583
other core, and then in some points it
goes in and just invalidates it in the, in

70
00:06:25,583 --> 00:06:31,861
the second core. If the data is not up to
date as in, it would be in main memory,

71
00:06:31,861 --> 00:06:38,457
you lose the changes. So, by one processor
keeping it in the own state here, it keeps

72
00:06:38,457 --> 00:06:44,973
track that at some point, if it never gets
invalidated out of that processor's cache,

73
00:06:44,973 --> 00:06:51,625
it needs to write that out to main memory,
to keep it up, up to date. Now, there's a

74
00:06:51,625 --> 00:06:57,123
couple other arcs here. you can transition
from the own state back to the modified

75
00:06:57,123 --> 00:07:02,504
state if the processor, which has it in
the owned state wants to go to a write. .

76
00:07:02,504 --> 00:07:06,858
It can't do that while it's in the owned
state, because while it's in the owned

77
00:07:06,858 --> 00:07:11,212
state, other pro cessors may have shared
copies of it. So, when it needs to do

78
00:07:11,212 --> 00:07:15,682
that, if it wants to do, P1 wants to do a
write here, it needs to re-invalidate

79
00:07:15,682 --> 00:07:20,210
everyone else's copies across the bus. So
it's going to have to send an intent to

80
00:07:20,210 --> 00:07:24,855
write for that line, and everyone else
will snoop that traffic, and transition to

81
00:07:24,855 --> 00:07:29,660
the invalid state. And then, this
processor will be able to transition to

82
00:07:29,660 --> 00:07:35,328
the modified state, and now it's able to
actually modify the data. Okay. So we've

83
00:07:35,328 --> 00:07:41,812
got this arc here, which we sort of
already talked about, is that if you're in

84
00:07:41,812 --> 00:07:47,881
the owned state, anyone else can get read
only shared copies of it. . They can't go

85
00:07:47,881 --> 00:07:53,510
get an exclusive copy, because that would
basically violate this notion, cuz then

86
00:07:53,510 --> 00:07:59,277
they would be able to upgrade to modified
without telling anybody, and we don't want

87
00:07:59,277 --> 00:08:04,770
that. But they can get shared read-only
copies of the data and then there's this

88
00:08:04,770 --> 00:08:10,400
arc here from owned to invalid, is if some
other processor wants to write the data.

89
00:08:10,820 --> 00:08:17,850
We're going, processor one, P1 here will
say, we'll see the intent to write from

90
00:08:17,850 --> 00:08:24,620
another processor. It will, snoop that
traffic effectively, and at that point it

91
00:08:24,620 --> 00:08:34,133
will transition to this invalid state.
note here that this intent to write, we

92
00:08:34,133 --> 00:08:39,786
may need to provide information across the
bus when we're in the owned state. Because

93
00:08:39,786 --> 00:08:44,840
if the only, if we're the only owner of
that, or the only cache that has that

94
00:08:44,840 --> 00:08:50,294
data, and the other processor is basically
going straight into this state here via

95
00:08:50,294 --> 00:08:57,960
rightness, we're going to need to provide
the data. Okay, so, questions about MOESI?

96
00:08:58,700 --> 00:09:02,295
So far, But a basic, extra
optimization.'Cause we don't have to. We

97
00:09:02,295 --> 00:09:06,246
can basically transfer data around. And
one cache can have a, a, a cache line in

98
00:09:06,246 --> 00:09:10,247
the owned state. And later, some other
cache, you know, the exact same cache line

99
00:09:10,247 --> 00:09:14,197
in the own state. And it can basically
bounce around without ever having to go

100
00:09:14,197 --> 00:09:18,835
out to main memory. And this, this
decreases our bandwidth out to the main

101
00:09:18,835 --> 00:09:29,626
memory system. Okay. Then we're going to
talk about MESIF or MESIF, which is

102
00:09:29,626 --> 00:09:36,488
actually used in the core I7 in the most
up to date Intel processors. And it looks

103
00:09:36,488 --> 00:09:43,935
very similar to MOESI, except we're going
to see an extra little letter in this one

104
00:09:43,935 --> 00:09:52,153
bubble here. Effectively, the, what's
going on here is we add an extra state

105
00:09:52,153 --> 00:09:58,919
called the Forward State. And this is
similar to sort of the optimization we saw

106
00:09:58,919 --> 00:10:09,510
in MOESI, except it can't keep the data
writeable. So, what happens in this

107
00:10:09,510 --> 00:10:14,941
protocol is, let's say the first cache,
which does a read miss on a line for

108
00:10:14,941 --> 00:10:21,127
widely shared data is going to be elected
and going to get the data in this forward

109
00:10:21,127 --> 00:10:27,255
state. And then if other caches want to
get read only copies, bring it in shared.

110
00:10:27,255 --> 00:10:34,640
Instead of having to go out to main
memory, the cache that has it in the

111
00:10:34,640 --> 00:10:40,906
forward state is going to provide that
data across the bus. So this is going to

112
00:10:40,906 --> 00:10:47,172
effectively decrease our bandwidth to main
memory, by providing the data out of

113
00:10:47,172 --> 00:10:53,352
another cache's, cache, effectively, or
another processor's cache rather, and then

114
00:10:53,352 --> 00:10:57,988
you won't have to, have to transition it.
Now, this is a little bit of a

115
00:10:57,988 --> 00:11:03,220
simplification. There is a question here
of, if you're in this forward state and

116
00:11:03,220 --> 00:11:08,550
you invalidate the data. Who has it? does
anyone provide the data? So there's sort

117
00:11:08,550 --> 00:11:13,342
of two choices here. One choice is no one
has it in the forward state. So when it's

118
00:11:13,342 --> 00:11:17,901
a snooper quest for a line, it actually
has, it just have to go out of the main

119
00:11:17,901 --> 00:11:22,576
memory. That's kind of the easy case. The
other case is you could try to actually

120
00:11:22,576 --> 00:11:27,427
build a protocol where another cache when
one, one cache invalids the forward, it

121
00:11:27,427 --> 00:11:32,161
just chooses another cache. But probably
the simplest thing you do is when the

122
00:11:32,161 --> 00:11:37,039
forward, the forwarding core invalidates
the data. For whatever reason, you just go

123
00:11:37,039 --> 00:11:40,784
back out to main memory, because there's
always a copy in main memory. So

124
00:11:40,784 --> 00:11:44,358
effectively you're just keeping read only
copies. Yeah, you're right. You're

125
00:11:44,358 --> 00:11:47,835
probably going to enter, into the
exclusive state. That's a good question.

126
00:11:50,240 --> 00:11:56,991
so I read two different versions of this
in, in different books. So, I'm not quite

127
00:11:56,991 --> 00:12:03,135
sure Intel actually documents what they do
for, for this. probably what's okay, so,

128
00:12:03,135 --> 00:12:07,399
so, you probably will, youre probably
right. You probably want to enter straight

129
00:12:07,399 --> 00:12:13,657
into exclusive state. If you have a read
only copy, What. You, you, yeah. So what's

130
00:12:13,657 --> 00:12:18,470
gonna happen is when you transition from E
to S here, you're gonna transition from E

131
00:12:18,470 --> 00:12:23,226
to F. And then you're going to be able to
make this, you'll end up in the F state.

132
00:12:23,226 --> 00:12:27,408
So the first person who actually
downgrades is going to always end up in

133
00:12:27,408 --> 00:12:32,508
the F state. but like I said. I saw other
references where people said, There were

134
00:12:32,508 --> 00:12:36,862
other people implementing something
similar to this. Where they, Some have,

135
00:12:36,862 --> 00:12:41,838
some have some election where they figure
out who is the, the forwarding, Node, but

136
00:12:41,838 --> 00:12:49,430
probably the easiest thing to do is to
downgrade from E to F. So the rest of the

137
00:12:49,430 --> 00:12:55,694
course, we're gonna look at how to scale
beyond these broadcast and these

138
00:12:55,694 --> 00:13:02,472
invalidate protocols that have to snoop on
a bus. so, some of the problems of

139
00:13:02,472 --> 00:13:08,660
building these. Snooping systems is, that
you need, it really affects how you design

140
00:13:08,660 --> 00:13:14,100
your processor. So first of all, you're
gonna have to add more bandwidth into your

141
00:13:14,100 --> 00:13:20,473
cache. Or at least more bandwidth into
your tag array. so one choice is going to

142
00:13:20,473 --> 00:13:26,711
dual port your tags. Another choice is you
can steal cycles for snoops. So what I

143
00:13:26,711 --> 00:13:31,847
mean by steal cycles is if there is a bus
transaction happening and you need to

144
00:13:31,847 --> 00:13:36,662
check this against your tags, you actually
block the main processor that is

145
00:13:36,662 --> 00:13:41,862
associated with that cash from accessing
the cash that cycle, so you, you generate

146
00:13:41,862 --> 00:13:48,424
a stall signal to the cash or to the main
pipe. And one of the things here that get

147
00:13:48,823 --> 00:13:54,207
a little tricky is, and this will affects
your design, is if you have a multilevel

148
00:13:54,207 --> 00:13:59,658
cache, usually you want to put your sort
L2 tag array on the bus and snoop against

149
00:13:59,658 --> 00:14:04,644
your L2 tag array. But if it hits there
and you figure out that you have to

150
00:14:04,644 --> 00:14:10,320
invalidate something. You're going to have
to invalidate down the entire cache

151
00:14:10,320 --> 00:14:15,000
hierarchy, all the way down to the level
one cache. So this can actually affect

152
00:14:15,000 --> 00:14:19,680
your throughput on your level one cache
effectively. And also, it sort of is, is

153
00:14:19,680 --> 00:14:23,940
annoying to do, cuz it's going to
effectively have to reach down and touch

154
00:14:23,940 --> 00:14:28,320
your tag array of your L1 cache. And as I
had mentioned, I think, last time,

155
00:14:28,320 --> 00:14:33,386
briefly. If you're thinking about
something like a exclusive cache. So a

156
00:14:33,386 --> 00:14:40,993
cache where the tags and the L2 don't have
the tags in the L1. You're going to have

157
00:14:40,993 --> 00:14:46,236
to check both tags for every snoop
transaction, and that can be pretty,

158
00:14:46,236 --> 00:14:52,378
pretty painful, to do. Or you have to
copy, the L1 tags, but it's effectively

159
00:14:52,378 --> 00:14:58,520
the same thing as just having a, inclusive
cache, but maybe for little less data

160
00:14:58,520 --> 00:15:06,002
storage. Okay, so what limits our
performance? Why can't we just build 1,000

161
00:15:06,002 --> 00:15:11,278
processors on a big bus? Well it's the
same idea if you have 1,000 people in this

162
00:15:11,278 --> 00:15:16,491
room, and they're all trying to shout to
each other at the same time. At some point

163
00:15:16,491 --> 00:15:21,322
you, you run out of both bandwidth, and
more importantly you need some way to

164
00:15:21,322 --> 00:15:27,817
coordinate them. But also, but also if you
wanted to, if you're required to basically

165
00:15:27,817 --> 00:15:33,386
serialize, the occupancy on the bus goes
up. So, if you have one bus with two

166
00:15:33,386 --> 00:15:38,880
people talking on the bus at a time, they
each can, let's say, and they talk

167
00:15:38,880 --> 00:15:43,781
ten percent of the time, then you have a
twenty percent utilized bus. Well all of a

168
00:15:43,781 --> 00:15:49,647
sudden, if you have ten people on this
bus, you have 100% utilized bus and if you

169
00:15:49,647 --> 00:15:54,844
have 1,000 people, you have an
oversubscribed bus so, you have to worry

170
00:15:54,844 --> 00:16:00,327
about the bandwidth, and occupancy, cuz we
do need to make these different bus

171
00:16:00,327 --> 00:16:05,215
transactions atomic. So it's not quite
just a bandwidth problem. And what, what I

172
00:16:05,215 --> 00:16:10,766
mean by balance, you could make the bus
wider. To increase the bandwidth, but it's

173
00:16:10,766 --> 00:16:16,337
not going to solve our problems. Because
there's an occupancy challenge here also

174
00:16:16,337 --> 00:16:21,426
that you need effectively atomic
transactions to happen across the bus in

175
00:16:21,426 --> 00:16:28,852
order to keep the cache coherence protocol
correct. Okay, so before we move off this

176
00:16:28,852 --> 00:16:34,634
topic into our interconnection networks,
that we were talking about today, hm, I

177
00:16:34,634 --> 00:16:40,713
want to talk about one of the challenge
of, that happens in simple cache coherence

178
00:16:40,713 --> 00:16:49,322
systems. And that's false sharing. So
caches, like to track information on, a

179
00:16:49,322 --> 00:16:56,331
particular bloc size. So, we've talked
about cach es which have 64 byte, lines,

180
00:16:56,331 --> 00:17:03,018
or 64 byte block sizes, and they can be
bigger or smaller than that. Now, one of

181
00:17:03,018 --> 00:17:08,374
the things that happens that is pretty
unpleasant in these coherence protocols is

182
00:17:08,374 --> 00:17:13,337
let's say, you take a piece of data which
is shared, and needs to be coherent

183
00:17:13,337 --> 00:17:18,824
between two different processors. And it's
gets communicated relatively often. And

184
00:17:18,824 --> 00:17:24,180
you put some other piece of critical data
right next to it, on the same cache line.

185
00:17:25,480 --> 00:17:30,209
All of the sudden, what's going to happen
is, because they're packed into one cache

186
00:17:30,209 --> 00:17:37,006
line, and we only track that information
on a per cache line basis, whenever that

187
00:17:37,006 --> 00:17:43,615
one piece of data, let's say it's a four
byte integer, and there's another four

188
00:17:43,615 --> 00:17:49,249
byte integer which is not shared, .
Whatever the first four by energy which

189
00:17:49,249 --> 00:17:53,623
let's just say a lock or something like
that gets, gets bounced around between

190
00:17:53,623 --> 00:17:57,781
caches you're gonna bounce around the
other data. So this can effectively can

191
00:17:57,781 --> 00:18:02,046
hurt your performance common case
performance for non shared data by having

192
00:18:02,046 --> 00:18:06,366
this true sharing of data happening. And
this is not something that typically

193
00:18:06,366 --> 00:18:10,578
happens in a normal cache because in a
uniprocessor cache system you're gonna

194
00:18:10,578 --> 00:18:14,520
bring the data in and it's gonna bring
everything in and you get spacial

195
00:18:14,520 --> 00:18:19,772
locality. And if you. Pump it out, you
know, you can, you can get conflicts which

196
00:18:19,772 --> 00:18:25,390
are sort of equivalent to this but it's a
little bit different idea here. It's never

197
00:18:25,390 --> 00:18:30,540
going to be in the same line. But with
false sharing, we, we do see this. Hm, now

198
00:18:30,540 --> 00:18:36,158
false sharing is interesting because
people have come up with a whole measure

199
00:18:36,158 --> 00:18:41,642
of techniques to avoid it. So, anyone have
an idea, one, one really good technique to

200
00:18:41,642 --> 00:18:47,700
avoid false sharing? What we can do, and
this is pretty common, is either the

201
00:18:47,700 --> 00:18:54,164
programmer or the compiler can detect that
this is happening and it will actually pad

202
00:18:54,164 --> 00:18:59,877
the information out. So waste memory for
highly contended pieces of data, and

203
00:18:59,877 --> 00:19:06,442
co-locate it with nothing that is shared.
. So one of the better examples of why you

204
00:19:06,442 --> 00:19:10,607
really have to care about this is
something like your stack. Sometimes if

205
00:19:10,607 --> 00:19:14,975
you, if you were to have, let's say, a
lock on your stack, there's a lot of data

206
00:19:14,975 --> 00:19:19,569
which you need to use often, and it's all
local. Stacks between threads are all

207
00:19:19,569 --> 00:19:23,845
local. But if you have, like, some sort of
variable that you pass to someone else,

208
00:19:23,845 --> 00:19:28,211
which is a struct, and inside that struct
is a lock, or something like that. All of

209
00:19:28,211 --> 00:19:32,146
sudden, you're basically going to be
bouncing a line around which is your

210
00:19:32,146 --> 00:19:36,931
stack. And it's, other people are going to
be invalidating your stack. So one way to

211
00:19:36,931 --> 00:19:41,804
solve this is when you put a lock, and the
compiler can sometimes recognize this.

212
00:19:41,804 --> 00:19:47,029
Because you can actually designate memory
addresses as locks, with special keywords,

213
00:19:47,029 --> 00:19:51,726
sometimes, depending on the language. And
when you do that, it'll say, oh, don't put

214
00:19:51,726 --> 00:19:56,364
this with anything else, or maybe only
collocate this data with things that are

215
00:19:56,364 --> 00:20:01,120
other locks. because that may have bad
sharing performance anyway, for instance.

216
00:20:01,120 --> 00:20:08,931
So and really what you want to do here, is
not have a false sharing case. Now, the

217
00:20:08,931 --> 00:20:16,974
analog default sharing is actually true
sharing. So there are, there are cases

218
00:20:16,974 --> 00:20:24,235
where you'll have multiple pieces of data
that are, shared differently between

219
00:20:24,235 --> 00:20:29,458
different lines. But they are also widely
shared. So example of this is, you have an

220
00:20:29,458 --> 00:20:34,553
array of locks, and different processors
won't be grabbing these blocks randomly.

221
00:20:34,746 --> 00:20:39,969
You can use similar techniques in fold
sharing. Now, you probably don't want all

222
00:20:39,969 --> 00:20:46,444
those locks to be on those cache line.
Because the locks are basically going to

223
00:20:46,444 --> 00:20:50,725
be bouncing around, and everyone is going
to be contending for that one cache line

224
00:20:50,725 --> 00:20:54,704
to get it, modified in their ache, or in
the M state in their cache. So what you

225
00:20:54,704 --> 00:20:58,833
can think about doing is, is actually just
doing the similar technique, and putting,

226
00:20:58,833 --> 00:21:08,960
each of those locks on a separate cache
line. Okay, so let's switch gears here.