1
00:00:06,120 --> 00:00:18,860
And Dijkstra came up with this naming
nomenclature. So, implementation of sum of

2
00:00:18,860 --> 00:00:25,203
force. this idea is all well and good, but
we still need some way to come up with

3
00:00:25,203 --> 00:00:31,503
mutual exclusion. Believe it or not, you
can actually do this with just loads and

4
00:00:31,503 --> 00:00:35,887
stores. And we're going to be talking
about that a little bit later in class.

5
00:00:35,887 --> 00:00:41,052
But it's really pretty slow. and it's, it,
there are some algorithms that can do this

6
00:00:41,052 --> 00:00:45,436
with just loads and stores. But from an
implementation perspective, people

7
00:00:45,436 --> 00:00:50,288
typically try to go faster. So, if you
want to go faster, it might be helpful for

8
00:00:50,288 --> 00:00:54,630
your computer architecture, your
instructions set architect, to come up

9
00:00:54,630 --> 00:00:59,467
with a special instruction which will
allow you to do something atomically. So,

10
00:00:59,467 --> 00:01:03,995
when we say do something atomically, it
means there's nothing else can be

11
00:01:03,995 --> 00:01:09,637
interweaved in that, in the meantime. And
the, the simple solution here is that you

12
00:01:09,637 --> 00:01:16,729
add an atomic operation which can do a
read, a modify, and a write. All

13
00:01:16,729 --> 00:01:23,317
atomically if nothing else interleaving
it. So going, going back to our locks and

14
00:01:23,317 --> 00:01:29,330
sum of fours here. All inside of this,
this P and all inside this V here, you

15
00:01:29,330 --> 00:01:34,575
need to somehow, atomically modify S. You
need to be able to read S you need to be

16
00:01:34,575 --> 00:01:39,885
able to write S if you were able to
implement this somehow with code inside of

17
00:01:39,885 --> 00:01:46,034
the implementation of, of P and inside the
implementation of V. So, the primitive

18
00:01:46,034 --> 00:01:51,060
here that makes this faster is having some
read modify write operation. And we're

19
00:01:51,060 --> 00:01:57,820
going to look at a couple different
choices of this. One of the most basic

20
00:01:57,820 --> 00:02:05,160
ones is test and set. x86 has this
instruction. This is the one of the most

21
00:02:05,610 --> 00:02:11,310
basic operations on x86 for atomic
operations. lots of other architectures

22
00:02:11,310 --> 00:02:17,010
have an instructions which does this. So,
we write here a piece a code. But in

23
00:02:17,010 --> 00:02:22,710
reality, this is a piece of pseudo code,
but in reality, this is an instruction

24
00:02:22,710 --> 00:02:27,960
which we'll actually do this atomically
relative to the whole rest of the system.

25
00:02:27,960 --> 00:02:33,832
So, let's look at what test and set does.
The basic idea of test and set is it, is

26
00:02:33,832 --> 00:02:39,720
it's probably one of the most primitive
atomic operations you can do. I've seen

27
00:02:39,720 --> 00:02:45,536
something slightly simpler than this.
there's like a test and clear which is a

28
00:02:45,536 --> 00:02:51,844
little bit easier to implement. But,
largely the idea is that, you're going to

29
00:02:51,844 --> 00:02:58,332
look at a memory address. And if that
memory address is what you expect it to

30
00:02:58,332 --> 00:03:12,139
be, then write some value. And, implicit
in this piece of code here is that, the

31
00:03:12,139 --> 00:03:16,560
piece of code which executed the testing
set needs to know whether it's exceeded

32
00:03:16,560 --> 00:03:20,653
and was successfully able to write the
value or not to, or to not write the

33
00:03:20,653 --> 00:03:25,020
value. So, it's sort of a status code and
that's returned in this register R here.

34
00:03:26,040 --> 00:03:30,698
So, if we look at this piece of code, you
pass into it a memory reference, a memory

35
00:03:30,698 --> 00:03:35,357
address, which is the address of the sum
of four or the address lock. And you pass

36
00:03:35,357 --> 00:03:42,884
into it and, and, and a register name. We
do a load into that register and that

37
00:03:42,884 --> 00:03:51,538
happens unconditionally. This part here
though, is the, is the, the test part. So

38
00:03:51,538 --> 00:03:58,160
the test part here is we check whether the
thing we just read from this memory

39
00:03:58,160 --> 00:04:04,206
address is zero. And if so, we're going to
write a one to the memory address.

40
00:04:04,206 --> 00:04:08,796
Sometimes these test and set operations
will actually allow you to write something

41
00:04:08,796 --> 00:04:13,274
else besides just one. But, for right now,
let's, this is sort of the more basic one

42
00:04:13,274 --> 00:04:17,752
here. So, it writes a one to that memory
address. And what's important here is the

43
00:04:17,752 --> 00:04:24,292
architecture guarantees that all of this
happens atomically. But what's interesting

44
00:04:24,292 --> 00:04:30,279
is, if you think about our processors that
we built, this requires a load and a

45
00:04:30,279 --> 00:04:36,269
store. Hm. So, we will talk about how we
implement this a little bit in, in a few

46
00:04:36,269 --> 00:04:41,218
slides. But, you basically have to stop
everything else, stop all other loads and

47
00:04:41,218 --> 00:04:48,686
stores happening in the system. Do this
load, do the test, and possibly do the

48
00:04:48,686 --> 00:04:54,111
store in hardware. One of the more basic
ways this is done or one of the original

49
00:04:54,111 --> 00:04:58,623
way this was done actually on some of the
SA-6 based architectures, is there is

50
00:04:58,623 --> 00:05:03,081
something called the lock bit on the bus.
So what they did is when one processor was

51
00:05:03,081 --> 00:05:07,538
going to go execute a atomic operation
like this, a testing set for instance. You

52
00:05:07,538 --> 00:05:11,940
'd actually broadcast all of the other
processors in the system saying, I'm about

53
00:05:11,940 --> 00:05:16,893
to do this. Do not touch anything. Do not
issue any memory instructions effectively,

54
00:05:16,893 --> 00:05:21,336
or do not issue any bus, based memory
instructions. It would, it would do the

55
00:05:21,336 --> 00:05:25,672
load, it would do the test, and it would
do the set, and then it would release the

56
00:05:25,672 --> 00:05:32,214
broadcast bit. This is the wire on a
multi-drop bus. that's a pretty big

57
00:05:32,214 --> 00:05:37,403
hammer. things have gone a little bit
better since then. But, just to give you

58
00:05:37,403 --> 00:05:44,701
an idea of basic hardware implementation
of this. things got a little more

59
00:05:44,701 --> 00:05:51,627
sophisticated this sorta showed up in the,
the 70s' here. These fancier operations or

60
00:05:51,627 --> 00:05:59,454
the fancy atomic operations. So, this
whole box here is still a atomic

61
00:05:59,454 --> 00:06:03,478
operation. All the, the code in here
happens atomically, and what this

62
00:06:03,478 --> 00:06:08,095
basically does is it's going to take a
memory address and add something to it.

63
00:06:08,095 --> 00:06:13,457
So, we're going to add Rv to R and put
that into memory at the same memory

64
00:06:13,457 --> 00:06:18,070
address and, so we're going to do the load
in the store here, atomically. So, it's a

65
00:06:18,070 --> 00:06:23,665
read modified write operation also. And
having it all the atomic. Note, this still

66
00:06:23,665 --> 00:06:29,047
actually returns into register R here, the
original value. And sometimes that's

67
00:06:29,047 --> 00:06:34,838
actually very useful if you're trying to
implement a semifor based on this you kind

68
00:06:34,429 --> 00:06:39,675
of want to see what was there before.
There's some, there's some jokes around

69
00:06:39,879 --> 00:06:48,073
fetching at here, though. the atomic
operation community kind of push this to

70
00:06:48,073 --> 00:06:53,796
the extreme at some point and people
started having sort of fetch and insert

71
00:06:53,796 --> 00:06:59,297
something here. So, fetch and multiply,
fetch and divide or, or things like that

72
00:06:59,297 --> 00:07:05,126
and people started building some stranger
machines that had fetch and something

73
00:07:05,126 --> 00:07:09,999
else. There's this old joke that someone
was going to implement instruction called

74
00:07:10,171 --> 00:07:14,929
fetch and FFT. So, it would be atomic
Fourier Transform. I don't think it ever

75
00:07:14,929 --> 00:07:19,229
got to that point. But, in the, in the
microcoded days and the, and the early

76
00:07:19,229 --> 00:07:23,529
microprocessor microcoded days, there was
some pretty sophisticated atomic

77
00:07:23,529 --> 00:07:28,665
operations. That's largely been decided to
not be a good direction t o go.

78
00:07:28,667 --> 00:07:33,300
You want to have a little bit of
complexity in your atomic operations. so

79
00:07:33,300 --> 00:07:38,623
maybe test and set is a little too simple.
But, some of these fancier fetching FFTs

80
00:07:38,623 --> 00:07:48,131
is probably a little bit too complex.
Here's another interesting operation that

81
00:07:48,131 --> 00:07:54,036
is atomic. And this one you can actually
build some pretty interesting semaphores

82
00:07:54,036 --> 00:07:59,590
out of. It's called swap. So, it takes a
register value and a memory location, and

83
00:07:59,590 --> 00:08:05,706
it'll take what's in the memory location
and what's in the register and swap the

84
00:08:05,706 --> 00:08:12,335
two things. And note, we need to use we
use a temporary here because you can't

85
00:08:12,335 --> 00:08:19,835
actually swap two things without a
temporary. This is actually not very

86
00:08:19,835 --> 00:08:24,257
common because it's kind of hard to use.
so what's a little bit more common to

87
00:08:24,257 --> 00:08:28,958
something called compare and swap which is
something between kind of a test and set,

88
00:08:28,958 --> 00:08:33,548
and a swap operation. And that's, that's
what commonly used and we'll talk about

89
00:08:33,548 --> 00:08:38,915
that in a second. Okay, so now we go back
to the multiple consumers problems that we

90
00:08:38,915 --> 00:08:45,192
have, where multiple people were trying to
DQ from a, a Q. And let's introduce test

91
00:08:45,192 --> 00:08:55,531
and set, and we'll look at our critical
section here. We're going to use this term

92
00:08:55,531 --> 00:09:01,555
mutex to basically mean a sum of four that
only one thing can enter at a time. So,

93
00:09:01,555 --> 00:09:07,826
what happens at the beginning here is
there is a tight loop, and this tight loop

94
00:09:07,826 --> 00:09:14,332
will read from a relocation of mutex into
a temporary register, and will compare it

95
00:09:14,332 --> 00:09:25,986
to zero. Note, we define test and set as
writing one to memory if it succeeds. So

96
00:09:25,986 --> 00:09:32,168
if it was zero, that means that it was not
able to actually modify memory. The test

97
00:09:32,168 --> 00:09:38,275
failed. So if the test failed, it's just
going to sit and spin here. And we'll call

98
00:09:38,275 --> 00:09:44,383
this busy waiting or spin attritional
spinlock, it's called. If it acquired the

99
00:09:44,383 --> 00:09:52,701
lock, this read value would have been a
one. So, it would not be equal to, or

100
00:09:52,701 --> 00:10:00,776
excuse me, if there, if it acquired the
lock through the, the, the, you just

101
00:10:00,776 --> 00:10:11,122
write, if it's one, it's zero. Okay, sorry
we did this wrong. If, if the memory

102
00:10:11,122 --> 00:10:26,030
address was a, if it was a one coming into
here, that means someone else already

103
00:10:26,030 --> 00:10:33,651
locked it. If it's a zero coming into
here, that means no one locked it, but we

104
00:10:33,651 --> 00:10:38,473
locked it and set it to a one now. So if
we acquired the lock, we're going to

105
00:10:38,473 --> 00:10:43,169
atomically set it to a one, and we're
going to read a zero. So, we'll drop into

106
00:10:43,169 --> 00:10:48,182
our critical section and start executing
our critical section. And it, under the

107
00:10:48,182 --> 00:10:53,321
critical section, we can do stores, we can
do loads, we can do whatever we want cuz

108
00:10:53,321 --> 00:10:58,397
we're guaranteeing that no other op, no
other processor, no other thread is doing

109
00:10:58,397 --> 00:11:03,713
anything to it. And then at some point, we
need to do the, the release here. And that

110
00:11:03,713 --> 00:11:09,713
we actually don't need a test and set for,
we just need a store. A store is just

111
00:11:09,713 --> 00:11:15,941
going to clear that in a real location,
and some other thread is safe to go and

112
00:11:15,941 --> 00:11:25,382
fall into the critical section here. Okay.
So, this can be done with lots of

113
00:11:25,382 --> 00:11:29,925
different implementations. We showed here
of test and set, but this can be done with

114
00:11:29,925 --> 00:11:34,360
swap, it can be done with fetch and add.
The code gets a little bit more complex.

115
00:11:33,927 --> 00:11:41,260
But, one of the interesting questions that
comes up is, what do you, what happens if

116
00:11:41,260 --> 00:11:47,070
you running along some threat acquires the
critical or acquires the law falls into

117
00:11:47,070 --> 00:11:52,460
the critical section and let's say,
terminates or gets swapped out for very

118
00:11:52,460 --> 00:12:00,615
long period of time? What's, what's going
to happen here? Well, system crash. I

119
00:12:00,615 --> 00:12:07,480
mean, we, we don't reference bad memory
here. This is more, more of a hang. So,

120
00:12:07,480 --> 00:12:15,832
it's going to freeze. So this is actually
pretty common in the old, for instance, if

121
00:12:15,832 --> 00:12:23,942
you look at like non-preemptive operating
systems. So, back in the day of like the

122
00:12:23,942 --> 00:12:29,910
original Mac OS before Mac OS became Mac
OSX, so versions sort of nine and earlier.

123
00:12:29,910 --> 00:12:35,591
That was a cooperative operating system.
And if someone went and grabbed a lock,

124
00:12:35,591 --> 00:12:41,485
and then died or took an error case, the
whole machine would crash, crash. if

125
00:12:41,485 --> 00:12:46,394
you're pre-emptive, I mean, there's like a
timer going off, it's possible you still

126
00:12:46,394 --> 00:12:51,056
you won't be able to make forward
progress. But at least, the OS can

127
00:12:51,056 --> 00:12:57,909
interrupt your process and try and do
something about it. It can detect that one

128
00:12:57,909 --> 00:13:02,326
process is hung. But in a cooperativ e
multi-threading environment, if one

129
00:13:02,326 --> 00:13:05,972
process, let's say, just hogs the
processor and never gives back the

130
00:13:05,972 --> 00:13:10,324
processor or just dies, and you have some
other process, which say, which, or, or,

131
00:13:10,324 --> 00:13:14,731
let's say, you, you grab the lock and one
process and die. And then, another process

132
00:13:14,731 --> 00:13:19,301
is running along and tries to acquire the
lock, but it's just sitting there spinning

133
00:13:19,301 --> 00:13:22,985
in this little spin section here for
forever. You're basically just going to

134
00:13:23,322 --> 00:13:27,028
hang the processor. You're not going to
see anything else here. So, you've got to

135
00:13:27,365 --> 00:13:31,071
be pretty careful especially in
cooperative multi-threading environments.

136
00:13:31,071 --> 00:13:34,946
So, one of the tricks to this actually is,
if you're in a cooperative multi-threaded

137
00:13:34,946 --> 00:13:41,747
environment, they'll typically put inside
the spin loop here something which will

138
00:13:41,747 --> 00:13:47,620
yield to a different thread. So, someone
else can try to do something in that time.

139
00:13:47,620 --> 00:13:53,529
Cuz if you just sit there spinning, it's
very possible that you won't be fair or no

140
00:13:53,529 --> 00:13:58,584
one else can ever go to unlock the lock,
for instance. So, you've got to be a

141
00:13:58,584 --> 00:14:06,242
little careful there. Okay. So, we're
going to move forward here. And look at

142
00:14:06,242 --> 00:14:13,202
how to do a slightly different version of
the same piece of code, but now with

143
00:14:13,202 --> 00:14:19,999
compare and swap. So, before we do that,
let's talk about the, the semantics of

144
00:14:19,999 --> 00:14:27,845
compare and swap. Compare and swap takes
memory and two registers here. And it's

145
00:14:27,845 --> 00:14:35,936
going to compare memory with one of the
registers. If it's equal, then its going

146
00:14:35,936 --> 00:14:43,719
to take the other register, put it in
memory, and effectively swap them in

147
00:14:43,719 --> 00:14:51,004
return of status. If the compare operation
fails, and we come to this L station

148
00:14:51,004 --> 00:14:56,177
statement here, we don't do any swap. So
this would be effectively taking a

149
00:14:56,177 --> 00:15:01,070
register and a memory location and
swapping them, dependent on another

150
00:15:01,070 --> 00:15:07,062
register. So, it's a little bit more
powerful than just swap. It's a little

151
00:15:07,062 --> 00:15:12,763
harder to implement, as you might imagine.
it's probably not as hard to implement as

152
00:15:12,763 --> 00:15:18,504
something like fetch and app. The reason
for this is, if you want to do this

153
00:15:18,504 --> 00:15:23,280
operation out of the memory system, you
can basically send it as atomic sort of

154
00:15:23,280 --> 00:15:27,797
operation. Go to main memory, compare
this, put a comparitor there and swap the

155
00:15:27,797 --> 00:15:32,724
two things. If you try to have these sort
of increments or arbitrary instructions or

156
00:15:32,724 --> 00:15:37,368
adds, you need to put a adder out there
for sure. This only needs a comparator out

157
00:15:37,368 --> 00:15:41,301
there. Still, still, still painful to go
implement. But, it might a little bit

158
00:15:41,301 --> 00:15:45,955
better than putting a full adder in your
memory controller cuz that's almost like

159
00:15:45,955 --> 00:15:50,498
duplicating your ALU out in your memory
sub-system which doesn't make a whole lot

160
00:15:50,498 --> 00:15:54,820
of sense. But, that's, that's why people
used to talk about the fetching FFT. But,

161
00:15:55,280 --> 00:16:01,538
okay. So, let's take a look at the same
piece of the same case here of a

162
00:16:01,538 --> 00:16:07,580
multi-reader piece of code. But instead,
what we're going to do is we're not going

163
00:16:07,115 --> 00:16:13,003
to have a spin lock at the beginning.
Instead, we're going to have a critical

164
00:16:13,003 --> 00:16:19,277
section, but we can concurrently execute
critical sections from different threads

165
00:16:19,277 --> 00:16:24,045
here which we're not going to call it a
critical section then. We're going to say,

166
00:16:24,045 --> 00:16:29,184
we'll concurrently execute what was in the
other thread's critical section, and then

167
00:16:29,184 --> 00:16:34,261
we're going to atomically try to swap out
the head pointer and replace it. So, we're

168
00:16:34,261 --> 00:16:39,119
inspectively doing speculative work here,
and this is called non-blockive,

169
00:16:39,119 --> 00:16:45,580
non-blocking synchronization. So, the
synchronization primitive we're going to

170
00:16:46,000 --> 00:16:50,797
do the loads, we talked about that before.
We checked to make sure there's at least

171
00:16:50,797 --> 00:16:55,887
something available. We'll pull the, the
head pointer off. This is all speculative

172
00:16:55,887 --> 00:17:00,743
work at this point cuz we have not check
to make sure that we can validly pull off

173
00:17:00,743 --> 00:17:06,236
the head of the queue. We're going to
increment the head pointer into a

174
00:17:06,236 --> 00:17:11,325
temporary here that we're going to call
new head or registered new head. And then,

175
00:17:11,325 --> 00:17:19,169
we're going to do a, try to swap the new
head with the old head atomically. So,

176
00:17:19,169 --> 00:17:24,896
it's going to take the register that we
think is the new head and memory location

177
00:17:24,896 --> 00:17:30,344
where the head pointer is. We're going to
try to swap these two things. And the

178
00:17:30,344 --> 00:17:36,072
thing that we're dependent on here is that
what we think is the old head, still is

179
00:17:36,072 --> 00:17:47,995
the old head. And if, if something bad hap
pens here, we can, we can try again. so,

180
00:17:47,995 --> 00:17:56,022
so the idea here is that we're going to
spin actually in this effectively bigger

181
00:17:56,022 --> 00:18:13,307
loop, trying to speculatively try to go
and change the, the head. So, one, one

182
00:18:13,307 --> 00:18:25,435
question that comes up here is, what if
the head was DQ'd from, added back on to,

183
00:18:25,435 --> 00:18:32,288
and, and it just so happens that we have
some sort of, it's called an ABA problem?

184
00:18:32,288 --> 00:18:39,141
The headpointer was a, was changed to be,
and later changed back to a, all in, this

185
00:18:39,141 --> 00:18:51,841
short period of time. Yeah. That, that
might be a problem here in this piece of

186
00:18:51,841 --> 00:18:56,235
code. You might need to think about that
and carefully reason about that. Usually,

187
00:18:56,235 --> 00:19:00,465
you can protect that in some other way
sort of, when the, the pointer wraps

188
00:19:00,465 --> 00:19:04,695
around, maybe. if you think about a normal
ray, when it wraps around, you put

189
00:19:04,860 --> 00:19:09,474
something which is not a compare and swap.
You put like more spin lock, or a spin

190
00:19:09,474 --> 00:19:13,759
loop sort of lock on it. So, you only have
to worry about that in these sort of

191
00:19:13,759 --> 00:19:17,769
circular buffers when something wraps
around. Cuz otherwise, there's no, no

192
00:19:17,769 --> 00:19:22,014
other way that the head pointer can go
back to where it was before. But, you do

193
00:19:22,014 --> 00:19:26,327
need to worry about that, that you wrapped
all the way around the array in this very

194
00:19:26,327 --> 00:19:30,332
short period of time. So, there's some,
there's some race conditions that happen

195
00:19:30,332 --> 00:19:37,889
if these non-blocking synchronizations
operations. Another non-blocking

196
00:19:37,889 --> 00:19:46,914
synchronization primitive that is actually
really cool, I think, is, goes by lots of

197
00:19:46,914 --> 00:19:53,237
different names. it goes by load locked,
load linked, load reserve, and store

198
00:19:53,237 --> 00:19:59,724
conditional. So, you take your ISA, you
take your instruction set, and you add two

199
00:19:59,724 --> 00:20:07,124
new instructions. Something which checks
to make sure or something which doesn't

200
00:20:07,124 --> 00:20:16,315
load, and what you then do is store
conditional checks to make sure that no

201
00:20:16,315 --> 00:20:24,874
one else did a load. Or a special load
link to that same address in the

202
00:20:24,874 --> 00:20:34,099
inter-meeting time. I can explain that one
more time, it's a little confusing. But

203
00:20:34,099 --> 00:20:42,660
the, basically the idea is a flag register
here. And in hardware, you do a load link.

204
00:20:44,420 --> 00:20:48,291
Just load the value into your register and
its sort of on the side, somewhere n ext

205
00:20:48,291 --> 00:20:51,596
to the register. It may not be
architecturally visible, or shouldn't be

206
00:20:51,596 --> 00:20:58,423
architecturally visible likely. It will
set a bit. You execute your critical

207
00:20:58,423 --> 00:21:04,124
section. And when you come back to it, to
go do the store, we say, for this sort of

208
00:21:04,124 --> 00:21:11,410
read modified write operations. When you
go to the store, the store checks to make

209
00:21:11,410 --> 00:21:18,465
sure the flag is still set to one. The
flag is one. If it's still one, that means

210
00:21:18,465 --> 00:21:23,060
no one did any memory, memory operations
to that address or did at least any load

211
00:21:23,060 --> 00:21:27,655
links or store initials to that address in
the inter meeting time. Actually, sorry.

212
00:21:27,655 --> 00:21:32,137
We'll say that in a more completely. No
one else did a store conditional during

213
00:21:32,137 --> 00:21:36,675
that time. So, no one else tried to do a
store update. Other people may have tried

214
00:21:36,675 --> 00:21:41,425
to do a load. But, no one else could of
store at that address, or store

215
00:21:41,425 --> 00:21:46,395
conditional at that address. Because if
someone else did a store conditional,

216
00:21:46,395 --> 00:21:52,233
what'll happen is that it'll actually
cancel the other processor's reservation,

217
00:21:52,233 --> 00:21:58,600
so set their flag bits to zero atomically.
This is a pretty primitive, this is a

218
00:21:58,600 --> 00:22:06,103
pretty powerful primitive here such that
you can basically do a load. Try to do a

219
00:22:06,103 --> 00:22:10,500
store later in the future. If no one else
try to do a store to that same address in

220
00:22:10,500 --> 00:22:14,736
the meantime, or a store conditional to
that address in the meantime, your store

221
00:22:14,736 --> 00:22:18,972
goes through and everybody else's loads.
If they do a load in the meantime, will

222
00:22:18,972 --> 00:22:23,154
get invalidated or they'll know that it
gets invalidated when they go to do a

223
00:22:23,154 --> 00:22:29,444
store conditional. So, if we take the same
example here and we implement with load

224
00:22:29,444 --> 00:22:34,997
link and store conditional, we're going to
look at the headpointer here where we

225
00:22:34,997 --> 00:22:44,514
normally do the update. And we're going to
do a load into this register. And, when we

226
00:22:44,514 --> 00:22:49,081
go into a store conditional, if we,
someone else had changed this value in the

227
00:22:49,081 --> 00:22:52,935
meantime, we're going to get a fail and
we're going to jump back around and have

228
00:22:52,935 --> 00:23:00,920
to redo the load link. And this actually
gets rid of the ABA problem, also. Because

229
00:23:01,300 --> 00:23:05,817
if you did have that sort of race, someone
else would have done a store conditional

230
00:23:05,817 --> 00:23:10,335
to that address, and the wrap around cases
and all that stuff would actually, and it

231
00:23:10,335 --> 00:23:16,854
would even validate your flag on that bit,
if you will. We're flagging that memory

232
00:23:16,854 --> 00:23:22,643
address. One of the reasons that people
like this is, when you go to do load link

233
00:23:22,643 --> 00:23:28,361
or load reserve in store conditional, you
can, a lot of times sort of piggy back

234
00:23:28,361 --> 00:23:33,933
this over a memory coherence protocol.
Such that, you do a load link and it will

235
00:23:33,933 --> 00:23:39,925
add an extra bit in your memory, we'll
say. So, if you're doing some sort of

236
00:23:39,925 --> 00:23:45,122
memory coherence protocol, it'll pull it
in your cache and you can sort of tag this

237
00:23:45,122 --> 00:23:50,945
little loop line as special, or that's the
flag bit here effectively cuz this whole

238
00:23:50,945 --> 00:23:57,527
line was load linked. And then, if no one
else pushes it out of your cache or

239
00:23:57,527 --> 00:24:01,342
fetches it from your cache in the
meantime, so would mean no one else did a

240
00:24:01,342 --> 00:24:05,100
load link on it. By the time you go to do
the store, you know that no one did

241
00:24:05,100 --> 00:24:09,105
anything. And what's really nice about
this is there is no communication that has

242
00:24:09,105 --> 00:24:12,962
to happen between the processors on the
common case with an uncontained lock.

243
00:24:12,962 --> 00:24:16,819
Versus the spin locks are, are generating
memory traffic all the time. They are

244
00:24:16,819 --> 00:24:20,825
saying, they're doing loads, doing stores
and trying to like grab the lock, trying

245
00:24:20,825 --> 00:24:24,880
to grab lock, trying to grab lock, trying
to grab lock and they just generating all

246
00:24:24,880 --> 00:24:28,836
this traffic all over the place versus
this is very quiet. You just pull in your

247
00:24:28,836 --> 00:24:33,681
local cash. And if by the time you get
back to it, it was still in your cache and

248
00:24:33,681 --> 00:24:39,250
it still has it's bitset. That's great.
You get to do it. In the contendant case

249
00:24:39,250 --> 00:24:44,656
though, you may actually get traffic ping
ponging here, but you can still guarantee

250
00:24:44,656 --> 00:24:57,521
correctness. So, this brings us to
performance. As I said, some of these

251
00:24:57,521 --> 00:25:02,465
blocking atomic operations can generate a
lot of memory traffic. So, they'll say

252
00:25:02,465 --> 00:25:07,721
they're spinning on memory and they'll at
least be doing lots of loads. And those

253
00:25:07,721 --> 00:25:13,312
lots of loads may be going out to main
memory. Non-walking atomic operations, so

254
00:25:13,312 --> 00:25:18,128
like these compare and swamp operations or
load link store conditional. And the load

255
00:25:18,128 --> 00:25:22,543
link store conditional, as I was alludin g
to many times you can, you can do it

256
00:25:22,543 --> 00:25:27,016
against your own cashe. also these, these
things can have sometimes a better

257
00:25:27,360 --> 00:25:32,119
performance because you don't have to sit
there sort of spinning in type loops. You

258
00:25:32,119 --> 00:25:36,821
can try to do stuff in, in the mean time.
The down side to, at least the compare and

259
00:25:36,821 --> 00:25:41,236
swamp implementation is you have to sort
of speculative work in the meantime.

260
00:25:41,236 --> 00:25:46,350
You're sitting there redoing some work and
if you didn't get the lock, that

261
00:25:46,350 --> 00:25:52,137
performance can be bad. There's also, as I
alluded to at the beginning of class,

262
00:25:52,137 --> 00:25:57,209
there are things that use ordinary loads
and stores, or algorithms that do allow

263
00:25:57,209 --> 00:26:02,218
blocking with just loads and stores. Their
performance is, is not, not very good and

264
00:26:02,218 --> 00:26:09,910
we're going to look at an example of that
called Dekker's Algorithm today. Finally,

265
00:26:11,620 --> 00:26:17,878
the performance of these isn't just
dependent on the operations themselves.

266
00:26:17,878 --> 00:26:25,067
So, of the, these primitives, harbor
primitives, have better performance or

267
00:26:25,067 --> 00:26:29,359
worse performance depending on how highly
they are contended. So, what I mean by

268
00:26:29,359 --> 00:26:33,706
contended is I mean you have multiple
threads trying to acquire the lock at the

269
00:26:33,706 --> 00:26:37,965
same time. If you have lots of people
trying to fight for the lock, that's high

270
00:26:37,965 --> 00:26:42,228
contention. If you have a lock which
there's almost never contention on, you go

271
00:26:42,228 --> 00:26:46,599
to just go to get the lock and you just
get it, and it's there for a really rare

272
00:26:46,599 --> 00:26:50,807
case. other locking primitives have
better, better protocols there. And, of

273
00:26:50,807 --> 00:26:55,124
course, we're going to talk about this in
a little bit of how do you make these

274
00:26:55,124 --> 00:26:59,222
integrate well with caches, out-of-order
processors, and out-of-order memory