1
00:00:03,220 --> 00:00:07,922
Okay. So, we're, today, we're talking
about parallelism and synchronization.

2
00:00:08,118 --> 00:00:12,624
from a parallel, parallelization
perspective, we're worried here about

3
00:00:12,624 --> 00:00:17,457
parallel programming and how to build
computer architectures that can run

4
00:00:17,457 --> 00:00:23,106
simultaneously multiple programs or
multiple threads within one program. And

5
00:00:23,106 --> 00:00:28,027
today, we're gonna talk about some models
of that synchronization, and some

6
00:00:28,027 --> 00:00:33,214
primitives to help you solve it like
synchronization primitives and memory

7
00:00:33,214 --> 00:00:42,624
fences. This is a little bit of background
or motivation for nowadays why we're going

8
00:00:42,624 --> 00:00:48,944
to multiple processors. So, we're, show
here Moore's Law plotted on a log plot. A

9
00:00:48,944 --> 00:00:55,254
number of transistors going up
exponentially. Our sequential performance

10
00:00:55,254 --> 00:01:00,478
has sort of gone up, and then sort of
stopped at this point. If you go look at

11
00:01:00,478 --> 00:01:06,478
the new core I7 or something like that,
it's kind of flattened out here. It isn't

12
00:01:06,478 --> 00:01:11,920
continuing to go up. and in fact, the
numbers that they actually publish for

13
00:01:11,920 --> 00:01:16,734
sequential SpecINT to our actually
parallel numbers these days because

14
00:01:16,734 --> 00:01:22,525
they're using multiple cores to make
SpecINT and SpecFP go a little bit faster.

15
00:01:22,734 --> 00:01:29,837
and, clock frequency has also found out we
didn't get to our 10GHz processor. And

16
00:01:29,837 --> 00:01:35,567
largely, that it is due to us hitting some
power challenges here. But it's also just

17
00:01:35,567 --> 00:01:40,459
incredibly hard to build and design these
very high frequency processors. So,

18
00:01:40,459 --> 00:01:45,609
instead, we started to use these more and
more transistors to implement multiple

19
00:01:45,609 --> 00:01:50,631
cores. So, we started off with two cores
and then four cores and, you know, we've,

20
00:01:50,631 --> 00:01:55,904
we've gone up from there. Intel, I think,
he is now selling an eight core part and a

21
00:01:55,904 --> 00:02:01,486
ten core part is coming out soon. AMD has
a sixteen core part, but it is not a true

22
00:02:01,486 --> 00:02:07,200
sixteen core part. I think it is two eight
cores that are glued together on a chip.

23
00:02:07,413 --> 00:02:12,969
and there's some other weirder stuff up
here. Some actual many cores, multicore

24
00:02:12,969 --> 00:02:18,880
processors with more, more cores. But
people did multiprocessor built

25
00:02:18,880 --> 00:02:24,435
multiprocessors before the multicore
revolution here. Back in these days, there

26
00:02:24,435 --> 00:02:29,781
were multiprocessor systems. You could
take multiple chips and put them tog ether

27
00:02:29,785 --> 00:02:35,288
somehow into small systems or large
systems. And for the rest, rest of class

28
00:02:35,489 --> 00:02:40,456
between now and the end of the term, we're
going to be talking about parallel

29
00:02:40,456 --> 00:02:45,892
computing systems. Both systems that have
multiple cores on one chip and multiple

30
00:02:45,892 --> 00:02:51,283
chips in a system. And for, there's some
subtle differences between the two of

31
00:02:51,283 --> 00:02:59,240
those things, but a lot of commonality. In
today's lecture, we're going to focus on

32
00:02:59,780 --> 00:03:10,820
symmetric multiprocessors and memory
systems for sharing of data and using

33
00:03:10,820 --> 00:03:15,666
shared memory to share data. But there are
other ways to share data and we're already

34
00:03:15,666 --> 00:03:20,635
touching on that in two lectures when we
talk about messaging in more detail. The,

35
00:03:20,635 --> 00:03:25,481
I bring up symmetric multiprocessors
because it's the simplest model to reason

36
00:03:25,481 --> 00:03:30,290
about. All processors are up here, and
memory is down here. And let's assume

37
00:03:30,290 --> 00:03:35,450
there's no caches in the system to begin
with. And everyone is equidistant from

38
00:03:35,450 --> 00:03:40,283
memory, and it's one big memory. And
you've multiple processors, which could

39
00:03:40,283 --> 00:03:49,005
either be running multiple threads in one
program or multiple programs. A couple

40
00:03:49,005 --> 00:03:55,310
other interesting things is, you know,
these concurrency challenges don't come up

41
00:03:55,310 --> 00:04:00,277
just from having multiple processors.
There are other things that can go and

42
00:04:00,277 --> 00:04:05,572
communicate with memory or communicate via
memory. We have all these different IO

43
00:04:05,572 --> 00:04:10,931
controllers down here, network cards, disc
controllers, graphics cards and they also

44
00:04:10,931 --> 00:04:16,095
want to read and write memory. So, in your
memory system, even on a uniprocessor

45
00:04:16,095 --> 00:04:20,539
system, there are multiple agents that
want to read and write memory

46
00:04:20,539 --> 00:04:28,629
simultaneously. Okay. So, let's talk about
synchronization. The dining philosopher

47
00:04:28,629 --> 00:04:36,801
problem is an example of a synchronization
challenge and we're going to talk about

48
00:04:36,801 --> 00:04:43,121
two major synchronization ideas but there
are more than, are shown on the slide.

49
00:04:43,370 --> 00:04:49,939
mainly they're sort of broadcast models
and, and other things like that. But for

50
00:04:49,939 --> 00:04:56,863
right now, when I say synchronization,
it's some way to synchronize communication

51
00:04:56,863 --> 00:05:03,150
or arbitrate communication in a restricted
fashion. So, we're going to have two here,

52
00:05:03,150 --> 00:05:08,670
we're going to talk about prod
ucer-consumer, that's the, the figure on

53
00:05:08,670 --> 00:05:14,266
the right here. A producer, as the name
implies, produces some values. And a

54
00:05:14,266 --> 00:05:20,022
consumer, consumes the values. This is the
most basic form here, one producer, one

55
00:05:20,022 --> 00:05:25,814
consumer. You could think about having one
producer, multiple consumers or multiple

56
00:05:25,814 --> 00:05:31,536
producers, multiple consumers or multiple
producers, one consumer. a lot of this

57
00:05:31,536 --> 00:05:37,430
same ideas hold here. So, that's, that's
one model that people like to use. Another

58
00:05:37,430 --> 00:05:42,959
model that people like to use is there's
some shared resource and you want to make

59
00:05:42,959 --> 00:05:48,289
sure that not more, let's say, then one
processor is trying to access that shared

60
00:05:48,289 --> 00:05:54,407
resource. And this resource could either
be a disk, a graphics card, or it could

61
00:05:54,407 --> 00:06:01,160
actually be a location in memory. And
we're going to call this mutual exclusion.

62
00:06:01,160 --> 00:06:07,275
And the reason it's called mutual
exclusion is it's exclusive, who can

63
00:06:07,275 --> 00:06:13,406
access the resource at one time. Now, the
generalized form of this, that, that's the

64
00:06:13,406 --> 00:06:19,916
most basic form is only one processor or
one entity can go and access the resource

65
00:06:19,916 --> 00:06:26,122
at a time. A more general form of it is
some number of processors or resources can

66
00:06:26,122 --> 00:06:32,480
go access that at one time. And this is
the more general semaphore question which

67
00:06:32,480 --> 00:06:39,198
we'll be talking about a little bit later.
So, what I mean by that is, you could have

68
00:06:39,198 --> 00:06:44,194
P processors or let's say, P is twenty.
And you could have a resource that no more

69
00:06:44,194 --> 00:06:49,108
than two processors can go access at the
same time. You might say, well, why would

70
00:06:49,108 --> 00:06:54,145
you want to do that? Well, a good example
is something like a network card that has

71
00:06:54,145 --> 00:06:58,752
two outbound queues, we'll say. So, you
can actually have multiple processors

72
00:06:58,752 --> 00:07:03,544
using it, but if you try to have three
processors use that one network card at

73
00:07:03,544 --> 00:07:07,782
the same time, well, there's only two
outbound queues. It has to decide,

74
00:07:07,782 --> 00:07:13,773
somehow, to, to share those. so that's the
example of a, of a true versus just a

75
00:07:13,773 --> 00:07:19,320
mutex. Okay. So, let's go through some
code examples here cuz these are going to

76
00:07:19,737 --> 00:07:28,647
be instructive and let's look at actually,
before I do that you can even have on a

77
00:07:28,647 --> 00:07:34,409
uniprocessor system, synchronization
challenges. And h ow is that possible?

78
00:07:34,409 --> 00:07:38,877
Well, as I said, even in the uniprocessor
system, you sometimes have multiple

79
00:07:38,877 --> 00:07:43,285
concurrent entities. So, it's either your
disk drive, doing DMA into memory at the

80
00:07:43,285 --> 00:07:47,932
same time as a process is trying to read
from that block, for instance. And you

81
00:07:47,932 --> 00:07:52,638
need to guarantee, for instance, that this
block is completely read before the

82
00:07:52,638 --> 00:07:56,630
processor goes and tries to read the
block. That's one way that is going to

83
00:07:56,630 --> 00:08:01,848
happen in uniprocessor system. Another way
that happen in a uniprocessor system, is

84
00:08:01,848 --> 00:08:06,437
uniprocessor systems can many times be
multiprogrammed. So, the operating system

85
00:08:06,437 --> 00:08:10,327
would actually time slice between
different programs. And you'll get

86
00:08:10,327 --> 00:08:14,273
interleaving of instructions in a
preemptive environment at least of

87
00:08:14,273 --> 00:08:18,677
different processes at the same time. So,
you might be running one process, the

88
00:08:18,677 --> 00:08:23,024
timer tick goes off. You stop that
process. You switch to another process and

89
00:08:23,024 --> 00:08:27,428
start executing code from that. At some
point, the timer tick goes off, and you

90
00:08:27,428 --> 00:08:32,575
switch back to the first process. So, even
on the uniprocessor system, you can have

91
00:08:32,575 --> 00:08:39,174
synchronization challenges. Okay. So,
let's, let's look at a producer-consumer

92
00:08:39,174 --> 00:08:50,323
example. Let's look at the abstract first.
So, you have a producer here, a consumer

93
00:08:50,323 --> 00:08:58,507
there. We're going to use memory and one
big symmetric block of memory for all of

94
00:08:58,507 --> 00:09:05,586
the communication at this point. We have a
queue between the producer and the

95
00:09:05,586 --> 00:09:11,200
consumer shown here. And then, we have a
head and a tail pointer. So, it's a FIFO

96
00:09:11,200 --> 00:09:15,247
says, where the, our first in, first out
queue. I was going to say, where the head

97
00:09:15,247 --> 00:09:19,704
is, where the tail is, they can sort of
move after each other and they'll say.

98
00:09:19,704 --> 00:09:28,959
it's circular so they wrap around at some
point. You have a register value here on

99
00:09:28,959 --> 00:09:36,281
the producer and you can move the tail
pointer into this register. And then, we

100
00:09:36,281 --> 00:09:41,947
have three registers over here. The tail
pointer register, the head pointer

101
00:09:41,947 --> 00:09:48,274
register, and a register for the value we
receive. And the reason I point out these

102
00:09:48,274 --> 00:09:52,335
different registers is just to show that
the producer owns this register, and the

103
00:09:52,335 --> 00:09:56,196
consumer owns these three regist ers.
And, as you can see here, this says, R

104
00:09:55,845 --> 00:09:59,355
tail and that says, R tail. So, they're
going to have different copies of R tail

105
00:09:59,706 --> 00:10:03,617
and they could potentially get out of sync
here. We're going see some, some fun

106
00:10:03,617 --> 00:10:09,342
hijinks happening. Okay. So, let's, let's
look at the basic sequence of code here

107
00:10:09,342 --> 00:10:16,094
where the producer wants to put an item x
into the cube for a consumer to read

108
00:10:16,094 --> 00:10:23,737
sometime in the future. First thing, it
wants to do is it's going to read the tail

109
00:10:23,737 --> 00:10:30,731
pointer into its register. It's going to
then, this is a, a load store architecture

110
00:10:30,731 --> 00:10:35,913
here. It's going to do a store of value x
into the tail, which is going to put it

111
00:10:35,913 --> 00:10:41,305
here. Now, we're going to have to bump the
tail. We can't do this atomically so we

112
00:10:41,305 --> 00:10:46,907
actually have to add one to it in our
register space so we just increment it in

113
00:10:46,907 --> 00:10:52,089
our own register. And then, we're going to
store this back into the pointer in

114
00:10:52,089 --> 00:11:05,032
memory. Seems simple enough. Let's look at
what the consumer does. The consumer loads

115
00:11:05,032 --> 00:11:09,832
the head. And the first thing it's really
going to do here is it's going to try to

116
00:11:09,832 --> 00:11:13,464
figure out if the head equals the tail.
Cuz that will mean that the queue is empty

117
00:11:13,464 --> 00:11:19,158
and it just needs to block. Loads the tail
pointer. So, it's the head pointer in this

118
00:11:19,158 --> 00:11:24,866
register, a tail pointer in that register,
and sees that they're equal. If they are

119
00:11:24,866 --> 00:11:31,026
equal, it's going to jump up here, reload
the tail pointer into the tab to see if

120
00:11:30,815 --> 00:11:39,483
it's got updated by this thread and just
spin here for forever until the tail

121
00:11:39,483 --> 00:11:44,477
pointer is not equal to head pointer,
which means there's some data available.

122
00:11:44,477 --> 00:11:49,405
And then, we can fall through. And in the
fall through case, we are going to load

123
00:11:49,405 --> 00:11:54,140
from the head into R, so into this
register, this is, so we, we did the read.

124
00:11:55,120 --> 00:12:01,540
And we're going to update the head pointer
to the next location. Save it off and

125
00:12:01,540 --> 00:12:14,201
then, do something on this value we got.
So, processes just do, do something. So,

126
00:12:14,201 --> 00:12:19,226
this program is a little naive cuz we've
been talking about out of order

127
00:12:19,226 --> 00:12:24,653
processors, we've been talking about out
of order memory systems. And in a perfect

128
00:12:24,653 --> 00:12:30,215
world, this is assuming that instructions
are exec uted in order and that there's no

129
00:12:30,215 --> 00:12:35,441
interleaving between the consumer and the
producer here, and vice versa, in the

130
00:12:35,441 --> 00:12:57,245
memory system. So, anyone think there's
any problems with this? does this work

131
00:12:57,245 --> 00:13:03,675
fine? Yup. Okay. So, you're, you're saying
if the consumer runs before the producer

132
00:13:03,675 --> 00:13:10,091
does anything, you might just try to take
values off the queue. Well, so let's guard

133
00:13:10,091 --> 00:13:15,211
against that and say, the, the start case
is head pointer equals tail pointer. So,

134
00:13:15,211 --> 00:13:22,351
that won't happen, you'll sit spinning
here. Now, because no one's like jumping

135
00:13:22,351 --> 00:13:27,624
up and screaming up and down at this
point, you're sort of all assuming that

136
00:13:27,624 --> 00:13:32,364
these operations happen in order. But
we've talked about out of order machines

137
00:13:32,364 --> 00:13:37,103
which reorder loads relative to stores.
We've talked about very interesting

138
00:13:37,103 --> 00:13:41,842
interleaving. We've talked about out of
order memory systems. So, so much for me

139
00:13:41,842 --> 00:13:46,884
jumping up and down here, and saying, what
do we guarantee about loads and stores to

140
00:13:46,884 --> 00:13:54,624
different addresses? So, we said, all we
know is that a load and a store or excuse

141
00:13:54,624 --> 00:14:02,189
me, a store and a load that are serialized
in that order, on the same processor, to

142
00:14:02,189 --> 00:14:08,588
the same address, will happen in order.
Our out of order super scale and our out

143
00:14:08,588 --> 00:14:14,354
of order memory systems, we've talked
about, have defined nothing about ordering

144
00:14:14,354 --> 00:14:19,417
between processors or even memory
operations inside of one processor. So,

145
00:14:19,417 --> 00:14:25,401
that's what we're really going to be
talking about today, how to reorder those

146
00:14:25,401 --> 00:14:36,626
instructions and how to prevent that from
causing havoc. Okay. So, let's, let's look

147
00:14:36,626 --> 00:14:43,440
at this and let's number these purple
instructions. So, we're going to number

148
00:14:43,440 --> 00:14:48,293
store, store in these two loads. The
reason we, we look at those, in

149
00:14:48,293 --> 00:14:52,720
particular, is this is basically updating
state that this thread is reading from,

150
00:14:52,720 --> 00:14:57,203
and these are the two instructions that
update that thread, and these are the two

151
00:14:57,203 --> 00:15:06,460
instructions that read that thread, or
read that state rather. So, the programmer

152
00:15:07,240 --> 00:15:15,543
assumes that there's some level of
causality, here. Incorrectly, producer is

153
00:15:15,543 --> 00:15:25,464
assuming, if you look at this code, that
if three, this load happens after this

154
00:15:25,464 --> 00:15:33,169
store h ere, that there's some relation
between one and four. Namely, that one has

155
00:15:33,169 --> 00:15:40,899
happened by the time four happens. Not a
good assumption in our memory systems. And

156
00:15:40,899 --> 00:15:46,452
that's especially in our out-of-order
memory system. Because, if we recall, out,

157
00:15:46,452 --> 00:15:54,627
out of order memory system said nothing
about store ordering. We talked about, the

158
00:15:54,627 --> 00:15:59,217
only thing we talked about was having a
load at the same address as a store and

159
00:15:59,217 --> 00:16:04,366
not moving that load past that store. But
otherwise, these two stores are out of

160
00:16:04,366 --> 00:16:08,654
order processor that can very easily
reorder or an out of order memory system

161
00:16:08,654 --> 00:16:13,404
can very easily reorder. Likewise, over
here, these are just two loads. These

162
00:16:13,404 --> 00:16:18,908
loads can happen completely out of order
in an out of order computer, and out of

163
00:16:18,908 --> 00:16:26,701
our memory system. No guarantees there.
So, this assumption is bad in a

164
00:16:26,701 --> 00:16:32,096
multiprocessor system. Uniprocessor system
might make sense, multiprocessor system,

165
00:16:32,096 --> 00:16:39,300
bad assumption. So, let's look at some
sequences that are problematic. Name,

166
00:16:39,300 --> 00:16:45,871
namely, let's say, two and then, three
happens. And then, four and then one

167
00:16:45,871 --> 00:16:52,905
happens. So, what, what happens when that?
Well, two updated the tail. Three says,

168
00:16:52,905 --> 00:17:00,746
raise the value of that new tail pointer.
But at this point, we haven't actually

169
00:17:00,746 --> 00:17:07,216
stored the value into the cube because one
never happened. We go x cubed four and it

170
00:17:07,216 --> 00:17:13,475
just reads garbage from memory. And then
finally, the value gets updated. Totally

171
00:17:13,475 --> 00:17:17,721
valid in out of order, out of our memory
systems that we talked about up to this

172
00:17:17,721 --> 00:17:25,353
point. Nothing wrong here. Okay, let's
say, four and then one happen. So, goes

173
00:17:25,353 --> 00:17:30,623
the load. Load the head. Then, the value
gets written. And then, two and then three

174
00:17:30,623 --> 00:17:35,826
happen. This is just completely out of
order, kind of no good semantics here. So,

175
00:17:35,826 --> 00:17:41,096
this piece of code which looks like it
should have, the programmer would assume

176
00:17:41,096 --> 00:17:46,232
to have good semantics. Basically, it has
no, no useful semantics unless we define

177
00:17:46,232 --> 00:17:47,900
what those semantics are.