1
00:00:04,080 --> 00:00:09,692
Okay, so now we get to talk about issues
of sequential consistency and how to

2
00:00:09,692 --> 00:00:15,746
actually go about implementing these
things. So going back to our, valid

3
00:00:15,746 --> 00:00:20,986
interleaving. We start to ask ourselves,
how do we build a sequentially consistent

4
00:00:20,986 --> 00:00:26,872
model with out of order processors, or out
of order execution of memory. Well we

5
00:00:26,872 --> 00:00:31,998
already said in our out of order system
that we can re-order loads to different

6
00:00:31,998 --> 00:00:37,029
addresses. We can re-order loads to the
same address. Doesn't really matter. Well,

7
00:00:37,071 --> 00:00:42,003
if you go take that, all of a sudden they
just breaks a sequential consistency. Huh.

8
00:00:42,003 --> 00:00:47,417
We said we could reorder a store in a load
in our out of order processor if they were

9
00:00:47,417 --> 00:00:52,516
to different addresses. We said we could
reorder stores and load the other way. We

10
00:00:52,516 --> 00:00:57,805
said we could reorder stores in stores. So
out of order memory systems and auto, auto

11
00:00:57,805 --> 00:01:02,715
processors effectively break sequential
consistencies. So how you break, how do

12
00:01:02,715 --> 00:01:08,504
you implement a processor which is useful?
And is out of order, and we want to go out

13
00:01:08,504 --> 00:01:14,607
of order for performance reasons. Also
cache has started to become a big problem

14
00:01:14,607 --> 00:01:19,893
here for performance. If we want to have
true sequential consistency you're going

15
00:01:19,893 --> 00:01:25,244
to have to somehow figure out how to not
store any data in your cache. Because the

16
00:01:25,244 --> 00:01:31,667
second that you store data in your cache,
you're going to have to go, let's say it's

17
00:01:31,667 --> 00:01:36,565
a right back cache if you do a store to
your local cache and you write a one to

18
00:01:36,565 --> 00:01:41,647
it, and some other processor tries to read
it, well it's not going to be able to see

19
00:01:41,647 --> 00:01:51,255
that data. So we need to be very careful
here of when is data actually visible to

20
00:01:51,255 --> 00:02:00,104
other processors if you have a cashe
involved in multi-processor system. So we

21
00:02:00,104 --> 00:02:10,873
are going to move towards little bit more
weak or the relax memory models. So as I

22
00:02:10,873 --> 00:02:17,783
said last class, almost no processor
actually implements a, a truly

23
00:02:17,783 --> 00:02:22,288
sequentially consistent memory system.
They're all weaker than that. Now the

24
00:02:22,288 --> 00:02:27,334
question is how weak do you want to go.
and this starts to come up because it's a,

25
00:02:27,334 --> 00:02:31,960
it's a tech trade off between what the
programmer can wrap their head around

26
00:02:32,520 --> 00:02:43,282
versus effectively performance in an out
of order system. So what we can do is we

27
00:02:43,282 --> 00:02:53,493
can think about going to weaker memory
systems, and when we do that, we are going

28
00:02:53,493 --> 00:02:58,467
to make the programmer decide when a load
in a store can be reordered, or when a

29
00:02:58,467 --> 00:03:03,745
store in a store can be reordered, or when
a load in a load can be re reordered. So

30
00:03:03,745 --> 00:03:08,606
let's look at these four different cases
independently here. Reordering a load with

31
00:03:08,606 --> 00:03:13,174
a load, reordering a load with a store
after it, reordering a store with a load

32
00:03:13,174 --> 00:03:17,684
after it and reordering two stores. And
note we're not talking about the same

33
00:03:17,684 --> 00:03:22,252
memory addresses here at all. We are
talking about two different addresses, but

34
00:03:22,252 --> 00:03:26,937
sequential consistency even says for
different addresses we need to worry about

35
00:03:26,937 --> 00:03:32,011
this. So how many of these extra
dependencies can we remove and we're going

36
00:03:32,011 --> 00:03:37,408
to introduce this notion of a memory fence
or sometimes we call memory barrier

37
00:03:37,408 --> 00:03:43,146
operation which is a serialization point
that says all of the previous instructions

38
00:03:43,146 --> 00:03:48,747
before well, I will be careful what I say
there actually. My references is actually

39
00:03:48,747 --> 00:03:54,417
a serialization operation which says all
of the memory accesses or at least the

40
00:03:54,417 --> 00:03:59,608
named memory accesses before a certain
point have been completed before the

41
00:03:59,608 --> 00:04:06,366
memory fence completes. And, all of the
operations after it do not complete, have

42
00:04:06,366 --> 00:04:11,786
not started or completed. Before the
memory fence completes so this really is a

43
00:04:11,786 --> 00:04:15,785
barrier on your memory operations and the
reason I'm being a little bit hesitant

44
00:04:15,785 --> 00:04:21,756
here is the basic memory fence will say
that all memory operations before the

45
00:04:21,756 --> 00:04:27,726
memory fence completes and that none after
the memory fence have started when the

46
00:04:27,726 --> 00:04:32,862
memory fence completes. But people have
introduced a little bit even weaker

47
00:04:32,862 --> 00:04:37,576
versions of memory fence over time of
course, for performance. And you start to

48
00:04:37,576 --> 00:04:42,350
see things where. You'll have only load
memory fences, where it barriers against

49
00:04:42,350 --> 00:04:47,491
all the loads previous. Or you have, one
you need directional memory fences which

50
00:04:47,491 --> 00:04:52,398
will say only look at all the stores that
happen after this point and don't allow

51
00:04:52,398 --> 00:04:56,887
any of them to happen before this point.
So, people have actually introduced

52
00:04:56,887 --> 00:05:01,735
instructions in modern architectures that
have all these things. And, probably one

53
00:05:01,735 --> 00:05:06,642
of the weaker ones if you actually wanna
go look at a really weak memory model, is

54
00:05:06,642 --> 00:05:11,430
go read about the itanium memory models,
extremely weak memory model. and they,

55
00:05:11,430 --> 00:05:15,423
they. Need to have fences all over the
place. And the, the trade off here, the

56
00:05:15,423 --> 00:05:20,336
performance trade off is how many fences
you need to put in. Versus performance you

57
00:05:20,336 --> 00:05:25,643
get from reordering of memory operations.
So let's look at a couple, let's, let's

58
00:05:25,643 --> 00:05:31,527
name a few of these things, cause everyone
likes naming. And these are computer

59
00:05:31,527 --> 00:05:37,034
architecture research names. These are
common across the industry at this point

60
00:05:37,034 --> 00:05:42,267
effectively, One of them is called total
store ordering. So in total store

61
00:05:42,267 --> 00:05:48,656
ordering, this is not a big step away from
sequential consistency. The stores, some

62
00:05:48,656 --> 00:05:54,596
total store ordering. Loads do not get
reordered with other loads. Loads do not

63
00:05:54,596 --> 00:05:59,960
get reordered with other stores. Stores
don't get reordered with other s, stores.

64
00:05:59,960 --> 00:06:04,945
But a store followed by a load? Can be
reordered. So a load can be moved above a

65
00:06:04,945 --> 00:06:13,587
store here. And if you want to enforce the
order between the store and load you need

66
00:06:13,587 --> 00:06:22,398
to write this. You need to do store word
and then fence, loadwork. And this

67
00:06:22,398 --> 00:06:30,220
guarantees that this load is going to
happen after that store in the processor.

68
00:06:33,160 --> 00:06:38,767
Partial store ordering is a little bit
weaker than that. So partial store

69
00:06:38,767 --> 00:06:44,758
ordering, loads of other loads don't get
reordered. Loads of other stores don't

70
00:06:44,758 --> 00:06:51,924
need fences, but stores followed by loads,
which is up here, and stores followed by

71
00:06:51,924 --> 00:06:58,091
stores also need fences to guarantee
ordering. And then finally we get this

72
00:06:58,091 --> 00:07:03,654
class of weak ordering, and frankly, most
processors these days actually fall into

73
00:07:03,654 --> 00:07:09,010
some category of weak ordering, and
there's sort of different questions in how

74
00:07:09,010 --> 00:07:14,848
weak it is. but we're not going into that
little detail here. And here, you actually

75
00:07:14,848 --> 00:07:19,037
need a fence to enforce all of these
orderings, so it's kind of, kind of

76
00:07:19,449 --> 00:07:25,701
interesting, interesting to think about.
The nice thing about fences is that. While

77
00:07:25,701 --> 00:07:30,216
these o perations are expensive and are
big serialization points, the memory

78
00:07:30,216 --> 00:07:34,975
fences, you only need to pay for them when
you care about the ordering. When you

79
00:07:34,975 --> 00:07:39,525
don't care about the ordering, the
computer architecture can reorder things

80
00:07:39,525 --> 00:07:44,379
and make everything go faster. It's only
exactly when you care about it that you

81
00:07:44,379 --> 00:07:49,353
need to introduce defenses. one thing I
was going to say is some compilers will do

82
00:07:49,353 --> 00:07:54,328
some of this work for you. Introduce
defenses, for you. And usually one of the

83
00:07:54,328 --> 00:07:59,363
ways that it's done actually is if you see
atomic operation. So if they see, if a

84
00:07:59,363 --> 00:08:04,459
compiler sees something like a lock
operation or P or a V, in your code and

85
00:08:04,459 --> 00:08:10,221
it's sorta a special way to signal back to
the compiler. It'll know that you need to

86
00:08:10,221 --> 00:08:15,418
make sure the memory operations have all
been serialized at that point. Usually

87
00:08:15,418 --> 00:08:20,680
also atomic operations like test and set
are fences on their own right. which

88
00:08:20,680 --> 00:08:26,259
helps, but sometimes compilers can even
figure out where other fences are needed

89
00:08:26,259 --> 00:08:32,519
for other memory addresses. And one of the
things that I want to say here is that

90
00:08:32,519 --> 00:08:37,170
fences don't actually take a argue bit.
It's not going to take an address to go

91
00:08:37,170 --> 00:08:42,000
fence on, and the reason for this is we
really want to do is we want to make sure

92
00:08:42,000 --> 00:08:46,830
all the previous memory operations, cuz in
the question of consistency we haven't

93
00:08:46,830 --> 00:08:51,660
said anything about the addresses. We just
said previous memory operations in one

94
00:08:51,660 --> 00:08:57,328
thread don't move after a subsequent
memory operation. Okay, so let's look at

95
00:08:57,328 --> 00:09:04,836
an example here of where we might need to
introduce a different, different levels of

96
00:09:04,836 --> 00:09:14,743
fences here. So here we have our producer
consumer model. And one of the things we

97
00:09:14,743 --> 00:09:18,401
were worried about in our original
examples, we were worried about these two

98
00:09:18,401 --> 00:09:23,324
stores getting reordered. Well, we can
force that by putting a memory fence

99
00:09:23,324 --> 00:09:27,620
operation in here, and saying, it's a
store, store. So no stores can pass any

100
00:09:27,620 --> 00:09:33,951
other stores, and this'll guarantee that
these happen in order. But we don't really

101
00:09:33,951 --> 00:09:41,385
care, in this case we'll say, if, well,
this actually already has an arc, so we're

102
00:09:41,385 --> 00:09:45,305
not, that's not going to get reordered,
beca use this register is, is, is

103
00:09:45,305 --> 00:09:49,906
dependent there. But let's go over here,
and take a look over here. If we have full

104
00:09:49,906 --> 00:09:55,628
sequential consistency, we would not be
able to reorder these two loads. Well for

105
00:09:55,628 --> 00:10:01,124
performance maybe we want to reorder those
loads because one of the input registers

106
00:10:01,124 --> 00:10:05,949
isn't valable. Because it's the result of
a long multiply for instance, or

107
00:10:05,949 --> 00:10:10,612
something, something like that. So you
might want to reorder into your out of

108
00:10:10,612 --> 00:10:14,564
order processes for performance reasons.
In true sequential consistency you

109
00:10:14,564 --> 00:10:18,884
wouldn't be able to do that. But with this
weaker memory model we can reorder that,

110
00:10:18,884 --> 00:10:22,730
but then guarantee correctness by
introducing a fence operation here. And

111
00:10:22,730 --> 00:10:29,700
this fence operation is gonna to say.
After these loads are done. These loads

112
00:10:29,700 --> 00:10:35,307
need to complete before this load starts.
So you can sort of guarantee that if our

113
00:10:35,307 --> 00:10:40,366
head equals our tail, you know at that
point that you're, you're safe to go

114
00:10:40,366 --> 00:10:53,877
execute the rest of this code here. Any
questions about that so far, about the,

115
00:10:53,877 --> 00:10:59,360
the fences and why? It's kind of a pay as
you go, if you will.