1
00:00:03,500 --> 00:00:09,830
Okay, so Multiple Logical Communication
Channels. What we just talked about with

2
00:00:09,830 --> 00:00:17,678
all these different message types here,
thus like, the ten not drawn. Can't all go

3
00:00:17,678 --> 00:00:22,102
on the same network at the same time. If
you have one interconnection network or

4
00:00:22,102 --> 00:00:28,110
one logical, excuse me, logical network
here. You can end up with cases where you

5
00:00:28,110 --> 00:00:32,883
have responses that get qued behind
requests that will create more traffic.

6
00:00:32,883 --> 00:00:37,782
So, in these protocols we have multiple
phases of, of communication. We send

7
00:00:37,782 --> 00:00:42,618
messages from point A to point B, then
from B to C, then from C back to B, and

8
00:00:42,618 --> 00:00:47,645
then B back to A, or something like that.
So, if you try to over lay this onto an

9
00:00:47,645 --> 00:00:52,672
inner connection network, and you do head
of line processing. So you process the

10
00:00:52,672 --> 00:00:58,813
message which is at the head of the que
first It's very common to have something

11
00:00:58,813 --> 00:01:04,624
where you have a new response that sneaks
in. Which is dependent on some other

12
00:01:04,624 --> 00:01:08,660
transaction that's happening somewhere
else in the system. And you're waiting for

13
00:01:10,122 --> 00:01:15,176
excuse me a new request comes in that gets
qued behind a response. Or excuse me, you

14
00:01:15,176 --> 00:01:20,291
have a new request that comes in that is
in front of a response and all of a sudden

15
00:01:20,291 --> 00:01:26,277
you're going to have a deadlock. So, one
way people go about solving this is you

16
00:01:26,277 --> 00:01:30,698
look at all the different types of
messages that you have, message types that

17
00:01:30,698 --> 00:01:35,232
you have, and you figure out which types
can coexist on the network at the same

18
00:01:35,232 --> 00:01:42,307
time. so good example here is a I think if
you look at a relatively basic cache

19
00:01:42,307 --> 00:01:50,010
coherence protocol there's maybe twelve
different message types. You can try to

20
00:01:50,010 --> 00:01:54,387
shrink that down to maybe four, three or
four different classes of traffic. That

21
00:01:54,387 --> 00:01:58,818
those classes of traffic when intermixed
will never introduce a deadlock. If you

22
00:01:58,818 --> 00:02:02,807
would be really safe you'd just have
twelve separate networks or twelve

23
00:02:02,807 --> 00:02:07,436
different networks, or twelve logical
different networks. But that's expensive

24
00:02:07,436 --> 00:02:12,286
to build, so people try to come up with
equivalence classes of different traffic

25
00:02:12,286 --> 00:02:16,590
and then only have that number of
networks. But you're going to have to

26
00:02:16,590 --> 00:02:21,258
segregate these flows in different,
logical or ph ysical channels to make sure

27
00:02:21,258 --> 00:02:29,010
you don't have deadlock happening. Another
thing that I wanted to point out here is

28
00:02:29,010 --> 00:02:38,710
we just distributed our memory. So we need
to start worrying about the memory

29
00:02:38,710 --> 00:02:45,156
ordering point or for given memory
address, where do you go and what order do

30
00:02:45,156 --> 00:02:52,809
memory operations happen in? So just like
in a bus space protocol you have some sort

31
00:02:52,809 --> 00:02:57,060
of arbitor which determines the
transaction or who wins the bus. In a

32
00:02:57,060 --> 00:03:01,926
directory based system, typically the
forth given address is the ultimate other.

33
00:03:01,926 --> 00:03:06,608
So two messages are coming in from
different course you go for right access

34
00:03:06,608 --> 00:03:11,289
to a particular address. One of them is
going to get to the directory first and

35
00:03:11,289 --> 00:03:16,503
what if we gets the directory first, could
win or estimately will win. You could

36
00:03:16,503 --> 00:03:21,717
think of, sometimes this could be unfair.
So some of these systems you try to have

37
00:03:21,717 --> 00:03:26,931
something which prevents the core, which
is close to the directory, always winning

38
00:03:26,931 --> 00:03:32,273
access to that contended line we'll say.
So may have some sort of back off protocol

39
00:03:32,273 --> 00:03:37,423
playing into effect there. But this is
what effectively guarantees our atomicity

40
00:03:37,423 --> 00:03:42,830
is the directory allows, guarantees that a
particular line can only be transitioning

41
00:03:42,830 --> 00:03:49,277
from one state to another state at a given
time. So we have the directory as the

42
00:03:49,277 --> 00:03:55,756
ordering point. And whichever message gets
there, gets to the home directory first

43
00:03:55,756 --> 00:04:01,755
let's say wins. Subsequent request for
that line are going to lose so what do you

44
00:04:01,755 --> 00:04:06,402
do on a lose. Well you probably, the
directory's probably going to want to send

45
00:04:06,402 --> 00:04:10,321
a negative acknowledgement, or NACK. It's
going to say, or send a retry. It's the

46
00:04:10,321 --> 00:04:14,038
same thing here. It's going to say, I
can't handle this right now. This line is

47
00:04:14,038 --> 00:04:18,057
currently being transitioned already.
Someone else won the transitioning of this

48
00:04:18,057 --> 00:04:21,775
line. Go retry that transaction, the
memory transaction again in the future.

49
00:04:21,775 --> 00:04:25,391
Now, this gets pretty complicated back at
the cache controller. Cuz it's going to

50
00:04:25,693 --> 00:04:30,933
get a retry, or a NACK back. It could
potentially have a message coming in for

51
00:04:30,933 --> 00:04:35,788
that exact same cache line. It needs to
give the directories message priority in

52
00:04:35,788 --> 00:04:41,011
this case over the pipeline the main
processor. So it's going to have to order

53
00:04:41,011 --> 00:04:46,139
that after, it because the directory one
the directory is the ultimate ordering

54
00:04:46,139 --> 00:04:51,266
point for the memory location. Finally,
you have to worry about forward progress

55
00:04:51,266 --> 00:04:57,685
guarantees. What I mean by this is, it's
pretty easy to build a memory system that

56
00:04:57,685 --> 00:05:02,838
you pull the data into your cache. Your
cache control pull means your cache. But

57
00:05:02,838 --> 00:05:08,252
before you're able to do load or storage
at that actual address, it gets bumped out

58
00:05:08,252 --> 00:05:12,701
f your cache. At all sign you have a cash
line just sort of bouncing between two

59
00:05:12,701 --> 00:05:18,289
cashes and its a live lock scenario. So
you believes directory based coherence

60
00:05:18,289 --> 00:05:23,254
systems and this also happens with bus
space protocols plus usually it'll be less

61
00:05:23,254 --> 00:05:27,880
likely. you have to have some sort of
forward progress guarantee. And the reason

62
00:05:27,880 --> 00:05:32,394
is less likely is because in a directory
based system the communication time is

63
00:05:32,394 --> 00:05:36,287
usually longer. So the window of, of
vulnerability is longer. We afford

64
00:05:36,287 --> 00:05:40,969
progress guarantee means that once you get
into your cache you need to at least do

65
00:05:40,969 --> 00:05:45,426
one memory operation to it, to guarantee
you have won bef, before you relinquish

66
00:05:45,426 --> 00:05:50,455
the memory back to the directory control.
So, if you do, if you doing a load you're

67
00:05:50,455 --> 00:05:55,954
actually reading your cache shared to the
actual one load. And then you're allowed

68
00:05:55,954 --> 00:06:00,751
to cough it up, so you don't respond back
with a reply. To, let's say, an

69
00:06:00,751 --> 00:06:05,114
invalidation request from the directory
control until you've done your one memory

70
00:06:05,114 --> 00:06:09,264
transaction. Or likewise, if you bring in
a modify. Do that one store before you,

71
00:06:09,264 --> 00:06:13,467
you cough the data back up or release the
data back into the memory controller.

72
00:06:13,467 --> 00:06:17,830
That's really important, to make sure you
have some forward progress in the system,

73
00:06:17,830 --> 00:06:23,212
and don't end up with a live lock. Okay so
we're going to get into the, the more

74
00:06:23,212 --> 00:06:31,043
futurey for future looking stuff here.
We've been talking about what's called, a

75
00:06:31,043 --> 00:06:35,752
full map directory, where there's a bit
processor core. If you have 1,000 cores in

76
00:06:35,752 --> 00:06:41,303
the system, that's a very large bit map.
And, it's pretty uncommon that a thousand

77
00:06:41,303 --> 00:06:48,406
core in the system will all be reading all
of the data, or all of one particular

78
00:06:48,406 --> 00:06:54,761
cache line. So there might be wasteful and
your directory in a format directory grows

79
00:06:54,761 --> 00:07:00,629
at order N. So that's, that's not
particularly good. Hmm, so, people have

80
00:07:00,629 --> 00:07:06,043
looked into different protocols here. I
just want to touch on this, this is

81
00:07:06,262 --> 00:07:11,858
largely the area sort of research and
future directions of this. One, one idea

82
00:07:11,858 --> 00:07:17,246
here is to have a limited pointer based
directory. So instead of having a bit mask

83
00:07:17,246 --> 00:07:22,569
of the share of list. So its right this is
the share of list. So its sort of a bit

84
00:07:22,569 --> 00:07:27,629
mask of the share of list you have base
two encoding numbers up to some, some

85
00:07:27,629 --> 00:07:32,688
point. This is why its called limited
directory or limited point of directory,

86
00:07:32,688 --> 00:07:37,945
because you can't have all the nodes in
the system on the list. There is none of

87
00:07:37,945 --> 00:07:42,562
entries here. But you can name them
because you have base two encoding of the

88
00:07:42,562 --> 00:07:49,883
actual number. And if you get bigger this
in this entire list, so lets say you'll

89
00:07:49,883 --> 00:07:55,419
have one, two, three, four, five entries
an over sense six sharers, six mead owing

90
00:07:55,419 --> 00:08:01,025
copies of the list want to be taken.
Usually this is an overflow bit here which

91
00:08:01,025 --> 00:08:06,845
says its more than six and when its more
than six and it will share and also in a

92
00:08:06,845 --> 00:08:12,451
transition to modified. You're going to
just send a broadcast message or send a

93
00:08:12,664 --> 00:08:19,030
invalidate every single of cache in the
system. But usually this could be a good

94
00:08:19,030 --> 00:08:24,306
trade off, because it's pretty uncommon to
have extremely widely shared lines. So

95
00:08:24,306 --> 00:08:28,405
it's an interesting trade-off. There are
storage space, versus sending more

96
00:08:28,405 --> 00:08:35,100
messages in the system. Like wise there's
interesting protocol here called

97
00:08:35,100 --> 00:08:40,930
limitless. Where same idea, it's a limited
directory and overflow bit, but if it

98
00:08:40,930 --> 00:08:46,984
overflows you start to keep the directory,
or you start to keep the sharer list in

99
00:08:46,984 --> 00:08:52,101
software in a structure in main memory.
And this requires you to basically have

100
00:08:52,101 --> 00:08:55,770
very fast traps such that when this
happens because your, your servicing cache

101
00:08:55,770 --> 00:08:59,862
line here you interrupt the main processor
and the main processor provides the rest

102
00:08:59,862 --> 00:09:03,554
of the sharer list for instance, if the
sharer list gets overflo wed.

103
00:09:03,560 --> 00:09:08,028
So, there's, and there's a bunch of stuff
in between in some future, future research

104
00:09:08,028 --> 00:09:14,498
that's still being done actively in this
space. Beyond simple directory coherence,

105
00:09:14,498 --> 00:09:19,525
we have some pretty cool on-chip
coherence. This is why this is actually

106
00:09:19,525 --> 00:09:25,289
being studied a lot right now, is people
built this massively parallel directory

107
00:09:25,289 --> 00:09:30,248
based machines in the past. They got some
use, they were very good for some

108
00:09:30,248 --> 00:09:34,829
applications. But now we are starting to
put more and more cores on to a given

109
00:09:34,829 --> 00:09:38,596
chip, so we start to see on-chip
coherence. And figuring out how to

110
00:09:38,596 --> 00:09:43,234
leverage, the fast on-chip communication
alongst directories to make more

111
00:09:43,234 --> 00:09:48,647
interesting coherence protocols. There is
something called COMA systems or Cache

112
00:09:48,647 --> 00:09:53,931
Only Memory Architectures, where instead
of having a data in main memory, you don't

113
00:09:53,931 --> 00:09:59,344
have main memory ever and the directories
move around. before beyond scope of the

114
00:09:59,344 --> 00:10:05,272
scores worry about the KSR one if you want
to go about that kind a score research one

115
00:10:05,465 --> 00:10:11,034
and then also this real had a scale of the
sharer list is active. Briefly wanted to

116
00:10:11,034 --> 00:10:17,455
talk about the most up-to-date versions of
these things. We have the SGI UV 1000

117
00:10:17,455 --> 00:10:24,037
which is a descendant of the Origin and
the Origin 2000 machines from SGI, lots of

118
00:10:24,037 --> 00:10:30,902
cores here, 2560 cores that are all kept
cache coherent. They use a directory based

119
00:10:30,902 --> 00:10:35,958
coherence protocol here. It's very non
uniform, and it's all connected together

120
00:10:35,958 --> 00:10:41,144
by a multi-chassis treaty tour. So, this
is one chassis, there's actually up to

121
00:10:41,144 --> 00:10:46,719
eight of these. Princeton has one of these
in the HPCRC centre, that I think is four

122
00:10:46,719 --> 00:10:54,048
chassis, which is sort of half the size of
the maximum. An on-chip example here is

123
00:10:54,048 --> 00:11:01,376
the TILE64Pro had 64 cores. And each of
the cores, itself is a directory home and

124
00:11:01,376 --> 00:11:07,769
it runs a directory based protocol. And I
was talking about dividing communication

125
00:11:08,003 --> 00:11:14,395
traffic into different flows. Well there
were three different memory networks here.

126
00:11:14,395 --> 00:11:20,008
So we had to do, come up with three
different classes of traffic, that

127
00:11:20,008 --> 00:11:25,713
themselves would not deadlock themselves,
if you will. And also there's four memory

128
00:11:25,713 --> 00:11:30,373
controllers, which is con nected into our
interconnect system. And because of this,

129
00:11:30,373 --> 00:11:34,684
the communication lane sees different. A
core here talking to the American

130
00:11:34,684 --> 00:11:39,403
controller is very fast, whereas a core
there takes longer. But maybe this core is

131
00:11:39,403 --> 00:11:45,620
close to that memory controller. So non,
non-uniformity here. Okay, so this is our

132
00:11:45,620 --> 00:11:50,624
last slide of the term here, Beyond ELE
475. If you want to go on and do more.

133
00:11:51,112 --> 00:11:55,262
Well start reading some papers from
different computer architecture

134
00:11:55,262 --> 00:11:59,350
conferences. The proceedings of
International Symposium on Computer

135
00:11:59,350 --> 00:12:03,988
Architecture, ISCA is a good place to
start. That's probably the top, major

136
00:12:03,988 --> 00:12:08,138
architecture conference in the field. The
International Symposium on

137
00:12:08,138 --> 00:12:12,470
Microarchitecture is the top
Microarchitecture conference. So if you're

138
00:12:12,470 --> 00:12:17,291
trying to look inside of a processor, and
some of the smaller microtechtectural

139
00:12:17,291 --> 00:12:22,537
details. ASPLOS, Architectural Support for
Programming Languages and Operating

140
00:12:22,537 --> 00:12:27,900
Systems has lot of different processor
between software and hardware's radio

141
00:12:27,900 --> 00:12:33,532
conference. And HPCA looks at or used to
look up at a higher performing high, big

142
00:12:33,532 --> 00:12:39,163
computers, high performance computer
systems. but now lot of normal computer

143
00:12:39,163 --> 00:12:44,460
operation ends up in a compensawsive. I
would have CD audio. Well go to research.

144
00:12:44,460 --> 00:12:51,005
Built in chips, built some test FPGA learn
more about parallel computer architecture

145
00:12:51,005 --> 00:12:56,397
and, in deed you can come back in the
Fall. and take ELE 580A which is going to

146
00:12:56,397 --> 00:13:01,906
be a graduate level primary sources paper
reading course. More traditional graduate

147
00:13:01,906 --> 00:13:07,222
course where you'll learn about all the
different parallel computing systems. So

148
00:13:07,222 --> 00:13:11,833
this is called parallel computation
because it's both parallel computer

149
00:13:11,833 --> 00:13:16,957
architecture and some programming models
that hook together without in parallel

150
00:13:16,957 --> 00:13:21,697
programming together with the
architectures cause they go, they go very

151
00:13:21,697 --> 00:13:26,240
hand in hand. So let's stop here for
today. And stop here for the course.