1
00:00:03,600 --> 00:00:08,476
So let's, let's think about our snoopy
protocols that we've talked about, our bus

2
00:00:08,476 --> 00:00:13,535
based protocols, and the performance and
the asymptotic performance requirements of

3
00:00:13,535 --> 00:00:19,396
them. So, what are the challenges of a
snooping protocol? As we discussed before,

4
00:00:19,396 --> 00:00:24,782
as you add more people or more processors
to the system you have more entities

5
00:00:24,782 --> 00:00:32,496
shouting on one shared media. And you need
to hear the shouting. You can't just

6
00:00:32,496 --> 00:00:37,160
forget some, some, some shout because you
need to snoop that against your vocal

7
00:00:37,160 --> 00:00:42,002
cache. So when ever another core takes a
cache miss, you need to snoop that against

8
00:00:42,002 --> 00:00:46,964
your cache and make sure that you don't
have a copy of that or that you have to

9
00:00:46,964 --> 00:00:51,329
invalidate for instance overplay back of
some data or do something other

10
00:00:51,329 --> 00:00:57,694
invalidation coherence protocol
adjustment. So, what's, what's annoying

11
00:00:57,694 --> 00:01:02,560
about this, if you sort of look at the
amount of bandwidth you require on your

12
00:01:02,560 --> 00:01:08,050
bus, all cache miss need to go across that
bus everyone needs to look at that, and

13
00:01:08,050 --> 00:01:12,853
everyone needs to have a port that has
enough bandwidth to look at all those

14
00:01:12,853 --> 00:01:18,718
transactions on their cache. So, if we go
to look at this, our bus, if we want to

15
00:01:18,718 --> 00:01:24,339
sort of keep up with the same amount of
bandwidth of cache misses per core as we

16
00:01:24,339 --> 00:01:29,613
add more cores to the system is going to
grow order N, where N is the number of

17
00:01:29,613 --> 00:01:35,427
processors. . Because though you can
compute this because everyone let's say

18
00:01:35,427 --> 00:01:39,449
has the same number of same amount of
cache misses going on and you want to have

19
00:01:39,449 --> 00:01:44,337
the same cache miss rate and you just
multiply it by N. Three each, each course

20
00:01:44,337 --> 00:01:50,140
is going to have that. Well, that will be
fine when N is eight, but if N goes to a

21
00:01:50,140 --> 00:01:55,453
thousand or a million, you're going to
have some serious problems with that, it's

22
00:01:55,453 --> 00:02:00,975
a very, very big bus. And it's not just
straight badnwidth, you also need to have

23
00:02:00,975 --> 00:02:06,615
effectively somewhere arbitrate for the
bus and you need to have atomic

24
00:02:06,615 --> 00:02:11,485
transactions going across that bus so it
may be hard to actually even if you have a

25
00:02:11,485 --> 00:02:16,876
very high bandwidth bus you may not have
enough this cycles in order to operate in

26
00:02:16,876 --> 00:02:23,263
the bus. So, a solution this, is, we start
to look at something we're going to call

27
00:02:23,263 --> 00:02:28,926
directory cache coherence and directory
protocols. And the idea in a directory

28
00:02:28,926 --> 00:02:34,951
protocol, that the key idea here is that
instead of broadcasting your invalidations

29
00:02:34,951 --> 00:02:40,831
to every other core in the system, or all
cache in the system, every other core in

30
00:02:40,831 --> 00:02:45,985
the system. Instead what you do is you go
talk to a location that we're going to

31
00:02:46,421 --> 00:02:52,764
call directory. And this directory is
going to keep track of which caches have

32
00:02:52,764 --> 00:02:59,660
that data. And what's nice about this is
now if you take a cache miss you can go

33
00:02:59,660 --> 00:03:05,150
ask the directory well, who has all these
different, who has this cache line. And if

34
00:03:05,150 --> 00:03:09,994
only one other core has the cache line
let's say readable, it will and you're

35
00:03:09,994 --> 00:03:15,678
trying to take it into an exclusive access
trying to write to it. You only need to

36
00:03:15,678 --> 00:03:20,844
invalidate only one location instead of
sending that data to all end processors in

37
00:03:20,844 --> 00:03:26,861
your system. So we've cut down what we,
was a broadcast system into a point to

38
00:03:26,861 --> 00:03:32,673
point system. But the overhead that we
have to keep now is we need to track, in a

39
00:03:32,673 --> 00:03:39,535
directory. All the locations that have or
all the caches which could have a

40
00:03:39,535 --> 00:03:44,573
particular cache line in it. And, and
we'll, we'll go through a much more

41
00:03:44,573 --> 00:03:50,249
complicated example of that but that's
the, that's the overall key idea. And this

42
00:03:50,249 --> 00:03:56,166
is going to turn what was a broadcast into
a point to point communication. And we can

43
00:03:56,166 --> 00:04:01,252
use point to point interconnects for this.
Another good point of scalability here is

44
00:04:01,252 --> 00:04:05,734
you can actually have different
directories. So you don't have to have one

45
00:04:05,734 --> 00:04:10,518
big monolithic directory. Instead you can
segment the address space somehow and

46
00:04:10,518 --> 00:04:16,232
depending on the address that you have you
can go to a different directory. And, by

47
00:04:16,232 --> 00:04:20,942
going in these different directories, you
can actually increase the bandwidth, to

48
00:04:20,942 --> 00:04:31,349
your directories. Okay, so let's see how
this fits into like, a block diagram here.

49
00:04:31,349 --> 00:04:38,826
We have CPUs are trying to communicate
with other CPUs via shared memory. And

50
00:04:38,826 --> 00:04:47,202
they go check their cache first. If it's
not in the cache, before they would have

51
00:04:47,202 --> 00:04:53,153
communicate or cross a bus, and everyone
would have to look at that traffic but

52
00:04:53,153 --> 00:04:59,733
instead, in our directory cache coherence,
they'll send a message from the cache. To

53
00:04:59,733 --> 00:05:05,211
the directory controller associated with
the address so you're going to send a

54
00:05:05,211 --> 00:05:11,842
message here to this directory controller.
And they directly controller is going to

55
00:05:11,842 --> 00:05:15,410
keep track for every single line in or
the, the basic directory controller here

56
00:05:15,410 --> 00:05:23,063
is going to keep track for every single
line and memory. The list of possible

57
00:05:23,063 --> 00:05:29,253
other caches which could potentially have
that piece of data and we're going to call

58
00:05:29,253 --> 00:05:35,749
that the sharer list or the share list.
And this is. Right now, if we look at

59
00:05:35,749 --> 00:05:40,856
this, might still be a uniform
communication network. So, let's say, in

60
00:05:40,856 --> 00:05:46,611
here, you have some omega network. Anyone
can talk to anyone else and the latency

61
00:05:46,611 --> 00:05:52,221
through it is fixed. So this is still a
uniform memory access system. We've not.

62
00:05:52,221 --> 00:05:58,191
We didn't have to go non-uniform here. So,
no cache is necessarily closer or farther

63
00:05:58,191 --> 00:06:05,911
away to any other piece of memory in a
system like this. So this is kind of our

64
00:06:05,911 --> 00:06:10,196
naive directory cache coherence protocol.
But what's still nice here is we don't

65
00:06:10,196 --> 00:06:14,908
have to broadcast. We can our, let's say
our omega network, or mesh network, or

66
00:06:14,908 --> 00:06:19,193
something else on the inside here. We'd
like to send from this cache directly to

67
00:06:19,193 --> 00:06:23,976
this controller. If no other cache has it.
Let's say, readable or writable, or

68
00:06:23,976 --> 00:06:28,915
anything like that. It can just respond
back with the memory. The, the data from

69
00:06:28,915 --> 00:06:33,728
memory. If not, instead of invalidating
and broadcasting an invalidate to all

70
00:06:33,728 --> 00:06:38,477
other cores. Instead now, the directory
can just say. Oh, this core, this cache

71
00:06:38,477 --> 00:06:43,543
and this cache have copies. I need to send
two messages. One to this cache, one to

72
00:06:43,543 --> 00:06:49,239
that cache, and validate them. Wait for
the responses and then reply back with the

73
00:06:49,239 --> 00:06:55,347
data. So, we can, we can decrease our
bandwidth that we use in the common case

74
00:06:55,347 --> 00:07:02,576
across our inter-connection network. So
I'm going to show a slightly different

75
00:07:02,576 --> 00:07:09,049
picture here which is pretty similar to
the previous picture. Well, you'll notice

76
00:07:09,049 --> 00:07:19,482
is that the memory and the directory, now,
are connected to an individual CPU. So why

77
00:07:19,482 --> 00:07:25,725
do we do this? Well. If you're building
one of these scalable systems some sort of

78
00:07:25,725 --> 00:07:30,507
like supercomputer, it might be a good
property that as you add more CPUs to the

79
00:07:30,507 --> 00:07:36,536
system, you also add more RAM memory. And
maybe more directory storage or something

80
00:07:36,536 --> 00:07:41,935
like that. Another positive of, of, of a
design like this is this CPU now is

81
00:07:41,935 --> 00:07:50,696
actually close to this memory bank. And we
can try to take advantage of that. So one,

82
00:07:50,696 --> 00:07:55,378
one question comes up, is how can we take
advantage of that? Anyone have any

83
00:07:55,378 --> 00:08:00,059
thoughts? Okay so, shared data, we don't
know where it's going to be accessed. It

84
00:08:00,059 --> 00:08:05,240
could be accessed by all six CPUs and all
six caches here. But it's very common that

85
00:08:05,240 --> 00:08:10,109
your stack for your program is going to be
only access local and the instruction

86
00:08:10,109 --> 00:08:15,457
memory for your program is only going to
be access local. So you can potentially

87
00:08:15,457 --> 00:08:20,870
have performance benefits by putting the
instruction in memory or excuse me,

88
00:08:20,870 --> 00:08:26,923
instruction and stack and maybe even some
portion of the heap close to this core cuz

89
00:08:26,923 --> 00:08:32,834
then you can access that really quickly,
but only shared data has to go across this

90
00:08:32,834 --> 00:08:43,531
interconnect. And will or in fact that has
a fancy name. So systems where some data

91
00:08:43,531 --> 00:08:50,765
is close and some data is far away are
called Non Uniform Memory Access or NUMA

92
00:08:50,765 --> 00:08:57,728
and you might see this actually even in
your desktop processors are actually

93
00:08:58,280 --> 00:09:03,672
Moving towards numerous systems. They're,
they're, some of them are, are actually If

94
00:09:03,672 --> 00:09:08,564
you look at some of the. I believe it's
the A and D chips today are already numa

95
00:09:08,564 --> 00:09:12,680
systems even on a on a single di, or not
excuse me single chip with multiple dies

96
00:09:12,680 --> 00:09:16,355
or something like that there, there
actually two newer nodes inside of them

97
00:09:16,355 --> 00:09:21,436
sort of one for one memory controller one
for one another memory controller. So, if

98
00:09:21,436 --> 00:09:26,450
you go into something like Linux and you
go look in the proc directory, you can

99
00:09:26,450 --> 00:09:31,404
actually see there's a sub-directory in
there called NUMA. And it'll tell you the

100
00:09:31,404 --> 00:09:36,725
configuration of the different memory. And
then the OS can take advantage of this, so

101
00:09:36,725 --> 00:09:41,311
can put for instance, the stack and the
instruction memory, for a particular

102
00:09:41,311 --> 00:09:46,265
program, that's being used by a particular
core close to that core. And then, maybe

103
00:09:46,265 --> 00:09:50,941
through some other data, I can somehow
choose some other, other choice. Now, I

104
00:09:50,941 --> 00:09:59,920
want to make a point here, is that just
because the latency to memory is different

105
00:09:59,920 --> 00:10:09,746
does not mean that your system is a
directory based cached coherent NUMA

106
00:10:09,746 --> 00:10:15,706
system. So you can still have
non-directory-based systems where some

107
00:10:15,706 --> 00:10:20,865
memory is close and some memory is far
away. So you could still have a, basically

108
00:10:20,865 --> 00:10:24,656
a bus, or something like that, or maybe
some other internet connection network in

109
00:10:24,656 --> 00:10:28,447
there which is still a snooping protocol,
or effectively a snooping protocol. But

110
00:10:28,447 --> 00:10:33,262
some data is close and some data is far
away. But if you see this in literature

111
00:10:33,262 --> 00:10:38,492
usually you feel people talking about
directory based cache coherent NUMA

112
00:10:38,492 --> 00:10:43,368
systems will call them CC NUMA or cache
coherent NUMA systems, that's usually sort

113
00:10:43,722 --> 00:10:49,729
of means that this is a cache coherent
non-uniform memory access architecture and

114
00:10:49,729 --> 00:10:55,665
usually implies that directory based for
the may be other protocols that people are

115
00:10:55,665 --> 00:11:01,090
using out there also. Okay so I want to go
back one slide here, and I wanted to

116
00:11:01,298 --> 00:11:06,690
finish off talking about one-topology,
which is interesting. And the difference

117
00:11:06,690 --> 00:11:12,497
between these two slides is we went from a
CPU here to CPUs. So this is a multi-core

118
00:11:12,497 --> 00:11:18,421
chip now and where this gets interesting
is you might have a directory based cache

119
00:11:18,421 --> 00:11:23,831
coherence system connecting multiple chips
but then inside of a chip you may have

120
00:11:23,831 --> 00:11:28,739
something like a bus based snooping
protocol. So we actually mix and match

121
00:11:28,739 --> 00:11:33,888
these two things. And how we go about
doing this is, if caches, if cores inside

122
00:11:33,888 --> 00:11:39,372
of this one chip for instance, after you
go get data from each other, they can just

123
00:11:39,372 --> 00:11:44,521
effectively snoop on each other, but
outside of that your cache controller or

124
00:11:44,521 --> 00:11:49,760
may be your L3 cache for this particular
chip, is going to respond to messages

125
00:11:49,760 --> 00:11:54,452
coming from other directories, like
invalidation requests and do something

126
00:11:54,860 --> 00:11:59,893
about it. So there's basically a
transducer there between a directory base

127
00:11:59,893 --> 00:12:05,061
cache coherence protocol, and a bus base
snoopy protocol. And this is prett y,

128
00:12:05,061 --> 00:12:10,978
pretty common these days. Especially given
that you have a fair number of multi core

129
00:12:10,978 --> 00:12:17,031
systems showing up. And being used in more
of these directory based cache coherence

130
00:12:17,031 --> 00:12:23,991
systems. And we'll talk about one of them
at the end actually the SGI UV systems or

131
00:12:23,991 --> 00:12:30,373
UV 1000, which we'll talk about, is a
users off-the-shelve, Intel parts, modern

132
00:12:30,373 --> 00:12:37,343
day sort of Core i7 parts mixed together
with a NUMA, and directory based coherence

133
00:12:37,343 --> 00:12:44,565
system to connect all the chips together.
So there's a transducer from the external

134
00:12:44,565 --> 00:12:49,940
snoop bus protocol to the directory based
coherence protocol.