1
00:00:03,400 --> 00:00:08,615
Okay, so now I'll start talking about,
we're gonna, we're gonna spend the rest of

2
00:00:08,615 --> 00:00:13,633
the lecture talking about different
coherence protocols, and relative merits

3
00:00:13,633 --> 00:00:18,980
of them, on a bus. I wanted to contrast
this to what we're gonna talk about in two

4
00:00:18,980 --> 00:00:24,328
lectures, where we're gonna be talking
about different coherence protocols across

5
00:00:24,526 --> 00:00:29,873
switch interconnects, and places where you
don't have a shared medium or a shared

6
00:00:29,873 --> 00:00:35,287
broadcast medium. Okay. So let's, as a
warm-up here, we're gonna start off by

7
00:00:35,287 --> 00:00:40,943
looking at what can happen with. We know
happen at the same time as memory

8
00:00:40,943 --> 00:00:47,439
transactions from a uni-processor. So,
let's take a look at where you can have

9
00:00:47,439 --> 00:00:54,273
consistency problems in a uni-processor
system. As a warm-up and a motivator here.

10
00:00:54,273 --> 00:01:01,107
So here, we have a processor, and that's a
cache. This is memory. And then, somewhere

11
00:01:01,107 --> 00:01:08,784
in the memory bus here, we off a bit with
a DMA agent. Or a direct memory access

12
00:01:08,784 --> 00:01:14,678
agent. Sometimes it's called, All
semester, even. Because it, it has a term,

13
00:01:14,678 --> 00:01:20,624
because of multiple agents which can
effectively drive transactions onto main

14
00:01:20,624 --> 00:01:26,494
memory clusters. So if you go look at
like, for instance, PCI Exu2014, PCI or

15
00:01:26,494 --> 00:01:32,364
PCI Express, which are sort of the
extension cards for your system, they'll,

16
00:01:32,364 --> 00:01:38,462
they'll use the term, bus mastering. What
that really means is that there's a DMA

17
00:01:38,462 --> 00:01:44,408
engine out at the I/O place. Okay, so,
what I'm trying to get across here is, you

18
00:01:44,408 --> 00:01:51,924
can actually. overlap in a uni-processor
system. Moving data from the disk to main

19
00:01:51,924 --> 00:01:59,551
memory without having to use the
processor. Because you have . Originally,

20
00:01:59,827 --> 00:02:07,821
or, or, . And require program to IO in
order to go access the disk. So a simple

21
00:02:07,821 --> 00:02:14,897
way the processor would actually read an
address, which is translated. And you

22
00:02:14,897 --> 00:02:27,098
don't have the here. Some . This impostor
and possibly . As you can tell that

23
00:02:27,098 --> 00:02:36,720
requires you to this memory. It's kinda
slow. So people decided let's put extra

24
00:02:36,720 --> 00:02:45,167
direct memory access engines out at diode
prices. So, this actually goes back to

25
00:02:45,167 --> 00:02:53,716
early, early computers. Like mainframes
like very sophisticated the, the, DMA. And

26
00:02:53,716 --> 00:03:00,753
they're, they're, they were, they're obvi
ously . System 360 they have, coviariable

27
00:03:00,753 --> 00:03:08,818
DMA engines that they can effectively .
But the simplest case given is basically

28
00:03:08,818 --> 00:03:16,330
going to have a register which says where
we provide disk and where memory and how

29
00:03:16,330 --> 00:03:25,106
long and, and a, and a go button, where
go, you could sit there . You could also

30
00:03:25,106 --> 00:03:31,974
possibly do it the other way. Okay, so
let's look at this from a coherence

31
00:03:31,974 --> 00:03:40,292
perspective. And let's choose this cache.
Let's look a memory to disk transaction.

32
00:03:40,292 --> 00:03:48,611
So we look we look at the DMA it should
say having this location in memory into

33
00:03:48,611 --> 00:03:56,201
this location of the disk. We tell it to
go. Now while it's doing that the

34
00:03:56,201 --> 00:04:04,356
processor writes to an address that tells
it to... Inside this page. What happens?

35
00:04:04,356 --> 00:04:11,944
It's not write through. There's no
coherence protocol really going on in this

36
00:04:11,944 --> 00:04:22,095
case so far. Well, values, hopefully. Or
maybe it does, these cache lines do out to

37
00:04:22,095 --> 00:04:29,782
main memory. So then it gets some of the
new values, and some of the old values.

38
00:04:29,782 --> 00:04:36,839
The point I'm trying to make here is that,
you don't know. And what's gonna happen?

39
00:04:37,019 --> 00:04:42,235
and that, that's a little scary. You
wanna, you know? You wanna in your system.

40
00:04:42,235 --> 00:04:46,613
Now, you can say, well maybe the
processors shouldn't go and write to that,

41
00:04:46,613 --> 00:04:51,530
those memory addresses. That is a valid
solution. this is actually pretty common

42
00:04:51,530 --> 00:04:57,107
on, on modern day systems. Is that the OS
will know when the actually is in flight.

43
00:04:57,107 --> 00:05:02,024
They will just make sure not to go access
that, those memory addresses. Well, what I

44
00:05:02,024 --> 00:05:07,301
was trying to introduce here is that you
actually processor system, can have

45
00:05:07,301 --> 00:05:12,791
coherence problems, with respect to. Where
the different addresses are. You can

46
00:05:12,791 --> 00:05:20,867
actually have . . You can also have it
going the other way. To the disks to main

47
00:05:20,867 --> 00:05:28,297
memory transferring in that direction.
Well, let's say there's some values in the

48
00:05:28,297 --> 00:05:34,112
cache of the CPU. This is actually
probably the more interesting case, is

49
00:05:34,112 --> 00:05:39,927
that you have some data in the CPU's
cache, erasing the disk to physical

50
00:05:39,927 --> 00:05:46,830
memory. But all of a sudden, the CPU's
cache. Doesn't pick up the new value. And

51
00:05:46,830 --> 00:05:54,999
it wants to go and read that. It's like
reading a file off a disk or some thing.

52
00:05:55,002 --> 00:06:03,376
This is gonna get the wrong value. So this
introduces, this moves us to our first

53
00:06:03,376 --> 00:06:11,548
idea in our coherence protocols, that has
a funny name, called snoopy caches. no,

54
00:06:11,548 --> 00:06:20,825
this is not named for the, the dog in the
Peanuts cartoons. But instead, Stu Goodman

55
00:06:20,825 --> 00:06:33,131
and, I believe it's a professor now and
one of his students, came up with the idea

56
00:06:33,131 --> 00:06:46,198
that you have the cash on the what's going
on and effectively update the cash with

57
00:06:46,198 --> 00:06:55,095
the, the data that's flying by on the bus.
So, if we look at this from a little bit

58
00:06:55,095 --> 00:07:02,671
more harder perspective. . We have our
cache and we have the tags. And escape

59
00:07:02,671 --> 00:07:09,785
into the cache. And it is effectively
sitting there watching the bus. And if a

60
00:07:09,785 --> 00:07:17,361
address that is in the cache slides behind
the bus, it needs to do something about

61
00:07:17,361 --> 00:07:24,660
it. It probably needs to invalidate the
address if its a right occurring across

62
00:07:24,660 --> 00:07:30,852
the bus. You can also do it the other way
that if you have a DNA engine which is

63
00:07:30,852 --> 00:07:36,910
reading from a memory and the data is
dirty in cache and it's not a memory, it's

64
00:07:36,910 --> 00:07:43,195
a right back cache, it may need to provide
data to the, the I O device that's trying

65
00:07:43,195 --> 00:07:48,798
to read from a memory and override
effectively where its coming from main

66
00:07:48,798 --> 00:07:55,008
memory. Now it probably doesn't wanna try
that after you're on the bus but there's

67
00:07:55,008 --> 00:08:01,468
also arbitration that's actually, actually
happening there. To determine who has the

68
00:08:01,468 --> 00:08:08,417
actual data. . But you have to figure out
sort of what is, what is the right

69
00:08:08,417 --> 00:08:15,035
fitting? You know, we're talking a little
bit more in today's lecture, what is the

70
00:08:15,035 --> 00:08:20,908
right thing to do in these, these
interesting formative cases? Before we

71
00:08:20,908 --> 00:08:27,443
move on here, this, this is getting a
little bit hard. maybe back in 1983, this

72
00:08:27,443 --> 00:08:36,028
wasn't so bad. But nowadays, we just tags
and our . A dual coordinate.'Cause in the

73
00:08:36,028 --> 00:08:42,433
Snoopy protocol, we're gonna need all
possible memory, transactions that are

74
00:08:42,433 --> 00:08:48,675
going on, let's say, by one processor.
And/or and DMA engine, to be verified by

75
00:08:48,675 --> 00:08:54,999
every other processor entity in the
system. That's a fair amount of bandwidth

76
00:08:54,999 --> 00:09:01,240
coming into here. And you could add two
ports to that. And that's okay if the

77
00:09:01,240 --> 00:09:08,233
cache is sort of f arther out. But it
might slow down your cache . Level one

78
00:09:08,233 --> 00:09:14,816
cache. So typically the way that people
build this is they'll try to maybe move

79
00:09:14,816 --> 00:09:21,150
forward or snoop on a level two cache and
have level one cache which is not

80
00:09:21,150 --> 00:09:27,816
necessarily snooped. Now this is where we
get back to inclusive versus exclusive

81
00:09:27,816 --> 00:09:34,400
caches. If you're level two cache is
inclusive of all data in level one this is

82
00:09:34,400 --> 00:09:41,400
actually not so bad to do. Because you are
guaranteed to the tags in level two cache.

83
00:09:42,292 --> 00:09:49,286
Global one Cash. So, you don't have to go,
move all the way down for cash. If you use

84
00:09:49,286 --> 00:09:54,791
exclusive cash, well, life gets a lot
harder. Cuz basically you need a check

85
00:09:54,791 --> 00:10:02,008
with a Global one emblem to . so, anyways,
all I was trying to get across here's that

86
00:10:02,008 --> 00:10:09,001
this, this significantly increases the
price of or cash design here, your tab

87
00:10:09,001 --> 00:10:15,028
design as you add more portions to this.
And it's not actually an area question, I

88
00:10:15,028 --> 00:10:20,925
mean. Just makes it larger, puts more of,
a clock cycle perfomance, specially if

89
00:10:20,925 --> 00:10:26,682
it's, your global one captioning, where
two important task. That's a critical

90
00:10:26,912 --> 00:10:33,973
tacking your processor to go, and certain
one, two, and anything else that too. One

91
00:10:33,973 --> 00:10:39,576
ways all this is actually, you have a
unique, ported tax structure and you

92
00:10:39,576 --> 00:10:45,485
somehow delay, the rest of, the, catch
soup transaction, while, you arbitrate and

93
00:10:45,485 --> 00:10:51,994
wait for time, in order to go access the
tax. So you. But you can, multiplex let's

94
00:10:51,994 --> 00:10:58,849
say the, the one portion of the tag, every
other cycle. One cycle's the main

95
00:10:58,849 --> 00:11:01,997
processor. One cycle is for the
transaction.

96
00:11:01,999 --> 00:11:14,784
. So just, have a little more of block
diagram view of this. We have a bus, of

97
00:11:14,784 --> 00:11:22,844
multi-processors, and our snoopy cache,
which actually has to see, all of the,

98
00:11:22,844 --> 00:11:29,209
items. No traffic. And in all the main
processor traffic across this bus. So

99
00:11:29,209 --> 00:11:35,872
that's a, a lot of bandwidth because
essentially you're broadcasting... Well,

100
00:11:35,872 --> 00:11:43,902
you might have to broadcast all of your
actions from one processor to all the, all

101
00:11:43,902 --> 00:11:50,992
the other processors. But we're looking at
techniques to reduce the requirements of,

102
00:11:50,992 --> 00:11:59,288
of this broadcast to be the subset of, of
the data. Okay so questions so far? , and,

103
00:11:59,288 --> 00:12:04,060
and adding supports before going into
protocols.