1
00:00:04,800 --> 00:00:12,052
We move on to our final topic of ELE 475
which is, directory based cached

2
00:00:12,052 --> 00:00:20,891
coherence. It's a little bit of a warm up
here. We remember the three Cs of caches.

3
00:00:20,891 --> 00:00:29,336
We had compulsory misses, capacity misses,
and conflict misses. Well, we're going to

4
00:00:29,833 --> 00:00:36,621
think about adding a new miss type here, a
coherence miss. So our coherence miss is

5
00:00:36,621 --> 00:00:42,499
some other cache, or some other entity,
reaching down into our cache and

6
00:00:42,499 --> 00:00:48,745
invalidating something there. So this is
strictly different then compulsory,

7
00:00:48,745 --> 00:00:54,277
capacity and conflict. If anything it
looks most closest to a compulsory miss

8
00:00:54,277 --> 00:01:00,241
because you're effectively. It's like a
first miss, but someone, some other entity

9
00:01:00,241 --> 00:01:06,567
bumped it out of your cache. But it's not
any of those either. So this communication

10
00:01:06,567 --> 00:01:11,406
that's coming from other cores that is
been in a snooping protocol or symetric

11
00:01:11,406 --> 00:01:16,552
shared memory multi-processor that other
traffic comes in and it will actually bump

12
00:01:16,552 --> 00:01:22,095
things out of your cache. You need to
worry about this. Now we're going to,

13
00:01:22,428 --> 00:01:27,118
we're going to take these coherency misses
and put them into two categories. True

14
00:01:27,118 --> 00:01:33,077
sharing, we, we talked about this briefly
at the end of two lectures ago, or three

15
00:01:33,077 --> 00:01:38,462
lectures ago. Well we're going to, we're
going to categorize this as two different

16
00:01:38,462 --> 00:01:44,348
categories of misses. True sharing misses,
which we're going to say, is that if were

17
00:01:44,348 --> 00:01:50,307
to have a cache where each block length or
cache line size is exactly what's a one

18
00:01:50,307 --> 00:01:56,051
byte to the minimal size you can have on
your machine, you would still have that

19
00:01:56,051 --> 00:02:00,201
miss. So that's a true miss. And a true
miss is if you're actually sharing data.

20
00:02:00,201 --> 00:02:04,252
So if one cache, let's say, writes some
data, and another cache wants to go read,

21
00:02:04,252 --> 00:02:08,460
or another processor wants to go read that
data, and needs to go pull it into its

22
00:02:08,460 --> 00:02:14,275
cache, that's a true sharing misses. You
need to do that communication. Now, to

23
00:02:14,275 --> 00:02:20,667
contrast that, we talk about false
sharing, or false sharing misses. And what

24
00:02:20,667 --> 00:02:27,996
a false sharing misses is, is saying that
if you were to reduce the, sharing size or

25
00:02:27,996 --> 00:02:35,268
the block size down to say, one byte or
one word and you run the same program and

26
00:02:35,268 --> 00:02:40,698
that miss occurs, or the miss occurs when
the block size is let's say one, versus

27
00:02:40,902 --> 00:02:46,060
or, or excuse me, if the miss occurs in
the larger block size versus when the

28
00:02:46,060 --> 00:02:51,762
block size is one, then that is actually a
false sharing miss. The block size was too

29
00:02:51,762 --> 00:02:57,056
big and you had two pieces of data in the
same cache line that are effectively

30
00:02:57,056 --> 00:03:02,486
causing a sharing, even though there was
no true sharing going on of the data. It

31
00:03:02,486 --> 00:03:09,063
just so happens they're packed into the
same cache line. Now it's a little bit

32
00:03:09,063 --> 00:03:15,304
more then that we're also going to say
that false sharing can happen when data

33
00:03:15,304 --> 00:03:21,679
gets moved around or gets invalidated but
it's not being, it may be shared later in

34
00:03:21,679 --> 00:03:28,156
the program, but that exact miss was not
because of data being communicated. So

35
00:03:28,156 --> 00:03:33,823
it's a little bit broader then what we
said last time. It, it, it does still

36
00:03:33,823 --> 00:03:38,557
happen because there is two piece data
packed in to the same line but effectively

37
00:03:38,557 --> 00:03:43,173
what I'm, what I'm trying to get out here
is you could have data's that bounce

38
00:03:43,173 --> 00:03:48,085
around between different caches and the
same instruction sequence or the same load

39
00:03:48,085 --> 00:03:53,411
in store sequence will not cause then this
is if you had a very small cache line size

40
00:03:53,411 --> 00:03:58,086
but does happen of a large cache line
size. So lets, lets look at a example here

41
00:03:58,086 --> 00:04:03,028
and try to categorize these different
misses. So let's, let's start off here.

42
00:04:03,028 --> 00:04:09,483
The initial conditions are X and X2, or X1
and X2, which are two piece, two words of

43
00:04:09,483 --> 00:04:17,328
data, are packed into the same cache block
or the same cache line. P1 and P2 have

44
00:04:17,328 --> 00:04:26,219
both read the data, and it's, it's, it's
readable in both caches at the beginning

45
00:04:26,219 --> 00:04:37,060
of time. Okay so all of a sudden we do a
right to X1. And we have to, what do we

46
00:04:37,060 --> 00:04:46,320
have to do? Well, we're going to have to
invalidate. X1 and P2. And this, this is a

47
00:04:46,320 --> 00:04:52,248
true miss because the data was in both. We
need to pull it out of the one. We need to

48
00:04:52,248 --> 00:04:57,541
actually invalidate it. cuz this is
actual, actual real data. Okay. So next

49
00:04:57,541 --> 00:05:06,583
thing we do is P2 goes and executes a read
of X2. Well, what you may notice here is

50
00:05:06,583 --> 00:05:14,560
at the beginning of time X2 is in the
cache of P2, but it got bumped out, here.

51
00:05:15,000 --> 00:05:24,670
If P1 never want to go right X2 so this is
a false sharing mess. This got an

52
00:05:24,670 --> 00:05:32,110
exclusive into P1's cache and this is
going to pull it out of that cache and

53
00:05:32,110 --> 00:05:39,946
invalidate it. So recall, say false miss
because X1 was irrelevant to P2 for this,

54
00:05:39,946 --> 00:05:46,692
for this miss. Okay, so now we see,
another right to X1. Well, P2 didn't

55
00:05:46,692 --> 00:05:55,420
actually touch X, X1, so likewise, this is
completely false sharing. Now we see a

56
00:05:55,420 --> 00:06:02,369
right to X2 well we didn't see any
communication going on here. So this is

57
00:06:02,369 --> 00:06:07,092
also a false sharing miss. Now finally we
have something that is real here. We're

58
00:06:07,092 --> 00:06:11,698
going to read X two. When we wrote X two
there, we read X two here, we are actually

59
00:06:11,698 --> 00:06:16,855
communicating data. So this is true
sharing. And that's okay. But we want to

60
00:06:17,311 --> 00:06:25,774
try to minimize these, these false sharing
patterns. This is just a warm up. Some mo,

61
00:06:25,774 --> 00:06:33,039
motivate us into directory based
coherence, a little bit. Okay. So let's,

62
00:06:33,039 --> 00:06:39,090
let's motivate this a little bit more. And
let's look at something like a online

63
00:06:39,090 --> 00:06:44,613
transaction processing workload. So this
is a database workload. So it's a

64
00:06:44,613 --> 00:06:50,438
multiprocessor database workload. It's
using threads. And what we're going to see

65
00:06:50,438 --> 00:06:56,338
here is we have. We're going to run the
same work load on a four processor system

66
00:06:56,338 --> 00:07:02,541
with four different cache sizes. This,
this data is, taken from a paper from your

67
00:07:02,541 --> 00:07:09,835
book. And. What you'll notice is as you
increase the cache size, our false sharing

68
00:07:09,835 --> 00:07:14,201
and our true sharing don't really change.
You still need to communicate data and

69
00:07:14,201 --> 00:07:18,295
your still going to get false sharing.
Just cuz you make the cache size bigger,

70
00:07:18,295 --> 00:07:22,388
didn't change the block size. You're still
going to get the same false sharing

71
00:07:22,388 --> 00:07:30,229
patterns. But, as you increase the cache
size. The instruction. Misses the capacity

72
00:07:30,229 --> 00:07:35,682
misses the conflict misses the cold the,
the, the compulsory misses we call that

73
00:07:35,682 --> 00:07:42,850
cold go down because the cache is getting
bigger, So non-shared cache lines

74
00:07:42,850 --> 00:07:50,898
performing the, the, the. . Number of
memory cycles. The amount of time you take

75
00:07:50,898 --> 00:07:55,583
memory misses there, is going down. But
the rest of this is not changing. Hm.

76
00:07:55,583 --> 00:08:00,516
Well, okay. This is, this is interesting,
so the second question comes up as, wh at

77
00:08:00,521 --> 00:08:05,649
happens if we increase the number of cores
in our system? So this is a relatively

78
00:08:05,649 --> 00:08:10,776
small system here. Let's, let's plot the
number of cores here from one to eight, at

79
00:08:10,776 --> 00:08:15,840
the same workload. And look what happens.
So if we, we look at this. Something else

80
00:08:15,840 --> 00:08:20,967
is in variant here. This is for a fixed
cache size. We're going to plot the number

81
00:08:20,967 --> 00:08:28,536
of processors down here now. The. Number
of memory cycles per instruction, for

82
00:08:28,536 --> 00:08:35,209
instruction misses, complex capacity
misses, com, compulsory misses, doesn't

83
00:08:35,209 --> 00:08:42,051
change, just stays the same because it's
basically uni-processor based. But, as you

84
00:08:42,051 --> 00:08:50,417
add more cores you get both, more true
sharing and more false sharing. Hmm, well

85
00:08:50,417 --> 00:08:56,044
this is a little scary cuz our performance
is basically going down as we add more

86
00:08:56,044 --> 00:09:01,671
course. So this is only up to eight, what
happens if we're, you know way out here at

87
00:09:01,671 --> 00:09:06,611
100 course? What do we think's going to
happen? Well, we're probably going to, to

88
00:09:06,611 --> 00:09:12,101
be dominated, our performance is going to
be dominated by the sharing and the false

89
00:09:12,101 --> 00:09:17,886
sharing and these misses. Huh, well, we're
going to start thinking about that one.

90
00:09:17,886 --> 00:09:22,848
Figure out how to, how to solve that and
think about scalability of coherence