1 00:00:04,800 --> 00:00:12,052 We move on to our final topic of ELE 475 which is, directory based cached 2 00:00:12,052 --> 00:00:20,891 coherence. It's a little bit of a warm up here. We remember the three Cs of caches. 3 00:00:20,891 --> 00:00:29,336 We had compulsory misses, capacity misses, and conflict misses. Well, we're going to 4 00:00:29,833 --> 00:00:36,621 think about adding a new miss type here, a coherence miss. So our coherence miss is 5 00:00:36,621 --> 00:00:42,499 some other cache, or some other entity, reaching down into our cache and 6 00:00:42,499 --> 00:00:48,745 invalidating something there. So this is strictly different then compulsory, 7 00:00:48,745 --> 00:00:54,277 capacity and conflict. If anything it looks most closest to a compulsory miss 8 00:00:54,277 --> 00:01:00,241 because you're effectively. It's like a first miss, but someone, some other entity 9 00:01:00,241 --> 00:01:06,567 bumped it out of your cache. But it's not any of those either. So this communication 10 00:01:06,567 --> 00:01:11,406 that's coming from other cores that is been in a snooping protocol or symetric 11 00:01:11,406 --> 00:01:16,552 shared memory multi-processor that other traffic comes in and it will actually bump 12 00:01:16,552 --> 00:01:22,095 things out of your cache. You need to worry about this. Now we're going to, 13 00:01:22,428 --> 00:01:27,118 we're going to take these coherency misses and put them into two categories. True 14 00:01:27,118 --> 00:01:33,077 sharing, we, we talked about this briefly at the end of two lectures ago, or three 15 00:01:33,077 --> 00:01:38,462 lectures ago. Well we're going to, we're going to categorize this as two different 16 00:01:38,462 --> 00:01:44,348 categories of misses. True sharing misses, which we're going to say, is that if were 17 00:01:44,348 --> 00:01:50,307 to have a cache where each block length or cache line size is exactly what's a one 18 00:01:50,307 --> 00:01:56,051 byte to the minimal size you can have on your machine, you would still have that 19 00:01:56,051 --> 00:02:00,201 miss. So that's a true miss. And a true miss is if you're actually sharing data. 20 00:02:00,201 --> 00:02:04,252 So if one cache, let's say, writes some data, and another cache wants to go read, 21 00:02:04,252 --> 00:02:08,460 or another processor wants to go read that data, and needs to go pull it into its 22 00:02:08,460 --> 00:02:14,275 cache, that's a true sharing misses. You need to do that communication. Now, to 23 00:02:14,275 --> 00:02:20,667 contrast that, we talk about false sharing, or false sharing misses. And what 24 00:02:20,667 --> 00:02:27,996 a false sharing misses is, is saying that if you were to reduce the, sharing size or 25 00:02:27,996 --> 00:02:35,268 the block size down to say, one byte or one word and you run the same program and 26 00:02:35,268 --> 00:02:40,698 that miss occurs, or the miss occurs when the block size is let's say one, versus 27 00:02:40,902 --> 00:02:46,060 or, or excuse me, if the miss occurs in the larger block size versus when the 28 00:02:46,060 --> 00:02:51,762 block size is one, then that is actually a false sharing miss. The block size was too 29 00:02:51,762 --> 00:02:57,056 big and you had two pieces of data in the same cache line that are effectively 30 00:02:57,056 --> 00:03:02,486 causing a sharing, even though there was no true sharing going on of the data. It 31 00:03:02,486 --> 00:03:09,063 just so happens they're packed into the same cache line. Now it's a little bit 32 00:03:09,063 --> 00:03:15,304 more then that we're also going to say that false sharing can happen when data 33 00:03:15,304 --> 00:03:21,679 gets moved around or gets invalidated but it's not being, it may be shared later in 34 00:03:21,679 --> 00:03:28,156 the program, but that exact miss was not because of data being communicated. So 35 00:03:28,156 --> 00:03:33,823 it's a little bit broader then what we said last time. It, it, it does still 36 00:03:33,823 --> 00:03:38,557 happen because there is two piece data packed in to the same line but effectively 37 00:03:38,557 --> 00:03:43,173 what I'm, what I'm trying to get out here is you could have data's that bounce 38 00:03:43,173 --> 00:03:48,085 around between different caches and the same instruction sequence or the same load 39 00:03:48,085 --> 00:03:53,411 in store sequence will not cause then this is if you had a very small cache line size 40 00:03:53,411 --> 00:03:58,086 but does happen of a large cache line size. So lets, lets look at a example here 41 00:03:58,086 --> 00:04:03,028 and try to categorize these different misses. So let's, let's start off here. 42 00:04:03,028 --> 00:04:09,483 The initial conditions are X and X2, or X1 and X2, which are two piece, two words of 43 00:04:09,483 --> 00:04:17,328 data, are packed into the same cache block or the same cache line. P1 and P2 have 44 00:04:17,328 --> 00:04:26,219 both read the data, and it's, it's, it's readable in both caches at the beginning 45 00:04:26,219 --> 00:04:37,060 of time. Okay so all of a sudden we do a right to X1. And we have to, what do we 46 00:04:37,060 --> 00:04:46,320 have to do? Well, we're going to have to invalidate. X1 and P2. And this, this is a 47 00:04:46,320 --> 00:04:52,248 true miss because the data was in both. We need to pull it out of the one. We need to 48 00:04:52,248 --> 00:04:57,541 actually invalidate it. cuz this is actual, actual real data. Okay. So next 49 00:04:57,541 --> 00:05:06,583 thing we do is P2 goes and executes a read of X2. Well, what you may notice here is 50 00:05:06,583 --> 00:05:14,560 at the beginning of time X2 is in the cache of P2, but it got bumped out, here. 51 00:05:15,000 --> 00:05:24,670 If P1 never want to go right X2 so this is a false sharing mess. This got an 52 00:05:24,670 --> 00:05:32,110 exclusive into P1's cache and this is going to pull it out of that cache and 53 00:05:32,110 --> 00:05:39,946 invalidate it. So recall, say false miss because X1 was irrelevant to P2 for this, 54 00:05:39,946 --> 00:05:46,692 for this miss. Okay, so now we see, another right to X1. Well, P2 didn't 55 00:05:46,692 --> 00:05:55,420 actually touch X, X1, so likewise, this is completely false sharing. Now we see a 56 00:05:55,420 --> 00:06:02,369 right to X2 well we didn't see any communication going on here. So this is 57 00:06:02,369 --> 00:06:07,092 also a false sharing miss. Now finally we have something that is real here. We're 58 00:06:07,092 --> 00:06:11,698 going to read X two. When we wrote X two there, we read X two here, we are actually 59 00:06:11,698 --> 00:06:16,855 communicating data. So this is true sharing. And that's okay. But we want to 60 00:06:17,311 --> 00:06:25,774 try to minimize these, these false sharing patterns. This is just a warm up. Some mo, 61 00:06:25,774 --> 00:06:33,039 motivate us into directory based coherence, a little bit. Okay. So let's, 62 00:06:33,039 --> 00:06:39,090 let's motivate this a little bit more. And let's look at something like a online 63 00:06:39,090 --> 00:06:44,613 transaction processing workload. So this is a database workload. So it's a 64 00:06:44,613 --> 00:06:50,438 multiprocessor database workload. It's using threads. And what we're going to see 65 00:06:50,438 --> 00:06:56,338 here is we have. We're going to run the same work load on a four processor system 66 00:06:56,338 --> 00:07:02,541 with four different cache sizes. This, this data is, taken from a paper from your 67 00:07:02,541 --> 00:07:09,835 book. And. What you'll notice is as you increase the cache size, our false sharing 68 00:07:09,835 --> 00:07:14,201 and our true sharing don't really change. You still need to communicate data and 69 00:07:14,201 --> 00:07:18,295 your still going to get false sharing. Just cuz you make the cache size bigger, 70 00:07:18,295 --> 00:07:22,388 didn't change the block size. You're still going to get the same false sharing 71 00:07:22,388 --> 00:07:30,229 patterns. But, as you increase the cache size. The instruction. Misses the capacity 72 00:07:30,229 --> 00:07:35,682 misses the conflict misses the cold the, the, the compulsory misses we call that 73 00:07:35,682 --> 00:07:42,850 cold go down because the cache is getting bigger, So non-shared cache lines 74 00:07:42,850 --> 00:07:50,898 performing the, the, the. . Number of memory cycles. The amount of time you take 75 00:07:50,898 --> 00:07:55,583 memory misses there, is going down. But the rest of this is not changing. Hm. 76 00:07:55,583 --> 00:08:00,516 Well, okay. This is, this is interesting, so the second question comes up as, wh at 77 00:08:00,521 --> 00:08:05,649 happens if we increase the number of cores in our system? So this is a relatively 78 00:08:05,649 --> 00:08:10,776 small system here. Let's, let's plot the number of cores here from one to eight, at 79 00:08:10,776 --> 00:08:15,840 the same workload. And look what happens. So if we, we look at this. Something else 80 00:08:15,840 --> 00:08:20,967 is in variant here. This is for a fixed cache size. We're going to plot the number 81 00:08:20,967 --> 00:08:28,536 of processors down here now. The. Number of memory cycles per instruction, for 82 00:08:28,536 --> 00:08:35,209 instruction misses, complex capacity misses, com, compulsory misses, doesn't 83 00:08:35,209 --> 00:08:42,051 change, just stays the same because it's basically uni-processor based. But, as you 84 00:08:42,051 --> 00:08:50,417 add more cores you get both, more true sharing and more false sharing. Hmm, well 85 00:08:50,417 --> 00:08:56,044 this is a little scary cuz our performance is basically going down as we add more 86 00:08:56,044 --> 00:09:01,671 course. So this is only up to eight, what happens if we're, you know way out here at 87 00:09:01,671 --> 00:09:06,611 100 course? What do we think's going to happen? Well, we're probably going to, to 88 00:09:06,611 --> 00:09:12,101 be dominated, our performance is going to be dominated by the sharing and the false 89 00:09:12,101 --> 00:09:17,886 sharing and these misses. Huh, well, we're going to start thinking about that one. 90 00:09:17,886 --> 00:09:22,848 Figure out how to, how to solve that and think about scalability of coherence