1 00:00:03,400 --> 00:00:08,615 Okay, so now I'll start talking about, we're gonna, we're gonna spend the rest of 2 00:00:08,615 --> 00:00:13,633 the lecture talking about different coherence protocols, and relative merits 3 00:00:13,633 --> 00:00:18,980 of them, on a bus. I wanted to contrast this to what we're gonna talk about in two 4 00:00:18,980 --> 00:00:24,328 lectures, where we're gonna be talking about different coherence protocols across 5 00:00:24,526 --> 00:00:29,873 switch interconnects, and places where you don't have a shared medium or a shared 6 00:00:29,873 --> 00:00:35,287 broadcast medium. Okay. So let's, as a warm-up here, we're gonna start off by 7 00:00:35,287 --> 00:00:40,943 looking at what can happen with. We know happen at the same time as memory 8 00:00:40,943 --> 00:00:47,439 transactions from a uni-processor. So, let's take a look at where you can have 9 00:00:47,439 --> 00:00:54,273 consistency problems in a uni-processor system. As a warm-up and a motivator here. 10 00:00:54,273 --> 00:01:01,107 So here, we have a processor, and that's a cache. This is memory. And then, somewhere 11 00:01:01,107 --> 00:01:08,784 in the memory bus here, we off a bit with a DMA agent. Or a direct memory access 12 00:01:08,784 --> 00:01:14,678 agent. Sometimes it's called, All semester, even. Because it, it has a term, 13 00:01:14,678 --> 00:01:20,624 because of multiple agents which can effectively drive transactions onto main 14 00:01:20,624 --> 00:01:26,494 memory clusters. So if you go look at like, for instance, PCI Exu2014, PCI or 15 00:01:26,494 --> 00:01:32,364 PCI Express, which are sort of the extension cards for your system, they'll, 16 00:01:32,364 --> 00:01:38,462 they'll use the term, bus mastering. What that really means is that there's a DMA 17 00:01:38,462 --> 00:01:44,408 engine out at the I/O place. Okay, so, what I'm trying to get across here is, you 18 00:01:44,408 --> 00:01:51,924 can actually. overlap in a uni-processor system. Moving data from the disk to main 19 00:01:51,924 --> 00:01:59,551 memory without having to use the processor. Because you have . Originally, 20 00:01:59,827 --> 00:02:07,821 or, or, . And require program to IO in order to go access the disk. So a simple 21 00:02:07,821 --> 00:02:14,897 way the processor would actually read an address, which is translated. And you 22 00:02:14,897 --> 00:02:27,098 don't have the here. Some . This impostor and possibly . As you can tell that 23 00:02:27,098 --> 00:02:36,720 requires you to this memory. It's kinda slow. So people decided let's put extra 24 00:02:36,720 --> 00:02:45,167 direct memory access engines out at diode prices. So, this actually goes back to 25 00:02:45,167 --> 00:02:53,716 early, early computers. Like mainframes like very sophisticated the, the, DMA. And 26 00:02:53,716 --> 00:03:00,753 they're, they're, they were, they're obvi ously . System 360 they have, coviariable 27 00:03:00,753 --> 00:03:08,818 DMA engines that they can effectively . But the simplest case given is basically 28 00:03:08,818 --> 00:03:16,330 going to have a register which says where we provide disk and where memory and how 29 00:03:16,330 --> 00:03:25,106 long and, and a, and a go button, where go, you could sit there . You could also 30 00:03:25,106 --> 00:03:31,974 possibly do it the other way. Okay, so let's look at this from a coherence 31 00:03:31,974 --> 00:03:40,292 perspective. And let's choose this cache. Let's look a memory to disk transaction. 32 00:03:40,292 --> 00:03:48,611 So we look we look at the DMA it should say having this location in memory into 33 00:03:48,611 --> 00:03:56,201 this location of the disk. We tell it to go. Now while it's doing that the 34 00:03:56,201 --> 00:04:04,356 processor writes to an address that tells it to... Inside this page. What happens? 35 00:04:04,356 --> 00:04:11,944 It's not write through. There's no coherence protocol really going on in this 36 00:04:11,944 --> 00:04:22,095 case so far. Well, values, hopefully. Or maybe it does, these cache lines do out to 37 00:04:22,095 --> 00:04:29,782 main memory. So then it gets some of the new values, and some of the old values. 38 00:04:29,782 --> 00:04:36,839 The point I'm trying to make here is that, you don't know. And what's gonna happen? 39 00:04:37,019 --> 00:04:42,235 and that, that's a little scary. You wanna, you know? You wanna in your system. 40 00:04:42,235 --> 00:04:46,613 Now, you can say, well maybe the processors shouldn't go and write to that, 41 00:04:46,613 --> 00:04:51,530 those memory addresses. That is a valid solution. this is actually pretty common 42 00:04:51,530 --> 00:04:57,107 on, on modern day systems. Is that the OS will know when the actually is in flight. 43 00:04:57,107 --> 00:05:02,024 They will just make sure not to go access that, those memory addresses. Well, what I 44 00:05:02,024 --> 00:05:07,301 was trying to introduce here is that you actually processor system, can have 45 00:05:07,301 --> 00:05:12,791 coherence problems, with respect to. Where the different addresses are. You can 46 00:05:12,791 --> 00:05:20,867 actually have . . You can also have it going the other way. To the disks to main 47 00:05:20,867 --> 00:05:28,297 memory transferring in that direction. Well, let's say there's some values in the 48 00:05:28,297 --> 00:05:34,112 cache of the CPU. This is actually probably the more interesting case, is 49 00:05:34,112 --> 00:05:39,927 that you have some data in the CPU's cache, erasing the disk to physical 50 00:05:39,927 --> 00:05:46,830 memory. But all of a sudden, the CPU's cache. Doesn't pick up the new value. And 51 00:05:46,830 --> 00:05:54,999 it wants to go and read that. It's like reading a file off a disk or some thing. 52 00:05:55,002 --> 00:06:03,376 This is gonna get the wrong value. So this introduces, this moves us to our first 53 00:06:03,376 --> 00:06:11,548 idea in our coherence protocols, that has a funny name, called snoopy caches. no, 54 00:06:11,548 --> 00:06:20,825 this is not named for the, the dog in the Peanuts cartoons. But instead, Stu Goodman 55 00:06:20,825 --> 00:06:33,131 and, I believe it's a professor now and one of his students, came up with the idea 56 00:06:33,131 --> 00:06:46,198 that you have the cash on the what's going on and effectively update the cash with 57 00:06:46,198 --> 00:06:55,095 the, the data that's flying by on the bus. So, if we look at this from a little bit 58 00:06:55,095 --> 00:07:02,671 more harder perspective. . We have our cache and we have the tags. And escape 59 00:07:02,671 --> 00:07:09,785 into the cache. And it is effectively sitting there watching the bus. And if a 60 00:07:09,785 --> 00:07:17,361 address that is in the cache slides behind the bus, it needs to do something about 61 00:07:17,361 --> 00:07:24,660 it. It probably needs to invalidate the address if its a right occurring across 62 00:07:24,660 --> 00:07:30,852 the bus. You can also do it the other way that if you have a DNA engine which is 63 00:07:30,852 --> 00:07:36,910 reading from a memory and the data is dirty in cache and it's not a memory, it's 64 00:07:36,910 --> 00:07:43,195 a right back cache, it may need to provide data to the, the I O device that's trying 65 00:07:43,195 --> 00:07:48,798 to read from a memory and override effectively where its coming from main 66 00:07:48,798 --> 00:07:55,008 memory. Now it probably doesn't wanna try that after you're on the bus but there's 67 00:07:55,008 --> 00:08:01,468 also arbitration that's actually, actually happening there. To determine who has the 68 00:08:01,468 --> 00:08:08,417 actual data. . But you have to figure out sort of what is, what is the right 69 00:08:08,417 --> 00:08:15,035 fitting? You know, we're talking a little bit more in today's lecture, what is the 70 00:08:15,035 --> 00:08:20,908 right thing to do in these, these interesting formative cases? Before we 71 00:08:20,908 --> 00:08:27,443 move on here, this, this is getting a little bit hard. maybe back in 1983, this 72 00:08:27,443 --> 00:08:36,028 wasn't so bad. But nowadays, we just tags and our . A dual coordinate.'Cause in the 73 00:08:36,028 --> 00:08:42,433 Snoopy protocol, we're gonna need all possible memory, transactions that are 74 00:08:42,433 --> 00:08:48,675 going on, let's say, by one processor. And/or and DMA engine, to be verified by 75 00:08:48,675 --> 00:08:54,999 every other processor entity in the system. That's a fair amount of bandwidth 76 00:08:54,999 --> 00:09:01,240 coming into here. And you could add two ports to that. And that's okay if the 77 00:09:01,240 --> 00:09:08,233 cache is sort of f arther out. But it might slow down your cache . Level one 78 00:09:08,233 --> 00:09:14,816 cache. So typically the way that people build this is they'll try to maybe move 79 00:09:14,816 --> 00:09:21,150 forward or snoop on a level two cache and have level one cache which is not 80 00:09:21,150 --> 00:09:27,816 necessarily snooped. Now this is where we get back to inclusive versus exclusive 81 00:09:27,816 --> 00:09:34,400 caches. If you're level two cache is inclusive of all data in level one this is 82 00:09:34,400 --> 00:09:41,400 actually not so bad to do. Because you are guaranteed to the tags in level two cache. 83 00:09:42,292 --> 00:09:49,286 Global one Cash. So, you don't have to go, move all the way down for cash. If you use 84 00:09:49,286 --> 00:09:54,791 exclusive cash, well, life gets a lot harder. Cuz basically you need a check 85 00:09:54,791 --> 00:10:02,008 with a Global one emblem to . so, anyways, all I was trying to get across here's that 86 00:10:02,008 --> 00:10:09,001 this, this significantly increases the price of or cash design here, your tab 87 00:10:09,001 --> 00:10:15,028 design as you add more portions to this. And it's not actually an area question, I 88 00:10:15,028 --> 00:10:20,925 mean. Just makes it larger, puts more of, a clock cycle perfomance, specially if 89 00:10:20,925 --> 00:10:26,682 it's, your global one captioning, where two important task. That's a critical 90 00:10:26,912 --> 00:10:33,973 tacking your processor to go, and certain one, two, and anything else that too. One 91 00:10:33,973 --> 00:10:39,576 ways all this is actually, you have a unique, ported tax structure and you 92 00:10:39,576 --> 00:10:45,485 somehow delay, the rest of, the, catch soup transaction, while, you arbitrate and 93 00:10:45,485 --> 00:10:51,994 wait for time, in order to go access the tax. So you. But you can, multiplex let's 94 00:10:51,994 --> 00:10:58,849 say the, the one portion of the tag, every other cycle. One cycle's the main 95 00:10:58,849 --> 00:11:01,997 processor. One cycle is for the transaction. 96 00:11:01,999 --> 00:11:14,784 . So just, have a little more of block diagram view of this. We have a bus, of 97 00:11:14,784 --> 00:11:22,844 multi-processors, and our snoopy cache, which actually has to see, all of the, 98 00:11:22,844 --> 00:11:29,209 items. No traffic. And in all the main processor traffic across this bus. So 99 00:11:29,209 --> 00:11:35,872 that's a, a lot of bandwidth because essentially you're broadcasting... Well, 100 00:11:35,872 --> 00:11:43,902 you might have to broadcast all of your actions from one processor to all the, all 101 00:11:43,902 --> 00:11:50,992 the other processors. But we're looking at techniques to reduce the requirements of, 102 00:11:50,992 --> 00:11:59,288 of this broadcast to be the subset of, of the data. Okay so questions so far? , and, 103 00:11:59,288 --> 00:12:04,060 and adding supports before going into protocols.