1 00:00:03,500 --> 00:00:09,830 Okay, so Multiple Logical Communication Channels. What we just talked about with 2 00:00:09,830 --> 00:00:17,678 all these different message types here, thus like, the ten not drawn. Can't all go 3 00:00:17,678 --> 00:00:22,102 on the same network at the same time. If you have one interconnection network or 4 00:00:22,102 --> 00:00:28,110 one logical, excuse me, logical network here. You can end up with cases where you 5 00:00:28,110 --> 00:00:32,883 have responses that get qued behind requests that will create more traffic. 6 00:00:32,883 --> 00:00:37,782 So, in these protocols we have multiple phases of, of communication. We send 7 00:00:37,782 --> 00:00:42,618 messages from point A to point B, then from B to C, then from C back to B, and 8 00:00:42,618 --> 00:00:47,645 then B back to A, or something like that. So, if you try to over lay this onto an 9 00:00:47,645 --> 00:00:52,672 inner connection network, and you do head of line processing. So you process the 10 00:00:52,672 --> 00:00:58,813 message which is at the head of the que first It's very common to have something 11 00:00:58,813 --> 00:01:04,624 where you have a new response that sneaks in. Which is dependent on some other 12 00:01:04,624 --> 00:01:08,660 transaction that's happening somewhere else in the system. And you're waiting for 13 00:01:10,122 --> 00:01:15,176 excuse me a new request comes in that gets qued behind a response. Or excuse me, you 14 00:01:15,176 --> 00:01:20,291 have a new request that comes in that is in front of a response and all of a sudden 15 00:01:20,291 --> 00:01:26,277 you're going to have a deadlock. So, one way people go about solving this is you 16 00:01:26,277 --> 00:01:30,698 look at all the different types of messages that you have, message types that 17 00:01:30,698 --> 00:01:35,232 you have, and you figure out which types can coexist on the network at the same 18 00:01:35,232 --> 00:01:42,307 time. so good example here is a I think if you look at a relatively basic cache 19 00:01:42,307 --> 00:01:50,010 coherence protocol there's maybe twelve different message types. You can try to 20 00:01:50,010 --> 00:01:54,387 shrink that down to maybe four, three or four different classes of traffic. That 21 00:01:54,387 --> 00:01:58,818 those classes of traffic when intermixed will never introduce a deadlock. If you 22 00:01:58,818 --> 00:02:02,807 would be really safe you'd just have twelve separate networks or twelve 23 00:02:02,807 --> 00:02:07,436 different networks, or twelve logical different networks. But that's expensive 24 00:02:07,436 --> 00:02:12,286 to build, so people try to come up with equivalence classes of different traffic 25 00:02:12,286 --> 00:02:16,590 and then only have that number of networks. But you're going to have to 26 00:02:16,590 --> 00:02:21,258 segregate these flows in different, logical or ph ysical channels to make sure 27 00:02:21,258 --> 00:02:29,010 you don't have deadlock happening. Another thing that I wanted to point out here is 28 00:02:29,010 --> 00:02:38,710 we just distributed our memory. So we need to start worrying about the memory 29 00:02:38,710 --> 00:02:45,156 ordering point or for given memory address, where do you go and what order do 30 00:02:45,156 --> 00:02:52,809 memory operations happen in? So just like in a bus space protocol you have some sort 31 00:02:52,809 --> 00:02:57,060 of arbitor which determines the transaction or who wins the bus. In a 32 00:02:57,060 --> 00:03:01,926 directory based system, typically the forth given address is the ultimate other. 33 00:03:01,926 --> 00:03:06,608 So two messages are coming in from different course you go for right access 34 00:03:06,608 --> 00:03:11,289 to a particular address. One of them is going to get to the directory first and 35 00:03:11,289 --> 00:03:16,503 what if we gets the directory first, could win or estimately will win. You could 36 00:03:16,503 --> 00:03:21,717 think of, sometimes this could be unfair. So some of these systems you try to have 37 00:03:21,717 --> 00:03:26,931 something which prevents the core, which is close to the directory, always winning 38 00:03:26,931 --> 00:03:32,273 access to that contended line we'll say. So may have some sort of back off protocol 39 00:03:32,273 --> 00:03:37,423 playing into effect there. But this is what effectively guarantees our atomicity 40 00:03:37,423 --> 00:03:42,830 is the directory allows, guarantees that a particular line can only be transitioning 41 00:03:42,830 --> 00:03:49,277 from one state to another state at a given time. So we have the directory as the 42 00:03:49,277 --> 00:03:55,756 ordering point. And whichever message gets there, gets to the home directory first 43 00:03:55,756 --> 00:04:01,755 let's say wins. Subsequent request for that line are going to lose so what do you 44 00:04:01,755 --> 00:04:06,402 do on a lose. Well you probably, the directory's probably going to want to send 45 00:04:06,402 --> 00:04:10,321 a negative acknowledgement, or NACK. It's going to say, or send a retry. It's the 46 00:04:10,321 --> 00:04:14,038 same thing here. It's going to say, I can't handle this right now. This line is 47 00:04:14,038 --> 00:04:18,057 currently being transitioned already. Someone else won the transitioning of this 48 00:04:18,057 --> 00:04:21,775 line. Go retry that transaction, the memory transaction again in the future. 49 00:04:21,775 --> 00:04:25,391 Now, this gets pretty complicated back at the cache controller. Cuz it's going to 50 00:04:25,693 --> 00:04:30,933 get a retry, or a NACK back. It could potentially have a message coming in for 51 00:04:30,933 --> 00:04:35,788 that exact same cache line. It needs to give the directories message priority in 52 00:04:35,788 --> 00:04:41,011 this case over the pipeline the main processor. So it's going to have to order 53 00:04:41,011 --> 00:04:46,139 that after, it because the directory one the directory is the ultimate ordering 54 00:04:46,139 --> 00:04:51,266 point for the memory location. Finally, you have to worry about forward progress 55 00:04:51,266 --> 00:04:57,685 guarantees. What I mean by this is, it's pretty easy to build a memory system that 56 00:04:57,685 --> 00:05:02,838 you pull the data into your cache. Your cache control pull means your cache. But 57 00:05:02,838 --> 00:05:08,252 before you're able to do load or storage at that actual address, it gets bumped out 58 00:05:08,252 --> 00:05:12,701 f your cache. At all sign you have a cash line just sort of bouncing between two 59 00:05:12,701 --> 00:05:18,289 cashes and its a live lock scenario. So you believes directory based coherence 60 00:05:18,289 --> 00:05:23,254 systems and this also happens with bus space protocols plus usually it'll be less 61 00:05:23,254 --> 00:05:27,880 likely. you have to have some sort of forward progress guarantee. And the reason 62 00:05:27,880 --> 00:05:32,394 is less likely is because in a directory based system the communication time is 63 00:05:32,394 --> 00:05:36,287 usually longer. So the window of, of vulnerability is longer. We afford 64 00:05:36,287 --> 00:05:40,969 progress guarantee means that once you get into your cache you need to at least do 65 00:05:40,969 --> 00:05:45,426 one memory operation to it, to guarantee you have won bef, before you relinquish 66 00:05:45,426 --> 00:05:50,455 the memory back to the directory control. So, if you do, if you doing a load you're 67 00:05:50,455 --> 00:05:55,954 actually reading your cache shared to the actual one load. And then you're allowed 68 00:05:55,954 --> 00:06:00,751 to cough it up, so you don't respond back with a reply. To, let's say, an 69 00:06:00,751 --> 00:06:05,114 invalidation request from the directory control until you've done your one memory 70 00:06:05,114 --> 00:06:09,264 transaction. Or likewise, if you bring in a modify. Do that one store before you, 71 00:06:09,264 --> 00:06:13,467 you cough the data back up or release the data back into the memory controller. 72 00:06:13,467 --> 00:06:17,830 That's really important, to make sure you have some forward progress in the system, 73 00:06:17,830 --> 00:06:23,212 and don't end up with a live lock. Okay so we're going to get into the, the more 74 00:06:23,212 --> 00:06:31,043 futurey for future looking stuff here. We've been talking about what's called, a 75 00:06:31,043 --> 00:06:35,752 full map directory, where there's a bit processor core. If you have 1,000 cores in 76 00:06:35,752 --> 00:06:41,303 the system, that's a very large bit map. And, it's pretty uncommon that a thousand 77 00:06:41,303 --> 00:06:48,406 core in the system will all be reading all of the data, or all of one particular 78 00:06:48,406 --> 00:06:54,761 cache line. So there might be wasteful and your directory in a format directory grows 79 00:06:54,761 --> 00:07:00,629 at order N. So that's, that's not particularly good. Hmm, so, people have 80 00:07:00,629 --> 00:07:06,043 looked into different protocols here. I just want to touch on this, this is 81 00:07:06,262 --> 00:07:11,858 largely the area sort of research and future directions of this. One, one idea 82 00:07:11,858 --> 00:07:17,246 here is to have a limited pointer based directory. So instead of having a bit mask 83 00:07:17,246 --> 00:07:22,569 of the share of list. So its right this is the share of list. So its sort of a bit 84 00:07:22,569 --> 00:07:27,629 mask of the share of list you have base two encoding numbers up to some, some 85 00:07:27,629 --> 00:07:32,688 point. This is why its called limited directory or limited point of directory, 86 00:07:32,688 --> 00:07:37,945 because you can't have all the nodes in the system on the list. There is none of 87 00:07:37,945 --> 00:07:42,562 entries here. But you can name them because you have base two encoding of the 88 00:07:42,562 --> 00:07:49,883 actual number. And if you get bigger this in this entire list, so lets say you'll 89 00:07:49,883 --> 00:07:55,419 have one, two, three, four, five entries an over sense six sharers, six mead owing 90 00:07:55,419 --> 00:08:01,025 copies of the list want to be taken. Usually this is an overflow bit here which 91 00:08:01,025 --> 00:08:06,845 says its more than six and when its more than six and it will share and also in a 92 00:08:06,845 --> 00:08:12,451 transition to modified. You're going to just send a broadcast message or send a 93 00:08:12,664 --> 00:08:19,030 invalidate every single of cache in the system. But usually this could be a good 94 00:08:19,030 --> 00:08:24,306 trade off, because it's pretty uncommon to have extremely widely shared lines. So 95 00:08:24,306 --> 00:08:28,405 it's an interesting trade-off. There are storage space, versus sending more 96 00:08:28,405 --> 00:08:35,100 messages in the system. Like wise there's interesting protocol here called 97 00:08:35,100 --> 00:08:40,930 limitless. Where same idea, it's a limited directory and overflow bit, but if it 98 00:08:40,930 --> 00:08:46,984 overflows you start to keep the directory, or you start to keep the sharer list in 99 00:08:46,984 --> 00:08:52,101 software in a structure in main memory. And this requires you to basically have 100 00:08:52,101 --> 00:08:55,770 very fast traps such that when this happens because your, your servicing cache 101 00:08:55,770 --> 00:08:59,862 line here you interrupt the main processor and the main processor provides the rest 102 00:08:59,862 --> 00:09:03,554 of the sharer list for instance, if the sharer list gets overflo wed. 103 00:09:03,560 --> 00:09:08,028 So, there's, and there's a bunch of stuff in between in some future, future research 104 00:09:08,028 --> 00:09:14,498 that's still being done actively in this space. Beyond simple directory coherence, 105 00:09:14,498 --> 00:09:19,525 we have some pretty cool on-chip coherence. This is why this is actually 106 00:09:19,525 --> 00:09:25,289 being studied a lot right now, is people built this massively parallel directory 107 00:09:25,289 --> 00:09:30,248 based machines in the past. They got some use, they were very good for some 108 00:09:30,248 --> 00:09:34,829 applications. But now we are starting to put more and more cores on to a given 109 00:09:34,829 --> 00:09:38,596 chip, so we start to see on-chip coherence. And figuring out how to 110 00:09:38,596 --> 00:09:43,234 leverage, the fast on-chip communication alongst directories to make more 111 00:09:43,234 --> 00:09:48,647 interesting coherence protocols. There is something called COMA systems or Cache 112 00:09:48,647 --> 00:09:53,931 Only Memory Architectures, where instead of having a data in main memory, you don't 113 00:09:53,931 --> 00:09:59,344 have main memory ever and the directories move around. before beyond scope of the 114 00:09:59,344 --> 00:10:05,272 scores worry about the KSR one if you want to go about that kind a score research one 115 00:10:05,465 --> 00:10:11,034 and then also this real had a scale of the sharer list is active. Briefly wanted to 116 00:10:11,034 --> 00:10:17,455 talk about the most up-to-date versions of these things. We have the SGI UV 1000 117 00:10:17,455 --> 00:10:24,037 which is a descendant of the Origin and the Origin 2000 machines from SGI, lots of 118 00:10:24,037 --> 00:10:30,902 cores here, 2560 cores that are all kept cache coherent. They use a directory based 119 00:10:30,902 --> 00:10:35,958 coherence protocol here. It's very non uniform, and it's all connected together 120 00:10:35,958 --> 00:10:41,144 by a multi-chassis treaty tour. So, this is one chassis, there's actually up to 121 00:10:41,144 --> 00:10:46,719 eight of these. Princeton has one of these in the HPCRC centre, that I think is four 122 00:10:46,719 --> 00:10:54,048 chassis, which is sort of half the size of the maximum. An on-chip example here is 123 00:10:54,048 --> 00:11:01,376 the TILE64Pro had 64 cores. And each of the cores, itself is a directory home and 124 00:11:01,376 --> 00:11:07,769 it runs a directory based protocol. And I was talking about dividing communication 125 00:11:08,003 --> 00:11:14,395 traffic into different flows. Well there were three different memory networks here. 126 00:11:14,395 --> 00:11:20,008 So we had to do, come up with three different classes of traffic, that 127 00:11:20,008 --> 00:11:25,713 themselves would not deadlock themselves, if you will. And also there's four memory 128 00:11:25,713 --> 00:11:30,373 controllers, which is con nected into our interconnect system. And because of this, 129 00:11:30,373 --> 00:11:34,684 the communication lane sees different. A core here talking to the American 130 00:11:34,684 --> 00:11:39,403 controller is very fast, whereas a core there takes longer. But maybe this core is 131 00:11:39,403 --> 00:11:45,620 close to that memory controller. So non, non-uniformity here. Okay, so this is our 132 00:11:45,620 --> 00:11:50,624 last slide of the term here, Beyond ELE 475. If you want to go on and do more. 133 00:11:51,112 --> 00:11:55,262 Well start reading some papers from different computer architecture 134 00:11:55,262 --> 00:11:59,350 conferences. The proceedings of International Symposium on Computer 135 00:11:59,350 --> 00:12:03,988 Architecture, ISCA is a good place to start. That's probably the top, major 136 00:12:03,988 --> 00:12:08,138 architecture conference in the field. The International Symposium on 137 00:12:08,138 --> 00:12:12,470 Microarchitecture is the top Microarchitecture conference. So if you're 138 00:12:12,470 --> 00:12:17,291 trying to look inside of a processor, and some of the smaller microtechtectural 139 00:12:17,291 --> 00:12:22,537 details. ASPLOS, Architectural Support for Programming Languages and Operating 140 00:12:22,537 --> 00:12:27,900 Systems has lot of different processor between software and hardware's radio 141 00:12:27,900 --> 00:12:33,532 conference. And HPCA looks at or used to look up at a higher performing high, big 142 00:12:33,532 --> 00:12:39,163 computers, high performance computer systems. but now lot of normal computer 143 00:12:39,163 --> 00:12:44,460 operation ends up in a compensawsive. I would have CD audio. Well go to research. 144 00:12:44,460 --> 00:12:51,005 Built in chips, built some test FPGA learn more about parallel computer architecture 145 00:12:51,005 --> 00:12:56,397 and, in deed you can come back in the Fall. and take ELE 580A which is going to 146 00:12:56,397 --> 00:13:01,906 be a graduate level primary sources paper reading course. More traditional graduate 147 00:13:01,906 --> 00:13:07,222 course where you'll learn about all the different parallel computing systems. So 148 00:13:07,222 --> 00:13:11,833 this is called parallel computation because it's both parallel computer 149 00:13:11,833 --> 00:13:16,957 architecture and some programming models that hook together without in parallel 150 00:13:16,957 --> 00:13:21,697 programming together with the architectures cause they go, they go very 151 00:13:21,697 --> 00:13:26,240 hand in hand. So let's stop here for today. And stop here for the course.