1 00:00:03,600 --> 00:00:08,476 So let's, let's think about our snoopy protocols that we've talked about, our bus 2 00:00:08,476 --> 00:00:13,535 based protocols, and the performance and the asymptotic performance requirements of 3 00:00:13,535 --> 00:00:19,396 them. So, what are the challenges of a snooping protocol? As we discussed before, 4 00:00:19,396 --> 00:00:24,782 as you add more people or more processors to the system you have more entities 5 00:00:24,782 --> 00:00:32,496 shouting on one shared media. And you need to hear the shouting. You can't just 6 00:00:32,496 --> 00:00:37,160 forget some, some, some shout because you need to snoop that against your vocal 7 00:00:37,160 --> 00:00:42,002 cache. So when ever another core takes a cache miss, you need to snoop that against 8 00:00:42,002 --> 00:00:46,964 your cache and make sure that you don't have a copy of that or that you have to 9 00:00:46,964 --> 00:00:51,329 invalidate for instance overplay back of some data or do something other 10 00:00:51,329 --> 00:00:57,694 invalidation coherence protocol adjustment. So, what's, what's annoying 11 00:00:57,694 --> 00:01:02,560 about this, if you sort of look at the amount of bandwidth you require on your 12 00:01:02,560 --> 00:01:08,050 bus, all cache miss need to go across that bus everyone needs to look at that, and 13 00:01:08,050 --> 00:01:12,853 everyone needs to have a port that has enough bandwidth to look at all those 14 00:01:12,853 --> 00:01:18,718 transactions on their cache. So, if we go to look at this, our bus, if we want to 15 00:01:18,718 --> 00:01:24,339 sort of keep up with the same amount of bandwidth of cache misses per core as we 16 00:01:24,339 --> 00:01:29,613 add more cores to the system is going to grow order N, where N is the number of 17 00:01:29,613 --> 00:01:35,427 processors. . Because though you can compute this because everyone let's say 18 00:01:35,427 --> 00:01:39,449 has the same number of same amount of cache misses going on and you want to have 19 00:01:39,449 --> 00:01:44,337 the same cache miss rate and you just multiply it by N. Three each, each course 20 00:01:44,337 --> 00:01:50,140 is going to have that. Well, that will be fine when N is eight, but if N goes to a 21 00:01:50,140 --> 00:01:55,453 thousand or a million, you're going to have some serious problems with that, it's 22 00:01:55,453 --> 00:02:00,975 a very, very big bus. And it's not just straight badnwidth, you also need to have 23 00:02:00,975 --> 00:02:06,615 effectively somewhere arbitrate for the bus and you need to have atomic 24 00:02:06,615 --> 00:02:11,485 transactions going across that bus so it may be hard to actually even if you have a 25 00:02:11,485 --> 00:02:16,876 very high bandwidth bus you may not have enough this cycles in order to operate in 26 00:02:16,876 --> 00:02:23,263 the bus. So, a solution this, is, we start to look at something we're going to call 27 00:02:23,263 --> 00:02:28,926 directory cache coherence and directory protocols. And the idea in a directory 28 00:02:28,926 --> 00:02:34,951 protocol, that the key idea here is that instead of broadcasting your invalidations 29 00:02:34,951 --> 00:02:40,831 to every other core in the system, or all cache in the system, every other core in 30 00:02:40,831 --> 00:02:45,985 the system. Instead what you do is you go talk to a location that we're going to 31 00:02:46,421 --> 00:02:52,764 call directory. And this directory is going to keep track of which caches have 32 00:02:52,764 --> 00:02:59,660 that data. And what's nice about this is now if you take a cache miss you can go 33 00:02:59,660 --> 00:03:05,150 ask the directory well, who has all these different, who has this cache line. And if 34 00:03:05,150 --> 00:03:09,994 only one other core has the cache line let's say readable, it will and you're 35 00:03:09,994 --> 00:03:15,678 trying to take it into an exclusive access trying to write to it. You only need to 36 00:03:15,678 --> 00:03:20,844 invalidate only one location instead of sending that data to all end processors in 37 00:03:20,844 --> 00:03:26,861 your system. So we've cut down what we, was a broadcast system into a point to 38 00:03:26,861 --> 00:03:32,673 point system. But the overhead that we have to keep now is we need to track, in a 39 00:03:32,673 --> 00:03:39,535 directory. All the locations that have or all the caches which could have a 40 00:03:39,535 --> 00:03:44,573 particular cache line in it. And, and we'll, we'll go through a much more 41 00:03:44,573 --> 00:03:50,249 complicated example of that but that's the, that's the overall key idea. And this 42 00:03:50,249 --> 00:03:56,166 is going to turn what was a broadcast into a point to point communication. And we can 43 00:03:56,166 --> 00:04:01,252 use point to point interconnects for this. Another good point of scalability here is 44 00:04:01,252 --> 00:04:05,734 you can actually have different directories. So you don't have to have one 45 00:04:05,734 --> 00:04:10,518 big monolithic directory. Instead you can segment the address space somehow and 46 00:04:10,518 --> 00:04:16,232 depending on the address that you have you can go to a different directory. And, by 47 00:04:16,232 --> 00:04:20,942 going in these different directories, you can actually increase the bandwidth, to 48 00:04:20,942 --> 00:04:31,349 your directories. Okay, so let's see how this fits into like, a block diagram here. 49 00:04:31,349 --> 00:04:38,826 We have CPUs are trying to communicate with other CPUs via shared memory. And 50 00:04:38,826 --> 00:04:47,202 they go check their cache first. If it's not in the cache, before they would have 51 00:04:47,202 --> 00:04:53,153 communicate or cross a bus, and everyone would have to look at that traffic but 52 00:04:53,153 --> 00:04:59,733 instead, in our directory cache coherence, they'll send a message from the cache. To 53 00:04:59,733 --> 00:05:05,211 the directory controller associated with the address so you're going to send a 54 00:05:05,211 --> 00:05:11,842 message here to this directory controller. And they directly controller is going to 55 00:05:11,842 --> 00:05:15,410 keep track for every single line in or the, the basic directory controller here 56 00:05:15,410 --> 00:05:23,063 is going to keep track for every single line and memory. The list of possible 57 00:05:23,063 --> 00:05:29,253 other caches which could potentially have that piece of data and we're going to call 58 00:05:29,253 --> 00:05:35,749 that the sharer list or the share list. And this is. Right now, if we look at 59 00:05:35,749 --> 00:05:40,856 this, might still be a uniform communication network. So, let's say, in 60 00:05:40,856 --> 00:05:46,611 here, you have some omega network. Anyone can talk to anyone else and the latency 61 00:05:46,611 --> 00:05:52,221 through it is fixed. So this is still a uniform memory access system. We've not. 62 00:05:52,221 --> 00:05:58,191 We didn't have to go non-uniform here. So, no cache is necessarily closer or farther 63 00:05:58,191 --> 00:06:05,911 away to any other piece of memory in a system like this. So this is kind of our 64 00:06:05,911 --> 00:06:10,196 naive directory cache coherence protocol. But what's still nice here is we don't 65 00:06:10,196 --> 00:06:14,908 have to broadcast. We can our, let's say our omega network, or mesh network, or 66 00:06:14,908 --> 00:06:19,193 something else on the inside here. We'd like to send from this cache directly to 67 00:06:19,193 --> 00:06:23,976 this controller. If no other cache has it. Let's say, readable or writable, or 68 00:06:23,976 --> 00:06:28,915 anything like that. It can just respond back with the memory. The, the data from 69 00:06:28,915 --> 00:06:33,728 memory. If not, instead of invalidating and broadcasting an invalidate to all 70 00:06:33,728 --> 00:06:38,477 other cores. Instead now, the directory can just say. Oh, this core, this cache 71 00:06:38,477 --> 00:06:43,543 and this cache have copies. I need to send two messages. One to this cache, one to 72 00:06:43,543 --> 00:06:49,239 that cache, and validate them. Wait for the responses and then reply back with the 73 00:06:49,239 --> 00:06:55,347 data. So, we can, we can decrease our bandwidth that we use in the common case 74 00:06:55,347 --> 00:07:02,576 across our inter-connection network. So I'm going to show a slightly different 75 00:07:02,576 --> 00:07:09,049 picture here which is pretty similar to the previous picture. Well, you'll notice 76 00:07:09,049 --> 00:07:19,482 is that the memory and the directory, now, are connected to an individual CPU. So why 77 00:07:19,482 --> 00:07:25,725 do we do this? Well. If you're building one of these scalable systems some sort of 78 00:07:25,725 --> 00:07:30,507 like supercomputer, it might be a good property that as you add more CPUs to the 79 00:07:30,507 --> 00:07:36,536 system, you also add more RAM memory. And maybe more directory storage or something 80 00:07:36,536 --> 00:07:41,935 like that. Another positive of, of, of a design like this is this CPU now is 81 00:07:41,935 --> 00:07:50,696 actually close to this memory bank. And we can try to take advantage of that. So one, 82 00:07:50,696 --> 00:07:55,378 one question comes up, is how can we take advantage of that? Anyone have any 83 00:07:55,378 --> 00:08:00,059 thoughts? Okay so, shared data, we don't know where it's going to be accessed. It 84 00:08:00,059 --> 00:08:05,240 could be accessed by all six CPUs and all six caches here. But it's very common that 85 00:08:05,240 --> 00:08:10,109 your stack for your program is going to be only access local and the instruction 86 00:08:10,109 --> 00:08:15,457 memory for your program is only going to be access local. So you can potentially 87 00:08:15,457 --> 00:08:20,870 have performance benefits by putting the instruction in memory or excuse me, 88 00:08:20,870 --> 00:08:26,923 instruction and stack and maybe even some portion of the heap close to this core cuz 89 00:08:26,923 --> 00:08:32,834 then you can access that really quickly, but only shared data has to go across this 90 00:08:32,834 --> 00:08:43,531 interconnect. And will or in fact that has a fancy name. So systems where some data 91 00:08:43,531 --> 00:08:50,765 is close and some data is far away are called Non Uniform Memory Access or NUMA 92 00:08:50,765 --> 00:08:57,728 and you might see this actually even in your desktop processors are actually 93 00:08:58,280 --> 00:09:03,672 Moving towards numerous systems. They're, they're, some of them are, are actually If 94 00:09:03,672 --> 00:09:08,564 you look at some of the. I believe it's the A and D chips today are already numa 95 00:09:08,564 --> 00:09:12,680 systems even on a on a single di, or not excuse me single chip with multiple dies 96 00:09:12,680 --> 00:09:16,355 or something like that there, there actually two newer nodes inside of them 97 00:09:16,355 --> 00:09:21,436 sort of one for one memory controller one for one another memory controller. So, if 98 00:09:21,436 --> 00:09:26,450 you go into something like Linux and you go look in the proc directory, you can 99 00:09:26,450 --> 00:09:31,404 actually see there's a sub-directory in there called NUMA. And it'll tell you the 100 00:09:31,404 --> 00:09:36,725 configuration of the different memory. And then the OS can take advantage of this, so 101 00:09:36,725 --> 00:09:41,311 can put for instance, the stack and the instruction memory, for a particular 102 00:09:41,311 --> 00:09:46,265 program, that's being used by a particular core close to that core. And then, maybe 103 00:09:46,265 --> 00:09:50,941 through some other data, I can somehow choose some other, other choice. Now, I 104 00:09:50,941 --> 00:09:59,920 want to make a point here, is that just because the latency to memory is different 105 00:09:59,920 --> 00:10:09,746 does not mean that your system is a directory based cached coherent NUMA 106 00:10:09,746 --> 00:10:15,706 system. So you can still have non-directory-based systems where some 107 00:10:15,706 --> 00:10:20,865 memory is close and some memory is far away. So you could still have a, basically 108 00:10:20,865 --> 00:10:24,656 a bus, or something like that, or maybe some other internet connection network in 109 00:10:24,656 --> 00:10:28,447 there which is still a snooping protocol, or effectively a snooping protocol. But 110 00:10:28,447 --> 00:10:33,262 some data is close and some data is far away. But if you see this in literature 111 00:10:33,262 --> 00:10:38,492 usually you feel people talking about directory based cache coherent NUMA 112 00:10:38,492 --> 00:10:43,368 systems will call them CC NUMA or cache coherent NUMA systems, that's usually sort 113 00:10:43,722 --> 00:10:49,729 of means that this is a cache coherent non-uniform memory access architecture and 114 00:10:49,729 --> 00:10:55,665 usually implies that directory based for the may be other protocols that people are 115 00:10:55,665 --> 00:11:01,090 using out there also. Okay so I want to go back one slide here, and I wanted to 116 00:11:01,298 --> 00:11:06,690 finish off talking about one-topology, which is interesting. And the difference 117 00:11:06,690 --> 00:11:12,497 between these two slides is we went from a CPU here to CPUs. So this is a multi-core 118 00:11:12,497 --> 00:11:18,421 chip now and where this gets interesting is you might have a directory based cache 119 00:11:18,421 --> 00:11:23,831 coherence system connecting multiple chips but then inside of a chip you may have 120 00:11:23,831 --> 00:11:28,739 something like a bus based snooping protocol. So we actually mix and match 121 00:11:28,739 --> 00:11:33,888 these two things. And how we go about doing this is, if caches, if cores inside 122 00:11:33,888 --> 00:11:39,372 of this one chip for instance, after you go get data from each other, they can just 123 00:11:39,372 --> 00:11:44,521 effectively snoop on each other, but outside of that your cache controller or 124 00:11:44,521 --> 00:11:49,760 may be your L3 cache for this particular chip, is going to respond to messages 125 00:11:49,760 --> 00:11:54,452 coming from other directories, like invalidation requests and do something 126 00:11:54,860 --> 00:11:59,893 about it. So there's basically a transducer there between a directory base 127 00:11:59,893 --> 00:12:05,061 cache coherence protocol, and a bus base snoopy protocol. And this is prett y, 128 00:12:05,061 --> 00:12:10,978 pretty common these days. Especially given that you have a fair number of multi core 129 00:12:10,978 --> 00:12:17,031 systems showing up. And being used in more of these directory based cache coherence 130 00:12:17,031 --> 00:12:23,991 systems. And we'll talk about one of them at the end actually the SGI UV systems or 131 00:12:23,991 --> 00:12:30,373 UV 1000, which we'll talk about, is a users off-the-shelve, Intel parts, modern 132 00:12:30,373 --> 00:12:37,343 day sort of Core i7 parts mixed together with a NUMA, and directory based coherence 133 00:12:37,343 --> 00:12:44,565 system to connect all the chips together. So there's a transducer from the external 134 00:12:44,565 --> 00:12:49,940 snoop bus protocol to the directory based coherence protocol.