1 00:00:03,220 --> 00:00:07,922 Okay. So, we're, today, we're talking about parallelism and synchronization. 2 00:00:08,118 --> 00:00:12,624 from a parallel, parallelization perspective, we're worried here about 3 00:00:12,624 --> 00:00:17,457 parallel programming and how to build computer architectures that can run 4 00:00:17,457 --> 00:00:23,106 simultaneously multiple programs or multiple threads within one program. And 5 00:00:23,106 --> 00:00:28,027 today, we're gonna talk about some models of that synchronization, and some 6 00:00:28,027 --> 00:00:33,214 primitives to help you solve it like synchronization primitives and memory 7 00:00:33,214 --> 00:00:42,624 fences. This is a little bit of background or motivation for nowadays why we're going 8 00:00:42,624 --> 00:00:48,944 to multiple processors. So, we're, show here Moore's Law plotted on a log plot. A 9 00:00:48,944 --> 00:00:55,254 number of transistors going up exponentially. Our sequential performance 10 00:00:55,254 --> 00:01:00,478 has sort of gone up, and then sort of stopped at this point. If you go look at 11 00:01:00,478 --> 00:01:06,478 the new core I7 or something like that, it's kind of flattened out here. It isn't 12 00:01:06,478 --> 00:01:11,920 continuing to go up. and in fact, the numbers that they actually publish for 13 00:01:11,920 --> 00:01:16,734 sequential SpecINT to our actually parallel numbers these days because 14 00:01:16,734 --> 00:01:22,525 they're using multiple cores to make SpecINT and SpecFP go a little bit faster. 15 00:01:22,734 --> 00:01:29,837 and, clock frequency has also found out we didn't get to our 10GHz processor. And 16 00:01:29,837 --> 00:01:35,567 largely, that it is due to us hitting some power challenges here. But it's also just 17 00:01:35,567 --> 00:01:40,459 incredibly hard to build and design these very high frequency processors. So, 18 00:01:40,459 --> 00:01:45,609 instead, we started to use these more and more transistors to implement multiple 19 00:01:45,609 --> 00:01:50,631 cores. So, we started off with two cores and then four cores and, you know, we've, 20 00:01:50,631 --> 00:01:55,904 we've gone up from there. Intel, I think, he is now selling an eight core part and a 21 00:01:55,904 --> 00:02:01,486 ten core part is coming out soon. AMD has a sixteen core part, but it is not a true 22 00:02:01,486 --> 00:02:07,200 sixteen core part. I think it is two eight cores that are glued together on a chip. 23 00:02:07,413 --> 00:02:12,969 and there's some other weirder stuff up here. Some actual many cores, multicore 24 00:02:12,969 --> 00:02:18,880 processors with more, more cores. But people did multiprocessor built 25 00:02:18,880 --> 00:02:24,435 multiprocessors before the multicore revolution here. Back in these days, there 26 00:02:24,435 --> 00:02:29,781 were multiprocessor systems. You could take multiple chips and put them tog ether 27 00:02:29,785 --> 00:02:35,288 somehow into small systems or large systems. And for the rest, rest of class 28 00:02:35,489 --> 00:02:40,456 between now and the end of the term, we're going to be talking about parallel 29 00:02:40,456 --> 00:02:45,892 computing systems. Both systems that have multiple cores on one chip and multiple 30 00:02:45,892 --> 00:02:51,283 chips in a system. And for, there's some subtle differences between the two of 31 00:02:51,283 --> 00:02:59,240 those things, but a lot of commonality. In today's lecture, we're going to focus on 32 00:02:59,780 --> 00:03:10,820 symmetric multiprocessors and memory systems for sharing of data and using 33 00:03:10,820 --> 00:03:15,666 shared memory to share data. But there are other ways to share data and we're already 34 00:03:15,666 --> 00:03:20,635 touching on that in two lectures when we talk about messaging in more detail. The, 35 00:03:20,635 --> 00:03:25,481 I bring up symmetric multiprocessors because it's the simplest model to reason 36 00:03:25,481 --> 00:03:30,290 about. All processors are up here, and memory is down here. And let's assume 37 00:03:30,290 --> 00:03:35,450 there's no caches in the system to begin with. And everyone is equidistant from 38 00:03:35,450 --> 00:03:40,283 memory, and it's one big memory. And you've multiple processors, which could 39 00:03:40,283 --> 00:03:49,005 either be running multiple threads in one program or multiple programs. A couple 40 00:03:49,005 --> 00:03:55,310 other interesting things is, you know, these concurrency challenges don't come up 41 00:03:55,310 --> 00:04:00,277 just from having multiple processors. There are other things that can go and 42 00:04:00,277 --> 00:04:05,572 communicate with memory or communicate via memory. We have all these different IO 43 00:04:05,572 --> 00:04:10,931 controllers down here, network cards, disc controllers, graphics cards and they also 44 00:04:10,931 --> 00:04:16,095 want to read and write memory. So, in your memory system, even on a uniprocessor 45 00:04:16,095 --> 00:04:20,539 system, there are multiple agents that want to read and write memory 46 00:04:20,539 --> 00:04:28,629 simultaneously. Okay. So, let's talk about synchronization. The dining philosopher 47 00:04:28,629 --> 00:04:36,801 problem is an example of a synchronization challenge and we're going to talk about 48 00:04:36,801 --> 00:04:43,121 two major synchronization ideas but there are more than, are shown on the slide. 49 00:04:43,370 --> 00:04:49,939 mainly they're sort of broadcast models and, and other things like that. But for 50 00:04:49,939 --> 00:04:56,863 right now, when I say synchronization, it's some way to synchronize communication 51 00:04:56,863 --> 00:05:03,150 or arbitrate communication in a restricted fashion. So, we're going to have two here, 52 00:05:03,150 --> 00:05:08,670 we're going to talk about prod ucer-consumer, that's the, the figure on 53 00:05:08,670 --> 00:05:14,266 the right here. A producer, as the name implies, produces some values. And a 54 00:05:14,266 --> 00:05:20,022 consumer, consumes the values. This is the most basic form here, one producer, one 55 00:05:20,022 --> 00:05:25,814 consumer. You could think about having one producer, multiple consumers or multiple 56 00:05:25,814 --> 00:05:31,536 producers, multiple consumers or multiple producers, one consumer. a lot of this 57 00:05:31,536 --> 00:05:37,430 same ideas hold here. So, that's, that's one model that people like to use. Another 58 00:05:37,430 --> 00:05:42,959 model that people like to use is there's some shared resource and you want to make 59 00:05:42,959 --> 00:05:48,289 sure that not more, let's say, then one processor is trying to access that shared 60 00:05:48,289 --> 00:05:54,407 resource. And this resource could either be a disk, a graphics card, or it could 61 00:05:54,407 --> 00:06:01,160 actually be a location in memory. And we're going to call this mutual exclusion. 62 00:06:01,160 --> 00:06:07,275 And the reason it's called mutual exclusion is it's exclusive, who can 63 00:06:07,275 --> 00:06:13,406 access the resource at one time. Now, the generalized form of this, that, that's the 64 00:06:13,406 --> 00:06:19,916 most basic form is only one processor or one entity can go and access the resource 65 00:06:19,916 --> 00:06:26,122 at a time. A more general form of it is some number of processors or resources can 66 00:06:26,122 --> 00:06:32,480 go access that at one time. And this is the more general semaphore question which 67 00:06:32,480 --> 00:06:39,198 we'll be talking about a little bit later. So, what I mean by that is, you could have 68 00:06:39,198 --> 00:06:44,194 P processors or let's say, P is twenty. And you could have a resource that no more 69 00:06:44,194 --> 00:06:49,108 than two processors can go access at the same time. You might say, well, why would 70 00:06:49,108 --> 00:06:54,145 you want to do that? Well, a good example is something like a network card that has 71 00:06:54,145 --> 00:06:58,752 two outbound queues, we'll say. So, you can actually have multiple processors 72 00:06:58,752 --> 00:07:03,544 using it, but if you try to have three processors use that one network card at 73 00:07:03,544 --> 00:07:07,782 the same time, well, there's only two outbound queues. It has to decide, 74 00:07:07,782 --> 00:07:13,773 somehow, to, to share those. so that's the example of a, of a true versus just a 75 00:07:13,773 --> 00:07:19,320 mutex. Okay. So, let's go through some code examples here cuz these are going to 76 00:07:19,737 --> 00:07:28,647 be instructive and let's look at actually, before I do that you can even have on a 77 00:07:28,647 --> 00:07:34,409 uniprocessor system, synchronization challenges. And h ow is that possible? 78 00:07:34,409 --> 00:07:38,877 Well, as I said, even in the uniprocessor system, you sometimes have multiple 79 00:07:38,877 --> 00:07:43,285 concurrent entities. So, it's either your disk drive, doing DMA into memory at the 80 00:07:43,285 --> 00:07:47,932 same time as a process is trying to read from that block, for instance. And you 81 00:07:47,932 --> 00:07:52,638 need to guarantee, for instance, that this block is completely read before the 82 00:07:52,638 --> 00:07:56,630 processor goes and tries to read the block. That's one way that is going to 83 00:07:56,630 --> 00:08:01,848 happen in uniprocessor system. Another way that happen in a uniprocessor system, is 84 00:08:01,848 --> 00:08:06,437 uniprocessor systems can many times be multiprogrammed. So, the operating system 85 00:08:06,437 --> 00:08:10,327 would actually time slice between different programs. And you'll get 86 00:08:10,327 --> 00:08:14,273 interleaving of instructions in a preemptive environment at least of 87 00:08:14,273 --> 00:08:18,677 different processes at the same time. So, you might be running one process, the 88 00:08:18,677 --> 00:08:23,024 timer tick goes off. You stop that process. You switch to another process and 89 00:08:23,024 --> 00:08:27,428 start executing code from that. At some point, the timer tick goes off, and you 90 00:08:27,428 --> 00:08:32,575 switch back to the first process. So, even on the uniprocessor system, you can have 91 00:08:32,575 --> 00:08:39,174 synchronization challenges. Okay. So, let's, let's look at a producer-consumer 92 00:08:39,174 --> 00:08:50,323 example. Let's look at the abstract first. So, you have a producer here, a consumer 93 00:08:50,323 --> 00:08:58,507 there. We're going to use memory and one big symmetric block of memory for all of 94 00:08:58,507 --> 00:09:05,586 the communication at this point. We have a queue between the producer and the 95 00:09:05,586 --> 00:09:11,200 consumer shown here. And then, we have a head and a tail pointer. So, it's a FIFO 96 00:09:11,200 --> 00:09:15,247 says, where the, our first in, first out queue. I was going to say, where the head 97 00:09:15,247 --> 00:09:19,704 is, where the tail is, they can sort of move after each other and they'll say. 98 00:09:19,704 --> 00:09:28,959 it's circular so they wrap around at some point. You have a register value here on 99 00:09:28,959 --> 00:09:36,281 the producer and you can move the tail pointer into this register. And then, we 100 00:09:36,281 --> 00:09:41,947 have three registers over here. The tail pointer register, the head pointer 101 00:09:41,947 --> 00:09:48,274 register, and a register for the value we receive. And the reason I point out these 102 00:09:48,274 --> 00:09:52,335 different registers is just to show that the producer owns this register, and the 103 00:09:52,335 --> 00:09:56,196 consumer owns these three regist ers. And, as you can see here, this says, R 104 00:09:55,845 --> 00:09:59,355 tail and that says, R tail. So, they're going to have different copies of R tail 105 00:09:59,706 --> 00:10:03,617 and they could potentially get out of sync here. We're going see some, some fun 106 00:10:03,617 --> 00:10:09,342 hijinks happening. Okay. So, let's, let's look at the basic sequence of code here 107 00:10:09,342 --> 00:10:16,094 where the producer wants to put an item x into the cube for a consumer to read 108 00:10:16,094 --> 00:10:23,737 sometime in the future. First thing, it wants to do is it's going to read the tail 109 00:10:23,737 --> 00:10:30,731 pointer into its register. It's going to then, this is a, a load store architecture 110 00:10:30,731 --> 00:10:35,913 here. It's going to do a store of value x into the tail, which is going to put it 111 00:10:35,913 --> 00:10:41,305 here. Now, we're going to have to bump the tail. We can't do this atomically so we 112 00:10:41,305 --> 00:10:46,907 actually have to add one to it in our register space so we just increment it in 113 00:10:46,907 --> 00:10:52,089 our own register. And then, we're going to store this back into the pointer in 114 00:10:52,089 --> 00:11:05,032 memory. Seems simple enough. Let's look at what the consumer does. The consumer loads 115 00:11:05,032 --> 00:11:09,832 the head. And the first thing it's really going to do here is it's going to try to 116 00:11:09,832 --> 00:11:13,464 figure out if the head equals the tail. Cuz that will mean that the queue is empty 117 00:11:13,464 --> 00:11:19,158 and it just needs to block. Loads the tail pointer. So, it's the head pointer in this 118 00:11:19,158 --> 00:11:24,866 register, a tail pointer in that register, and sees that they're equal. If they are 119 00:11:24,866 --> 00:11:31,026 equal, it's going to jump up here, reload the tail pointer into the tab to see if 120 00:11:30,815 --> 00:11:39,483 it's got updated by this thread and just spin here for forever until the tail 121 00:11:39,483 --> 00:11:44,477 pointer is not equal to head pointer, which means there's some data available. 122 00:11:44,477 --> 00:11:49,405 And then, we can fall through. And in the fall through case, we are going to load 123 00:11:49,405 --> 00:11:54,140 from the head into R, so into this register, this is, so we, we did the read. 124 00:11:55,120 --> 00:12:01,540 And we're going to update the head pointer to the next location. Save it off and 125 00:12:01,540 --> 00:12:14,201 then, do something on this value we got. So, processes just do, do something. So, 126 00:12:14,201 --> 00:12:19,226 this program is a little naive cuz we've been talking about out of order 127 00:12:19,226 --> 00:12:24,653 processors, we've been talking about out of order memory systems. And in a perfect 128 00:12:24,653 --> 00:12:30,215 world, this is assuming that instructions are exec uted in order and that there's no 129 00:12:30,215 --> 00:12:35,441 interleaving between the consumer and the producer here, and vice versa, in the 130 00:12:35,441 --> 00:12:57,245 memory system. So, anyone think there's any problems with this? does this work 131 00:12:57,245 --> 00:13:03,675 fine? Yup. Okay. So, you're, you're saying if the consumer runs before the producer 132 00:13:03,675 --> 00:13:10,091 does anything, you might just try to take values off the queue. Well, so let's guard 133 00:13:10,091 --> 00:13:15,211 against that and say, the, the start case is head pointer equals tail pointer. So, 134 00:13:15,211 --> 00:13:22,351 that won't happen, you'll sit spinning here. Now, because no one's like jumping 135 00:13:22,351 --> 00:13:27,624 up and screaming up and down at this point, you're sort of all assuming that 136 00:13:27,624 --> 00:13:32,364 these operations happen in order. But we've talked about out of order machines 137 00:13:32,364 --> 00:13:37,103 which reorder loads relative to stores. We've talked about very interesting 138 00:13:37,103 --> 00:13:41,842 interleaving. We've talked about out of order memory systems. So, so much for me 139 00:13:41,842 --> 00:13:46,884 jumping up and down here, and saying, what do we guarantee about loads and stores to 140 00:13:46,884 --> 00:13:54,624 different addresses? So, we said, all we know is that a load and a store or excuse 141 00:13:54,624 --> 00:14:02,189 me, a store and a load that are serialized in that order, on the same processor, to 142 00:14:02,189 --> 00:14:08,588 the same address, will happen in order. Our out of order super scale and our out 143 00:14:08,588 --> 00:14:14,354 of order memory systems, we've talked about, have defined nothing about ordering 144 00:14:14,354 --> 00:14:19,417 between processors or even memory operations inside of one processor. So, 145 00:14:19,417 --> 00:14:25,401 that's what we're really going to be talking about today, how to reorder those 146 00:14:25,401 --> 00:14:36,626 instructions and how to prevent that from causing havoc. Okay. So, let's, let's look 147 00:14:36,626 --> 00:14:43,440 at this and let's number these purple instructions. So, we're going to number 148 00:14:43,440 --> 00:14:48,293 store, store in these two loads. The reason we, we look at those, in 149 00:14:48,293 --> 00:14:52,720 particular, is this is basically updating state that this thread is reading from, 150 00:14:52,720 --> 00:14:57,203 and these are the two instructions that update that thread, and these are the two 151 00:14:57,203 --> 00:15:06,460 instructions that read that thread, or read that state rather. So, the programmer 152 00:15:07,240 --> 00:15:15,543 assumes that there's some level of causality, here. Incorrectly, producer is 153 00:15:15,543 --> 00:15:25,464 assuming, if you look at this code, that if three, this load happens after this 154 00:15:25,464 --> 00:15:33,169 store h ere, that there's some relation between one and four. Namely, that one has 155 00:15:33,169 --> 00:15:40,899 happened by the time four happens. Not a good assumption in our memory systems. And 156 00:15:40,899 --> 00:15:46,452 that's especially in our out-of-order memory system. Because, if we recall, out, 157 00:15:46,452 --> 00:15:54,627 out of order memory system said nothing about store ordering. We talked about, the 158 00:15:54,627 --> 00:15:59,217 only thing we talked about was having a load at the same address as a store and 159 00:15:59,217 --> 00:16:04,366 not moving that load past that store. But otherwise, these two stores are out of 160 00:16:04,366 --> 00:16:08,654 order processor that can very easily reorder or an out of order memory system 161 00:16:08,654 --> 00:16:13,404 can very easily reorder. Likewise, over here, these are just two loads. These 162 00:16:13,404 --> 00:16:18,908 loads can happen completely out of order in an out of order computer, and out of 163 00:16:18,908 --> 00:16:26,701 our memory system. No guarantees there. So, this assumption is bad in a 164 00:16:26,701 --> 00:16:32,096 multiprocessor system. Uniprocessor system might make sense, multiprocessor system, 165 00:16:32,096 --> 00:16:39,300 bad assumption. So, let's look at some sequences that are problematic. Name, 166 00:16:39,300 --> 00:16:45,871 namely, let's say, two and then, three happens. And then, four and then one 167 00:16:45,871 --> 00:16:52,905 happens. So, what, what happens when that? Well, two updated the tail. Three says, 168 00:16:52,905 --> 00:17:00,746 raise the value of that new tail pointer. But at this point, we haven't actually 169 00:17:00,746 --> 00:17:07,216 stored the value into the cube because one never happened. We go x cubed four and it 170 00:17:07,216 --> 00:17:13,475 just reads garbage from memory. And then finally, the value gets updated. Totally 171 00:17:13,475 --> 00:17:17,721 valid in out of order, out of our memory systems that we talked about up to this 172 00:17:17,721 --> 00:17:25,353 point. Nothing wrong here. Okay, let's say, four and then one happen. So, goes 173 00:17:25,353 --> 00:17:30,623 the load. Load the head. Then, the value gets written. And then, two and then three 174 00:17:30,623 --> 00:17:35,826 happen. This is just completely out of order, kind of no good semantics here. So, 175 00:17:35,826 --> 00:17:41,096 this piece of code which looks like it should have, the programmer would assume 176 00:17:41,096 --> 00:17:46,232 to have good semantics. Basically, it has no, no useful semantics unless we define 177 00:17:46,232 --> 00:17:47,900 what those semantics are.