1 00:00:03,320 --> 00:00:12,443 Okay. So, let's introduce a model of, and introduce some semantics to describe 2 00:00:12,443 --> 00:00:18,216 ordering of memory operations between different processors. One of the most 3 00:00:18,216 --> 00:00:24,221 common ones that you're going to see is called Sequential Consistencies by gnoming 4 00:00:24,683 --> 00:00:32,291 the only one out there. This is a very strong ordering. compared to any of the 5 00:00:32,291 --> 00:00:37,122 processors you guys run on your computers today, this is extremely strong. None of 6 00:00:37,122 --> 00:00:41,714 your computers actually implement this strong of an ordering. But, it's a good 7 00:00:41,714 --> 00:00:46,366 basis to think reason about and reason about parallel programs. So, we're going 8 00:00:46,366 --> 00:00:50,839 to study sequential consistency sums. So, up here, we have one, two, three, four, 9 00:00:50,839 --> 00:00:55,491 five processors, one big memory, still very basic model. We don't have, we've not 10 00:00:55,491 --> 00:01:00,620 introduced any caches or anything crazy like that yet because that makes all these 11 00:01:00,620 --> 00:01:06,147 things a little bit harder. Either caches are out of order. And, what sequential 12 00:01:06,147 --> 00:01:14,496 consistency is, is the idea that you take all of the instructions and all of the 13 00:01:14,496 --> 00:01:21,837 programs executing on all of the processors, and you guarantee that the 14 00:01:21,837 --> 00:01:33,037 execution sequence that is seen by all of the processors is some valid in order 15 00:01:33,037 --> 00:01:41,057 interleaving of those instructions of the relative processors themselves at the 16 00:01:41,057 --> 00:01:56,403 beginning. So, to give you an example. If you have processor one, and processor two. 17 00:01:56,403 --> 00:02:03,793 And processor one goes one, two, three, four. And processor two goes execution 18 00:02:03,793 --> 00:02:11,644 instruction five, six, seven, eight, you're guaranteed that some intervening 19 00:02:11,644 --> 00:02:18,017 here is what happened in sequential consistency. So, let's say, one and two 20 00:02:18,017 --> 00:02:23,849 happens. And then, five happens, and then three happens, and then six happens, and 21 00:02:23,849 --> 00:02:30,192 then seven, and then eight, and then four. That is a valid sequential consistent 22 00:02:30,192 --> 00:02:38,076 order. Likewise, another one is easier to reason out is one, two, three, four, five, 23 00:02:38,076 --> 00:02:44,615 six, seven, eight. That's valid in sequential consistent, sequential 24 00:02:44,615 --> 00:02:53,901 consistency. Likewise, I'm going to say five, six, seven, one, two, three, four, 25 00:02:54,602 --> 00:03:12,370 eight is valid order. What is not valid is going to be something like five, one, I 26 00:03:12,370 --> 00:03:26,786 don't know. Three, two, four, seven, eight. I see it's missing one. five. No, 27 00:03:26,786 --> 00:03:36,037 five, six, no, six. Okay. We'll put six in there. Because we've reordered one of the 28 00:03:36,037 --> 00:03:46,440 individual address sequences, two and three should be ordered. So, this gives us 29 00:03:46,440 --> 00:03:53,485 some guarantees. so the, the basic idea is that, it's an arbitrary, order-preserving 30 00:03:53,485 --> 00:03:59,247 interweaving. So, each processors addresses' are preserved in order, but can 31 00:03:59,247 --> 00:04:07,306 be interweaved in between the different threads arbitrarily. Okay. I want that to 32 00:04:07,306 --> 00:04:13,548 sink in for a second. Anyone have questions about what the model is? Or, why 33 00:04:13,548 --> 00:04:32,158 this is a good model? Okay. I'm going to, I'm going to ask a few questions here. Why 34 00:04:32,158 --> 00:04:47,250 is this a model that almost no one ever implements? So, the reason no one actually 35 00:04:47,250 --> 00:04:51,963 goes to try to do this is as all the things we talked about when we try out of 36 00:04:51,963 --> 00:04:56,877 our processors, you want to try to re-order your lows in stores for 37 00:04:56,877 --> 00:05:02,771 performance reasons. and the second, you try to introduce something like cache into 38 00:05:02,771 --> 00:05:07,764 your processor, this gets very hard because communication becomes a real 39 00:05:07,764 --> 00:05:13,381 bottleneck. You might have one processor that did a, for instance, a store into its 40 00:05:13,381 --> 00:05:18,998 cache and not everyone has seen that value update. So, unless you actually want to 41 00:05:18,998 --> 00:05:23,991 have one monolithic memory that you play all loads and stores against in some 42 00:05:23,991 --> 00:05:30,737 interleaved yet in order to the processor order, and everyone sees that exact 43 00:05:30,737 --> 00:05:34,610 ordering, you're not going to want to actually implement that. That's very hard 44 00:05:34,610 --> 00:05:47,027 to do. Okay. So, fun, funny example here of sequential consistency. Let's look at 45 00:05:47,027 --> 00:05:54,522 two threads, T1 and T2. And two variables, actually, we'll talk four variables here. 46 00:05:54,522 --> 00:06:01,872 X, Y, X prime, and Y prime. X prime and Y prime are the outputs here so we do stores 47 00:06:01,872 --> 00:06:12,373 to those. At the beginning of time, we are going to say X and Y are initialized to 48 00:06:12,373 --> 00:06:24,503 zero and ten, respectively as shown. Okay. Let's, let's talk about what our valid 49 00:06:24,503 --> 00:06:32,744 sequential consistent outcomes here for X prime and Y prime. Now, this is not 50 00:06:32,744 --> 00:06:38,612 actually easy to detect right off the get go. Does anyone know which of these four, 51 00:06:38,612 --> 00:06:44,409 I think the answer is on your sheet, but don't go look at it. it's n ot actually 52 00:06:44,409 --> 00:06:50,277 easy to detect which one is sequentially consistent, which one is not sequentially 53 00:06:50,277 --> 00:06:55,787 consistent. So, let's, let's walk through a few interweavings here to see what 54 00:06:55,787 --> 00:07:16,273 happens. Okay. So, we're going to have thread one, thread two, X, Y and X, Y 55 00:07:16,273 --> 00:07:30,859 primes. Let's say, we execute T1, and then T2 in time. So we're, we'll draw time this 56 00:07:30,859 --> 00:07:41,631 way. So we do, we do a store of one. We do a store word eleven. And then, we do a 57 00:07:41,631 --> 00:07:55,360 load word of Y. A store word of Y prime, load word of X and a store word of X 58 00:07:55,360 --> 00:08:04,400 prime. We do this store here. at the beginning of time, as I said, X has zero 59 00:08:05,140 --> 00:08:16,194 and Y has ten. We do a store here and this is going to be storing to X one, so we're 60 00:08:16,194 --> 00:08:26,077 going to update the value of X there. Now, we're going to store eleven to Y. Okay. 61 00:08:26,077 --> 00:08:35,624 So, eleven gets loaded there in time. We do this load. Nothing, no, nothing 62 00:08:35,624 --> 00:08:44,350 changes. We do a store here to Y prime and we look at what we got when we did this 63 00:08:44,350 --> 00:08:51,841 load of Y. We loaded eleven and we're going to store eleven to Y prime, at this 64 00:08:51,841 --> 00:09:00,617 point. Then, we do a load of X which has one in it now, and we do a store to X 65 00:09:00,617 --> 00:09:11,771 prime. It's going to have one. So, to sum up, you have eleven and one. Okay. That's 66 00:09:11,771 --> 00:09:19,018 this one. That's a valid output. And what's interesting to see here is, even 67 00:09:19,018 --> 00:09:24,396 with the stitch memory model, we're going to end up with different possible 68 00:09:24,396 --> 00:09:29,080 outcomes. There's not only one possible outcome. There's actually multiple 69 00:09:29,080 --> 00:09:36,780 possible valid outcomes here. Okay. So, let's look at another. Let's use the, the 70 00:09:36,780 --> 00:09:45,699 same columns at the top here and let's say we have these two split so that the first 71 00:09:45,699 --> 00:09:51,434 store happens here and the second, the second store from thread one happens 72 00:09:51,434 --> 00:09:57,022 there. Still sequentially consistent because they're being ordered inside of 73 00:09:57,022 --> 00:10:02,684 the respective threads, and it's a global ordering that everyone sees. Okay. So, 74 00:10:02,684 --> 00:10:12,222 let's, let's go do that one. So we do store word of one. Then, we execute in 75 00:10:12,222 --> 00:10:21,180 thread two load word of R1, Y. Store word of R1, Y prime. Okay. So, let's go to the 76 00:10:21,180 --> 00:10:30,253 store here. The store's going to, oh well, actually, we'll finish writing it first. 77 00:10:30,253 --> 00:10:39,326 Load word of R2 from X st ore word R2, X prime. And then finally, we, we finish off 78 00:10:39,326 --> 00:11:07,523 here writing the store word of one to X. Okay. Oops, it is wrong. Yep. That's what 79 00:11:07,523 --> 00:11:16,644 I do. Okay. Store word's going to happen here first. And, remember we started out 80 00:11:16,644 --> 00:11:21,904 with zero and ten, and zero in, in X and Y. So store word of one is going to 81 00:11:22,422 --> 00:11:28,803 happen. It's going to update X to be one. We do this load and it's going to be at 82 00:11:28,803 --> 00:11:39,189 ten now. And we store ten into Y prime. We do a load word here of X, or we're 83 00:11:39,189 --> 00:11:44,960 basically going to store that back into X prime. So, what was X? We look back at 84 00:11:44,960 --> 00:11:50,509 this column. Most up to date value was one. Okay. So we come over here, and we 85 00:11:50,509 --> 00:12:02,233 say, that's one. And then finally, we do a store word of eleven into Y. Okay? Or Y to 86 00:12:02,233 --> 00:12:08,882 Y, so it gets two, eleven into Y. But what's the output here? Well, the output 87 00:12:08,882 --> 00:12:15,438 is one and ten. And this value, this store, didn't do anything. I mean, it 88 00:12:15,438 --> 00:12:22,368 updated the value of Y but it didn't effect Y prime or X prime in any way, 89 00:12:22,368 --> 00:12:31,844 shape, or form. Okay. So that ends up being one in ten is a valid sequentially 90 00:12:31,844 --> 00:12:40,256 consistent output. Let's try a non-sequentially consistent execution 91 00:12:40,256 --> 00:12:57,302 order and see what happens. Let's say, I don't know, which is a good one to 92 00:12:57,302 --> 00:13:04,032 execute. there's a couple different ways to get this. Let's execute this 93 00:13:04,032 --> 00:13:12,650 instruction here first. So, to store to Y. And we're going to reorder that with a 94 00:13:12,650 --> 00:13:19,886 stored X. And, we're going to execute thread two in between those two 95 00:13:19,886 --> 00:13:45,672 operations. So, we use store word of eleven to Y. So, we've out of, let's say, 96 00:13:45,672 --> 00:13:52,454 T1 is an out of order processor here, for instance. Then, we're going to execute all 97 00:13:52,454 --> 00:14:13,946 of thread two, and we'll execute all of thread two in order. And then finally, 98 00:14:13,946 --> 00:14:24,455 we're going to come back and thread one here and do the store of let's say, one to 99 00:14:24,455 --> 00:14:32,462 X. Okay. So, X and Y start out with zero and eleven, or excuse me, zero and ten, 100 00:14:32,462 --> 00:14:43,393 rather. Store of eleven of to Y, okay that's going to update. This could be 101 00:14:43,393 --> 00:14:53,200 eleven. We're going to load Y now, so we're going to load this eleven into R1. 102 00:14:53,200 --> 00:15:00,163 And then, we're going to store up to Y prime eleven. Then we're going to load, 103 00:15:00,163 --> 00:15:04,268 let's say, X her e. Well, X is just the initial value, so 104 00:15:04,268 --> 00:15:10,000 we're going to get zero into R2, then we're going to store our zero into X prime 105 00:15:10,000 --> 00:15:16,813 there. And then finally, here, we're going to do a store of one to X. So, what's our 106 00:15:16,813 --> 00:15:22,504 output? Well, our output is zero and eleven. And, but we had a non-sequentially 107 00:15:22,504 --> 00:15:28,408 consistent, this is a non-sequentially consistent output here. So, there's no way 108 00:15:28,408 --> 00:15:34,162 we can work through all the different possible combinations. But the only way 109 00:15:34,162 --> 00:15:40,214 you're possibly going to get a zero for X prime and eleven for Y prime is if you 110 00:15:40,214 --> 00:15:45,445 execute non-sequentially consistent execution sequence. And this was a, 111 00:15:45,445 --> 00:15:54,070 because we flipped the ordering of these two instructions, that made this not 112 00:15:54,070 --> 00:16:04,650 sequentially consistent. So, we have a big X over it. We could work through all the 113 00:16:04,650 --> 00:16:08,530 other possible combinations here. But, there's actually another, interestingly 114 00:16:08,530 --> 00:16:12,410 enough, there's actually another way, another interweaving that gets you this 115 00:16:12,410 --> 00:16:16,648 same non sequentially consistently output from a different sequentially consistent 116 00:16:16,648 --> 00:16:20,579 ordering of these instructions. Or, or a different non-sequentially consistent 117 00:16:20,579 --> 00:16:33,214 ordering of the instructions. So that, that can't be true. Okay. So, what this is 118 00:16:33,214 --> 00:16:42,357 really trying to say is that, in our processors we've looked at up to this 119 00:16:42,357 --> 00:16:54,440 point, we've had some arcs. And example of an arc here is that if you do a load into 120 00:16:54,440 --> 00:17:01,852 R1 and then use the value of R1 to do the store, sorry these should be order flipped 121 00:17:01,852 --> 00:17:11,778 for MIPs compatibility. Yeah. I flipped them there, I missed here. Okay. for this 122 00:17:11,778 --> 00:17:17,310 store here to happen, it needs to wait for the value of R1. So, just simple read 123 00:17:17,310 --> 00:17:23,339 after write dependence. Likewise, here you have a read after write dependence. And 124 00:17:23,339 --> 00:17:28,420 it's sequential consistency you can have other dependencies. So for instance, if 125 00:17:28,420 --> 00:17:33,261 you do a load in the store and their to the same address, there needs to be order 126 00:17:33,261 --> 00:17:37,924 between those. Which we've already talked about in this class. What we've not talked 127 00:17:37,924 --> 00:17:42,646 about is additional sequential consistency requirements. So, if you want to have 128 00:17:42,646 --> 00:17:47,129 additional sequential consistency requirements, what it looks l ike is every 129 00:17:47,129 --> 00:17:53,044 memory operation is dependent on the prior memory operations in its own thread. So, 130 00:17:53,044 --> 00:17:59,194 we're going to draw that as a red arc here. And you can sort of see it here, but 131 00:17:59,194 --> 00:18:02,750 also, this is dependent on this, and this is dependent on that, and that is 132 00:18:02,750 --> 00:18:06,648 dependent on that, but sort of through transitivity they, they don't draw all the 133 00:18:06,648 --> 00:18:10,883 arcs. So, some, something to think about. Okay. So, we've got a question that comes 134 00:18:10,883 --> 00:18:15,820 up here. I think we already asked this question but, does a system of caches and 135 00:18:15,820 --> 00:18:21,520 out-of-order processors provide a sequentially consistent unit of memory? 136 00:18:27,080 --> 00:18:33,978 Answer's probably not. It's pretty hard to do. You could potentially try to build a 137 00:18:33,978 --> 00:18:40,916 processor which is out of order and, and do this. But, we talked about breaking and 138 00:18:40,916 --> 00:18:46,317 reordering all of these instructions in an out of order processor. So, we're going to 139 00:18:46,750 --> 00:18:51,863 try to think about what is the current, correct model if you cannot go for full 140 00:18:51,863 --> 00:18:58,416 sequential consistency. You're asking me, is that the Java compiler to guarantee 141 00:18:58,416 --> 00:19:03,241 this ordering. Okay. So, what we're actually talking about here is the 142 00:19:03,241 --> 00:19:08,325 hardware breaking the ordering And just reorder these randomly that the compiler 143 00:19:08,325 --> 00:19:16,237 can possibly never be able to try to control. That's, the, the programmer is at 144 00:19:16,237 --> 00:19:20,376 fault there. The programmer should not be assuming that. [LAUGH] The programmer 145 00:19:20,376 --> 00:19:24,197 assumes, assumes some, basically, programmers try to assume something like 146 00:19:24,197 --> 00:19:28,335 sequential consistency. And what we're saying is that none of the hardware out 147 00:19:28,335 --> 00:19:32,527 there implements sequential consistency. So the programmer can't assume that, it 148 00:19:32,527 --> 00:19:36,959 was a bad assumption on the programmer's perspective. so the compiler, for 149 00:19:36,959 --> 00:19:41,064 instance, the reason why the programmer assumes that is because it's kind of a 150 00:19:41,064 --> 00:19:45,444 rational thing to do. It's like the, the reasonable rational thing to do. But, it's 151 00:19:45,444 --> 00:19:49,823 really hard to implement something fast under those constraints. So, instead what 152 00:19:49,823 --> 00:19:54,202 people do is they try to come up with something weaker than that which still 153 00:19:54,202 --> 00:19:58,472 gives the programmer some simblence of programmability. And that's what we are 154 00:19:58,472 --> 00:20:03,016 going to be talking about the res t of today's lecture and next lecture what are 155 00:20:03,016 --> 00:20:07,500 those simblences of rationality that we can give the programmer. But, this is one 156 00:20:07,500 --> 00:20:12,466 program and this is another program. Or this is one thread and this is another 157 00:20:12,466 --> 00:20:17,126 thread in the same program. But the compiler cannot guarantee this ordering 158 00:20:17,126 --> 00:20:21,936 because the out of order processor will just move things around, disregards the 159 00:20:21,936 --> 00:20:26,720 compiler. We already talked about the out of load processors moving loads past 160 00:20:26,720 --> 00:20:31,446 storers and storers past loads in the hardware. So, and what we're saying here 161 00:20:31,446 --> 00:20:35,758 is if you want to actually implement full sequential consistency, all those 162 00:20:35,758 --> 00:20:40,425 optimizations that the hardware out of order processor want to do are invalid. We 163 00:20:40,425 --> 00:20:49,840 don't want to really do that. okay. So, that's sequential consistency.