1 00:00:06,120 --> 00:00:18,860 And Dijkstra came up with this naming nomenclature. So, implementation of sum of 2 00:00:18,860 --> 00:00:25,203 force. this idea is all well and good, but we still need some way to come up with 3 00:00:25,203 --> 00:00:31,503 mutual exclusion. Believe it or not, you can actually do this with just loads and 4 00:00:31,503 --> 00:00:35,887 stores. And we're going to be talking about that a little bit later in class. 5 00:00:35,887 --> 00:00:41,052 But it's really pretty slow. and it's, it, there are some algorithms that can do this 6 00:00:41,052 --> 00:00:45,436 with just loads and stores. But from an implementation perspective, people 7 00:00:45,436 --> 00:00:50,288 typically try to go faster. So, if you want to go faster, it might be helpful for 8 00:00:50,288 --> 00:00:54,630 your computer architecture, your instructions set architect, to come up 9 00:00:54,630 --> 00:00:59,467 with a special instruction which will allow you to do something atomically. So, 10 00:00:59,467 --> 00:01:03,995 when we say do something atomically, it means there's nothing else can be 11 00:01:03,995 --> 00:01:09,637 interweaved in that, in the meantime. And the, the simple solution here is that you 12 00:01:09,637 --> 00:01:16,729 add an atomic operation which can do a read, a modify, and a write. All 13 00:01:16,729 --> 00:01:23,317 atomically if nothing else interleaving it. So going, going back to our locks and 14 00:01:23,317 --> 00:01:29,330 sum of fours here. All inside of this, this P and all inside this V here, you 15 00:01:29,330 --> 00:01:34,575 need to somehow, atomically modify S. You need to be able to read S you need to be 16 00:01:34,575 --> 00:01:39,885 able to write S if you were able to implement this somehow with code inside of 17 00:01:39,885 --> 00:01:46,034 the implementation of, of P and inside the implementation of V. So, the primitive 18 00:01:46,034 --> 00:01:51,060 here that makes this faster is having some read modify write operation. And we're 19 00:01:51,060 --> 00:01:57,820 going to look at a couple different choices of this. One of the most basic 20 00:01:57,820 --> 00:02:05,160 ones is test and set. x86 has this instruction. This is the one of the most 21 00:02:05,610 --> 00:02:11,310 basic operations on x86 for atomic operations. lots of other architectures 22 00:02:11,310 --> 00:02:17,010 have an instructions which does this. So, we write here a piece a code. But in 23 00:02:17,010 --> 00:02:22,710 reality, this is a piece of pseudo code, but in reality, this is an instruction 24 00:02:22,710 --> 00:02:27,960 which we'll actually do this atomically relative to the whole rest of the system. 25 00:02:27,960 --> 00:02:33,832 So, let's look at what test and set does. The basic idea of test and set is it, is 26 00:02:33,832 --> 00:02:39,720 it's probably one of the most primitive atomic operations you can do. I've seen 27 00:02:39,720 --> 00:02:45,536 something slightly simpler than this. there's like a test and clear which is a 28 00:02:45,536 --> 00:02:51,844 little bit easier to implement. But, largely the idea is that, you're going to 29 00:02:51,844 --> 00:02:58,332 look at a memory address. And if that memory address is what you expect it to 30 00:02:58,332 --> 00:03:12,139 be, then write some value. And, implicit in this piece of code here is that, the 31 00:03:12,139 --> 00:03:16,560 piece of code which executed the testing set needs to know whether it's exceeded 32 00:03:16,560 --> 00:03:20,653 and was successfully able to write the value or not to, or to not write the 33 00:03:20,653 --> 00:03:25,020 value. So, it's sort of a status code and that's returned in this register R here. 34 00:03:26,040 --> 00:03:30,698 So, if we look at this piece of code, you pass into it a memory reference, a memory 35 00:03:30,698 --> 00:03:35,357 address, which is the address of the sum of four or the address lock. And you pass 36 00:03:35,357 --> 00:03:42,884 into it and, and, and a register name. We do a load into that register and that 37 00:03:42,884 --> 00:03:51,538 happens unconditionally. This part here though, is the, is the, the test part. So 38 00:03:51,538 --> 00:03:58,160 the test part here is we check whether the thing we just read from this memory 39 00:03:58,160 --> 00:04:04,206 address is zero. And if so, we're going to write a one to the memory address. 40 00:04:04,206 --> 00:04:08,796 Sometimes these test and set operations will actually allow you to write something 41 00:04:08,796 --> 00:04:13,274 else besides just one. But, for right now, let's, this is sort of the more basic one 42 00:04:13,274 --> 00:04:17,752 here. So, it writes a one to that memory address. And what's important here is the 43 00:04:17,752 --> 00:04:24,292 architecture guarantees that all of this happens atomically. But what's interesting 44 00:04:24,292 --> 00:04:30,279 is, if you think about our processors that we built, this requires a load and a 45 00:04:30,279 --> 00:04:36,269 store. Hm. So, we will talk about how we implement this a little bit in, in a few 46 00:04:36,269 --> 00:04:41,218 slides. But, you basically have to stop everything else, stop all other loads and 47 00:04:41,218 --> 00:04:48,686 stores happening in the system. Do this load, do the test, and possibly do the 48 00:04:48,686 --> 00:04:54,111 store in hardware. One of the more basic ways this is done or one of the original 49 00:04:54,111 --> 00:04:58,623 way this was done actually on some of the SA-6 based architectures, is there is 50 00:04:58,623 --> 00:05:03,081 something called the lock bit on the bus. So what they did is when one processor was 51 00:05:03,081 --> 00:05:07,538 going to go execute a atomic operation like this, a testing set for instance. You 52 00:05:07,538 --> 00:05:11,940 'd actually broadcast all of the other processors in the system saying, I'm about 53 00:05:11,940 --> 00:05:16,893 to do this. Do not touch anything. Do not issue any memory instructions effectively, 54 00:05:16,893 --> 00:05:21,336 or do not issue any bus, based memory instructions. It would, it would do the 55 00:05:21,336 --> 00:05:25,672 load, it would do the test, and it would do the set, and then it would release the 56 00:05:25,672 --> 00:05:32,214 broadcast bit. This is the wire on a multi-drop bus. that's a pretty big 57 00:05:32,214 --> 00:05:37,403 hammer. things have gone a little bit better since then. But, just to give you 58 00:05:37,403 --> 00:05:44,701 an idea of basic hardware implementation of this. things got a little more 59 00:05:44,701 --> 00:05:51,627 sophisticated this sorta showed up in the, the 70s' here. These fancier operations or 60 00:05:51,627 --> 00:05:59,454 the fancy atomic operations. So, this whole box here is still a atomic 61 00:05:59,454 --> 00:06:03,478 operation. All the, the code in here happens atomically, and what this 62 00:06:03,478 --> 00:06:08,095 basically does is it's going to take a memory address and add something to it. 63 00:06:08,095 --> 00:06:13,457 So, we're going to add Rv to R and put that into memory at the same memory 64 00:06:13,457 --> 00:06:18,070 address and, so we're going to do the load in the store here, atomically. So, it's a 65 00:06:18,070 --> 00:06:23,665 read modified write operation also. And having it all the atomic. Note, this still 66 00:06:23,665 --> 00:06:29,047 actually returns into register R here, the original value. And sometimes that's 67 00:06:29,047 --> 00:06:34,838 actually very useful if you're trying to implement a semifor based on this you kind 68 00:06:34,429 --> 00:06:39,675 of want to see what was there before. There's some, there's some jokes around 69 00:06:39,879 --> 00:06:48,073 fetching at here, though. the atomic operation community kind of push this to 70 00:06:48,073 --> 00:06:53,796 the extreme at some point and people started having sort of fetch and insert 71 00:06:53,796 --> 00:06:59,297 something here. So, fetch and multiply, fetch and divide or, or things like that 72 00:06:59,297 --> 00:07:05,126 and people started building some stranger machines that had fetch and something 73 00:07:05,126 --> 00:07:09,999 else. There's this old joke that someone was going to implement instruction called 74 00:07:10,171 --> 00:07:14,929 fetch and FFT. So, it would be atomic Fourier Transform. I don't think it ever 75 00:07:14,929 --> 00:07:19,229 got to that point. But, in the, in the microcoded days and the, and the early 76 00:07:19,229 --> 00:07:23,529 microprocessor microcoded days, there was some pretty sophisticated atomic 77 00:07:23,529 --> 00:07:28,665 operations. That's largely been decided to not be a good direction t o go. 78 00:07:28,667 --> 00:07:33,300 You want to have a little bit of complexity in your atomic operations. so 79 00:07:33,300 --> 00:07:38,623 maybe test and set is a little too simple. But, some of these fancier fetching FFTs 80 00:07:38,623 --> 00:07:48,131 is probably a little bit too complex. Here's another interesting operation that 81 00:07:48,131 --> 00:07:54,036 is atomic. And this one you can actually build some pretty interesting semaphores 82 00:07:54,036 --> 00:07:59,590 out of. It's called swap. So, it takes a register value and a memory location, and 83 00:07:59,590 --> 00:08:05,706 it'll take what's in the memory location and what's in the register and swap the 84 00:08:05,706 --> 00:08:12,335 two things. And note, we need to use we use a temporary here because you can't 85 00:08:12,335 --> 00:08:19,835 actually swap two things without a temporary. This is actually not very 86 00:08:19,835 --> 00:08:24,257 common because it's kind of hard to use. so what's a little bit more common to 87 00:08:24,257 --> 00:08:28,958 something called compare and swap which is something between kind of a test and set, 88 00:08:28,958 --> 00:08:33,548 and a swap operation. And that's, that's what commonly used and we'll talk about 89 00:08:33,548 --> 00:08:38,915 that in a second. Okay, so now we go back to the multiple consumers problems that we 90 00:08:38,915 --> 00:08:45,192 have, where multiple people were trying to DQ from a, a Q. And let's introduce test 91 00:08:45,192 --> 00:08:55,531 and set, and we'll look at our critical section here. We're going to use this term 92 00:08:55,531 --> 00:09:01,555 mutex to basically mean a sum of four that only one thing can enter at a time. So, 93 00:09:01,555 --> 00:09:07,826 what happens at the beginning here is there is a tight loop, and this tight loop 94 00:09:07,826 --> 00:09:14,332 will read from a relocation of mutex into a temporary register, and will compare it 95 00:09:14,332 --> 00:09:25,986 to zero. Note, we define test and set as writing one to memory if it succeeds. So 96 00:09:25,986 --> 00:09:32,168 if it was zero, that means that it was not able to actually modify memory. The test 97 00:09:32,168 --> 00:09:38,275 failed. So if the test failed, it's just going to sit and spin here. And we'll call 98 00:09:38,275 --> 00:09:44,383 this busy waiting or spin attritional spinlock, it's called. If it acquired the 99 00:09:44,383 --> 00:09:52,701 lock, this read value would have been a one. So, it would not be equal to, or 100 00:09:52,701 --> 00:10:00,776 excuse me, if there, if it acquired the lock through the, the, the, you just 101 00:10:00,776 --> 00:10:11,122 write, if it's one, it's zero. Okay, sorry we did this wrong. If, if the memory 102 00:10:11,122 --> 00:10:26,030 address was a, if it was a one coming into here, that means someone else already 103 00:10:26,030 --> 00:10:33,651 locked it. If it's a zero coming into here, that means no one locked it, but we 104 00:10:33,651 --> 00:10:38,473 locked it and set it to a one now. So if we acquired the lock, we're going to 105 00:10:38,473 --> 00:10:43,169 atomically set it to a one, and we're going to read a zero. So, we'll drop into 106 00:10:43,169 --> 00:10:48,182 our critical section and start executing our critical section. And it, under the 107 00:10:48,182 --> 00:10:53,321 critical section, we can do stores, we can do loads, we can do whatever we want cuz 108 00:10:53,321 --> 00:10:58,397 we're guaranteeing that no other op, no other processor, no other thread is doing 109 00:10:58,397 --> 00:11:03,713 anything to it. And then at some point, we need to do the, the release here. And that 110 00:11:03,713 --> 00:11:09,713 we actually don't need a test and set for, we just need a store. A store is just 111 00:11:09,713 --> 00:11:15,941 going to clear that in a real location, and some other thread is safe to go and 112 00:11:15,941 --> 00:11:25,382 fall into the critical section here. Okay. So, this can be done with lots of 113 00:11:25,382 --> 00:11:29,925 different implementations. We showed here of test and set, but this can be done with 114 00:11:29,925 --> 00:11:34,360 swap, it can be done with fetch and add. The code gets a little bit more complex. 115 00:11:33,927 --> 00:11:41,260 But, one of the interesting questions that comes up is, what do you, what happens if 116 00:11:41,260 --> 00:11:47,070 you running along some threat acquires the critical or acquires the law falls into 117 00:11:47,070 --> 00:11:52,460 the critical section and let's say, terminates or gets swapped out for very 118 00:11:52,460 --> 00:12:00,615 long period of time? What's, what's going to happen here? Well, system crash. I 119 00:12:00,615 --> 00:12:07,480 mean, we, we don't reference bad memory here. This is more, more of a hang. So, 120 00:12:07,480 --> 00:12:15,832 it's going to freeze. So this is actually pretty common in the old, for instance, if 121 00:12:15,832 --> 00:12:23,942 you look at like non-preemptive operating systems. So, back in the day of like the 122 00:12:23,942 --> 00:12:29,910 original Mac OS before Mac OS became Mac OSX, so versions sort of nine and earlier. 123 00:12:29,910 --> 00:12:35,591 That was a cooperative operating system. And if someone went and grabbed a lock, 124 00:12:35,591 --> 00:12:41,485 and then died or took an error case, the whole machine would crash, crash. if 125 00:12:41,485 --> 00:12:46,394 you're pre-emptive, I mean, there's like a timer going off, it's possible you still 126 00:12:46,394 --> 00:12:51,056 you won't be able to make forward progress. But at least, the OS can 127 00:12:51,056 --> 00:12:57,909 interrupt your process and try and do something about it. It can detect that one 128 00:12:57,909 --> 00:13:02,326 process is hung. But in a cooperativ e multi-threading environment, if one 129 00:13:02,326 --> 00:13:05,972 process, let's say, just hogs the processor and never gives back the 130 00:13:05,972 --> 00:13:10,324 processor or just dies, and you have some other process, which say, which, or, or, 131 00:13:10,324 --> 00:13:14,731 let's say, you, you grab the lock and one process and die. And then, another process 132 00:13:14,731 --> 00:13:19,301 is running along and tries to acquire the lock, but it's just sitting there spinning 133 00:13:19,301 --> 00:13:22,985 in this little spin section here for forever. You're basically just going to 134 00:13:23,322 --> 00:13:27,028 hang the processor. You're not going to see anything else here. So, you've got to 135 00:13:27,365 --> 00:13:31,071 be pretty careful especially in cooperative multi-threading environments. 136 00:13:31,071 --> 00:13:34,946 So, one of the tricks to this actually is, if you're in a cooperative multi-threaded 137 00:13:34,946 --> 00:13:41,747 environment, they'll typically put inside the spin loop here something which will 138 00:13:41,747 --> 00:13:47,620 yield to a different thread. So, someone else can try to do something in that time. 139 00:13:47,620 --> 00:13:53,529 Cuz if you just sit there spinning, it's very possible that you won't be fair or no 140 00:13:53,529 --> 00:13:58,584 one else can ever go to unlock the lock, for instance. So, you've got to be a 141 00:13:58,584 --> 00:14:06,242 little careful there. Okay. So, we're going to move forward here. And look at 142 00:14:06,242 --> 00:14:13,202 how to do a slightly different version of the same piece of code, but now with 143 00:14:13,202 --> 00:14:19,999 compare and swap. So, before we do that, let's talk about the, the semantics of 144 00:14:19,999 --> 00:14:27,845 compare and swap. Compare and swap takes memory and two registers here. And it's 145 00:14:27,845 --> 00:14:35,936 going to compare memory with one of the registers. If it's equal, then its going 146 00:14:35,936 --> 00:14:43,719 to take the other register, put it in memory, and effectively swap them in 147 00:14:43,719 --> 00:14:51,004 return of status. If the compare operation fails, and we come to this L station 148 00:14:51,004 --> 00:14:56,177 statement here, we don't do any swap. So this would be effectively taking a 149 00:14:56,177 --> 00:15:01,070 register and a memory location and swapping them, dependent on another 150 00:15:01,070 --> 00:15:07,062 register. So, it's a little bit more powerful than just swap. It's a little 151 00:15:07,062 --> 00:15:12,763 harder to implement, as you might imagine. it's probably not as hard to implement as 152 00:15:12,763 --> 00:15:18,504 something like fetch and app. The reason for this is, if you want to do this 153 00:15:18,504 --> 00:15:23,280 operation out of the memory system, you can basically send it as atomic sort of 154 00:15:23,280 --> 00:15:27,797 operation. Go to main memory, compare this, put a comparitor there and swap the 155 00:15:27,797 --> 00:15:32,724 two things. If you try to have these sort of increments or arbitrary instructions or 156 00:15:32,724 --> 00:15:37,368 adds, you need to put a adder out there for sure. This only needs a comparator out 157 00:15:37,368 --> 00:15:41,301 there. Still, still, still painful to go implement. But, it might a little bit 158 00:15:41,301 --> 00:15:45,955 better than putting a full adder in your memory controller cuz that's almost like 159 00:15:45,955 --> 00:15:50,498 duplicating your ALU out in your memory sub-system which doesn't make a whole lot 160 00:15:50,498 --> 00:15:54,820 of sense. But, that's, that's why people used to talk about the fetching FFT. But, 161 00:15:55,280 --> 00:16:01,538 okay. So, let's take a look at the same piece of the same case here of a 162 00:16:01,538 --> 00:16:07,580 multi-reader piece of code. But instead, what we're going to do is we're not going 163 00:16:07,115 --> 00:16:13,003 to have a spin lock at the beginning. Instead, we're going to have a critical 164 00:16:13,003 --> 00:16:19,277 section, but we can concurrently execute critical sections from different threads 165 00:16:19,277 --> 00:16:24,045 here which we're not going to call it a critical section then. We're going to say, 166 00:16:24,045 --> 00:16:29,184 we'll concurrently execute what was in the other thread's critical section, and then 167 00:16:29,184 --> 00:16:34,261 we're going to atomically try to swap out the head pointer and replace it. So, we're 168 00:16:34,261 --> 00:16:39,119 inspectively doing speculative work here, and this is called non-blockive, 169 00:16:39,119 --> 00:16:45,580 non-blocking synchronization. So, the synchronization primitive we're going to 170 00:16:46,000 --> 00:16:50,797 do the loads, we talked about that before. We checked to make sure there's at least 171 00:16:50,797 --> 00:16:55,887 something available. We'll pull the, the head pointer off. This is all speculative 172 00:16:55,887 --> 00:17:00,743 work at this point cuz we have not check to make sure that we can validly pull off 173 00:17:00,743 --> 00:17:06,236 the head of the queue. We're going to increment the head pointer into a 174 00:17:06,236 --> 00:17:11,325 temporary here that we're going to call new head or registered new head. And then, 175 00:17:11,325 --> 00:17:19,169 we're going to do a, try to swap the new head with the old head atomically. So, 176 00:17:19,169 --> 00:17:24,896 it's going to take the register that we think is the new head and memory location 177 00:17:24,896 --> 00:17:30,344 where the head pointer is. We're going to try to swap these two things. And the 178 00:17:30,344 --> 00:17:36,072 thing that we're dependent on here is that what we think is the old head, still is 179 00:17:36,072 --> 00:17:47,995 the old head. And if, if something bad hap pens here, we can, we can try again. so, 180 00:17:47,995 --> 00:17:56,022 so the idea here is that we're going to spin actually in this effectively bigger 181 00:17:56,022 --> 00:18:13,307 loop, trying to speculatively try to go and change the, the head. So, one, one 182 00:18:13,307 --> 00:18:25,435 question that comes up here is, what if the head was DQ'd from, added back on to, 183 00:18:25,435 --> 00:18:32,288 and, and it just so happens that we have some sort of, it's called an ABA problem? 184 00:18:32,288 --> 00:18:39,141 The headpointer was a, was changed to be, and later changed back to a, all in, this 185 00:18:39,141 --> 00:18:51,841 short period of time. Yeah. That, that might be a problem here in this piece of 186 00:18:51,841 --> 00:18:56,235 code. You might need to think about that and carefully reason about that. Usually, 187 00:18:56,235 --> 00:19:00,465 you can protect that in some other way sort of, when the, the pointer wraps 188 00:19:00,465 --> 00:19:04,695 around, maybe. if you think about a normal ray, when it wraps around, you put 189 00:19:04,860 --> 00:19:09,474 something which is not a compare and swap. You put like more spin lock, or a spin 190 00:19:09,474 --> 00:19:13,759 loop sort of lock on it. So, you only have to worry about that in these sort of 191 00:19:13,759 --> 00:19:17,769 circular buffers when something wraps around. Cuz otherwise, there's no, no 192 00:19:17,769 --> 00:19:22,014 other way that the head pointer can go back to where it was before. But, you do 193 00:19:22,014 --> 00:19:26,327 need to worry about that, that you wrapped all the way around the array in this very 194 00:19:26,327 --> 00:19:30,332 short period of time. So, there's some, there's some race conditions that happen 195 00:19:30,332 --> 00:19:37,889 if these non-blocking synchronizations operations. Another non-blocking 196 00:19:37,889 --> 00:19:46,914 synchronization primitive that is actually really cool, I think, is, goes by lots of 197 00:19:46,914 --> 00:19:53,237 different names. it goes by load locked, load linked, load reserve, and store 198 00:19:53,237 --> 00:19:59,724 conditional. So, you take your ISA, you take your instruction set, and you add two 199 00:19:59,724 --> 00:20:07,124 new instructions. Something which checks to make sure or something which doesn't 200 00:20:07,124 --> 00:20:16,315 load, and what you then do is store conditional checks to make sure that no 201 00:20:16,315 --> 00:20:24,874 one else did a load. Or a special load link to that same address in the 202 00:20:24,874 --> 00:20:34,099 inter-meeting time. I can explain that one more time, it's a little confusing. But 203 00:20:34,099 --> 00:20:42,660 the, basically the idea is a flag register here. And in hardware, you do a load link. 204 00:20:44,420 --> 00:20:48,291 Just load the value into your register and its sort of on the side, somewhere n ext 205 00:20:48,291 --> 00:20:51,596 to the register. It may not be architecturally visible, or shouldn't be 206 00:20:51,596 --> 00:20:58,423 architecturally visible likely. It will set a bit. You execute your critical 207 00:20:58,423 --> 00:21:04,124 section. And when you come back to it, to go do the store, we say, for this sort of 208 00:21:04,124 --> 00:21:11,410 read modified write operations. When you go to the store, the store checks to make 209 00:21:11,410 --> 00:21:18,465 sure the flag is still set to one. The flag is one. If it's still one, that means 210 00:21:18,465 --> 00:21:23,060 no one did any memory, memory operations to that address or did at least any load 211 00:21:23,060 --> 00:21:27,655 links or store initials to that address in the inter meeting time. Actually, sorry. 212 00:21:27,655 --> 00:21:32,137 We'll say that in a more completely. No one else did a store conditional during 213 00:21:32,137 --> 00:21:36,675 that time. So, no one else tried to do a store update. Other people may have tried 214 00:21:36,675 --> 00:21:41,425 to do a load. But, no one else could of store at that address, or store 215 00:21:41,425 --> 00:21:46,395 conditional at that address. Because if someone else did a store conditional, 216 00:21:46,395 --> 00:21:52,233 what'll happen is that it'll actually cancel the other processor's reservation, 217 00:21:52,233 --> 00:21:58,600 so set their flag bits to zero atomically. This is a pretty primitive, this is a 218 00:21:58,600 --> 00:22:06,103 pretty powerful primitive here such that you can basically do a load. Try to do a 219 00:22:06,103 --> 00:22:10,500 store later in the future. If no one else try to do a store to that same address in 220 00:22:10,500 --> 00:22:14,736 the meantime, or a store conditional to that address in the meantime, your store 221 00:22:14,736 --> 00:22:18,972 goes through and everybody else's loads. If they do a load in the meantime, will 222 00:22:18,972 --> 00:22:23,154 get invalidated or they'll know that it gets invalidated when they go to do a 223 00:22:23,154 --> 00:22:29,444 store conditional. So, if we take the same example here and we implement with load 224 00:22:29,444 --> 00:22:34,997 link and store conditional, we're going to look at the headpointer here where we 225 00:22:34,997 --> 00:22:44,514 normally do the update. And we're going to do a load into this register. And, when we 226 00:22:44,514 --> 00:22:49,081 go into a store conditional, if we, someone else had changed this value in the 227 00:22:49,081 --> 00:22:52,935 meantime, we're going to get a fail and we're going to jump back around and have 228 00:22:52,935 --> 00:23:00,920 to redo the load link. And this actually gets rid of the ABA problem, also. Because 229 00:23:01,300 --> 00:23:05,817 if you did have that sort of race, someone else would have done a store conditional 230 00:23:05,817 --> 00:23:10,335 to that address, and the wrap around cases and all that stuff would actually, and it 231 00:23:10,335 --> 00:23:16,854 would even validate your flag on that bit, if you will. We're flagging that memory 232 00:23:16,854 --> 00:23:22,643 address. One of the reasons that people like this is, when you go to do load link 233 00:23:22,643 --> 00:23:28,361 or load reserve in store conditional, you can, a lot of times sort of piggy back 234 00:23:28,361 --> 00:23:33,933 this over a memory coherence protocol. Such that, you do a load link and it will 235 00:23:33,933 --> 00:23:39,925 add an extra bit in your memory, we'll say. So, if you're doing some sort of 236 00:23:39,925 --> 00:23:45,122 memory coherence protocol, it'll pull it in your cache and you can sort of tag this 237 00:23:45,122 --> 00:23:50,945 little loop line as special, or that's the flag bit here effectively cuz this whole 238 00:23:50,945 --> 00:23:57,527 line was load linked. And then, if no one else pushes it out of your cache or 239 00:23:57,527 --> 00:24:01,342 fetches it from your cache in the meantime, so would mean no one else did a 240 00:24:01,342 --> 00:24:05,100 load link on it. By the time you go to do the store, you know that no one did 241 00:24:05,100 --> 00:24:09,105 anything. And what's really nice about this is there is no communication that has 242 00:24:09,105 --> 00:24:12,962 to happen between the processors on the common case with an uncontained lock. 243 00:24:12,962 --> 00:24:16,819 Versus the spin locks are, are generating memory traffic all the time. They are 244 00:24:16,819 --> 00:24:20,825 saying, they're doing loads, doing stores and trying to like grab the lock, trying 245 00:24:20,825 --> 00:24:24,880 to grab lock, trying to grab lock, trying to grab lock and they just generating all 246 00:24:24,880 --> 00:24:28,836 this traffic all over the place versus this is very quiet. You just pull in your 247 00:24:28,836 --> 00:24:33,681 local cash. And if by the time you get back to it, it was still in your cache and 248 00:24:33,681 --> 00:24:39,250 it still has it's bitset. That's great. You get to do it. In the contendant case 249 00:24:39,250 --> 00:24:44,656 though, you may actually get traffic ping ponging here, but you can still guarantee 250 00:24:44,656 --> 00:24:57,521 correctness. So, this brings us to performance. As I said, some of these 251 00:24:57,521 --> 00:25:02,465 blocking atomic operations can generate a lot of memory traffic. So, they'll say 252 00:25:02,465 --> 00:25:07,721 they're spinning on memory and they'll at least be doing lots of loads. And those 253 00:25:07,721 --> 00:25:13,312 lots of loads may be going out to main memory. Non-walking atomic operations, so 254 00:25:13,312 --> 00:25:18,128 like these compare and swamp operations or load link store conditional. And the load 255 00:25:18,128 --> 00:25:22,543 link store conditional, as I was alludin g to many times you can, you can do it 256 00:25:22,543 --> 00:25:27,016 against your own cashe. also these, these things can have sometimes a better 257 00:25:27,360 --> 00:25:32,119 performance because you don't have to sit there sort of spinning in type loops. You 258 00:25:32,119 --> 00:25:36,821 can try to do stuff in, in the mean time. The down side to, at least the compare and 259 00:25:36,821 --> 00:25:41,236 swamp implementation is you have to sort of speculative work in the meantime. 260 00:25:41,236 --> 00:25:46,350 You're sitting there redoing some work and if you didn't get the lock, that 261 00:25:46,350 --> 00:25:52,137 performance can be bad. There's also, as I alluded to at the beginning of class, 262 00:25:52,137 --> 00:25:57,209 there are things that use ordinary loads and stores, or algorithms that do allow 263 00:25:57,209 --> 00:26:02,218 blocking with just loads and stores. Their performance is, is not, not very good and 264 00:26:02,218 --> 00:26:09,910 we're going to look at an example of that called Dekker's Algorithm today. Finally, 265 00:26:11,620 --> 00:26:17,878 the performance of these isn't just dependent on the operations themselves. 266 00:26:17,878 --> 00:26:25,067 So, of the, these primitives, harbor primitives, have better performance or 267 00:26:25,067 --> 00:26:29,359 worse performance depending on how highly they are contended. So, what I mean by 268 00:26:29,359 --> 00:26:33,706 contended is I mean you have multiple threads trying to acquire the lock at the 269 00:26:33,706 --> 00:26:37,965 same time. If you have lots of people trying to fight for the lock, that's high 270 00:26:37,965 --> 00:26:42,228 contention. If you have a lock which there's almost never contention on, you go 271 00:26:42,228 --> 00:26:46,599 to just go to get the lock and you just get it, and it's there for a really rare 272 00:26:46,599 --> 00:26:50,807 case. other locking primitives have better, better protocols there. And, of 273 00:26:50,807 --> 00:26:55,124 course, we're going to talk about this in a little bit of how do you make these 274 00:26:55,124 --> 00:26:59,222 integrate well with caches, out-of-order processors, and out-of-order memory