1 00:00:04,080 --> 00:00:09,692 Okay, so now we get to talk about issues of sequential consistency and how to 2 00:00:09,692 --> 00:00:15,746 actually go about implementing these things. So going back to our, valid 3 00:00:15,746 --> 00:00:20,986 interleaving. We start to ask ourselves, how do we build a sequentially consistent 4 00:00:20,986 --> 00:00:26,872 model with out of order processors, or out of order execution of memory. Well we 5 00:00:26,872 --> 00:00:31,998 already said in our out of order system that we can re-order loads to different 6 00:00:31,998 --> 00:00:37,029 addresses. We can re-order loads to the same address. Doesn't really matter. Well, 7 00:00:37,071 --> 00:00:42,003 if you go take that, all of a sudden they just breaks a sequential consistency. Huh. 8 00:00:42,003 --> 00:00:47,417 We said we could reorder a store in a load in our out of order processor if they were 9 00:00:47,417 --> 00:00:52,516 to different addresses. We said we could reorder stores and load the other way. We 10 00:00:52,516 --> 00:00:57,805 said we could reorder stores in stores. So out of order memory systems and auto, auto 11 00:00:57,805 --> 00:01:02,715 processors effectively break sequential consistencies. So how you break, how do 12 00:01:02,715 --> 00:01:08,504 you implement a processor which is useful? And is out of order, and we want to go out 13 00:01:08,504 --> 00:01:14,607 of order for performance reasons. Also cache has started to become a big problem 14 00:01:14,607 --> 00:01:19,893 here for performance. If we want to have true sequential consistency you're going 15 00:01:19,893 --> 00:01:25,244 to have to somehow figure out how to not store any data in your cache. Because the 16 00:01:25,244 --> 00:01:31,667 second that you store data in your cache, you're going to have to go, let's say it's 17 00:01:31,667 --> 00:01:36,565 a right back cache if you do a store to your local cache and you write a one to 18 00:01:36,565 --> 00:01:41,647 it, and some other processor tries to read it, well it's not going to be able to see 19 00:01:41,647 --> 00:01:51,255 that data. So we need to be very careful here of when is data actually visible to 20 00:01:51,255 --> 00:02:00,104 other processors if you have a cashe involved in multi-processor system. So we 21 00:02:00,104 --> 00:02:10,873 are going to move towards little bit more weak or the relax memory models. So as I 22 00:02:10,873 --> 00:02:17,783 said last class, almost no processor actually implements a, a truly 23 00:02:17,783 --> 00:02:22,288 sequentially consistent memory system. They're all weaker than that. Now the 24 00:02:22,288 --> 00:02:27,334 question is how weak do you want to go. and this starts to come up because it's a, 25 00:02:27,334 --> 00:02:31,960 it's a tech trade off between what the programmer can wrap their head around 26 00:02:32,520 --> 00:02:43,282 versus effectively performance in an out of order system. So what we can do is we 27 00:02:43,282 --> 00:02:53,493 can think about going to weaker memory systems, and when we do that, we are going 28 00:02:53,493 --> 00:02:58,467 to make the programmer decide when a load in a store can be reordered, or when a 29 00:02:58,467 --> 00:03:03,745 store in a store can be reordered, or when a load in a load can be re reordered. So 30 00:03:03,745 --> 00:03:08,606 let's look at these four different cases independently here. Reordering a load with 31 00:03:08,606 --> 00:03:13,174 a load, reordering a load with a store after it, reordering a store with a load 32 00:03:13,174 --> 00:03:17,684 after it and reordering two stores. And note we're not talking about the same 33 00:03:17,684 --> 00:03:22,252 memory addresses here at all. We are talking about two different addresses, but 34 00:03:22,252 --> 00:03:26,937 sequential consistency even says for different addresses we need to worry about 35 00:03:26,937 --> 00:03:32,011 this. So how many of these extra dependencies can we remove and we're going 36 00:03:32,011 --> 00:03:37,408 to introduce this notion of a memory fence or sometimes we call memory barrier 37 00:03:37,408 --> 00:03:43,146 operation which is a serialization point that says all of the previous instructions 38 00:03:43,146 --> 00:03:48,747 before well, I will be careful what I say there actually. My references is actually 39 00:03:48,747 --> 00:03:54,417 a serialization operation which says all of the memory accesses or at least the 40 00:03:54,417 --> 00:03:59,608 named memory accesses before a certain point have been completed before the 41 00:03:59,608 --> 00:04:06,366 memory fence completes. And, all of the operations after it do not complete, have 42 00:04:06,366 --> 00:04:11,786 not started or completed. Before the memory fence completes so this really is a 43 00:04:11,786 --> 00:04:15,785 barrier on your memory operations and the reason I'm being a little bit hesitant 44 00:04:15,785 --> 00:04:21,756 here is the basic memory fence will say that all memory operations before the 45 00:04:21,756 --> 00:04:27,726 memory fence completes and that none after the memory fence have started when the 46 00:04:27,726 --> 00:04:32,862 memory fence completes. But people have introduced a little bit even weaker 47 00:04:32,862 --> 00:04:37,576 versions of memory fence over time of course, for performance. And you start to 48 00:04:37,576 --> 00:04:42,350 see things where. You'll have only load memory fences, where it barriers against 49 00:04:42,350 --> 00:04:47,491 all the loads previous. Or you have, one you need directional memory fences which 50 00:04:47,491 --> 00:04:52,398 will say only look at all the stores that happen after this point and don't allow 51 00:04:52,398 --> 00:04:56,887 any of them to happen before this point. So, people have actually introduced 52 00:04:56,887 --> 00:05:01,735 instructions in modern architectures that have all these things. And, probably one 53 00:05:01,735 --> 00:05:06,642 of the weaker ones if you actually wanna go look at a really weak memory model, is 54 00:05:06,642 --> 00:05:11,430 go read about the itanium memory models, extremely weak memory model. and they, 55 00:05:11,430 --> 00:05:15,423 they. Need to have fences all over the place. And the, the trade off here, the 56 00:05:15,423 --> 00:05:20,336 performance trade off is how many fences you need to put in. Versus performance you 57 00:05:20,336 --> 00:05:25,643 get from reordering of memory operations. So let's look at a couple, let's, let's 58 00:05:25,643 --> 00:05:31,527 name a few of these things, cause everyone likes naming. And these are computer 59 00:05:31,527 --> 00:05:37,034 architecture research names. These are common across the industry at this point 60 00:05:37,034 --> 00:05:42,267 effectively, One of them is called total store ordering. So in total store 61 00:05:42,267 --> 00:05:48,656 ordering, this is not a big step away from sequential consistency. The stores, some 62 00:05:48,656 --> 00:05:54,596 total store ordering. Loads do not get reordered with other loads. Loads do not 63 00:05:54,596 --> 00:05:59,960 get reordered with other stores. Stores don't get reordered with other s, stores. 64 00:05:59,960 --> 00:06:04,945 But a store followed by a load? Can be reordered. So a load can be moved above a 65 00:06:04,945 --> 00:06:13,587 store here. And if you want to enforce the order between the store and load you need 66 00:06:13,587 --> 00:06:22,398 to write this. You need to do store word and then fence, loadwork. And this 67 00:06:22,398 --> 00:06:30,220 guarantees that this load is going to happen after that store in the processor. 68 00:06:33,160 --> 00:06:38,767 Partial store ordering is a little bit weaker than that. So partial store 69 00:06:38,767 --> 00:06:44,758 ordering, loads of other loads don't get reordered. Loads of other stores don't 70 00:06:44,758 --> 00:06:51,924 need fences, but stores followed by loads, which is up here, and stores followed by 71 00:06:51,924 --> 00:06:58,091 stores also need fences to guarantee ordering. And then finally we get this 72 00:06:58,091 --> 00:07:03,654 class of weak ordering, and frankly, most processors these days actually fall into 73 00:07:03,654 --> 00:07:09,010 some category of weak ordering, and there's sort of different questions in how 74 00:07:09,010 --> 00:07:14,848 weak it is. but we're not going into that little detail here. And here, you actually 75 00:07:14,848 --> 00:07:19,037 need a fence to enforce all of these orderings, so it's kind of, kind of 76 00:07:19,449 --> 00:07:25,701 interesting, interesting to think about. The nice thing about fences is that. While 77 00:07:25,701 --> 00:07:30,216 these o perations are expensive and are big serialization points, the memory 78 00:07:30,216 --> 00:07:34,975 fences, you only need to pay for them when you care about the ordering. When you 79 00:07:34,975 --> 00:07:39,525 don't care about the ordering, the computer architecture can reorder things 80 00:07:39,525 --> 00:07:44,379 and make everything go faster. It's only exactly when you care about it that you 81 00:07:44,379 --> 00:07:49,353 need to introduce defenses. one thing I was going to say is some compilers will do 82 00:07:49,353 --> 00:07:54,328 some of this work for you. Introduce defenses, for you. And usually one of the 83 00:07:54,328 --> 00:07:59,363 ways that it's done actually is if you see atomic operation. So if they see, if a 84 00:07:59,363 --> 00:08:04,459 compiler sees something like a lock operation or P or a V, in your code and 85 00:08:04,459 --> 00:08:10,221 it's sorta a special way to signal back to the compiler. It'll know that you need to 86 00:08:10,221 --> 00:08:15,418 make sure the memory operations have all been serialized at that point. Usually 87 00:08:15,418 --> 00:08:20,680 also atomic operations like test and set are fences on their own right. which 88 00:08:20,680 --> 00:08:26,259 helps, but sometimes compilers can even figure out where other fences are needed 89 00:08:26,259 --> 00:08:32,519 for other memory addresses. And one of the things that I want to say here is that 90 00:08:32,519 --> 00:08:37,170 fences don't actually take a argue bit. It's not going to take an address to go 91 00:08:37,170 --> 00:08:42,000 fence on, and the reason for this is we really want to do is we want to make sure 92 00:08:42,000 --> 00:08:46,830 all the previous memory operations, cuz in the question of consistency we haven't 93 00:08:46,830 --> 00:08:51,660 said anything about the addresses. We just said previous memory operations in one 94 00:08:51,660 --> 00:08:57,328 thread don't move after a subsequent memory operation. Okay, so let's look at 95 00:08:57,328 --> 00:09:04,836 an example here of where we might need to introduce a different, different levels of 96 00:09:04,836 --> 00:09:14,743 fences here. So here we have our producer consumer model. And one of the things we 97 00:09:14,743 --> 00:09:18,401 were worried about in our original examples, we were worried about these two 98 00:09:18,401 --> 00:09:23,324 stores getting reordered. Well, we can force that by putting a memory fence 99 00:09:23,324 --> 00:09:27,620 operation in here, and saying, it's a store, store. So no stores can pass any 100 00:09:27,620 --> 00:09:33,951 other stores, and this'll guarantee that these happen in order. But we don't really 101 00:09:33,951 --> 00:09:41,385 care, in this case we'll say, if, well, this actually already has an arc, so we're 102 00:09:41,385 --> 00:09:45,305 not, that's not going to get reordered, beca use this register is, is, is 103 00:09:45,305 --> 00:09:49,906 dependent there. But let's go over here, and take a look over here. If we have full 104 00:09:49,906 --> 00:09:55,628 sequential consistency, we would not be able to reorder these two loads. Well for 105 00:09:55,628 --> 00:10:01,124 performance maybe we want to reorder those loads because one of the input registers 106 00:10:01,124 --> 00:10:05,949 isn't valable. Because it's the result of a long multiply for instance, or 107 00:10:05,949 --> 00:10:10,612 something, something like that. So you might want to reorder into your out of 108 00:10:10,612 --> 00:10:14,564 order processes for performance reasons. In true sequential consistency you 109 00:10:14,564 --> 00:10:18,884 wouldn't be able to do that. But with this weaker memory model we can reorder that, 110 00:10:18,884 --> 00:10:22,730 but then guarantee correctness by introducing a fence operation here. And 111 00:10:22,730 --> 00:10:29,700 this fence operation is gonna to say. After these loads are done. These loads 112 00:10:29,700 --> 00:10:35,307 need to complete before this load starts. So you can sort of guarantee that if our 113 00:10:35,307 --> 00:10:40,366 head equals our tail, you know at that point that you're, you're safe to go 114 00:10:40,366 --> 00:10:53,877 execute the rest of this code here. Any questions about that so far, about the, 115 00:10:53,877 --> 00:10:59,360 the fences and why? It's kind of a pay as you go, if you will.