1 00:00:02,540 --> 00:00:12,528 But before, before we do that, I wanted to talk about programming and a competitor to 2 00:00:12,528 --> 00:00:22,161 using shared memory for programming. So, so far, we have looked at a shared memory 3 00:00:22,161 --> 00:00:27,398 architecture to communicate from one core to another core. So, what does that means 4 00:00:27,398 --> 00:00:32,252 is one core does a store to some memory address. And sometime in the future, 5 00:00:32,252 --> 00:00:40,117 another core goes to read that memory address and gets the data. What's 6 00:00:40,117 --> 00:00:48,727 interesting about that paradigm is the sender of the data, the core which is 7 00:00:48,727 --> 00:00:53,446 writing the data or doing the store does not need to know which core is going to 8 00:00:53,446 --> 00:01:01,382 read the data in the future. Or it's very possible that no one will go read that 9 00:01:01,382 --> 00:01:09,049 data in the future. And in shared memory architectures, that doesn't, that doesn't 10 00:01:09,049 --> 00:01:15,500 affect anything. We can actually have one core writes a piece of data in the memory 11 00:01:15,500 --> 00:01:20,600 and some address and given this implicit name, we effectively have some associative 12 00:01:20,600 --> 00:01:25,580 match where you can look up based on the address, the piece of data and go find the 13 00:01:25,580 --> 00:01:30,440 piece of data sometime in the future. Now, you probably need to use locking and to, 14 00:01:30,440 --> 00:01:35,120 to guaranteed causality between one processor writing the data and another 15 00:01:35,120 --> 00:01:40,705 processor reading the data in shared memory but you don't need to know the 16 00:01:40,705 --> 00:01:46,904 destination. In contrast, we're going to talk about explicit message passing as a 17 00:01:46,904 --> 00:01:52,733 programming model. Now, there are lots of hardware implementations, you can go 18 00:01:52,733 --> 00:01:58,945 implement this. But let's start off by talking about exclusive messaging as a 19 00:01:58,945 --> 00:02:04,620 program model to communicate between different threads, processes, or cores. 20 00:02:06,278 --> 00:02:11,077 And what we're going to have here is we're going to add, we're going to, we're going 21 00:02:10,688 --> 00:02:20,644 to have an API or a, a programming interface here of send, where it names the 22 00:02:20,644 --> 00:02:30,571 destination, and passes into it, a pointer to some data. And in our messaging API, 23 00:02:30,571 --> 00:02:38,415 this is going to take this data and somehow get it to the receiver or the 24 00:02:38,415 --> 00:02:48,122 destination. On the receive side, the most basic thing they can do is you can receive 25 00:02:48,122 --> 00:02:56,904 data. Note this receive here does not denote who it's receiving data from. So, a 26 00:02:56,904 --> 00:03:03,129 common of thing you can do is just receive in order, receive the next piece of data 27 00:03:03,129 --> 00:03:09,124 that it gets in. Now, that may not be super useful so you may want to extend 28 00:03:09,124 --> 00:03:15,196 this to having the receive actually take a, a source also. But it's not required. 29 00:03:15,196 --> 00:03:21,421 There are the reason I'm, I'm talking about this in this abstract fashion here 30 00:03:21,421 --> 00:03:28,242 is there's many different explicit message passing programming interfaces because 31 00:03:28,242 --> 00:03:33,955 it's just software that people have implemented over, over the years. but at 32 00:03:33,955 --> 00:03:39,669 the, at the fundamental level, there's a send and a receive. And you could think 33 00:03:39,669 --> 00:03:45,510 about having the send, or the send requires a destination and receive can 34 00:03:45,510 --> 00:03:50,260 just receive of, of, basically the input queue or it would try to receive from a 35 00:03:50,260 --> 00:03:55,581 pre-recorded source, or another way to do this is what they do in, in MPI, which is 36 00:03:55,581 --> 00:03:59,762 called the public message passing interface, which is a probably the most 37 00:03:59,762 --> 00:04:04,449 common programming interface for messaging, is it won't take a source, but 38 00:04:04,449 --> 00:04:09,655 instead, it will take a tag, which is effectively a number which says, what 39 00:04:09,655 --> 00:04:15,477 message type is the or not message type, sorry, what, what traffic flow is it, you 40 00:04:15,477 --> 00:04:21,513 sort of name the flow. This is kind of a tag is, is sort of a more generalized form 41 00:04:21,513 --> 00:04:28,190 of, in TCP/IP, they have ports, for instance, right? And UDP has ports. Okay. 42 00:04:28,190 --> 00:04:32,662 So, questions so far about this basic program model and why is it different than 43 00:04:32,662 --> 00:04:41,100 loads and stores through shared addresses. Very different. They are jewels of each 44 00:04:41,100 --> 00:04:44,740 other. I'm going to tell you right now that you, if you can implement one thing 45 00:04:44,740 --> 00:04:48,120 of shared memory, you can implement messaging and vice versa, and there have 46 00:04:48,120 --> 00:04:51,716 been lots of fights over this, and at some point or, or, the whole community realized 47 00:04:51,716 --> 00:04:55,140 that you can do either with either. You can do any programming model, you can do 48 00:04:55,140 --> 00:05:14,016 one and the other. Okay. So, the default message type we're going to talk about is 49 00:05:14,016 --> 00:05:27,299 unicast. So, that's one-to-one. If you ascend to a destination, the destination 50 00:05:27,299 --> 00:05:27,299 is one other node or one other process, or one other thread. The reason that I say, 51 00:05:27,299 --> 00:05:27,299 process, thread, and node is depending on how API is structured. You might have, you 52 00:05:27,299 --> 00:05:27,299 could use a, a messaging, exclusive message passing system on something like a 53 00:05:27,299 --> 00:05:27,299 uniprocessor. We have different threads or different processes running and you are 54 00:05:27,299 --> 00:05:31,206 only messaging for one process to another process even though it is one core. But 55 00:05:31,206 --> 00:05:35,590 there's a pretty natural extension of that. You can take those two processes and 56 00:05:35,590 --> 00:05:40,599 put them on different cores, and have them communicate using the same interface. We 57 00:05:40,599 --> 00:05:46,750 could also do this between two threads, but that's less common. Okay. So, that, so 58 00:05:46,750 --> 00:05:52,315 unicast is one-to-one. Some messaging networks and some messaging primitives 59 00:05:52,315 --> 00:05:58,319 will actually program models will have a couple of other choices here. We can have 60 00:05:58,319 --> 00:06:03,738 multicast. So,- this is communication from one node, but instead of having a 61 00:06:03,738 --> 00:06:09,522 destination here which denotes one other location, it can denote a set of other 62 00:06:09,522 --> 00:06:14,910 people to communicate with. It is still messaging except for destination, we've 63 00:06:14,910 --> 00:06:19,206 expanded our notion of destination here so it can be a set, not just a single thing. 64 00:06:22,198 --> 00:06:29,035 We can also have broadcast, where you can have one node, instead of having 65 00:06:29,035 --> 00:06:33,782 destination here, you can have some magical flags which says, communicate this 66 00:06:33,782 --> 00:06:38,836 with every other node in the system or every other process in the system or every 67 00:06:38,836 --> 00:06:43,891 other threat in the system or every other process in that process group or thread 68 00:06:43,891 --> 00:06:51,346 group there is. There's ways to sort of say that in these software protocols. just 69 00:06:51,346 --> 00:07:00,153 for a quick question here. Has anyone here programmed an MPI? Okay. And once in a 70 00:07:00,153 --> 00:07:07,577 while, perform a programming. Good. So, we'll talk briefly. I'm going to show a 71 00:07:07,577 --> 00:07:14,997 brief code example here of one process communicating with another process via a 72 00:07:14,997 --> 00:07:21,444 very common messaging library called the message passing interface or MPI for 73 00:07:21,444 --> 00:07:30,030 short. So, let's look at this piece of code. It's a C code and what's happening 74 00:07:30,030 --> 00:07:37,239 here is, okay we start our program, we have my ID. So, the, the program model of 75 00:07:37,239 --> 00:07:45,196 MPI is that it will actually start the same process on multiple cores or multiple 76 00:07:45,196 --> 00:07:52,977 processes or multiple not, not, it's not threads but multiple processes or multiple 77 00:07:52,977 --> 00:08:00,684 cores will sta rt up the exact same program. So, this is sometimes called SPMD 78 00:08:00,684 --> 00:08:07,554 or single program multiple data program model. SPMD. So, the SPMD model here, what 79 00:08:07,554 --> 00:08:13,922 you're going to notice is that actually we're going to execute this program twice. 80 00:08:13,922 --> 00:08:20,644 One on the one core, and one on the second core. Now, you're going to ,, well, if 81 00:08:20,644 --> 00:08:26,306 you're executing exact same program, are they going to do the exact same thing? 82 00:08:26,306 --> 00:08:32,044 Well, let's take a look. Okay. So, we have this thing called my ID. This is what 83 00:08:32,044 --> 00:08:39,394 tells us in what core we are. And there's something here called numprocs, which is 84 00:08:39,394 --> 00:08:45,219 going to tell us how many processors there are in the computer. And we fill these in. 85 00:08:45,219 --> 00:08:50,827 So, you see here, we start off by calling MPI INT MPI INT sets up the message 86 00:08:50,827 --> 00:08:56,289 passing system from a software perspective. Then we call MPI 87 00:08:56,289 --> 00:09:00,860 communication size with some magic parameters. And it's going to fill in this 88 00:09:00,860 --> 00:09:05,550 field, which tells us how many processors there are in this, in this MPI program. 89 00:09:05,550 --> 00:09:10,304 When you launch an MPI program, there's a special MPI Start command that you will 90 00:09:10,304 --> 00:09:14,887 have to run, which will actually, is, it takes a parameter of how many it takes 91 00:09:14,887 --> 00:09:19,699 parameter of programmer trying to run and multiple processors that you are trying to 92 00:09:19,699 --> 00:09:24,282 run, and the number of processes that you are trying to run it on. So, our program 93 00:09:24,282 --> 00:09:28,578 now can dynamically detect how many processors were running on. It can also 94 00:09:28,578 --> 00:09:34,581 detect what's called our rank, which is our ID. So now, we can actually detect 95 00:09:34,581 --> 00:09:41,075 before the first process or the second process launched in a two-process system 96 00:09:41,075 --> 00:09:47,258 or which one we are, and will fill in my ID with a rank. And this is a library 97 00:09:47,258 --> 00:09:51,895 which will figure this out for us. In this, this MPI implementation. Okay. So, 98 00:09:51,895 --> 00:09:56,782 we are assert now. We say the number of procs equals two. this is just to make 99 00:09:56,782 --> 00:10:02,045 sure we're not running on three processors or ten processors or something like that. 100 00:10:02,045 --> 00:10:06,869 We want strictly two and, And not one, and not so that will, that will fail if we get 101 00:10:06,869 --> 00:10:11,693 a problem there. Okay. So now, this is, this is where we can do different things. 102 00:10:11,693 --> 00:10:16,078 So, single program multiple data. The reason it's c alled this is we can 103 00:10:16,078 --> 00:10:20,839 actually make decisions based on our processor ID or my ID here. So, we'll do 104 00:10:20,839 --> 00:10:25,600 something which says is my ID zero? Do this else, do this other piece of code, 105 00:10:28,760 --> 00:10:34,260 okay? So, is everyone want to wage your guess here what this program does so far? 106 00:10:38,720 --> 00:10:42,360 So, the one, one processor is going to be executing this, the other processor is 107 00:10:42,360 --> 00:10:51,648 going to be executing that. First processor executes a send. Let's, let's 108 00:10:51,648 --> 00:10:58,386 look at this first line here and see what happens. This is the data we're trying to 109 00:10:58,386 --> 00:11:03,380 send so we're sending x, which is an integer, which we load with the number 110 00:11:03,380 --> 00:11:14,611 475, which is our class, or course number. And we send that with a particular tag. 111 00:11:14,611 --> 00:11:23,511 So, MPI is structured around this notion of tags, which is how you can effectively 112 00:11:23,511 --> 00:11:34,250 connect up a sender and a receiver. Oh, sorry. That's how you can connect up 113 00:11:34,250 --> 00:11:39,656 multi, multiple pieces of traffic between senders and receivers. The other way that 114 00:11:39,656 --> 00:11:45,127 you figure out is by looking at, there's a, a number of who you're sending and 115 00:11:45,127 --> 00:11:51,012 receiving to, so this is the core number. So, what this says here is send x to rank 116 00:11:51,012 --> 00:11:57,785 number one with tag, tag, we're going to say tag is 475 also, and this MPI rest of 117 00:11:57,785 --> 00:12:04,941 this stuff here, don't worry about. And this is, this is a length. So, this is 118 00:12:04,941 --> 00:12:10,832 going to say, send one word of size integer to processor one with a particular 119 00:12:10,832 --> 00:12:17,429 tag. And this MPI COMM WORLD basically is extra flags you can pass there which says, 120 00:12:17,429 --> 00:12:23,712 do I want to send to ev, all other MPI sub processes or you can make subgroups? 121 00:12:23,712 --> 00:12:30,406 There's this notion of groups that MPI has but it's kind of, more complicated. Okay. 122 00:12:30,406 --> 00:12:36,343 So, at the same time, this other program is executing here. The first thing it goes 123 00:12:36,343 --> 00:12:41,987 to do is it receives. So, if it gets here first, it's just going to block and wait 124 00:12:41,987 --> 00:12:50,418 in this receive. Conveniently, this first processor did ascend, and has a matched 125 00:12:50,418 --> 00:12:59,733 receive here. So, it's going to receive with tag 475 length one of an MPI integer 126 00:12:59,733 --> 00:13:08,274 from zero, and return as a status code. So, it's going to fill in y with the data 127 00:13:08,274 --> 00:13:13,686 that got sent on this message. That's pretty fun. We just communicate inform 128 00:13:13,686 --> 00:13:18,801 ation between two processes. Okay. So, let's, let's look through what's happening 129 00:13:18,801 --> 00:13:28,840 here. Process zero as you do the send. So, it's going to send 475 to processor one. 130 00:13:31,309 --> 00:13:35,631 Processor one could have gotten here early, at this receive. And processor one 131 00:13:35,631 --> 00:13:40,845 does not execute any of this code. And it has different address base, x's and y's 132 00:13:40,845 --> 00:13:46,445 are different, we have no shared memory here. It gets here, and it's going to do 133 00:13:46,445 --> 00:13:52,198 receive. And at some point, the send shows up and the, the message shows up to the 134 00:13:52,198 --> 00:13:57,952 receiver. It gets received and put into y. So, it just fills it in which is why we 135 00:13:57,952 --> 00:14:05,093 pass the address of y. Now, we're going to increment y by 105, so we did some 136 00:14:05,093 --> 00:14:11,616 computation on this node. At the, at the same time while were, we're doing this 137 00:14:11,616 --> 00:14:16,657 receive, and we're doing this increment, and we're doing this send, process zero is 138 00:14:16,657 --> 00:14:22,462 basically just sitting here waiting, trying to receive. So, we're doing, we're 139 00:14:22,462 --> 00:14:27,270 having process zero. Sends some data to process one. Process one is going to do 140 00:14:27,270 --> 00:14:32,278 some math on it that's going to increment the number, and it's going to send that 141 00:14:32,278 --> 00:14:38,027 number back. So, it does the send back here which have y again. And we're going 142 00:14:38,027 --> 00:14:44,492 to send y to process zero of length one, and when the message shows up, we're going 143 00:14:44,492 --> 00:14:50,259 to receive in the process zero, and process zero is going to print out a 144 00:14:50,259 --> 00:14:57,996 message. okay. What, what does it say? Received number 580. Yes, it should go 580 145 00:14:57,996 --> 00:15:04,247 A, not 580 not 580 without an A. Because 580 without an A is not the parallel 146 00:15:04,247 --> 00:15:09,616 programming class, that's the security class next to it. But 580 A, you should go 147 00:15:09,616 --> 00:15:14,984 to it. So, it prints out the number. So, it can communicate from the one process to 148 00:15:14,984 --> 00:15:20,625 the other process and back with messaging. So, one of the things we should note here 149 00:15:20,625 --> 00:15:27,311 is, we are both moving data and we are synchronizing at the same time in this 150 00:15:27,311 --> 00:15:32,480 message. So, it's, it's doing two different operations here. So, this is a 151 00:15:32,480 --> 00:15:37,530 great question. So, this is a programming model right now. We're not going, you 152 00:15:37,530 --> 00:15:42,980 know, there's many different ways to implement MPI and people to implement MPI. 153 00:15:43,260 --> 00:15:52,210 The interface of these MPI send, MPI receive, are many different ways. So, 154 00:15:52,210 --> 00:15:56,981 people have implement this, implement this over a message, hardware message passing 155 00:15:56,981 --> 00:16:01,191 network, which does not have to go into the OS, for instance. You send and it 156 00:16:01,191 --> 00:16:05,345 actually, there's, effectively hardware there which implements MPI send and 157 00:16:05,345 --> 00:16:09,780 receive, sends it out over the networks and receives it on the receive side and 158 00:16:09,780 --> 00:16:14,215 there's some interconnection network in the middle. That's one way. Another kind 159 00:16:13,990 --> 00:16:19,951 of way is that you actually run MPI over a shared memory machine. At which point, MPI 160 00:16:19,951 --> 00:16:24,998 send. This is going to sound kind of funny. It basically takes the data, copies 161 00:16:24,998 --> 00:16:29,589 it into RAM somewhere and then when the receiver goes to receive it, it looks to 162 00:16:29,589 --> 00:16:34,059 open in the hash table, finds the location in RAM and doesn't reach from that pointer 163 00:16:34,059 --> 00:16:37,845 and that's, that's an option. And actually, that is one of the most common 164 00:16:37,845 --> 00:16:42,157 ways that people run MPI today is, that doesn't have to go in to the OS because 165 00:16:42,157 --> 00:16:47,550 it's just shared memory. on small machines, people run MPI that way. And it 166 00:16:47,550 --> 00:16:52,109 actually has good, has typically had a better performance than sort of the 167 00:16:52,109 --> 00:16:56,450 equivalent thing of writing the short memory program. And the reason for that is 168 00:16:56,450 --> 00:17:00,258 you've effectively, explicitly made the communication or you've made the 169 00:17:00,258 --> 00:17:04,563 communication explicit. So, that coherence protocol, you're basically going to be 170 00:17:04,563 --> 00:17:08,812 build to optimize for producer-consumer. You have a right to some location. And 171 00:17:08,812 --> 00:17:12,896 then, someone else is going to read it, your not going to have random false 172 00:17:12,896 --> 00:17:16,815 sharing problems, your not going to lots other problems. So, that's pretty common. 173 00:17:16,980 --> 00:17:21,506 where MPI has the biggest use today is in massively peril computations. So, this is 174 00:17:21,506 --> 00:17:26,743 sort of the supercomputers of the world. sort of the, all the supercomputers minus 175 00:17:26,743 --> 00:17:32,511 the vector of supercomputers who typically don't use MPI, but the massively parallel 176 00:17:32,511 --> 00:17:38,904 computers. So, like the biggest computers in the world right now, the I don't know, 177 00:17:38,904 --> 00:17:46,515 the Roadrunner computer at Los Alamos, for instance, and the similar sorts of 178 00:17:46,515 --> 00:17:52,683 computers like that. They will have special network cards that implemen t MPI 179 00:17:52,683 --> 00:17:57,275 effectively and they do it in user space. Now, another way to do it is that you can 180 00:17:57,275 --> 00:18:01,755 actually implement MPI over TCP/IP going through the operating system. And that's 181 00:18:01,755 --> 00:18:06,179 pretty common when people just run small clusters of computers is MPI send and 182 00:18:06,179 --> 00:18:10,434 receive will trap into the operating system. It will go into the library first 183 00:18:10,434 --> 00:18:14,746 and then up, the library will watch you read and write to the sockets. At some 184 00:18:14,746 --> 00:18:19,338 point, that will go out over the network card. All of those are possible and all of 185 00:18:19,338 --> 00:18:23,416 those have very optimized MPI implementations because this is the most 186 00:18:23,416 --> 00:18:28,944 widely used parallel messaging library for high performance application and has the 187 00:18:28,944 --> 00:18:36,029 largest . So, people have optimized this quite a bit. Okay. So, this, this brings 188 00:18:36,029 --> 00:18:40,533 us to the question of, a lot of words in this slide, but I do want to sort of talk 189 00:18:40,533 --> 00:18:44,703 about this, of message passing versus shared memory. So, we have two different 190 00:18:44,703 --> 00:18:49,707 parallel programming models. Message passing, where you have to explicitly name 191 00:18:49,707 --> 00:18:54,254 your destinations. And shared memory, where you went to some random location at 192 00:18:54,254 --> 00:18:58,591 some point in the future, someone goes and reads from it and you don't know who the 193 00:18:58,591 --> 00:19:01,987 reader is. It could be any of the processors in the system. If, you, you 194 00:19:01,987 --> 00:19:06,532 could just read around them it could be random. It could be literally random like 195 00:19:06,532 --> 00:19:10,659 you read some random number and it tells you whose going to do a reading of that 196 00:19:10,659 --> 00:19:14,770 location. But if you think about it, because you have any, you can write two 197 00:19:14,770 --> 00:19:19,233 locations someone else can go read it in shared memory. You are effectively having 198 00:19:19,233 --> 00:19:24,225 a one to all communication versus message passing allows your microarchitecture and 199 00:19:24,225 --> 00:19:29,566 your system to do the optimize around point-to-point communication. Okay. So, 200 00:19:29,566 --> 00:19:36,960 let's compare these two things. we have message passing here. In message passing, 201 00:19:37,360 --> 00:19:45,447 typically, memory is private per node or per process or per core. So, memory is not 202 00:19:45,447 --> 00:19:53,213 shared. Shared memory by definition, memory is shared. Message passing of 203 00:19:53,213 --> 00:19:58,300 explicit send and receives in our software code. So, we're actually going to put a 204 00:19:58,300 --> 00:20:02,760 MPI send and MPI receive. We' re actually going to have sends and receives. 205 00:20:06,100 --> 00:20:10,636 Conversely, shared memory, we have these implicit communication via loads and 206 00:20:10,636 --> 00:20:15,173 stores. And the load and the store names an address but it does not name a core 207 00:20:15,173 --> 00:20:21,583 number or a process number or a thread number. Message passing moves data. We do 208 00:20:21,583 --> 00:20:26,889 a send and we put some data in there. Also by definition of receiving a message on 209 00:20:26,889 --> 00:20:30,584 the receive side, there's some synchronization there. There's a 210 00:20:30,584 --> 00:20:36,790 producer-consumer relationship from the sender to the receiver. In shared memory, 211 00:20:36,790 --> 00:20:42,686 you have to explicitly have synchronization. sorry, I should say 212 00:20:42,686 --> 00:20:46,839 explicit. You need to add explicit synchronization via fences, locks, and 213 00:20:46,839 --> 00:20:51,343 flags. So, you need to add something in there to do synchronization. And if you 214 00:20:51,343 --> 00:20:55,321 don't, you don't, if you don't have synchronization, you could have race 215 00:20:55,321 --> 00:20:59,942 conditions. You can get the wrong data. You can pick up the incorrect data, which 216 00:20:59,942 --> 00:21:06,888 is a, a pretty common programming error in shared memory programming. So you don't 217 00:21:06,888 --> 00:21:16,128 need to know the destination, you don't need to know the destination up here. From 218 00:21:16,128 --> 00:21:21,100 a, from a programming perspective, what I wanted to point out here is message 219 00:21:21,100 --> 00:21:24,509 passing is very natural for producer-consumer style computations. So, 220 00:21:24,509 --> 00:21:28,869 if you have one node producing some value and another node reading the value, it's 221 00:21:28,619 --> 00:21:33,130 very, very natural. You set up, you set up a you send, you're sending from one node 222 00:21:33,130 --> 00:21:37,239 to the other node and receive it at the other node and you have a channel between 223 00:21:37,239 --> 00:21:41,249 them and you can communicate. And you can sort of send data in there, it's all in 224 00:21:41,249 --> 00:21:45,309 order, we'll say, and you can just send down the channel stuff comes out the other 225 00:21:45,309 --> 00:21:51,760 side. Very natural for producer-consumer. Shared memory, you have to implement 226 00:21:51,760 --> 00:21:57,364 producer-consumer like we did on last lecture, where you have, like a Python 227 00:21:57,364 --> 00:22:02,968 memory, and locks on that structure. But what is easy on shared memory, which is 228 00:22:02,968 --> 00:22:09,382 hard to view in message passing, is if you have a large shared data structure. Let's 229 00:22:09,382 --> 00:22:17,740 take, for example, you have a big table. And you're trying to process this table in 230 00:22:17,740 --> 00:22:23,868 parallel and you have, I don't know, a big you have lots of little files. And you 231 00:22:23,868 --> 00:22:28,556 give each of these files to a different processor. The processor reads the file 232 00:22:28,556 --> 00:22:33,184 and what we're trying to do here is we're trying to build a histogram. So, it's 233 00:22:33,184 --> 00:22:37,872 going to read some number out of a file and based on that number, it's going to 234 00:22:37,872 --> 00:22:42,085 look at a shared data structure and increment that location by one. Now, 235 00:22:42,085 --> 00:22:46,476 because it's shared memory, you want to make sure the two processes or two 236 00:22:46,476 --> 00:22:51,104 processors or two threads or two processes or two processors is not trying to 237 00:22:51,104 --> 00:22:56,102 increment the number at the same time, so you probably want to lock that location in 238 00:22:56,102 --> 00:23:00,255 the table or lock the whole table, increment it, and then unlock it so you 239 00:23:00,255 --> 00:23:05,034 don't have two people trying to increment it at the same time. But it's very natural 240 00:23:05,034 --> 00:23:09,642 to build a shared table in shared memory and have locks on that shared table and 241 00:23:09,642 --> 00:23:16,438 have different people operating on that shared piece of data at the same time. So, 242 00:23:16,438 --> 00:23:22,130 the interesting thing here is that you can actually tunnel shared memory over 243 00:23:22,130 --> 00:23:28,863 messaging. And you can tunnel messaging over shared memory. So, let's look at the 244 00:23:28,863 --> 00:23:36,793 first example here of how to implement shared memory on top of messaging. You 245 00:23:36,793 --> 00:23:40,299 could do this in software. And in software, we, how we do this is 246 00:23:40,299 --> 00:23:45,148 effectively try and turn all of our loads and stores into sends and receives. Maybe 247 00:23:45,148 --> 00:23:49,530 from, and there's maybe like one centralized node which has a big notion of 248 00:23:49,530 --> 00:23:53,971 memory. That's one way to do this or you distribute it somehow. It's going to be 249 00:23:53,971 --> 00:23:58,469 pretty painful to do. But people have implemented systems like this, where you 250 00:23:58,469 --> 00:24:03,693 actually implement loads and stores or your compiler will go and pick out loads 251 00:24:03,693 --> 00:24:10,938 and stores and turn them into messages. A more common thing to have happen is to 252 00:24:10,938 --> 00:24:18,139 actually have hardware that automatically turns communications into messaging. So, 253 00:24:18,139 --> 00:24:23,352 we are going to take this motion of a programming model and instead, we are 254 00:24:23,352 --> 00:24:28,493 going to turn into a broader motion of entities trying to communicate via 255 00:24:28,493 --> 00:24:35,050 explicit communications that are sent and then sort of interconnect. And this is 256 00:24:35,050 --> 00:24:42,225 actually the mo st common way these days that people go about implementing these 257 00:24:42,498 --> 00:24:48,871 large shared memory machines. This is actually, the memory traffic will get 258 00:24:48,871 --> 00:24:54,427 packetized. And we'll talk about packetization in a second. And it will get 259 00:24:54,427 --> 00:24:59,275 sent over a network and then received and then some receiver will do something with 260 00:24:59,275 --> 00:25:03,602 it. So, an example here, let's say, we have this core here and this core wants to 261 00:25:03,602 --> 00:25:09,862 do a load. And the, the, the, the data is in main memory. On our bus here, the core 262 00:25:09,862 --> 00:25:15,351 could just shout, I need address five. And the memory will shout back, got it, 263 00:25:15,351 --> 00:25:23,960 address five has the value six. If we have a switch interconnect, we put switches in 264 00:25:23,960 --> 00:25:29,573 here. We can have the core actually packetize the memory request. So, we'll 265 00:25:29,573 --> 00:25:35,724 take this load and it will make a message, send it over the network, and it will show 266 00:25:35,724 --> 00:25:42,679 up at the memory. The memory will read the message, make a response, and, And it can 267 00:25:42,679 --> 00:25:48,048 send it back over the network also. So effectively, what we have here is we can 268 00:25:48,048 --> 00:25:52,509 actually tunnel shared memory over a message network. And this is actually, for 269 00:25:52,509 --> 00:25:56,585 large cache coherence systems. The most common thing that happens. We'll be 270 00:25:56,585 --> 00:26:01,211 talking about this in greater depth in two classes when we have our directory based 271 00:26:01,211 --> 00:26:05,232 protocols, which are effectively implementing shared memory over these 272 00:26:05,232 --> 00:26:11,972 switched interconnection networks. You could also implement messaging over shared 273 00:26:11,972 --> 00:26:17,364 memory and this is pretty common in small systems that have shared memory. They have 274 00:26:17,364 --> 00:26:23,593 no other way to communicate. And this is exactly what we talked about last lecture. 275 00:26:23,593 --> 00:26:29,232 We had a FIFO, we had head and tail pointers, producer can effectively enqueue 276 00:26:29,232 --> 00:26:35,016 onto this queue in main, main memory. Consumer can go read from it and you could 277 00:26:35,016 --> 00:26:40,108 implement messaging this way. So, I guess, what I'm trying to get at here is 278 00:26:40,108 --> 00:26:44,743 messaging and shared memory are duals of each other. They may be more natural for 279 00:26:44,743 --> 00:26:49,207 one thing or another, but you can implement any algorithm, you can implement 280 00:26:49,207 --> 00:26:52,240 one in the other. And that's been shown at this point.