1 00:00:03,240 --> 00:00:06,681 Okay. So, we, we talked about vector processors. 2 00:00:06,681 --> 00:00:13,029 We introduced this sort of short vector idea with single instruction multiple data 3 00:00:13,029 --> 00:00:14,330 instructions. And, 4 00:00:14,330 --> 00:00:20,256 I wanted to talk a little bit about one of the common places that you see something 5 00:00:20,256 --> 00:00:24,913 that's like a vector processor, but isn't quite a vector processor. 6 00:00:24,913 --> 00:00:30,276 And that's in graphics processor units. So, this is the graphics card in your 7 00:00:30,276 --> 00:00:33,874 computer. So, my laptop here has a ETI graphics chip 8 00:00:33,874 --> 00:00:37,402 in it. And it actually, I've, I've run on it open 9 00:00:37,402 --> 00:00:43,047 CL code compiled to it, which is a general purpose usage of a graphics processor 10 00:00:43,047 --> 00:00:46,751 unit. So, one of the things I wanted to kick 11 00:00:46,751 --> 00:00:51,775 this off by was saying that these architectures look strange from a computer 12 00:00:51,775 --> 00:00:56,282 architecture perspective. And it really goes back to the fact that 13 00:00:56,282 --> 00:01:00,652 they were not designed to be general purpose computing machines. 14 00:01:00,652 --> 00:01:06,182 They were designed to render 3D graphics. So, the early versions of these had very 15 00:01:06,182 --> 00:01:09,460 fixed pipelines, and they had no programmability. 16 00:01:09,460 --> 00:01:13,830 So, you couldn't even use them to do general purpose computation. 17 00:01:13,830 --> 00:01:17,556 And then, they start to get a little bit more flexibility. 18 00:01:17,556 --> 00:01:23,114 So, an example of this actually was, the original Nvidia chips had something called 19 00:01:23,114 --> 00:01:26,530 pixel shaders. So, the idea is, on per pixel basis, as 20 00:01:26,530 --> 00:01:31,800 you render a three-dimensional picture for like a game or something like that. 21 00:01:31,800 --> 00:01:37,206 Each pixel, you could say, oh, I want you have some little custom, customization on 22 00:01:37,206 --> 00:01:41,057 how we render the pixel. So, it's little, every, the game 23 00:01:41,057 --> 00:01:46,463 programmer, for instance, got to write a little program for each pixel as they, as 24 00:01:46,463 --> 00:01:50,720 they got rendered to the screen. So, ran, ran a little program. 25 00:01:51,600 --> 00:01:57,560 But what's interesting about this, and, and these pixel shaders, 26 00:01:58,320 --> 00:02:03,331 There was actually a programming language that developed for the pixel shaders. And 27 00:02:03,331 --> 00:02:07,799 the first implementation of general purpose computing on these was people 28 00:02:07,799 --> 00:02:12,629 wrote pixel shaders to render on pixels, but actually, did something else with it. 29 00:02:12,629 --> 00:02:17,399 So, they went and computed some, some other program using these pixel shaders. 30 00:02:17,399 --> 00:02:21,806 So, you could try to do a matrix multiplication and the result would be a 31 00:02:21,806 --> 00:02:24,293 picture. [laugh] Kind of a strange idea there. 32 00:02:24,293 --> 00:02:29,752 It's like, well, we take one picture and another picture, and we run some special 33 00:02:29,752 --> 00:02:33,828 shading on it that makes it look sort of like a 3D picture. 34 00:02:33,828 --> 00:02:37,560 And all of a sudden, the output actually is the result. 35 00:02:38,080 --> 00:02:43,639 Well, the graphics people, the people who made graphics per graphics processing 36 00:02:43,639 --> 00:02:48,288 units got the bright idea that, well, Maybe we can make this a little bit easier 37 00:02:48,288 --> 00:02:52,657 and increase our user base, just beyond having graphics cards if we encourage 38 00:02:52,657 --> 00:02:56,644 people to write programs on these. So, they start to make it more and more 39 00:02:56,644 --> 00:02:59,757 general purpose. So, instead of just being able to do a 40 00:02:59,757 --> 00:03:04,235 pixel shader or work on one pixel at a time, and have it very pixel oriented, or 41 00:03:04,235 --> 00:03:08,714 graphics oriented, we'll rename everything and we'll come up with some programming 42 00:03:08,714 --> 00:03:12,864 model, and we'll expose some of the architecture, it'll make the architecture 43 00:03:12,864 --> 00:03:16,960 a little more general purpose. And then, you might be able to run some other 44 00:03:16,960 --> 00:03:21,088 programs on it. So, this is, this is really what, what 45 00:03:21,088 --> 00:03:24,831 happened. And, and as I said, the, the, one of the 46 00:03:24,831 --> 00:03:31,583 first people that tried use this and were trying to program it in the, the pixel 47 00:03:31,583 --> 00:03:37,441 shaders and the, the, the per vector per vertex computations which were also sort 48 00:03:37,441 --> 00:03:41,767 of baked into these original architectures were very hard. 49 00:03:41,767 --> 00:03:46,287 Cuz you're basically trying to think about everything as a picture in a frame, and 50 00:03:46,287 --> 00:03:48,713 then sort of back compute what was going on. 51 00:03:48,713 --> 00:03:53,179 But as we've moved along a little bit, we've started to see some new languages. 52 00:03:53,179 --> 00:03:57,480 So, it's new program support, and the architectures become more general purpose. 53 00:03:57,480 --> 00:04:02,131 So, this brings us to general purpose graphics processing units. 54 00:04:02,131 --> 00:04:06,634 So, it's, we still have GPU here, so it's still special purpose. 55 00:04:06,634 --> 00:04:12,762 But then, we stick GP in the front and we call this GPGPUs, General Purpose Graphics 56 00:04:12,762 --> 00:04:17,044 Processing Units. So, a good example of this actually is 57 00:04:17,044 --> 00:04:22,803 Nvidia decided to or, or came up with this programming language called CUDA. 58 00:04:22,803 --> 00:04:28,793 Now, this was not the first sort of foray into this. 59 00:04:28,793 --> 00:04:34,260 There are some research languages that predated this, and some ideas that 60 00:04:34,260 --> 00:04:38,413 predated this. But, the I, the idea here well we'll talk 61 00:04:38,413 --> 00:04:43,362 about that in a second actually. But the, the GPGPUs the programming 62 00:04:43,362 --> 00:04:47,576 model's a little bit strange. It's a, it's a threading model. 63 00:04:47,576 --> 00:04:52,076 And we're going to talk about that in, in a little bit more detail. 64 00:04:52,076 --> 00:04:58,075 But, first of all, I wanted to point out some differences between GPGPU and a true 65 00:04:58,075 --> 00:05:02,075 vector processor. So, in a GPGPU, there's a host CPU, which 66 00:05:02,075 --> 00:05:07,717 is your X86 processor in your computer. Then, you go across the bus, the PCIE bus, 67 00:05:07,717 --> 00:05:11,860 or PCI bus, Or EGP bus out onto a graphics card. 68 00:05:13,080 --> 00:05:18,964 So, there's no, there is a host processor, but it's not, there's no control processor 69 00:05:18,964 --> 00:05:23,843 like in our vector processors that we had talked about last lecture. 70 00:05:23,843 --> 00:05:29,440 So, the scalar processor is really far away, and doesn't drive everything as, as 71 00:05:29,871 --> 00:05:35,109 strictly, it's basically having two processors running, the host processor, 72 00:05:35,109 --> 00:05:40,706 and then, the graphics processor unit. And because of this, you actually have to 73 00:05:40,706 --> 00:05:44,940 run all of your control code on the vector unit, somehow. 74 00:05:45,400 --> 00:05:50,239 So, This, this attached host processor model 75 00:05:50,239 --> 00:05:54,325 does have some advantages. You can actually run the data parallel 76 00:05:54,325 --> 00:05:59,290 aspects of the program and then have the host processor running something else, 77 00:05:59,290 --> 00:06:04,633 Or the SA6 processor, which is connected to it across the bus have it running some 78 00:06:04,633 --> 00:06:08,467 other, other program. So, let's, let's dive into detail here and 79 00:06:08,467 --> 00:06:14,370 talk about the Compute Unified Device Architecture, which is what CUDA stands 80 00:06:14,370 --> 00:06:17,390 for. So, CUDA is, is the Nvidia way of 81 00:06:17,390 --> 00:06:23,189 programming, or the Nvidia programming model for the graphics processors. 82 00:06:23,189 --> 00:06:30,101 And, there's a broader industrial accepted way to program these, which uses roughly a 83 00:06:30,101 --> 00:06:36,694 similar programming model called Open CL which is, the name there is designed to 84 00:06:36,694 --> 00:06:43,049 evoke notions of open GL, which is the, the graphics language that is widely used 85 00:06:43,049 --> 00:06:46,900 for 3D rendering. So, 86 00:06:48,320 --> 00:06:51,950 You have to suspend disbelief here for a second. 87 00:06:51,950 --> 00:06:56,110 And, and this, this model is a little bit odd, I think. 88 00:06:56,110 --> 00:06:59,968 But, let's, let's, let's talk about what, what CUDA is. 89 00:06:59,968 --> 00:07:06,249 So, in CUDA, Let's say, we have a loop here. 90 00:07:06,470 --> 00:07:09,489 Let's, let's say, a non-CUDA program first. 91 00:07:09,489 --> 00:07:19,532 It's the upper portion of this code. We're trying to take y plus or, or y of i 92 00:07:19,532 --> 00:07:29,120 plus x of i times some scalar value. So, this is the traditional A times 93 00:07:29,580 --> 00:07:38,167 another vector plus a third vector. This is actually a inner loop of this 94 00:07:38,167 --> 00:07:45,452 shows up an inner loop of impact. So, is that, which is a, a benchmark that 95 00:07:45,452 --> 00:07:50,556 a lot of people run. So, pretty soon what we were doing before. 96 00:07:50,556 --> 00:07:54,812 You're adding one vector to another vector, except that you're taking one of 97 00:07:54,812 --> 00:07:58,140 the vectors and multiplying by a scalar. Pretty, pretty simple. 98 00:07:58,640 --> 00:08:05,782 So, in CUDA, the, the basic idea is that you don't do a lot of operation, you don't 99 00:08:05,782 --> 00:08:11,811 do a lot of work per, what we're going to call threads. 100 00:08:11,811 --> 00:08:20,840 So, what we're going to do here is we're going to define a block of data, 101 00:08:21,720 --> 00:08:28,813 And we use some special tabs on here to say what's going on. 102 00:08:28,813 --> 00:08:38,852 And, this block has some size. And then, what we do is, we define a 103 00:08:38,852 --> 00:08:47,059 thread that can operate in parallel times these 256, or there's 256 data values and 104 00:08:47,059 --> 00:08:52,337 256 threads. So, what's going to happen here is we're 105 00:08:52,337 --> 00:09:01,236 basically going to do the same operation here, y times y of i plus a times x of i, 106 00:09:01,236 --> 00:09:09,580 And we're going to store it into here. But, let's look where i comes from. 107 00:09:10,820 --> 00:09:19,610 I comes from some special keywords here where it computes which thread number you 108 00:09:19,610 --> 00:09:25,640 are or which index number you effectively are in this thread block. 109 00:09:26,220 --> 00:09:32,382 And then, this if statement here is the moral equivalent of our strip mining loop. 110 00:09:32,382 --> 00:09:38,032 So, this says, if we get above how ever many, however much work we're trying to 111 00:09:38,032 --> 00:09:42,727 do, don't do anymore work. So, we can block it into, let's say, some 112 00:09:42,727 --> 00:09:46,982 number of threads. And then, we can actually pass into an n 113 00:09:46,982 --> 00:09:53,145 where n is the amount of work to actually do or the length of the vector, and we can 114 00:09:53,145 --> 00:09:59,227 check that along the way. What I find interesting about this is, if 115 00:09:59,227 --> 00:10:05,152 you look at this on first view, it looks like each of these threads, oops, is 116 00:10:05,152 --> 00:10:12,555 completely independent. So, that is the programming model, is that 117 00:10:12,555 --> 00:10:17,700 each of the threads are independent. Unfortunately, the computer architecture 118 00:10:17,700 --> 00:10:22,370 that this is going to run on, each of the threads are not independent. 119 00:10:22,370 --> 00:10:25,620 So, in CUDA, you don't want to have these diverge. 120 00:10:26,260 --> 00:10:30,429 It's allowed to have them diverge. So, for instance, there's a, there's a if 121 00:10:30,429 --> 00:10:33,285 statement in here. So, you're going to have one go into the 122 00:10:33,285 --> 00:10:37,797 if statement, and one, one, a different thread follow through on the if statement. 123 00:10:37,797 --> 00:10:41,452 So, it is, it is allowed. But if you do that, you're basically just 124 00:10:41,452 --> 00:10:44,080 not using one of the pipelines for a while. 125 00:10:44,860 --> 00:10:50,466 So, I should of come back to this. Lets, let's talk about how this 126 00:10:50,466 --> 00:10:55,281 programming model shows up. So, the programming model is having lots 127 00:10:55,281 --> 00:11:01,102 of different little threads and the idea is you make sort of these micro threads. 128 00:11:01,102 --> 00:11:06,995 And then, you want to take these micro threads and the run time system plus the 129 00:11:06,995 --> 00:11:10,444 compiler, the CUDA compiler, will put it together. 130 00:11:10,444 --> 00:11:16,193 And actually put all the threads together, and actually have them operate on the 131 00:11:16,193 --> 00:11:24,260 exact same instructions at the same time. So, they're hiding a single instruction 132 00:11:24,260 --> 00:11:30,160 multiple data architecture under the hood of these threads. 133 00:11:30,780 --> 00:11:35,780 So, if we look at our example here, we do a load. 134 00:11:36,220 --> 00:11:40,476 So, multiply a different load add in the store, 135 00:11:40,476 --> 00:11:48,460 And that's, that's our a, x a times x plus y operation going on here. 136 00:11:49,100 --> 00:11:55,073 And, across the other way, These are all these threads are doing the 137 00:11:55,073 --> 00:11:57,540 same operation or effectively the same operation. 138 00:12:00,120 --> 00:12:05,538 So, they call these single instruction, multiple thread, which is kind of a funny 139 00:12:05,538 --> 00:12:08,076 idea. In reality, it's actually single 140 00:12:08,076 --> 00:12:12,877 instruction multiple data. But, they introduced this notion of 141 00:12:12,877 --> 00:12:18,495 threads with predication into the thread in, in, into the single structure multiple 142 00:12:18,495 --> 00:12:24,454 data to allow, effectively, one pipe, one microthread here to do something slightly 143 00:12:24,454 --> 00:12:27,340 different, By predication. 144 00:12:30,740 --> 00:12:36,788 So, what is the implications of the single instruction multiple thread? 145 00:12:36,788 --> 00:12:41,990 Well, Strangely enough, because you have it's 146 00:12:41,990 --> 00:12:47,769 hard to control the order of the threads relative to each other with the data, the 147 00:12:47,769 --> 00:12:53,617 memory system has to support all different notions and all different alignments of 148 00:12:53,617 --> 00:12:58,251 scatter-gather operations. So, they don't actually try to control the 149 00:12:58,251 --> 00:13:03,589 addressing because each of the threads could potentially try to do some scatter 150 00:13:03,589 --> 00:13:06,724 operation. So instead, what they do is, they have 151 00:13:06,724 --> 00:13:12,128 some really smart intelligence that takes all the addresses that come out of the 152 00:13:12,128 --> 00:13:15,664 execution units. And they say, oh, these look like they 153 00:13:15,664 --> 00:13:19,266 should line up and we'll issue these at the same time. 154 00:13:19,266 --> 00:13:24,837 So, if you happen to have threads which all try to do, let's say, a of i, we'll 155 00:13:24,837 --> 00:13:30,198 say, where i is a thread number, Then you have units trying operations. 156 00:13:30,198 --> 00:13:36,228 If you try to pack them all together and the hardware actually goes and tries to 157 00:13:36,228 --> 00:13:42,660 figure this out. And, as I had mentioned before, if you, 158 00:13:43,540 --> 00:13:48,235 You have to use predication here if you have different control flow. 159 00:13:48,235 --> 00:13:52,930 So, you need strong, strong use of predication to allow threads to go 160 00:13:52,930 --> 00:13:59,748 different directions. So, things get even more complicated in 161 00:13:59,748 --> 00:14:07,351 these architectures. These GPU, general purpose GPU architectures and take the 162 00:14:07,351 --> 00:14:14,778 word warp here and replace it with thread. So unfortunately, if you, if you go read 163 00:14:14,778 --> 00:14:21,714 your textbook, they have a nice table which translates GPGPU nomenclature to the 164 00:14:21,714 --> 00:14:25,203 whole rest of computer architecture nomenclature. 165 00:14:25,203 --> 00:14:30,471 But, the GPU people came up with completely different names for everything 166 00:14:30,685 --> 00:14:35,313 which is just kind of annoying. Because names already existed for 167 00:14:35,313 --> 00:14:36,310 everything. So, 168 00:14:36,310 --> 00:14:43,785 If you go look inside of one these GPUs, they actually are a massively parallel 169 00:14:43,785 --> 00:14:49,436 architecture with multiple lanes, like our vector architecture. 170 00:14:49,436 --> 00:14:55,180 And then, on top of that, they are a multi-threaded architecture. 171 00:14:55,740 --> 00:15:00,159 Typically, these architectures don't have caches. So, to hide a lot of the memory 172 00:15:00,159 --> 00:15:04,523 latency what they'll do is you'll take all the threads that are active in the 173 00:15:04,523 --> 00:15:08,999 machine, and this is part of the reason they have this threading model, is you'll 174 00:15:08,999 --> 00:15:13,251 take all the threads that are active in the machine and you'll schedule one 175 00:15:13,251 --> 00:15:16,440 thread. And if that thread, let's say, misses out to memory, 176 00:15:16,740 --> 00:15:21,070 You'll time slice it out and then schedule a different thread. 177 00:15:21,070 --> 00:15:26,240 So, that actually we'll fine grain interweave threads on a functional unit. 178 00:15:27,360 --> 00:15:36,780 So, it's a strange idea here mixing multi-string with SIM D at the same time. 179 00:15:37,060 --> 00:15:43,860 So, lots of different parallels and aspects coming together in these GPGPUs. 180 00:15:44,460 --> 00:15:50,191 I don't want to go into that much detail because then you have a whole class on how 181 00:15:50,191 --> 00:15:54,557 to program GPGPU's. But basic idea I wanted to get across is 182 00:15:54,557 --> 00:15:59,742 that they are a multi-threaded single instruction multiple-data machine, but 183 00:15:59,742 --> 00:16:03,768 they overlay on top of that, this strange notion of threads. 184 00:16:03,768 --> 00:16:09,226 And, but, the threads don't do exactly the same work because it's a SIM D machine. 185 00:16:09,430 --> 00:16:18,825 You basically just end up wasting slots. So, these have a lot of performance. 186 00:16:18,825 --> 00:16:25,042 So, some examples of this the Nvidia, This is actually the modern day, Nvidia 187 00:16:25,042 --> 00:16:29,339 computer archi, or GPU architecture you can go buy. 188 00:16:29,339 --> 00:16:34,404 They call it the Fermi. This is in that card I showed last time 189 00:16:34,404 --> 00:16:41,360 with the, the, the Tesla. Let's zoom in on one of these and talk 190 00:16:41,360 --> 00:16:46,677 about what's inside of here. So, roughly what's going on is they 191 00:16:46,677 --> 00:16:51,289 actually have, Well, first of all, let's see the stuff 192 00:16:51,289 --> 00:16:56,901 that's not programmable, which is actually a significant portion of the design here. 193 00:16:56,901 --> 00:17:02,445 If you look down here, they have vertex shaders and tessellation units and texture 194 00:17:02,445 --> 00:17:06,299 mapping units, Texture, texture caches And really, what 195 00:17:06,299 --> 00:17:09,748 this is all for, is this if for graphics processing. 196 00:17:09,748 --> 00:17:14,210 [laugh] And then, [laugh] we sort of smush onto that, some array of general purpose 197 00:17:14,210 --> 00:17:21,828 units or mildly general purpose units. And inside of each one of these cores 198 00:17:21,828 --> 00:17:26,018 here, there's a floating point unit and an integer unit. 199 00:17:26,018 --> 00:17:32,536 And if you cut this way, that's actually each one of these here what they call 200 00:17:32,536 --> 00:17:37,735 core, is effectively a lane. So, they're replicated in that direction. 201 00:17:37,735 --> 00:17:43,400 And then, one, two, three, four, five, six, seven, eight, this direction is SIM 202 00:17:43,400 --> 00:17:48,680 D. So, they basically have a SIM D 203 00:17:48,680 --> 00:17:55,498 architecture with multiple lanes. So, lots of parallelism going on. 204 00:17:55,498 --> 00:18:01,336 And then, at the top here, they have what they call the warp scheduler which is the 205 00:18:01,336 --> 00:18:06,866 thread scheduler, which will assign instructions down into the different 206 00:18:07,327 --> 00:18:11,109 parallel units. So, lots of, lots of interesting things 207 00:18:11,296 --> 00:18:14,040 going on in parallel here on these machines. 208 00:18:16,000 --> 00:18:22,887 So, I wanted to stop here on GPUs because we could have a whole another lecture on 209 00:18:22,887 --> 00:18:26,171 that. I, I don't really want to go into that 210 00:18:26,171 --> 00:18:31,698 level of detail on GPUs. But, let's switch topics and start talking 211 00:18:31,698 --> 00:18:36,310 about multi-threading. Actually, before we go off this thing, I'm 212 00:18:36,310 --> 00:18:40,351 sure people have questions about GPUs. But I'll take one or two of them, but I 213 00:18:40,351 --> 00:18:44,240 don't want to go into that much detail. I just want to sort of introduce the idea 214 00:18:44,240 --> 00:18:48,281 that you could use graphics processors. They have some similarity to vectors, but 215 00:18:48,281 --> 00:18:51,160 they're kind of the degenerate case of vector processors. 216 00:18:55,620 --> 00:19:01,136 Actually, a quick show of hands. Who, who, who has a ATI video card in 217 00:19:01,136 --> 00:19:06,415 their machine? Okay. Who has Nvidia? 218 00:19:06,415 --> 00:19:12,421 Okay. Who has Intel? Aha. So, so, interesting tidbit here. 219 00:19:12,421 --> 00:19:20,158 Everyone always thinks about ATI and Nvidia as being the sort of leaders in 220 00:19:20,158 --> 00:19:25,840 these fancy graphics cards. But in reality, 221 00:19:26,220 --> 00:19:30,184 Intel sells the most number of graphics processors in the world today. 222 00:19:30,184 --> 00:19:34,149 And it's partially cuz they kind of give them away for free, and they've 223 00:19:34,149 --> 00:19:37,377 effectively integrated them onto all the Intel chips now. 224 00:19:37,377 --> 00:19:39,926 So, It's kind of a funny thing that the least 225 00:19:39,926 --> 00:19:44,570 innovative the, the, the least exciting graphics processors out there are, are 226 00:19:44,570 --> 00:19:48,251 kind of there not because they're good, but because they're cheap. 227 00:19:48,251 --> 00:19:51,140 Lots of, lots of things in the world work like that.