1 00:00:03,400 --> 00:00:10,020 Today, we move on to our new topic. Vector computers. 2 00:00:11,220 --> 00:00:14,260 So, a little bit of introduction on vector, 3 00:00:14,700 --> 00:00:23,012 Vector machine is a vector processor. Broadly, it's a way to get at having data 4 00:00:23,012 --> 00:00:26,861 level parallelism. Many times for, let's say, array 5 00:00:26,861 --> 00:00:33,067 operations, you're going to want to take one whole array and add it too another 6 00:00:33,067 --> 00:00:36,917 whole array. And let's say, these arrays are large. 7 00:00:36,917 --> 00:00:42,965 Does it really make sense to have a processor sit in a tight loop doing load, 8 00:00:42,965 --> 00:00:46,500 add, Store, load, add, store, load, add, store 9 00:00:46,500 --> 00:00:52,891 in a loop? And it's the insight that comes out that if you have computations that 10 00:00:52,891 --> 00:00:58,020 work on vectors or matrices or even multi-dimensional matrices, 11 00:00:58,300 --> 00:01:02,715 You can think about building an architecture where you don't have to have 12 00:01:02,715 --> 00:01:05,997 as much instruction fetch, instruction decode bandwidth. 13 00:01:05,997 --> 00:01:10,949 And you don't have to sit there and fetch new instructions and continually operate 14 00:01:10,949 --> 00:01:14,947 on those new instructions. You could just have an instruction which 15 00:01:14,947 --> 00:01:19,302 encodes some large amount of computation. Because its all the same, it's the 16 00:01:19,302 --> 00:01:23,114 insight. Also, in today's lecture, we're going to 17 00:01:23,114 --> 00:01:27,690 be talking about single instruction multiple data architectures. 18 00:01:27,690 --> 00:01:32,449 This is kind of a degenerate case of vector architectures. 19 00:01:32,449 --> 00:01:39,096 And a good example of this is something like multimedia extensions or MMX in the 20 00:01:39,096 --> 00:01:44,020 Intel processors or Ultivec in the power PC architecture. 21 00:01:44,222 --> 00:01:48,598 The newer thing that Intel has added now, they all call SSE. 22 00:01:48,598 --> 00:01:53,379 Streaming, something extensions. I actually don't know what the second S 23 00:01:53,379 --> 00:01:57,217 stands for. And then they also now have something they 24 00:01:57,217 --> 00:02:00,247 call AVX, which is even wider. They can, can, 25 00:02:00,247 --> 00:02:07,080 Basically and continually add in more instructions to make the short vector 26 00:02:07,080 --> 00:02:11,986 nature better. And then, finally today, if we have time, 27 00:02:11,986 --> 00:02:16,060 we'll be talking about graphics processing units. 28 00:02:16,560 --> 00:02:24,422 So, I have some examples here. This is the ATI FirePro 3DV7800 and then 29 00:02:24,422 --> 00:02:32,949 we have the Nvidia equivalence, Nvidia competitor, which is the Nvidia Tesla, I 30 00:02:32,949 --> 00:02:38,375 think this is C075. Both of these, these are both very fast 31 00:02:38,375 --> 00:02:46,017 processors. And what is interesting is, these started out as graphics, graphics 32 00:02:46,017 --> 00:02:50,690 processors. So, they started out to play video games 33 00:02:50,690 --> 00:02:55,220 effectively or to do some sort of rendering of three-dimensional data. 34 00:02:55,560 --> 00:02:59,599 So, you're taking some data, You operate on it and there's massive 35 00:02:59,599 --> 00:03:03,079 parallelism there. Lots of different triangles in a, in a 36 00:03:03,079 --> 00:03:07,304 three-dimensional image, for instance, in three-dimensional rendering. 37 00:03:07,304 --> 00:03:12,338 And people have this insight that, that same processing architecture that is good 38 00:03:12,338 --> 00:03:17,309 at rendering triangles might be good at doing, let's say, dense matrix operations 39 00:03:17,309 --> 00:03:22,943 also. And we've seen this outgrowth and we've seen a whole programming model come 40 00:03:22,943 --> 00:03:27,260 up around this and this is, this is very recent. to some extent, 41 00:03:27,920 --> 00:03:32,996 These architectures don't come from the same lineage as some sort of normal 42 00:03:32,996 --> 00:03:36,268 processors. They come from, really come from fixed 43 00:03:36,268 --> 00:03:41,278 function hardware that was there to design, there to render video games and 44 00:03:41,478 --> 00:03:45,958 three-dimensional sorts of scenes. So, their architectures look quite a bit 45 00:03:45,958 --> 00:03:49,624 different and the naming is very different, so if you go pull, pick up the 46 00:03:49,624 --> 00:03:53,842 manual, it tells you how to program one of these things and you come from a computer 47 00:03:53,842 --> 00:03:57,709 architecture background, you're just not going to understand any of the words. 48 00:03:57,960 --> 00:04:01,576 Your book actually, the Hennessey and Patterson book has a very good table which 49 00:04:01,576 --> 00:01:31,927 translates some sort of, traditional computer architecture words to GPU words 50 00:04:05,442 --> 00:04:08,093 and that makes life a lot easier. Okay. 51 00:04:08,093 --> 00:04:12,801 So, let's get started. Looking at vector processors, and let's 52 00:04:12,801 --> 00:04:18,106 look at the programming model first before we look at the architecture. 53 00:04:18,106 --> 00:04:24,400 So, this would be software model, not the not what the hardware looks like, yes. 54 00:04:25,100 --> 00:04:32,058 So, to start off here, A couple things to note is in the 55 00:04:32,058 --> 00:04:37,363 traditional vector architecture, you're going to have some scour registers. 56 00:04:37,363 --> 00:04:41,669 And these are the registers like in a normal microprocessor. 57 00:04:41,669 --> 00:04:46,283 They just hold one value. Thye're maybe, let's say, 32 bits or 64 58 00:04:46,283 --> 00:04:51,589 bits in width. And then, you have a second register file, 59 00:04:51,589 --> 00:04:55,360 which holds. Vectors. 60 00:04:55,760 --> 00:05:01,634 And when you go to access one of these vectors, it's the same thing as a register 61 00:05:01,634 --> 00:05:05,333 file, file here. If you go to access, let's say, vector 62 00:05:05,333 --> 00:05:10,846 register three, or something like that, you're going to, that doesn't denote one 63 00:05:10,846 --> 00:05:14,037 value. Instead, it denotes many values at one 64 00:05:14,037 --> 00:05:20,575 time. And typically, we have a fixed width here 65 00:05:20,575 --> 00:05:24,314 drawn, but typically these things have very long widths. 66 00:05:24,314 --> 00:05:29,821 So, for instance, something like the Cray processor or the Cray-1 processors, had a 67 00:05:29,821 --> 00:05:34,445 maximum vector length of 64 elements where each element was 64 bits. 68 00:05:34,445 --> 00:05:40,020 So, it's a lot of data that you're, you're sort of moving around at one time with one 69 00:05:40,020 --> 00:05:44,472 operation. And an important piece of sort of 70 00:05:44,472 --> 00:05:51,102 architectural or least program model hardware here is the vector-length 71 00:05:51,102 --> 00:05:54,591 register. The vector-length register says, how many 72 00:05:54,591 --> 00:05:59,894 of these elements are actually populated?" And we'll see why that's important. 73 00:05:59,894 --> 00:06:05,128 But for right now, let's just think of having the vector-length register be 74 00:06:05,128 --> 00:06:09,245 equivalent to the maximum number of elements in the vector. 75 00:06:09,245 --> 00:06:14,688 So, think of it as having 64 elements and the vector-length register just says 76 00:06:14,897 --> 00:06:20,480 there's, you're always operating on all 64 bit, entries of data in parallel. 77 00:06:22,040 --> 00:06:27,719 Now, if we go look at the program model connected to this, we need to add some 78 00:06:27,719 --> 00:06:31,122 extra instructions now. In our Scalar processors, or all the 79 00:06:31,122 --> 00:06:34,180 processors we've been talking about up to this point, 80 00:06:34,180 --> 00:06:36,988 It operates on one register with one other register. 81 00:06:36,988 --> 00:06:40,876 And that still exists in this model. But it operates only on these Scalar 82 00:06:40,876 --> 00:06:43,522 registers. Now, the reason why we still have the 83 00:06:43,522 --> 00:06:47,572 Scalar registers around in this model, is we want to have things like branch 84 00:06:47,572 --> 00:06:51,299 conditions, address computation, things like that are not vectorizable. 85 00:06:51,299 --> 00:06:53,891 They don't, you know, you don't have 64 addresses. 86 00:06:53,891 --> 00:06:58,103 Maybe, maybe you do in certain cases. But typically, you're not going to have 87 00:06:58,103 --> 00:07:01,181 that laying around. You're just going to have an address and 88 00:07:01,181 --> 00:07:05,609 you need to load from address and sort of for branches, you need to do the branching 89 00:07:05,771 --> 00:07:08,040 based on some value, And not all 64 values. 90 00:07:08,340 --> 00:07:11,459 But, we now add some special extra instructions. 91 00:07:11,459 --> 00:07:16,304 So, if you go look in your book, they develop this architecture they call VMIPS 92 00:07:16,304 --> 00:07:20,600 or vector MIPS. And they add some extra instructions here 93 00:07:20,600 --> 00:07:25,921 which look very similar to normal MIPS but all of a sudden they put some Vs at the 94 00:07:25,921 --> 00:07:28,772 end here. So,, VV which means it operates on a 95 00:07:28,772 --> 00:07:32,889 vector with another vector. They also developed some instructions 96 00:07:32,889 --> 00:07:37,530 which have a V and a S, which is the Scalar so you can do a vector plus a 97 00:07:37,530 --> 00:07:42,683 Scalar which would be something along the lines of if you were to have, let's say, 98 00:07:42,683 --> 00:07:47,901 add vector Scalar where you're adding one vector with a Scalar register where the 99 00:07:47,901 --> 00:07:50,700 scale register, let's say is loaded with one. 100 00:07:50,700 --> 00:07:55,600 You could do this add and it'll increment every element of the vector by one. 101 00:07:55,940 --> 00:08:00,948 You also have load in stores, which can pull out very large chunks of memory and 102 00:08:00,948 --> 00:08:04,880 put back very large chunk of memory from the arrays in memory. 103 00:08:05,200 --> 00:08:10,073 But if you look at what's going on in one of these instructions, we're taking one 104 00:08:10,073 --> 00:08:16,587 vector, another vector, putting it into Some sort of arithmetic operation and then 105 00:08:16,587 --> 00:08:20,836 storing it into another register. This is a register-register vector 106 00:08:20,836 --> 00:08:24,148 architecture. There has been some register-memory and 107 00:08:24,148 --> 00:08:29,084 memory-memory vector architectures out there, where instead of naming registers, 108 00:08:29,084 --> 00:08:34,021 vector registers, you can name places in memory, but the vector-vector oh, excuse 109 00:08:34,021 --> 00:08:37,458 me the register-register variants are, are the most popular. 110 00:08:37,458 --> 00:08:42,520 Just like the register-register Scalar computer architectures are now the most 111 00:08:42,520 --> 00:08:45,470 popular. One thing I did want to point out here is, 112 00:08:45,470 --> 00:08:49,022 we've said nothing about how many ALUs there are in this architecture. 113 00:08:49,022 --> 00:08:51,254 This is just the abstract programming model., 114 00:06:38,440 --> 00:08:54,653 So, don't get this confused with having one, two, three, four, five, six 115 00:08:54,653 --> 00:08:58,712 functional units or something like that. This is just a abstract model right now, 116 00:08:58,712 --> 00:09:06,280 we have not talked about the hardware. So, this brings up, how do we get data? 117 00:09:06,820 --> 00:09:12,918 And we have a instruction here that we'll call load vector. 118 00:09:12,918 --> 00:09:21,015 Load vector has a destination, being a vector and the is, Is a register, and you 119 00:09:21,015 --> 00:09:26,299 might have another offset in the register. But let's say, there's only one register 120 00:09:26,299 --> 00:09:29,365 in our, in our basic load vector operation here. 121 00:09:29,365 --> 00:09:34,062 And this is the address that points to the base of the vector in memory. 122 00:09:34,062 --> 00:09:39,281 And when you go to do this load, it's actually going to pull in from memory into 123 00:09:39,281 --> 00:09:43,391 our vector register. You could also start to think about having 124 00:09:43,391 --> 00:09:48,213 interesting offsets or strides here. So, that's what this picture here is 125 00:09:48,213 --> 00:09:53,176 trying to show is we have a base pointer pointing to by register one, it's a Scalar 126 00:09:53,176 --> 00:09:58,080 register and note it's has different naming, these have Vs and these are Rs and 127 00:09:58,080 --> 00:10:01,780 then, We have a stride here which says, where in 128 00:10:01,780 --> 00:10:05,811 memory to take from. So, you can think about having something 129 00:10:05,811 --> 00:10:09,590 where you can do basically multiple locations in memory. 130 00:10:09,590 --> 00:10:13,176 But you want every fifth element or something like that. 131 00:10:13,176 --> 00:10:18,171 So, you could load register two here with five, register one here with the base 132 00:10:18,171 --> 00:10:21,309 address, And then, do this load vector instruction 133 00:10:21,309 --> 00:10:26,259 and it'll take each fifth piece of memory of some data size and load it into the 134 00:10:26,259 --> 00:10:29,678 vector register. And this is our abstract model, but at 135 00:10:29,678 --> 00:10:33,984 the, at the beginning here, let's assume what's called the unit stride which 136 00:10:33,984 --> 00:10:39,306 basically means this here, is always one, so its always getting the next value in a 137 00:10:39,306 --> 00:10:42,611 row. We'll, we'll talk in more complicated 138 00:10:42,611 --> 00:10:46,284 cases about having non-unit stride. Okay. 139 00:10:46,284 --> 00:10:53,160 So, let's look at what this does to code. Here we have a basic code example, it's 140 00:10:53,160 --> 00:10:59,107 going to multiply element-wise.. Different elements of a, of a, of Vector 141 00:10:59,107 --> 00:11:02,044 here, A and B, and deposit it into Vector C. 142 00:11:02,044 --> 00:11:07,219 Now, this is in memory because this is C code so these are actually arrays. 143 00:11:07,219 --> 00:11:12,954 Now, obviously this is not a, you know, array multiplication here, cuz array math 144 00:11:12,954 --> 00:11:17,220 is much more complicated. This is a element-wise multiplication. 145 00:11:17,620 --> 00:11:20,578 If you go look at the Scalar assembly code. 146 00:11:20,578 --> 00:11:23,400 Well, first of all, we need to have a loop. 147 00:11:25,600 --> 00:11:30,972 We have to load the first value, load the second value, do the multiply, do the 148 00:11:30,972 --> 00:11:34,390 store. This is showing code for floating-points 149 00:11:34,390 --> 00:11:38,995 double precision multiplies. Then, you have to increment a bunch of 150 00:11:38,995 --> 00:11:42,483 pointers. Check the, the boundary case and, and loop 151 00:11:42,483 --> 00:11:45,762 around. On your vector architecture, life gets a 152 00:11:45,762 --> 00:11:51,274 little bit easier here because we can do all 64 of these in one instruction, we 153 00:11:51,274 --> 00:11:56,692 don't have to loop. And all you really have to do is load, 154 00:11:56,692 --> 00:12:00,649 load, load vector, load vector, multiply and store. 155 00:12:00,649 --> 00:12:06,420 And this instruction on the top here loads the vector length register. 156 00:12:07,000 --> 00:12:12,276 And we look at the vector-length length register here of 64, cuz we're trying to 157 00:12:12,276 --> 00:12:15,442 do 64. But if we were to load the vector-length 158 00:12:15,442 --> 00:12:20,060 register to, say, with 32, we would only do the first 32 multiplications. 159 00:12:21,560 --> 00:12:25,355 And you can set that vector length register all the way up to maximum vector 160 00:12:25,355 --> 00:12:31,340 length. So, the vector-length register, 161 00:12:31,960 --> 00:12:37,467 There's, there's this value here we call the vector-length register max. 162 00:12:37,467 --> 00:12:40,880 Which is the width, the, the, the, the largest, 163 00:12:41,400 --> 00:12:46,040 It's going to be length of a vector. The vector length register says for the 164 00:12:46,040 --> 00:12:51,600 given operation we're about to compute, How many of those operations we should do? 165 00:12:52,340 --> 00:12:59,088 So, you could either easily have, something with, a vector length of a 166 00:12:59,088 --> 00:13:01,763 thousand. But you only want to do, let's say, the 167 00:13:01,763 --> 00:13:06,691 first 64 operations so you can load your vector-length register of 64 and only do 168 00:13:06,691 --> 00:13:10,177 64 operations. A good example for this actually is some 169 00:13:10,177 --> 00:13:14,300 of the super computers. Cray, Cray machine have relatively short 170 00:13:14,300 --> 00:13:19,228 vector-length maxes, but if you go look at something like NEC the Japanese 171 00:13:19,228 --> 00:13:24,348 supercomputers, the NECSX8 or nine or something like that which is, I think, 172 00:13:24,348 --> 00:13:29,660 actually now probably the fastest computer in the world or the SX9, I think is or 173 00:13:29,660 --> 00:13:34,140 whatever is the, the newest. I actually, I think it's the SX9 the new 174 00:13:34,140 --> 00:13:39,308 Japanese vector shift computer. They have very long vector-length maxes so 175 00:13:39,308 --> 00:13:44,285 they can actually have a vector-length of a thousand. So that, in one instruction, 176 00:13:44,285 --> 00:13:48,455 they can basically encode a thousand operations which is pretty, pretty fancy. 177 00:13:48,455 --> 00:13:53,312 But they can, you still need to be able to set the vector-length because maybe you 178 00:13:53,312 --> 00:13:55,477 don't want to do all a thousand all the time. 179 00:13:55,477 --> 00:13:59,996 Okay, so why, what is this vector stuff coming has some advantages? 180 00:13:59,996 --> 00:07:03,440 ,, if you go back to sort of the early vector machines, so something like the 181 00:14:07,258 --> 00:14:12,162 Control Data Corp 6600 or the Cray-1 they have 182 00:14:12,162 --> 00:14:16,045 Very deep pipelines. And if you think about the architecture 183 00:14:16,045 --> 00:14:21,353 we've been building up to this point, we had to add a lot of forwarding logic and a 184 00:14:21,353 --> 00:14:25,560 lot of bypasses to be able to bypass one value to the next value. 185 00:14:26,220 --> 00:14:32,598 Well, if you have a very deep pipe line, And you observe back to back multiply or 186 00:14:32,598 --> 00:14:35,020 something like that, you're going to stall a lot. 187 00:14:35,760 --> 00:14:40,173 But in a vector computer, because you know you're operating on, let's say, 64 188 00:14:40,173 --> 00:14:44,351 operations at a time anyway, This actually allows you to take out a lot 189 00:14:44,351 --> 00:14:47,764 of the bypassing. So, while these vector architectures have 190 00:14:47,764 --> 00:14:52,303 no bypassing in them. Because if you're going to be operating on 191 00:14:52,303 --> 00:14:57,729 64 things, and your pipeline length is six anyway, there's no possibility that you'll 192 00:14:57,729 --> 00:15:02,768 ever actually have to forward data back to, let's say, itself or something like 193 00:15:02,768 --> 00:15:08,065 that in the early you could do all the bypassing between different operations in 194 00:15:08,065 --> 00:15:14,279 the register file itself. Also, you know, deep pipelines are good 195 00:15:14,279 --> 00:15:19,475 cuz you can have very fast clock rates. So, to give you an example, the old Cray-1 196 00:15:19,475 --> 00:15:23,482 had a 80 megahertz clock. Now, you might say, 80 megahertz ooh, 197 00:15:23,482 --> 00:15:27,802 that's, that's not very fast. But, you know, 80 megahertz back in the 198 00:15:27,990 --> 00:15:32,435 probably late 60s' early 70s,' was very fast clock rate for a processor. 199 00:15:32,435 --> 00:15:37,381 I mean, these were supercomputers, mind you, but they were very aggressive and 200 00:15:37,381 --> 00:15:42,202 they can do that because they had deep, deep pipelining and lots and lots of 201 00:15:42,202 --> 00:15:45,020 logic, and these things were physically large. 202 00:15:47,420 --> 00:15:51,690 I mentioned the memory system. And, vector computers have some 203 00:15:51,690 --> 00:15:56,800 interesting changes that you have to think about in the memory system. 204 00:15:58,340 --> 00:16:03,068 One of the things you can do is, because you have so many memory operations going 205 00:16:03,068 --> 00:16:04,644 on, You can use vector load. 206 00:16:04,644 --> 00:16:09,022 You can actually overlap going out to main memory with doing the next load 207 00:16:09,022 --> 00:16:12,000 effectively, even if you're doing them sequentially. 208 00:16:12,000 --> 00:16:16,549 And most these vector architectures have many, many memory banks. 209 00:16:16,549 --> 00:16:21,738 And what's nice is if you have unit stride, you know that your one operation, 210 00:16:21,738 --> 00:16:25,647 your one load is going to, to go to this bank, the next operation is going to next, 211 00:16:25,647 --> 00:16:30,694 that bank, that bank, that bank, that bank and have basically a very good bank 212 00:16:30,694 --> 00:16:36,358 distribution or bank utilization. And this is assuming right now that we are 213 00:16:36,358 --> 00:16:39,460 actually only doing one memory load at a time. 214 00:16:39,460 --> 00:16:44,381 And I have a little note up here that says, okay, well, each load takes, let's 215 00:16:44,381 --> 00:16:49,974 say, four cycles. Busy bank time and you have twelve second 216 00:16:49,974 --> 00:16:54,613 link to get out to memory in this Cray-1 machine. 217 00:16:54,613 --> 00:16:58,983 Well, On a normal architecture, this would be 218 00:16:58,983 --> 00:17:02,784 pretty bad, because you'd be stalling, twelve cycles, let's say, to go out to 219 00:17:02,784 --> 00:17:06,367 your memory system. I mean, that's, that's not the end of the 220 00:17:06,367 --> 00:17:10,385 world but that's, that's not great if you, like have a, a load, and then a use, a 221 00:17:10,385 --> 00:17:14,090 load, use and just keep going back and forth, between those load and use. 222 00:17:14,090 --> 00:17:18,160 But in the vector architecture, because we have a long vector length and we're 223 00:17:18,160 --> 00:17:21,761 loading 64 different values and we know that they're going to have good 224 00:17:21,761 --> 00:17:24,475 distribution over many different memory banks, 225 00:17:24,475 --> 00:17:28,701 We can effectively do this one load and we can overlap the latency in the memory 226 00:17:28,701 --> 00:17:31,953 banks with each other. So, we'll start one load here, and then 227 00:17:31,953 --> 00:17:35,384 one lead here, one load here. And if, you know, it has four cycle 228 00:17:35,384 --> 00:17:39,867 occupancy on the respective bank, and we have a 64-entry vector, definitely by the 229 00:17:39,867 --> 00:17:44,517 time we wrap around and get back to using this bank again, that first operation will 230 00:17:44,517 --> 00:17:47,670 be done. So, it's a relatively effective way to 231 00:17:47,670 --> 00:17:51,585 increase the bandwidth of your architecture and guarantee that you're not 232 00:17:51,585 --> 00:17:53,120 going to have bank conflicts.