1 00:00:00,000 --> 00:00:04,311 . Okay. 2 00:00:04,311 --> 00:00:10,740 So now we're going to move off of vectors and talk about sort of a near cousin of 3 00:00:10,740 --> 00:00:14,111 vectors, Or how you can deal, or have vector 4 00:00:14,111 --> 00:00:22,153 computing, in your desktop today. So this is actually a lot of this was done 5 00:00:22,153 --> 00:00:30,271 actually by Ruby Reith here at Princeton she added a lot of multimedia extensions 6 00:00:30,271 --> 00:00:36,780 to the HPPA risk architecture. There's a couple of other people involved 7 00:00:36,780 --> 00:00:43,022 in this, but the, she was actually pretty influential in, in dealing, to do this. 8 00:00:43,022 --> 00:00:49,421 The, the idea here is that if you have a wide register, so if you're doing let's 9 00:00:49,421 --> 00:00:55,067 say 64 bit additions, And you don't want to have to do 64 bit 10 00:00:55,067 --> 00:01:00,413 additions, or don't actually have 64 bit data laying around, you could cut it in 11 00:01:00,413 --> 00:01:03,477 half and do two 32 bit operations at the same time, 12 00:01:03,477 --> 00:01:07,520 Or you can use that same ALU and try and do four sixteen bits, 13 00:01:07,980 --> 00:01:13,215 Or eight 8-bit operations. So, this is called SIMDy, or Single 14 00:01:13,215 --> 00:01:19,846 Instruction, Multiple Data, so you have, Or short SIMDy instructions here, cuz 15 00:01:19,846 --> 00:01:24,034 typically the, the vector length is pretty short, 16 00:01:24,034 --> 00:01:30,055 Or multimedia extensions. And you have an instruction which says, I 17 00:01:30,055 --> 00:01:34,680 want to do two 32-bit ads, we'll say, at the same time. 18 00:01:36,400 --> 00:01:42,555 This is was popularized in x86 at least by, MMX was the first, first 19 00:01:42,555 --> 00:01:48,182 implementation of this. And it's, it's sort of gone on from there 20 00:01:48,182 --> 00:01:51,348 to SSE, SSE3, SSE4, SSE4, and now Intel AVX. 21 00:01:51,348 --> 00:01:58,856 And the differenances between mmx and all the different SSE's largely has to do with 22 00:01:58,856 --> 00:02:03,266 the length of the register and how many instructions they had. 23 00:02:03,479 --> 00:02:09,383 So in AVX we've gone to 256 bit registers, wider registers, and it's extensible to I 24 00:02:09,383 --> 00:02:16,326 think 1,000 bit or, or 1024 bits. One thing I do want to point out about 25 00:02:16,326 --> 00:02:22,135 this which is interesting is this requires changes to your data path. 26 00:02:22,135 --> 00:02:28,348 If you have an adder, and you have a 32 bit add, and now you wanted to do eight, 27 00:02:28,348 --> 00:02:35,640 eight bit ads, you need to cut the carry chain in seven places. 28 00:02:36,040 --> 00:02:42,345 Now, that's if you have a basic adder. I guess it gets a little more complicated 29 00:02:42,345 --> 00:02:49,517 if you have something like a propagate, or, a, carry look ahead adder, or 30 00:02:49,517 --> 00:02:54,060 something like that, Because you may not have a simple place to 31 00:02:54,060 --> 00:02:58,685 go sniff the, the carry chains. There is still some place to cut it, 32 00:02:58,685 --> 00:03:02,054 But you might, your original design, you might have propagated across, 33 00:03:02,054 --> 00:03:05,958 Where now, you need to cut the boundary. So, this is, this is definitely a, a 34 00:03:05,958 --> 00:03:08,579 challenge. Also, for things like multiplies, if you 35 00:03:08,579 --> 00:03:12,964 want to do eight, eight bit multiplies. The, the, the structure looks a little bit 36 00:03:12,964 --> 00:03:15,852 different there. But the, some of these, the big insight 37 00:03:15,852 --> 00:03:20,130 here, is, you had that logic anyway. You're just effectively adding muxes on 38 00:03:20,130 --> 00:03:23,204 the carry chains to the, the, the data path. 39 00:03:23,204 --> 00:03:26,837 And some operations you don't even need to add. 40 00:03:26,837 --> 00:03:33,195 Obviously if you're operating on something like eight, eight bit values, you want to 41 00:03:33,195 --> 00:03:36,638 do the logical or of them. You don't need to add a special 42 00:03:36,638 --> 00:03:44,119 instruction for that. From a implementation perspective, this is 43 00:03:44,119 --> 00:03:49,873 what I was trying to get at here. You can, you've independent ad's going on, and they 44 00:03:49,873 --> 00:03:55,142 all happen in parallel So why, why do we like multimedia extensions, or these 45 00:03:55,142 --> 00:03:58,747 vector instructions or short vector instructions? 46 00:03:58,747 --> 00:04:02,075 And let's compare them to our big vector machines. 47 00:04:02,075 --> 00:04:07,344 So, one of the major differences is that you can't control the vector length. 48 00:04:07,344 --> 00:04:14,103 The vector length is the way the length of the, the native data word or the length of 49 00:04:14,103 --> 00:04:18,474 the instruction set. So, or the length, the length of the 50 00:04:18,474 --> 00:04:23,780 native data type for your instruction set. And, 51 00:04:24,040 --> 00:04:27,593 Strided, scatter-gather, these other operations are hard to do, 52 00:04:27,593 --> 00:04:30,797 Because typically you just have a single load in store. 53 00:04:30,797 --> 00:04:34,176 And you use the processor's load and storing instructions. 54 00:04:34,176 --> 00:04:38,487 Because the processor doesn't care. It's just like the same way that unary 55 00:04:38,487 --> 00:04:43,147 operations or logical operations don't need special instructions to do short 56 00:04:43,147 --> 00:04:46,293 vector, or single instruction multiple data operations. 57 00:04:46,293 --> 00:04:51,012 You don't need special instructions for SIM D data to be able to do loads and 58 00:04:51,012 --> 00:04:52,760 stores. You just load the data. 59 00:04:53,020 --> 00:04:57,937 And store the data. This is actually starting to change a 60 00:04:57,937 --> 00:05:02,199 little bit. Some of the new versions of SSE actually 61 00:05:02,199 --> 00:05:06,420 do have some, scatter-gather modifications. 62 00:05:06,420 --> 00:05:13,800 It's a, it's a little bit harder if you think about it because you can't hold a 63 00:05:13,800 --> 00:05:20,200 full address if you will, in a vector. So it's not like you can actually do sort 64 00:05:20,200 --> 00:05:24,160 of index of addressing, Index of addresses because you can't 65 00:05:24,160 --> 00:05:26,740 necessarily hold the full address in there. 66 00:05:26,740 --> 00:05:31,780 But, in essence, they've sort of come up with some way to do, scatter and gather 67 00:05:31,780 --> 00:05:38,259 operations. Couple things about having the vector 68 00:05:38,259 --> 00:05:45,033 register length being limited, is that you can't do as much work in one operation. 69 00:05:45,033 --> 00:05:51,556 So, you can't necessarily do a 64 operations in one instruction, like we did 70 00:05:51,556 --> 00:05:56,908 with our vector length of 64. So that's just, that just is a, is a 71 00:05:56,908 --> 00:06:00,922 problem. And, and unfortunately, what happens here 72 00:06:00,922 --> 00:06:06,860 is you end up having to do more operations and issue more instructions. 73 00:06:07,740 --> 00:06:13,615 And you're effectively increasing the bandwidth out of your fetch, unit. 74 00:06:13,615 --> 00:06:16,791 So it's not, it's not, not as, not as good. 75 00:06:17,030 --> 00:06:22,775 And finally, I just wanted to say we're, that processors are starting to move, that 76 00:06:22,775 --> 00:06:28,236 these multimedia extensions are starting to move a little bit towards vector 77 00:06:28,236 --> 00:06:31,995 processors. as they add more rich instruction sets. 78 00:06:31,995 --> 00:06:37,599 So, as we get to SSC4 for instance, or SSC4.2, there's more instructions in there 79 00:06:37,599 --> 00:06:43,486 and X 86 that can do fancier things. And the vector length is even getting, 80 00:06:43,486 --> 00:06:47,600 getting longer, up to 124 bits. Or excuse me 1024 bits.