1 00:00:03,280 --> 00:00:04,532 Okay. So, all here. 2 00:00:04,532 --> 00:00:09,401 So, let's get started. So, we're continuing our ELE 475 3 00:00:09,401 --> 00:00:12,744 experience. And we're going to continue on where we 4 00:00:12,744 --> 00:00:16,742 left off last time talking about vectors and vector machines. 5 00:00:16,742 --> 00:00:22,117 And just to recap, cuz we went through this really fast at the end of lecture 6 00:00:22,117 --> 00:00:25,525 last time. When you have a vector computer, one of 7 00:00:25,525 --> 00:00:30,900 the things that you want to do or the easy thing to do is to add vectors or numbers. 8 00:00:32,360 --> 00:00:36,612 But, what if you want to do work inside of a vector? So, you want to take a vector 9 00:00:36,612 --> 00:00:39,633 and you want to sum all of the elements in the vector. 10 00:00:39,633 --> 00:00:44,053 So, we call this a reduction, a vector reduction. And if you're trying to do this 11 00:00:44,053 --> 00:00:48,418 with a vector machine, unless you have some special instruction which looks at 12 00:00:48,418 --> 00:00:52,726 all the different elements, which is probably a bad thing to do cuz if you're 13 00:00:52,726 --> 00:00:58,595 trying to do that then you would lose all the advantages of having lane structures 14 00:00:58,595 --> 00:01:01,860 cuz you would build or partition the elements. 15 00:01:02,180 --> 00:01:07,150 Cuz if you had to do a reduction you would actually have to have, let's say, one ALU 16 00:01:07,150 --> 00:01:10,241 use all of the elements from these different lanes. 17 00:01:10,241 --> 00:01:14,606 And that would be, that'd be sad. So if you want to do a reduction, one of 18 00:01:14,606 --> 00:01:19,091 the ways to go about doing this is actually have use vectors but use them 19 00:01:19,091 --> 00:01:22,607 sort of temporally. And, you can use a, if you will a binary 20 00:01:22,607 --> 00:01:29,847 tree algorithm here to start off with a big long vector that you want to do the 21 00:01:29,847 --> 00:01:35,843 sum of all the, the sub parts of this. And the first step is you just cut this in 22 00:01:35,843 --> 00:01:41,988 half. And you take this half of the vector and that half of the vector and you add 23 00:01:41,988 --> 00:01:47,983 it, and you end up with the partial sums here, which is half the length. And again, 24 00:01:47,983 --> 00:01:53,979 add this half with that half and you can use vector instructions to do that and for 25 00:01:53,979 --> 00:01:59,075 something half the length. Continue, and at some point, you end up 26 00:01:59,075 --> 00:02:07,119 with a scallar, which is the sum. So, this is pretty widely used to do 27 00:02:07,119 --> 00:02:13,139 vector reductions. At the end of yesterday last class's 28 00:02:13,139 --> 00:02:17,920 lecture, we also briefly touched on more interesting addressing modes. 29 00:02:18,180 --> 00:02:22,933 So, the vector addressing modes and electro low, loads and stores we've been 30 00:02:22,933 --> 00:02:26,531 talking about. Up to this point, you could bank very well 31 00:02:26,531 --> 00:02:31,670 and you could assign, let's say, different regions of memory to, sort of, different 32 00:02:31,670 --> 00:02:34,754 lanes. And you would always be able to do a load 33 00:02:34,754 --> 00:02:39,572 and actually just read out from your bank that was sort of a, attached to a 34 00:02:39,572 --> 00:02:43,597 particular lane. Well, that works well for very 35 00:02:43,825 --> 00:02:49,745 well-structured memory accesses. But all of a sudden, let's say, you want to do an 36 00:02:49,745 --> 00:02:55,363 operation where you have C of D of i. So, you have a vector, D, and you want to 37 00:02:55,363 --> 00:02:59,386 index into that vector. So, it's a vector of addresses. 38 00:02:59,386 --> 00:03:03,181 And then, you want to take, Or, or, a vector of indexes. 39 00:03:03,181 --> 00:03:08,040 And then, you want to take that index and use that to index into C. 40 00:03:08,600 --> 00:03:13,906 So, this is something you commonly want to do, but you need special support for it. 41 00:03:13,906 --> 00:03:17,182 And a basic vector architecture may not have this. 42 00:03:17,379 --> 00:03:21,048 But you can add it. And the, the MIPS architecture which is 43 00:03:21,048 --> 00:03:26,617 developed in the Hennessy and Patterson book as this instruction here called load 44 00:03:26,617 --> 00:03:30,876 vector indirect. Where you can actually have two vector 45 00:03:30,876 --> 00:03:36,812 registers, and the one will index into the other, and then you have a destination 46 00:03:36,812 --> 00:03:40,147 vector register. And, we call this gather. 47 00:03:40,147 --> 00:03:44,609 But your memory system, because you don't know the a priori, if you will, the 48 00:03:44,609 --> 00:03:47,986 addressing, your memory system might get big and complex. 49 00:03:47,986 --> 00:03:52,992 And you need to be able to have all, all the lanes in your vector processor be able 50 00:03:52,992 --> 00:03:57,529 to talk to all the memory. And that's probably a good thing to do 51 00:03:57,529 --> 00:04:02,869 anyway to make your machine a little more flexible and to allow sort of vectors that 52 00:04:02,869 --> 00:04:05,633 don't have to align to a particular address. 53 00:04:05,822 --> 00:04:10,784 But, you have to make your memory system much more complicated to be able to do 54 00:04:10,784 --> 00:04:15,119 these sort of gather operations. And the scatter operation is the, the 55 00:04:15,119 --> 00:04:19,808 inverse of this. It would be SVI, a Store Vector Indirect 56 00:04:19,808 --> 00:04:23,830 which would do the store where you have an indirect for the store. 57 00:04:23,830 --> 00:04:28,340 So, if this would be a, on the left hand side of an assignment operation. 58 00:04:29,540 --> 00:04:31,739 Okay. So now, we get to talk about a couple of 59 00:04:31,739 --> 00:04:33,791 examples. Well, let's do, we'll touch on one 60 00:04:33,791 --> 00:04:36,234 example, actually, right now of a vector machine. 61 00:04:36,234 --> 00:04:40,535 And this is what I was trying to say, when I was coming in that, if you're going to 62 00:04:40,535 --> 00:04:44,591 build a really fast computer, and it could cost millions of dollars, you're going to 63 00:04:44,591 --> 00:04:51,843 look cool. So, the picture on the right here is the 64 00:04:51,843 --> 00:04:56,371 Cray-1. And I've had the pleasure of seeing a 65 00:04:56,371 --> 00:04:59,240 couple of these and sitting on a couple of these. 66 00:04:59,416 --> 00:05:01,993 And it has a nice little seat built into it. 67 00:05:01,993 --> 00:05:04,687 You can actually sit down on it and it's warm. 68 00:05:04,687 --> 00:05:09,782 Because this is a water cooled machine and it uses a lot of this is water cooled. 69 00:05:09,782 --> 00:05:14,116 They later went to something called floor inert to cool these machines. 70 00:05:14,292 --> 00:05:18,450 The Cray-1 was never floor inert cooled, but the Cray-2 I think was, 71 00:05:18,450 --> 00:05:23,223 And the Cray-3 definitely was. But, the, the idea is that you use water 72 00:05:23,223 --> 00:05:27,475 and you can have a nice place to sit so the operator has a nice place to sit down 73 00:05:27,475 --> 00:05:30,378 while he's, you know, he or she is working on the machine. 74 00:05:30,378 --> 00:05:34,629 And, it's heated because there's, these machines are quite hot and that, and part 75 00:05:34,629 --> 00:05:37,740 of the, the power supplies are actually under the bench here. 76 00:05:38,526 --> 00:05:44,180 The other fun thing about these is you'll notice they're shaped like the letter C, 77 00:05:44,180 --> 00:05:47,421 for Cray. No one really knows if that's true. 78 00:05:47,421 --> 00:05:53,488 I think you actually Seymour Cray claims this to somehow make the, the distance of 79 00:05:53,488 --> 00:05:57,487 the back plane shorter. But it, it, it is shaped like a C. 80 00:05:57,487 --> 00:06:03,858 And, and Seymour Cray, who's the, the, the founder of Cray, does have a C as the 81 00:06:03,858 --> 00:06:08,838 first letter of his name. But, for a little bit more from a influ, 82 00:06:08,838 --> 00:06:15,137 or a perspective of what's actually inside of here, the Cray-1 did not actually have 83 00:06:15,137 --> 00:06:19,531 lots of different lanes. Instead, what it was, it was a vector 84 00:06:19,531 --> 00:06:25,244 computer that had very long pipelines or long for the time pipelines, it had a 85 00:06:25,244 --> 00:06:30,444 couple pipelines for different, different functional units. And, it was a 86 00:06:30,444 --> 00:06:36,081 registered, Register, vector register, register style 87 00:06:36,081 --> 00:06:37,430 machine. And, 88 00:06:37,430 --> 00:06:43,910 Some of the, the, the interesting things about this is it didn't have any caches. 89 00:06:43,910 --> 00:06:49,131 And, well, deleting virtual memory, any of that other stuff because this is really 90 00:06:49,131 --> 00:06:53,634 sort of a super computer, you're using this to solve some big problem. 91 00:06:53,634 --> 00:06:58,334 So, you didn't need all this fancy dancy multi-tasking, virtualization. 92 00:06:58,334 --> 00:07:03,359 You ran one really big problem on it, you were trying to, I don't know, somehow, 93 00:07:03,359 --> 00:07:08,320 model nuclear weapons, or use it to crack codes, or something like that. 94 00:07:08,720 --> 00:07:12,012 Here's the, the, micro-architecture of the Cray-1. 95 00:07:12,298 --> 00:07:17,882 And, what we see is they have 64 vectors register, or excuse me, eight vector 96 00:07:17,882 --> 00:07:22,033 registers with 64 elements each. Their vector length is 64, 97 00:07:22,033 --> 00:07:27,044 Their maximum vector length is 64. And, they also have a bunch of scallar 98 00:07:27,044 --> 00:07:32,555 registers and they have a separate addressing address register bank of 99 00:07:32,555 --> 00:07:36,492 registers. And you can only do loads in store based 100 00:07:36,492 --> 00:07:42,332 on these address registers. What I was trying to get at here, is you 101 00:07:42,332 --> 00:07:49,488 can see that they basically had only one pipe for each of the different operations, 102 00:07:49,488 --> 00:07:54,495 but these pipes were relatively long. So, they give you an idea here something like 103 00:07:54,495 --> 00:08:00,629 the multiply with six cycles, multiply to six cycles which today sounds like, well, 104 00:08:00,629 --> 00:08:04,447 things are pipelined pretty deep. We have lots of transistors. 105 00:08:04,447 --> 00:08:08,140 But, you know, it's 1976, there weren't that many transistors. 106 00:08:08,140 --> 00:08:12,834 This thing was physically large. So, building a pipeline that long took, took 107 00:08:12,834 --> 00:08:15,946 space. Or, and another example here is I think 108 00:08:15,946 --> 00:08:18,624 the reciporical took about fourteen cycles, 109 00:08:18,624 --> 00:08:22,816 And that was pipelined. And this did not have interlocking between 110 00:08:22,816 --> 00:08:26,949 the different pipe stages. And didn't have to have bypassing because 111 00:08:26,949 --> 00:08:31,258 the vector length was so long. So, you didn't have to bypass from some 112 00:08:31,258 --> 00:08:33,703 place in the pipe to some place else in the pipe. 113 00:08:33,703 --> 00:08:39,639 They did have chaining but, and, and they did have inter-pipeline bypassing, but 114 00:08:39,639 --> 00:08:44,900 intra-pipeline bypassing wasn't, wasn't really there. 115 00:08:46,240 --> 00:08:51,295 Couple other things, this machine ran really pretty fast for the days. 116 00:08:51,295 --> 00:08:56,424 80 megahertz was I'm sure was the fastest clock tick of, of the day. 117 00:08:56,644 --> 00:09:02,360 Today, that sounds pretty slow but that, that was, that was pretty good for 1976.