1
00:00:03,280 --> 00:00:04,532
Okay.
So, all here.

2
00:00:04,532 --> 00:00:09,401
So, let's get started.
So, we're continuing our ELE 475

3
00:00:09,401 --> 00:00:12,744
experience.
And we're going to continue on where we

4
00:00:12,744 --> 00:00:16,742
left off last time talking about vectors
and vector machines.

5
00:00:16,742 --> 00:00:22,117
And just to recap, cuz we went through
this really fast at the end of lecture

6
00:00:22,117 --> 00:00:25,525
last time.
When you have a vector computer, one of

7
00:00:25,525 --> 00:00:30,900
the things that you want to do or the easy
thing to do is to add vectors or numbers.

8
00:00:32,360 --> 00:00:36,612
But, what if you want to do work inside of
a vector? So, you want to take a vector

9
00:00:36,612 --> 00:00:39,633
and you want to sum all of the elements in
the vector.

10
00:00:39,633 --> 00:00:44,053
So, we call this a reduction, a vector
reduction. And if you're trying to do this

11
00:00:44,053 --> 00:00:48,418
with a vector machine, unless you have
some special instruction which looks at

12
00:00:48,418 --> 00:00:52,726
all the different elements, which is
probably a bad thing to do cuz if you're

13
00:00:52,726 --> 00:00:58,595
trying to do that then you would lose all
the advantages of having lane structures

14
00:00:58,595 --> 00:01:01,860
cuz you would build or partition the
elements.

15
00:01:02,180 --> 00:01:07,150
Cuz if you had to do a reduction you would
actually have to have, let's say, one ALU

16
00:01:07,150 --> 00:01:10,241
use all of the elements from these
different lanes.

17
00:01:10,241 --> 00:01:14,606
And that would be, that'd be sad.
So if you want to do a reduction, one of

18
00:01:14,606 --> 00:01:19,091
the ways to go about doing this is
actually have use vectors but use them

19
00:01:19,091 --> 00:01:22,607
sort of temporally.
And, you can use a, if you will a binary

20
00:01:22,607 --> 00:01:29,847
tree algorithm here to start off with a
big long vector that you want to do the

21
00:01:29,847 --> 00:01:35,843
sum of all the, the sub parts of this.
And the first step is you just cut this in

22
00:01:35,843 --> 00:01:41,988
half. And you take this half of the vector
and that half of the vector and you add

23
00:01:41,988 --> 00:01:47,983
it, and you end up with the partial sums
here, which is half the length. And again,

24
00:01:47,983 --> 00:01:53,979
add this half with that half and you can
use vector instructions to do that and for

25
00:01:53,979 --> 00:01:59,075
something half the length.
Continue, and at some point, you end up

26
00:01:59,075 --> 00:02:07,119
with a scallar, which is the sum.
So, this is pretty widely used to do

27
00:02:07,119 --> 00:02:13,139
vector reductions.
At the end of yesterday last class's

28
00:02:13,139 --> 00:02:17,920
lecture, we also briefly touched on more
interesting addressing modes.

29
00:02:18,180 --> 00:02:22,933
So, the vector addressing modes and
electro low, loads and stores we've been

30
00:02:22,933 --> 00:02:26,531
talking about.
Up to this point, you could bank very well

31
00:02:26,531 --> 00:02:31,670
and you could assign, let's say, different
regions of memory to, sort of, different

32
00:02:31,670 --> 00:02:34,754
lanes.
And you would always be able to do a load

33
00:02:34,754 --> 00:02:39,572
and actually just read out from your bank
that was sort of a, attached to a

34
00:02:39,572 --> 00:02:43,597
particular lane.
Well, that works well for very

35
00:02:43,825 --> 00:02:49,745
well-structured memory accesses. But all
of a sudden, let's say, you want to do an

36
00:02:49,745 --> 00:02:55,363
operation where you have C of D of i.
So, you have a vector, D, and you want to

37
00:02:55,363 --> 00:02:59,386
index into that vector.
So, it's a vector of addresses.

38
00:02:59,386 --> 00:03:03,181
And then, you want to take,
Or, or, a vector of indexes.

39
00:03:03,181 --> 00:03:08,040
And then, you want to take that index and
use that to index into C.

40
00:03:08,600 --> 00:03:13,906
So, this is something you commonly want to
do, but you need special support for it.

41
00:03:13,906 --> 00:03:17,182
And a basic vector architecture may not
have this.

42
00:03:17,379 --> 00:03:21,048
But you can add it.
And the, the MIPS architecture which is

43
00:03:21,048 --> 00:03:26,617
developed in the Hennessy and Patterson
book as this instruction here called load

44
00:03:26,617 --> 00:03:30,876
vector indirect.
Where you can actually have two vector

45
00:03:30,876 --> 00:03:36,812
registers, and the one will index into the
other, and then you have a destination

46
00:03:36,812 --> 00:03:40,147
vector register.
And, we call this gather.

47
00:03:40,147 --> 00:03:44,609
But your memory system, because you don't
know the a priori, if you will, the

48
00:03:44,609 --> 00:03:47,986
addressing, your memory system might get
big and complex.

49
00:03:47,986 --> 00:03:52,992
And you need to be able to have all, all
the lanes in your vector processor be able

50
00:03:52,992 --> 00:03:57,529
to talk to all the memory.
And that's probably a good thing to do

51
00:03:57,529 --> 00:04:02,869
anyway to make your machine a little more
flexible and to allow sort of vectors that

52
00:04:02,869 --> 00:04:05,633
don't have to align to a particular
address.

53
00:04:05,822 --> 00:04:10,784
But, you have to make your memory system
much more complicated to be able to do

54
00:04:10,784 --> 00:04:15,119
these sort of gather operations.
And the scatter operation is the, the

55
00:04:15,119 --> 00:04:19,808
inverse of this.
It would be SVI, a Store Vector Indirect

56
00:04:19,808 --> 00:04:23,830
which would do the store where you have an
indirect for the store.

57
00:04:23,830 --> 00:04:28,340
So, if this would be a, on the left hand
side of an assignment operation.

58
00:04:29,540 --> 00:04:31,739
Okay.
So now, we get to talk about a couple of

59
00:04:31,739 --> 00:04:33,791
examples.
Well, let's do, we'll touch on one

60
00:04:33,791 --> 00:04:36,234
example, actually, right now of a vector
machine.

61
00:04:36,234 --> 00:04:40,535
And this is what I was trying to say, when
I was coming in that, if you're going to

62
00:04:40,535 --> 00:04:44,591
build a really fast computer, and it could
cost millions of dollars, you're going to

63
00:04:44,591 --> 00:04:51,843
look cool.
So, the picture on the right here is the

64
00:04:51,843 --> 00:04:56,371
Cray-1.
And I've had the pleasure of seeing a

65
00:04:56,371 --> 00:04:59,240
couple of these and sitting on a couple of
these.

66
00:04:59,416 --> 00:05:01,993
And it has a nice little seat built into
it.

67
00:05:01,993 --> 00:05:04,687
You can actually sit down on it and it's
warm.

68
00:05:04,687 --> 00:05:09,782
Because this is a water cooled machine and
it uses a lot of this is water cooled.

69
00:05:09,782 --> 00:05:14,116
They later went to something called floor
inert to cool these machines.

70
00:05:14,292 --> 00:05:18,450
The Cray-1 was never floor inert cooled,
but the Cray-2 I think was,

71
00:05:18,450 --> 00:05:23,223
And the Cray-3 definitely was.
But, the, the idea is that you use water

72
00:05:23,223 --> 00:05:27,475
and you can have a nice place to sit so
the operator has a nice place to sit down

73
00:05:27,475 --> 00:05:30,378
while he's, you know, he or she is working
on the machine.

74
00:05:30,378 --> 00:05:34,629
And, it's heated because there's, these
machines are quite hot and that, and part

75
00:05:34,629 --> 00:05:37,740
of the, the power supplies are actually
under the bench here.

76
00:05:38,526 --> 00:05:44,180
The other fun thing about these is you'll
notice they're shaped like the letter C,

77
00:05:44,180 --> 00:05:47,421
for Cray.
No one really knows if that's true.

78
00:05:47,421 --> 00:05:53,488
I think you actually Seymour Cray claims
this to somehow make the, the distance of

79
00:05:53,488 --> 00:05:57,487
the back plane shorter. But it, it, it is
shaped like a C.

80
00:05:57,487 --> 00:06:03,858
And, and Seymour Cray, who's the, the, the
founder of Cray, does have a C as the

81
00:06:03,858 --> 00:06:08,838
first letter of his name.
But, for a little bit more from a influ,

82
00:06:08,838 --> 00:06:15,137
or a perspective of what's actually inside
of here, the Cray-1 did not actually have

83
00:06:15,137 --> 00:06:19,531
lots of different lanes.
Instead, what it was, it was a vector

84
00:06:19,531 --> 00:06:25,244
computer that had very long pipelines or
long for the time pipelines, it had a

85
00:06:25,244 --> 00:06:30,444
couple pipelines for different, different
functional units. And, it was a

86
00:06:30,444 --> 00:06:36,081
registered,
Register, vector register, register style

87
00:06:36,081 --> 00:06:37,430
machine.
And,

88
00:06:37,430 --> 00:06:43,910
Some of the, the, the interesting things
about this is it didn't have any caches.

89
00:06:43,910 --> 00:06:49,131
And, well, deleting virtual memory, any of
that other stuff because this is really

90
00:06:49,131 --> 00:06:53,634
sort of a super computer, you're using
this to solve some big problem.

91
00:06:53,634 --> 00:06:58,334
So, you didn't need all this fancy dancy
multi-tasking, virtualization.

92
00:06:58,334 --> 00:07:03,359
You ran one really big problem on it, you
were trying to, I don't know, somehow,

93
00:07:03,359 --> 00:07:08,320
model nuclear weapons, or use it to crack
codes, or something like that.

94
00:07:08,720 --> 00:07:12,012
Here's the, the, micro-architecture of the
Cray-1.

95
00:07:12,298 --> 00:07:17,882
And, what we see is they have 64 vectors
register, or excuse me, eight vector

96
00:07:17,882 --> 00:07:22,033
registers with 64 elements each.
Their vector length is 64,

97
00:07:22,033 --> 00:07:27,044
Their maximum vector length is 64.
And, they also have a bunch of scallar

98
00:07:27,044 --> 00:07:32,555
registers and they have a separate
addressing address register bank of

99
00:07:32,555 --> 00:07:36,492
registers.
And you can only do loads in store based

100
00:07:36,492 --> 00:07:42,332
on these address registers.
What I was trying to get at here, is you

101
00:07:42,332 --> 00:07:49,488
can see that they basically had only one
pipe for each of the different operations,

102
00:07:49,488 --> 00:07:54,495
but these pipes were relatively long. So,
they give you an idea here something like

103
00:07:54,495 --> 00:08:00,629
the multiply with six cycles, multiply to
six cycles which today sounds like, well,

104
00:08:00,629 --> 00:08:04,447
things are pipelined pretty deep.
We have lots of transistors.

105
00:08:04,447 --> 00:08:08,140
But, you know, it's 1976, there weren't
that many transistors.

106
00:08:08,140 --> 00:08:12,834
This thing was physically large. So,
building a pipeline that long took, took

107
00:08:12,834 --> 00:08:15,946
space.
Or, and another example here is I think

108
00:08:15,946 --> 00:08:18,624
the reciporical took about fourteen
cycles,

109
00:08:18,624 --> 00:08:22,816
And that was pipelined.
And this did not have interlocking between

110
00:08:22,816 --> 00:08:26,949
the different pipe stages.
And didn't have to have bypassing because

111
00:08:26,949 --> 00:08:31,258
the vector length was so long.
So, you didn't have to bypass from some

112
00:08:31,258 --> 00:08:33,703
place in the pipe to some place else in
the pipe.

113
00:08:33,703 --> 00:08:39,639
They did have chaining but, and, and they
did have inter-pipeline bypassing, but

114
00:08:39,639 --> 00:08:44,900
intra-pipeline bypassing wasn't, wasn't
really there.

115
00:08:46,240 --> 00:08:51,295
Couple other things, this machine ran
really pretty fast for the days.

116
00:08:51,295 --> 00:08:56,424
80 megahertz was I'm sure was the fastest
clock tick of, of the day.

117
00:08:56,644 --> 00:09:02,360
Today, that sounds pretty slow but that,
that was, that was pretty good for 1976.