1
00:00:03,400 --> 00:00:10,020
Today, we move on to our new topic.
Vector computers.

2
00:00:11,220 --> 00:00:14,260
So, a little bit of introduction on
vector,

3
00:00:14,700 --> 00:00:23,012
Vector machine is a vector processor.
Broadly, it's a way to get at having data

4
00:00:23,012 --> 00:00:26,861
level parallelism.
Many times for, let's say, array

5
00:00:26,861 --> 00:00:33,067
operations, you're going to want to take
one whole array and add it too another

6
00:00:33,067 --> 00:00:36,917
whole array.
And let's say, these arrays are large.

7
00:00:36,917 --> 00:00:42,965
Does it really make sense to have a
processor sit in a tight loop doing load,

8
00:00:42,965 --> 00:00:46,500
add,
Store, load, add, store, load, add, store

9
00:00:46,500 --> 00:00:52,891
in a loop? And it's the insight that comes
out that if you have computations that

10
00:00:52,891 --> 00:00:58,020
work on vectors or matrices or even
multi-dimensional matrices,

11
00:00:58,300 --> 00:01:02,715
You can think about building an
architecture where you don't have to have

12
00:01:02,715 --> 00:01:05,997
as much instruction fetch, instruction
decode bandwidth.

13
00:01:05,997 --> 00:01:10,949
And you don't have to sit there and fetch
new instructions and continually operate

14
00:01:10,949 --> 00:01:14,947
on those new instructions.
You could just have an instruction which

15
00:01:14,947 --> 00:01:19,302
encodes some large amount of computation.
Because its all the same, it's the

16
00:01:19,302 --> 00:01:23,114
insight.
Also, in today's lecture, we're going to

17
00:01:23,114 --> 00:01:27,690
be talking about single instruction
multiple data architectures.

18
00:01:27,690 --> 00:01:32,449
This is kind of a degenerate case of
vector architectures.

19
00:01:32,449 --> 00:01:39,096
And a good example of this is something
like multimedia extensions or MMX in the

20
00:01:39,096 --> 00:01:44,020
Intel processors or Ultivec in the power
PC architecture.

21
00:01:44,222 --> 00:01:48,598
The newer thing that Intel has added now,
they all call SSE.

22
00:01:48,598 --> 00:01:53,379
Streaming, something extensions.
I actually don't know what the second S

23
00:01:53,379 --> 00:01:57,217
stands for.
And then they also now have something they

24
00:01:57,217 --> 00:02:00,247
call AVX, which is even wider.
They can, can,

25
00:02:00,247 --> 00:02:07,080
Basically and continually add in more
instructions to make the short vector

26
00:02:07,080 --> 00:02:11,986
nature better.
And then, finally today, if we have time,

27
00:02:11,986 --> 00:02:16,060
we'll be talking about graphics processing
units.

28
00:02:16,560 --> 00:02:24,422
So, I have some examples here.
This is the ATI FirePro 3DV7800 and then

29
00:02:24,422 --> 00:02:32,949
we have the Nvidia equivalence, Nvidia
competitor, which is the Nvidia Tesla, I

30
00:02:32,949 --> 00:02:38,375
think this is C075.
Both of these, these are both very fast

31
00:02:38,375 --> 00:02:46,017
processors. And what is interesting is,
these started out as graphics, graphics

32
00:02:46,017 --> 00:02:50,690
processors.
So, they started out to play video games

33
00:02:50,690 --> 00:02:55,220
effectively or to do some sort of
rendering of three-dimensional data.

34
00:02:55,560 --> 00:02:59,599
So, you're taking some data,
You operate on it and there's massive

35
00:02:59,599 --> 00:03:03,079
parallelism there.
Lots of different triangles in a, in a

36
00:03:03,079 --> 00:03:07,304
three-dimensional image, for instance, in
three-dimensional rendering.

37
00:03:07,304 --> 00:03:12,338
And people have this insight that, that
same processing architecture that is good

38
00:03:12,338 --> 00:03:17,309
at rendering triangles might be good at
doing, let's say, dense matrix operations

39
00:03:17,309 --> 00:03:22,943
also. And we've seen this outgrowth and
we've seen a whole programming model come

40
00:03:22,943 --> 00:03:27,260
up around this and this is, this is very
recent. to some extent,

41
00:03:27,920 --> 00:03:32,996
These architectures don't come from the
same lineage as some sort of normal

42
00:03:32,996 --> 00:03:36,268
processors.
They come from, really come from fixed

43
00:03:36,268 --> 00:03:41,278
function hardware that was there to
design, there to render video games and

44
00:03:41,478 --> 00:03:45,958
three-dimensional sorts of scenes.
So, their architectures look quite a bit

45
00:03:45,958 --> 00:03:49,624
different and the naming is very
different, so if you go pull, pick up the

46
00:03:49,624 --> 00:03:53,842
manual, it tells you how to program one of
these things and you come from a computer

47
00:03:53,842 --> 00:03:57,709
architecture background, you're just not
going to understand any of the words.

48
00:03:57,960 --> 00:04:01,576
Your book actually, the Hennessey and
Patterson book has a very good table which

49
00:04:01,576 --> 00:01:31,927
translates some sort of, traditional
computer architecture words to GPU words

50
00:04:05,442 --> 00:04:08,093
and that makes life a lot easier.
Okay.

51
00:04:08,093 --> 00:04:12,801
So, let's get started.
Looking at vector processors, and let's

52
00:04:12,801 --> 00:04:18,106
look at the programming model first before
we look at the architecture.

53
00:04:18,106 --> 00:04:24,400
So, this would be software model, not the
not what the hardware looks like, yes.

54
00:04:25,100 --> 00:04:32,058
So, to start off here,
A couple things to note is in the

55
00:04:32,058 --> 00:04:37,363
traditional vector architecture, you're
going to have some scour registers.

56
00:04:37,363 --> 00:04:41,669
And these are the registers like in a
normal microprocessor.

57
00:04:41,669 --> 00:04:46,283
They just hold one value.
Thye're maybe, let's say, 32 bits or 64

58
00:04:46,283 --> 00:04:51,589
bits in width.
And then, you have a second register file,

59
00:04:51,589 --> 00:04:55,360
which holds.
Vectors.

60
00:04:55,760 --> 00:05:01,634
And when you go to access one of these
vectors, it's the same thing as a register

61
00:05:01,634 --> 00:05:05,333
file, file here.
If you go to access, let's say, vector

62
00:05:05,333 --> 00:05:10,846
register three, or something like that,
you're going to, that doesn't denote one

63
00:05:10,846 --> 00:05:14,037
value.
Instead, it denotes many values at one

64
00:05:14,037 --> 00:05:20,575
time.
And typically, we have a fixed width here

65
00:05:20,575 --> 00:05:24,314
drawn, but typically these things have
very long widths.

66
00:05:24,314 --> 00:05:29,821
So, for instance, something like the Cray
processor or the Cray-1 processors, had a

67
00:05:29,821 --> 00:05:34,445
maximum vector length of 64 elements where
each element was 64 bits.

68
00:05:34,445 --> 00:05:40,020
So, it's a lot of data that you're, you're
sort of moving around at one time with one

69
00:05:40,020 --> 00:05:44,472
operation.
And an important piece of sort of

70
00:05:44,472 --> 00:05:51,102
architectural or least program model
hardware here is the vector-length

71
00:05:51,102 --> 00:05:54,591
register.
The vector-length register says, how many

72
00:05:54,591 --> 00:05:59,894
of these elements are actually populated?"
And we'll see why that's important.

73
00:05:59,894 --> 00:06:05,128
But for right now, let's just think of
having the vector-length register be

74
00:06:05,128 --> 00:06:09,245
equivalent to the maximum number of
elements in the vector.

75
00:06:09,245 --> 00:06:14,688
So, think of it as having 64 elements and
the vector-length register just says

76
00:06:14,897 --> 00:06:20,480
there's, you're always operating on all 64
bit, entries of data in parallel.

77
00:06:22,040 --> 00:06:27,719
Now, if we go look at the program model
connected to this, we need to add some

78
00:06:27,719 --> 00:06:31,122
extra instructions now.
In our Scalar processors, or all the

79
00:06:31,122 --> 00:06:34,180
processors we've been talking about up to
this point,

80
00:06:34,180 --> 00:06:36,988
It operates on one register with one other
register.

81
00:06:36,988 --> 00:06:40,876
And that still exists in this model.
But it operates only on these Scalar

82
00:06:40,876 --> 00:06:43,522
registers.
Now, the reason why we still have the

83
00:06:43,522 --> 00:06:47,572
Scalar registers around in this model, is
we want to have things like branch

84
00:06:47,572 --> 00:06:51,299
conditions, address computation, things
like that are not vectorizable.

85
00:06:51,299 --> 00:06:53,891
They don't, you know, you don't have 64
addresses.

86
00:06:53,891 --> 00:06:58,103
Maybe, maybe you do in certain cases.
But typically, you're not going to have

87
00:06:58,103 --> 00:07:01,181
that laying around.
You're just going to have an address and

88
00:07:01,181 --> 00:07:05,609
you need to load from address and sort of
for branches, you need to do the branching

89
00:07:05,771 --> 00:07:08,040
based on some value,
And not all 64 values.

90
00:07:08,340 --> 00:07:11,459
But, we now add some special extra
instructions.

91
00:07:11,459 --> 00:07:16,304
So, if you go look in your book, they
develop this architecture they call VMIPS

92
00:07:16,304 --> 00:07:20,600
or vector MIPS.
And they add some extra instructions here

93
00:07:20,600 --> 00:07:25,921
which look very similar to normal MIPS but
all of a sudden they put some Vs at the

94
00:07:25,921 --> 00:07:28,772
end here.
So,, VV which means it operates on a

95
00:07:28,772 --> 00:07:32,889
vector with another vector.
They also developed some instructions

96
00:07:32,889 --> 00:07:37,530
which have a V and a S, which is the
Scalar so you can do a vector plus a

97
00:07:37,530 --> 00:07:42,683
Scalar which would be something along the
lines of if you were to have, let's say,

98
00:07:42,683 --> 00:07:47,901
add vector Scalar where you're adding one
vector with a Scalar register where the

99
00:07:47,901 --> 00:07:50,700
scale register, let's say is loaded with
one.

100
00:07:50,700 --> 00:07:55,600
You could do this add and it'll increment
every element of the vector by one.

101
00:07:55,940 --> 00:08:00,948
You also have load in stores, which can
pull out very large chunks of memory and

102
00:08:00,948 --> 00:08:04,880
put back very large chunk of memory from
the arrays in memory.

103
00:08:05,200 --> 00:08:10,073
But if you look at what's going on in one
of these instructions, we're taking one

104
00:08:10,073 --> 00:08:16,587
vector, another vector, putting it into
Some sort of arithmetic operation and then

105
00:08:16,587 --> 00:08:20,836
storing it into another register.
This is a register-register vector

106
00:08:20,836 --> 00:08:24,148
architecture.
There has been some register-memory and

107
00:08:24,148 --> 00:08:29,084
memory-memory vector architectures out
there, where instead of naming registers,

108
00:08:29,084 --> 00:08:34,021
vector registers, you can name places in
memory, but the vector-vector oh, excuse

109
00:08:34,021 --> 00:08:37,458
me the register-register variants are, are
the most popular.

110
00:08:37,458 --> 00:08:42,520
Just like the register-register Scalar
computer architectures are now the most

111
00:08:42,520 --> 00:08:45,470
popular.
One thing I did want to point out here is,

112
00:08:45,470 --> 00:08:49,022
we've said nothing about how many ALUs
there are in this architecture.

113
00:08:49,022 --> 00:08:51,254
This is just the abstract programming
model.,

114
00:06:38,440 --> 00:08:54,653
So, don't get this confused with having
one, two, three, four, five, six

115
00:08:54,653 --> 00:08:58,712
functional units or something like that.
This is just a abstract model right now,

116
00:08:58,712 --> 00:09:06,280
we have not talked about the hardware.
So, this brings up, how do we get data?

117
00:09:06,820 --> 00:09:12,918
And we have a instruction here that we'll
call load vector.

118
00:09:12,918 --> 00:09:21,015
Load vector has a destination, being a
vector and the is, Is a register, and you

119
00:09:21,015 --> 00:09:26,299
might have another offset in the register.
But let's say, there's only one register

120
00:09:26,299 --> 00:09:29,365
in our, in our basic load vector operation
here.

121
00:09:29,365 --> 00:09:34,062
And this is the address that points to the
base of the vector in memory.

122
00:09:34,062 --> 00:09:39,281
And when you go to do this load, it's
actually going to pull in from memory into

123
00:09:39,281 --> 00:09:43,391
our vector register.
You could also start to think about having

124
00:09:43,391 --> 00:09:48,213
interesting offsets or strides here.
So, that's what this picture here is

125
00:09:48,213 --> 00:09:53,176
trying to show is we have a base pointer
pointing to by register one, it's a Scalar

126
00:09:53,176 --> 00:09:58,080
register and note it's has different
naming, these have Vs and these are Rs and

127
00:09:58,080 --> 00:10:01,780
then,
We have a stride here which says, where in

128
00:10:01,780 --> 00:10:05,811
memory to take from.
So, you can think about having something

129
00:10:05,811 --> 00:10:09,590
where you can do basically multiple
locations in memory.

130
00:10:09,590 --> 00:10:13,176
But you want every fifth element or
something like that.

131
00:10:13,176 --> 00:10:18,171
So, you could load register two here with
five, register one here with the base

132
00:10:18,171 --> 00:10:21,309
address,
And then, do this load vector instruction

133
00:10:21,309 --> 00:10:26,259
and it'll take each fifth piece of memory
of some data size and load it into the

134
00:10:26,259 --> 00:10:29,678
vector register.
And this is our abstract model, but at

135
00:10:29,678 --> 00:10:33,984
the, at the beginning here, let's assume
what's called the unit stride which

136
00:10:33,984 --> 00:10:39,306
basically means this here, is always one,
so its always getting the next value in a

137
00:10:39,306 --> 00:10:42,611
row.
We'll, we'll talk in more complicated

138
00:10:42,611 --> 00:10:46,284
cases about having non-unit stride.
Okay.

139
00:10:46,284 --> 00:10:53,160
So, let's look at what this does to code.
Here we have a basic code example, it's

140
00:10:53,160 --> 00:10:59,107
going to multiply element-wise..
Different elements of a, of a, of Vector

141
00:10:59,107 --> 00:11:02,044
here, A and B, and deposit it into Vector
C.

142
00:11:02,044 --> 00:11:07,219
Now, this is in memory because this is C
code so these are actually arrays.

143
00:11:07,219 --> 00:11:12,954
Now, obviously this is not a, you know,
array multiplication here, cuz array math

144
00:11:12,954 --> 00:11:17,220
is much more complicated.
This is a element-wise multiplication.

145
00:11:17,620 --> 00:11:20,578
If you go look at the Scalar assembly
code.

146
00:11:20,578 --> 00:11:23,400
Well, first of all, we need to have a
loop.

147
00:11:25,600 --> 00:11:30,972
We have to load the first value, load the
second value, do the multiply, do the

148
00:11:30,972 --> 00:11:34,390
store.
This is showing code for floating-points

149
00:11:34,390 --> 00:11:38,995
double precision multiplies.
Then, you have to increment a bunch of

150
00:11:38,995 --> 00:11:42,483
pointers.
Check the, the boundary case and, and loop

151
00:11:42,483 --> 00:11:45,762
around.
On your vector architecture, life gets a

152
00:11:45,762 --> 00:11:51,274
little bit easier here because we can do
all 64 of these in one instruction, we

153
00:11:51,274 --> 00:11:56,692
don't have to loop.
And all you really have to do is load,

154
00:11:56,692 --> 00:12:00,649
load, load vector, load vector, multiply
and store.

155
00:12:00,649 --> 00:12:06,420
And this instruction on the top here loads
the vector length register.

156
00:12:07,000 --> 00:12:12,276
And we look at the vector-length length
register here of 64, cuz we're trying to

157
00:12:12,276 --> 00:12:15,442
do 64.
But if we were to load the vector-length

158
00:12:15,442 --> 00:12:20,060
register to, say, with 32, we would only
do the first 32 multiplications.

159
00:12:21,560 --> 00:12:25,355
And you can set that vector length
register all the way up to maximum vector

160
00:12:25,355 --> 00:12:31,340
length.
So, the vector-length register,

161
00:12:31,960 --> 00:12:37,467
There's, there's this value here we call
the vector-length register max.

162
00:12:37,467 --> 00:12:40,880
Which is the width, the, the, the, the
largest,

163
00:12:41,400 --> 00:12:46,040
It's going to be length of a vector.
The vector length register says for the

164
00:12:46,040 --> 00:12:51,600
given operation we're about to compute,
How many of those operations we should do?

165
00:12:52,340 --> 00:12:59,088
So, you could either easily have,
something with, a vector length of a

166
00:12:59,088 --> 00:13:01,763
thousand.
But you only want to do, let's say, the

167
00:13:01,763 --> 00:13:06,691
first 64 operations so you can load your
vector-length register of 64 and only do

168
00:13:06,691 --> 00:13:10,177
64 operations.
A good example for this actually is some

169
00:13:10,177 --> 00:13:14,300
of the super computers.
Cray, Cray machine have relatively short

170
00:13:14,300 --> 00:13:19,228
vector-length maxes, but if you go look at
something like NEC the Japanese

171
00:13:19,228 --> 00:13:24,348
supercomputers, the NECSX8 or nine or
something like that which is, I think,

172
00:13:24,348 --> 00:13:29,660
actually now probably the fastest computer
in the world or the SX9, I think is or

173
00:13:29,660 --> 00:13:34,140
whatever is the, the newest.
I actually, I think it's the SX9 the new

174
00:13:34,140 --> 00:13:39,308
Japanese vector shift computer.
They have very long vector-length maxes so

175
00:13:39,308 --> 00:13:44,285
they can actually have a vector-length of
a thousand. So that, in one instruction,

176
00:13:44,285 --> 00:13:48,455
they can basically encode a thousand
operations which is pretty, pretty fancy.

177
00:13:48,455 --> 00:13:53,312
But they can, you still need to be able to
set the vector-length because maybe you

178
00:13:53,312 --> 00:13:55,477
don't want to do all a thousand all the
time.

179
00:13:55,477 --> 00:13:59,996
Okay, so why, what is this vector stuff
coming has some advantages?

180
00:13:59,996 --> 00:07:03,440
,, if you go back to sort of the early
vector machines, so something like the

181
00:14:07,258 --> 00:14:12,162
Control Data Corp 6600 or the Cray-1 they
have

182
00:14:12,162 --> 00:14:16,045
Very deep pipelines.
And if you think about the architecture

183
00:14:16,045 --> 00:14:21,353
we've been building up to this point, we
had to add a lot of forwarding logic and a

184
00:14:21,353 --> 00:14:25,560
lot of bypasses to be able to bypass one
value to the next value.

185
00:14:26,220 --> 00:14:32,598
Well, if you have a very deep pipe line,
And you observe back to back multiply or

186
00:14:32,598 --> 00:14:35,020
something like that, you're going to stall
a lot.

187
00:14:35,760 --> 00:14:40,173
But in a vector computer, because you know
you're operating on, let's say, 64

188
00:14:40,173 --> 00:14:44,351
operations at a time anyway,
This actually allows you to take out a lot

189
00:14:44,351 --> 00:14:47,764
of the bypassing.
So, while these vector architectures have

190
00:14:47,764 --> 00:14:52,303
no bypassing in them.
Because if you're going to be operating on

191
00:14:52,303 --> 00:14:57,729
64 things, and your pipeline length is six
anyway, there's no possibility that you'll

192
00:14:57,729 --> 00:15:02,768
ever actually have to forward data back
to, let's say, itself or something like

193
00:15:02,768 --> 00:15:08,065
that in the early you could do all the
bypassing between different operations in

194
00:15:08,065 --> 00:15:14,279
the register file itself.
Also, you know, deep pipelines are good

195
00:15:14,279 --> 00:15:19,475
cuz you can have very fast clock rates.
So, to give you an example, the old Cray-1

196
00:15:19,475 --> 00:15:23,482
had a 80 megahertz clock.
Now, you might say, 80 megahertz ooh,

197
00:15:23,482 --> 00:15:27,802
that's, that's not very fast.
But, you know, 80 megahertz back in the

198
00:15:27,990 --> 00:15:32,435
probably late 60s' early 70s,' was very
fast clock rate for a processor.

199
00:15:32,435 --> 00:15:37,381
I mean, these were supercomputers, mind
you, but they were very aggressive and

200
00:15:37,381 --> 00:15:42,202
they can do that because they had deep,
deep pipelining and lots and lots of

201
00:15:42,202 --> 00:15:45,020
logic, and these things were physically
large.

202
00:15:47,420 --> 00:15:51,690
I mentioned the memory system.
And, vector computers have some

203
00:15:51,690 --> 00:15:56,800
interesting changes that you have to think
about in the memory system.

204
00:15:58,340 --> 00:16:03,068
One of the things you can do is, because
you have so many memory operations going

205
00:16:03,068 --> 00:16:04,644
on,
You can use vector load.

206
00:16:04,644 --> 00:16:09,022
You can actually overlap going out to main
memory with doing the next load

207
00:16:09,022 --> 00:16:12,000
effectively, even if you're doing them
sequentially.

208
00:16:12,000 --> 00:16:16,549
And most these vector architectures have
many, many memory banks.

209
00:16:16,549 --> 00:16:21,738
And what's nice is if you have unit
stride, you know that your one operation,

210
00:16:21,738 --> 00:16:25,647
your one load is going to, to go to this
bank, the next operation is going to next,

211
00:16:25,647 --> 00:16:30,694
that bank, that bank, that bank, that bank
and have basically a very good bank

212
00:16:30,694 --> 00:16:36,358
distribution or bank utilization.
And this is assuming right now that we are

213
00:16:36,358 --> 00:16:39,460
actually only doing one memory load at a
time.

214
00:16:39,460 --> 00:16:44,381
And I have a little note up here that
says, okay, well, each load takes, let's

215
00:16:44,381 --> 00:16:49,974
say, four cycles.
Busy bank time and you have twelve second

216
00:16:49,974 --> 00:16:54,613
link to get out to memory in this Cray-1
machine.

217
00:16:54,613 --> 00:16:58,983
Well,
On a normal architecture, this would be

218
00:16:58,983 --> 00:17:02,784
pretty bad, because you'd be stalling,
twelve cycles, let's say, to go out to

219
00:17:02,784 --> 00:17:06,367
your memory system.
I mean, that's, that's not the end of the

220
00:17:06,367 --> 00:17:10,385
world but that's, that's not great if you,
like have a, a load, and then a use, a

221
00:17:10,385 --> 00:17:14,090
load, use and just keep going back and
forth, between those load and use.

222
00:17:14,090 --> 00:17:18,160
But in the vector architecture, because we
have a long vector length and we're

223
00:17:18,160 --> 00:17:21,761
loading 64 different values and we know
that they're going to have good

224
00:17:21,761 --> 00:17:24,475
distribution over many different memory
banks,

225
00:17:24,475 --> 00:17:28,701
We can effectively do this one load and we
can overlap the latency in the memory

226
00:17:28,701 --> 00:17:31,953
banks with each other.
So, we'll start one load here, and then

227
00:17:31,953 --> 00:17:35,384
one lead here, one load here.
And if, you know, it has four cycle

228
00:17:35,384 --> 00:17:39,867
occupancy on the respective bank, and we
have a 64-entry vector, definitely by the

229
00:17:39,867 --> 00:17:44,517
time we wrap around and get back to using
this bank again, that first operation will

230
00:17:44,517 --> 00:17:47,670
be done.
So, it's a relatively effective way to

231
00:17:47,670 --> 00:17:51,585
increase the bandwidth of your
architecture and guarantee that you're not

232
00:17:51,585 --> 00:17:53,120
going to have bank conflicts.