1
00:00:00,000 --> 00:00:04,311
.
Okay.

2
00:00:04,311 --> 00:00:10,740
So now we're going to move off of vectors
and talk about sort of a near cousin of

3
00:00:10,740 --> 00:00:14,111
vectors,
Or how you can deal, or have vector

4
00:00:14,111 --> 00:00:22,153
computing, in your desktop today.
So this is actually a lot of this was done

5
00:00:22,153 --> 00:00:30,271
actually by Ruby Reith here at Princeton
she added a lot of multimedia extensions

6
00:00:30,271 --> 00:00:36,780
to the HPPA risk architecture.
There's a couple of other people involved

7
00:00:36,780 --> 00:00:43,022
in this, but the, she was actually pretty
influential in, in dealing, to do this.

8
00:00:43,022 --> 00:00:49,421
The, the idea here is that if you have a
wide register, so if you're doing let's

9
00:00:49,421 --> 00:00:55,067
say 64 bit additions,
And you don't want to have to do 64 bit

10
00:00:55,067 --> 00:01:00,413
additions, or don't actually have 64 bit
data laying around, you could cut it in

11
00:01:00,413 --> 00:01:03,477
half and do two 32 bit operations at the
same time,

12
00:01:03,477 --> 00:01:07,520
Or you can use that same ALU and try and
do four sixteen bits,

13
00:01:07,980 --> 00:01:13,215
Or eight 8-bit operations.
So, this is called SIMDy, or Single

14
00:01:13,215 --> 00:01:19,846
Instruction, Multiple Data, so you have,
Or short SIMDy instructions here, cuz

15
00:01:19,846 --> 00:01:24,034
typically the, the vector length is pretty
short,

16
00:01:24,034 --> 00:01:30,055
Or multimedia extensions.
And you have an instruction which says, I

17
00:01:30,055 --> 00:01:34,680
want to do two 32-bit ads, we'll say, at
the same time.

18
00:01:36,400 --> 00:01:42,555
This is was popularized in x86 at least
by, MMX was the first, first

19
00:01:42,555 --> 00:01:48,182
implementation of this.
And it's, it's sort of gone on from there

20
00:01:48,182 --> 00:01:51,348
to SSE, SSE3, SSE4, SSE4, and now Intel
AVX.

21
00:01:51,348 --> 00:01:58,856
And the differenances between mmx and all
the different SSE's largely has to do with

22
00:01:58,856 --> 00:02:03,266
the length of the register and how many
instructions they had.

23
00:02:03,479 --> 00:02:09,383
So in AVX we've gone to 256 bit registers,
wider registers, and it's extensible to I

24
00:02:09,383 --> 00:02:16,326
think 1,000 bit or, or 1024 bits.
One thing I do want to point out about

25
00:02:16,326 --> 00:02:22,135
this which is interesting is this requires
changes to your data path.

26
00:02:22,135 --> 00:02:28,348
If you have an adder, and you have a 32
bit add, and now you wanted to do eight,

27
00:02:28,348 --> 00:02:35,640
eight bit ads, you need to cut the carry
chain in seven places.

28
00:02:36,040 --> 00:02:42,345
Now, that's if you have a basic adder.
I guess it gets a little more complicated

29
00:02:42,345 --> 00:02:49,517
if you have something like a propagate,
or, a, carry look ahead adder, or

30
00:02:49,517 --> 00:02:54,060
something like that,
Because you may not have a simple place to

31
00:02:54,060 --> 00:02:58,685
go sniff the, the carry chains.
There is still some place to cut it,

32
00:02:58,685 --> 00:03:02,054
But you might, your original design, you
might have propagated across,

33
00:03:02,054 --> 00:03:05,958
Where now, you need to cut the boundary.
So, this is, this is definitely a, a

34
00:03:05,958 --> 00:03:08,579
challenge.
Also, for things like multiplies, if you

35
00:03:08,579 --> 00:03:12,964
want to do eight, eight bit multiplies.
The, the, the structure looks a little bit

36
00:03:12,964 --> 00:03:15,852
different there.
But the, some of these, the big insight

37
00:03:15,852 --> 00:03:20,130
here, is, you had that logic anyway.
You're just effectively adding muxes on

38
00:03:20,130 --> 00:03:23,204
the carry chains to the, the, the data
path.

39
00:03:23,204 --> 00:03:26,837
And some operations you don't even need to
add.

40
00:03:26,837 --> 00:03:33,195
Obviously if you're operating on something
like eight, eight bit values, you want to

41
00:03:33,195 --> 00:03:36,638
do the logical or of them.
You don't need to add a special

42
00:03:36,638 --> 00:03:44,119
instruction for that.
From a implementation perspective, this is

43
00:03:44,119 --> 00:03:49,873
what I was trying to get at here. You can,
you've independent ad's going on, and they

44
00:03:49,873 --> 00:03:55,142
all happen in parallel So why, why do we
like multimedia extensions, or these

45
00:03:55,142 --> 00:03:58,747
vector instructions or short vector
instructions?

46
00:03:58,747 --> 00:04:02,075
And let's compare them to our big vector
machines.

47
00:04:02,075 --> 00:04:07,344
So, one of the major differences is that
you can't control the vector length.

48
00:04:07,344 --> 00:04:14,103
The vector length is the way the length of
the, the native data word or the length of

49
00:04:14,103 --> 00:04:18,474
the instruction set.
So, or the length, the length of the

50
00:04:18,474 --> 00:04:23,780
native data type for your instruction set.
And,

51
00:04:24,040 --> 00:04:27,593
Strided, scatter-gather, these other
operations are hard to do,

52
00:04:27,593 --> 00:04:30,797
Because typically you just have a single
load in store.

53
00:04:30,797 --> 00:04:34,176
And you use the processor's load and
storing instructions.

54
00:04:34,176 --> 00:04:38,487
Because the processor doesn't care.
It's just like the same way that unary

55
00:04:38,487 --> 00:04:43,147
operations or logical operations don't
need special instructions to do short

56
00:04:43,147 --> 00:04:46,293
vector, or single instruction multiple
data operations.

57
00:04:46,293 --> 00:04:51,012
You don't need special instructions for
SIM D data to be able to do loads and

58
00:04:51,012 --> 00:04:52,760
stores.
You just load the data.

59
00:04:53,020 --> 00:04:57,937
And store the data.
This is actually starting to change a

60
00:04:57,937 --> 00:05:02,199
little bit.
Some of the new versions of SSE actually

61
00:05:02,199 --> 00:05:06,420
do have some, scatter-gather
modifications.

62
00:05:06,420 --> 00:05:13,800
It's a, it's a little bit harder if you
think about it because you can't hold a

63
00:05:13,800 --> 00:05:20,200
full address if you will, in a vector.
So it's not like you can actually do sort

64
00:05:20,200 --> 00:05:24,160
of index of addressing,
Index of addresses because you can't

65
00:05:24,160 --> 00:05:26,740
necessarily hold the full address in
there.

66
00:05:26,740 --> 00:05:31,780
But, in essence, they've sort of come up
with some way to do, scatter and gather

67
00:05:31,780 --> 00:05:38,259
operations.
Couple things about having the vector

68
00:05:38,259 --> 00:05:45,033
register length being limited, is that you
can't do as much work in one operation.

69
00:05:45,033 --> 00:05:51,556
So, you can't necessarily do a 64
operations in one instruction, like we did

70
00:05:51,556 --> 00:05:56,908
with our vector length of 64.
So that's just, that just is a, is a

71
00:05:56,908 --> 00:06:00,922
problem.
And, and unfortunately, what happens here

72
00:06:00,922 --> 00:06:06,860
is you end up having to do more operations
and issue more instructions.

73
00:06:07,740 --> 00:06:13,615
And you're effectively increasing the
bandwidth out of your fetch, unit.

74
00:06:13,615 --> 00:06:16,791
So it's not, it's not, not as, not as
good.

75
00:06:17,030 --> 00:06:22,775
And finally, I just wanted to say we're,
that processors are starting to move, that

76
00:06:22,775 --> 00:06:28,236
these multimedia extensions are starting
to move a little bit towards vector

77
00:06:28,236 --> 00:06:31,995
processors. as they add more rich
instruction sets.

78
00:06:31,995 --> 00:06:37,599
So, as we get to SSC4 for instance, or
SSC4.2, there's more instructions in there

79
00:06:37,599 --> 00:06:43,486
and X 86 that can do fancier things.
And the vector length is even getting,

80
00:06:43,486 --> 00:06:47,600
getting longer, up to 124 bits.
Or excuse me 1024 bits.