1
00:00:03,240 --> 00:00:06,681
Okay.
So, we, we talked about vector processors.

2
00:00:06,681 --> 00:00:13,029
We introduced this sort of short vector
idea with single instruction multiple data

3
00:00:13,029 --> 00:00:14,330
instructions.
And,

4
00:00:14,330 --> 00:00:20,256
I wanted to talk a little bit about one of
the common places that you see something

5
00:00:20,256 --> 00:00:24,913
that's like a vector processor, but isn't
quite a vector processor.

6
00:00:24,913 --> 00:00:30,276
And that's in graphics processor units.
So, this is the graphics card in your

7
00:00:30,276 --> 00:00:33,874
computer.
So, my laptop here has a ETI graphics chip

8
00:00:33,874 --> 00:00:37,402
in it.
And it actually, I've, I've run on it open

9
00:00:37,402 --> 00:00:43,047
CL code compiled to it, which is a general
purpose usage of a graphics processor

10
00:00:43,047 --> 00:00:46,751
unit.
So, one of the things I wanted to kick

11
00:00:46,751 --> 00:00:51,775
this off by was saying that these
architectures look strange from a computer

12
00:00:51,775 --> 00:00:56,282
architecture perspective.
And it really goes back to the fact that

13
00:00:56,282 --> 00:01:00,652
they were not designed to be general
purpose computing machines.

14
00:01:00,652 --> 00:01:06,182
They were designed to render 3D graphics.
So, the early versions of these had very

15
00:01:06,182 --> 00:01:09,460
fixed pipelines, and they had no
programmability.

16
00:01:09,460 --> 00:01:13,830
So, you couldn't even use them to do
general purpose computation.

17
00:01:13,830 --> 00:01:17,556
And then, they start to get a little bit
more flexibility.

18
00:01:17,556 --> 00:01:23,114
So, an example of this actually was, the
original Nvidia chips had something called

19
00:01:23,114 --> 00:01:26,530
pixel shaders.
So, the idea is, on per pixel basis, as

20
00:01:26,530 --> 00:01:31,800
you render a three-dimensional picture for
like a game or something like that.

21
00:01:31,800 --> 00:01:37,206
Each pixel, you could say, oh, I want you
have some little custom, customization on

22
00:01:37,206 --> 00:01:41,057
how we render the pixel.
So, it's little, every, the game

23
00:01:41,057 --> 00:01:46,463
programmer, for instance, got to write a
little program for each pixel as they, as

24
00:01:46,463 --> 00:01:50,720
they got rendered to the screen.
So, ran, ran a little program.

25
00:01:51,600 --> 00:01:57,560
But what's interesting about this, and,
and these pixel shaders,

26
00:01:58,320 --> 00:02:03,331
There was actually a programming language
that developed for the pixel shaders. And

27
00:02:03,331 --> 00:02:07,799
the first implementation of general
purpose computing on these was people

28
00:02:07,799 --> 00:02:12,629
wrote pixel shaders to render on pixels,
but actually, did something else with it.

29
00:02:12,629 --> 00:02:17,399
So, they went and computed some, some
other program using these pixel shaders.

30
00:02:17,399 --> 00:02:21,806
So, you could try to do a matrix
multiplication and the result would be a

31
00:02:21,806 --> 00:02:24,293
picture.
[laugh] Kind of a strange idea there.

32
00:02:24,293 --> 00:02:29,752
It's like, well, we take one picture and
another picture, and we run some special

33
00:02:29,752 --> 00:02:33,828
shading on it that makes it look sort of
like a 3D picture.

34
00:02:33,828 --> 00:02:37,560
And all of a sudden, the output actually
is the result.

35
00:02:38,080 --> 00:02:43,639
Well, the graphics people, the people who
made graphics per graphics processing

36
00:02:43,639 --> 00:02:48,288
units got the bright idea that, well,
Maybe we can make this a little bit easier

37
00:02:48,288 --> 00:02:52,657
and increase our user base, just beyond
having graphics cards if we encourage

38
00:02:52,657 --> 00:02:56,644
people to write programs on these.
So, they start to make it more and more

39
00:02:56,644 --> 00:02:59,757
general purpose.
So, instead of just being able to do a

40
00:02:59,757 --> 00:03:04,235
pixel shader or work on one pixel at a
time, and have it very pixel oriented, or

41
00:03:04,235 --> 00:03:08,714
graphics oriented, we'll rename everything
and we'll come up with some programming

42
00:03:08,714 --> 00:03:12,864
model, and we'll expose some of the
architecture, it'll make the architecture

43
00:03:12,864 --> 00:03:16,960
a little more general purpose. And then,
you might be able to run some other

44
00:03:16,960 --> 00:03:21,088
programs on it.
So, this is, this is really what, what

45
00:03:21,088 --> 00:03:24,831
happened.
And, and as I said, the, the, one of the

46
00:03:24,831 --> 00:03:31,583
first people that tried use this and were
trying to program it in the, the pixel

47
00:03:31,583 --> 00:03:37,441
shaders and the, the, the per vector per
vertex computations which were also sort

48
00:03:37,441 --> 00:03:41,767
of baked into these original architectures
were very hard.

49
00:03:41,767 --> 00:03:46,287
Cuz you're basically trying to think about
everything as a picture in a frame, and

50
00:03:46,287 --> 00:03:48,713
then sort of back compute what was going
on.

51
00:03:48,713 --> 00:03:53,179
But as we've moved along a little bit,
we've started to see some new languages.

52
00:03:53,179 --> 00:03:57,480
So, it's new program support, and the
architectures become more general purpose.

53
00:03:57,480 --> 00:04:02,131
So, this brings us to general purpose
graphics processing units.

54
00:04:02,131 --> 00:04:06,634
So, it's, we still have GPU here, so it's
still special purpose.

55
00:04:06,634 --> 00:04:12,762
But then, we stick GP in the front and we
call this GPGPUs, General Purpose Graphics

56
00:04:12,762 --> 00:04:17,044
Processing Units.
So, a good example of this actually is

57
00:04:17,044 --> 00:04:22,803
Nvidia decided to or, or came up with this
programming language called CUDA.

58
00:04:22,803 --> 00:04:28,793
Now, this was not the first sort of foray
into this.

59
00:04:28,793 --> 00:04:34,260
There are some research languages that
predated this, and some ideas that

60
00:04:34,260 --> 00:04:38,413
predated this.
But, the I, the idea here well we'll talk

61
00:04:38,413 --> 00:04:43,362
about that in a second actually.
But the, the GPGPUs the programming

62
00:04:43,362 --> 00:04:47,576
model's a little bit strange.
It's a, it's a threading model.

63
00:04:47,576 --> 00:04:52,076
And we're going to talk about that in, in
a little bit more detail.

64
00:04:52,076 --> 00:04:58,075
But, first of all, I wanted to point out
some differences between GPGPU and a true

65
00:04:58,075 --> 00:05:02,075
vector processor.
So, in a GPGPU, there's a host CPU, which

66
00:05:02,075 --> 00:05:07,717
is your X86 processor in your computer.
Then, you go across the bus, the PCIE bus,

67
00:05:07,717 --> 00:05:11,860
or PCI bus,
Or EGP bus out onto a graphics card.

68
00:05:13,080 --> 00:05:18,964
So, there's no, there is a host processor,
but it's not, there's no control processor

69
00:05:18,964 --> 00:05:23,843
like in our vector processors that we had
talked about last lecture.

70
00:05:23,843 --> 00:05:29,440
So, the scalar processor is really far
away, and doesn't drive everything as, as

71
00:05:29,871 --> 00:05:35,109
strictly, it's basically having two
processors running, the host processor,

72
00:05:35,109 --> 00:05:40,706
and then, the graphics processor unit.
And because of this, you actually have to

73
00:05:40,706 --> 00:05:44,940
run all of your control code on the vector
unit, somehow.

74
00:05:45,400 --> 00:05:50,239
So,
This, this attached host processor model

75
00:05:50,239 --> 00:05:54,325
does have some advantages.
You can actually run the data parallel

76
00:05:54,325 --> 00:05:59,290
aspects of the program and then have the
host processor running something else,

77
00:05:59,290 --> 00:06:04,633
Or the SA6 processor, which is connected
to it across the bus have it running some

78
00:06:04,633 --> 00:06:08,467
other, other program.
So, let's, let's dive into detail here and

79
00:06:08,467 --> 00:06:14,370
talk about the Compute Unified Device
Architecture, which is what CUDA stands

80
00:06:14,370 --> 00:06:17,390
for.
So, CUDA is, is the Nvidia way of

81
00:06:17,390 --> 00:06:23,189
programming, or the Nvidia programming
model for the graphics processors.

82
00:06:23,189 --> 00:06:30,101
And, there's a broader industrial accepted
way to program these, which uses roughly a

83
00:06:30,101 --> 00:06:36,694
similar programming model called Open CL
which is, the name there is designed to

84
00:06:36,694 --> 00:06:43,049
evoke notions of open GL, which is the,
the graphics language that is widely used

85
00:06:43,049 --> 00:06:46,900
for 3D rendering.
So,

86
00:06:48,320 --> 00:06:51,950
You have to suspend disbelief here for a
second.

87
00:06:51,950 --> 00:06:56,110
And, and this, this model is a little bit
odd, I think.

88
00:06:56,110 --> 00:06:59,968
But, let's, let's, let's talk about what,
what CUDA is.

89
00:06:59,968 --> 00:07:06,249
So, in CUDA,
Let's say, we have a loop here.

90
00:07:06,470 --> 00:07:09,489
Let's, let's say, a non-CUDA program
first.

91
00:07:09,489 --> 00:07:19,532
It's the upper portion of this code.
We're trying to take y plus or, or y of i

92
00:07:19,532 --> 00:07:29,120
plus x of i times some scalar value.
So, this is the traditional A times

93
00:07:29,580 --> 00:07:38,167
another vector plus a third vector.
This is actually a inner loop of this

94
00:07:38,167 --> 00:07:45,452
shows up an inner loop of impact.
So, is that, which is a, a benchmark that

95
00:07:45,452 --> 00:07:50,556
a lot of people run.
So, pretty soon what we were doing before.

96
00:07:50,556 --> 00:07:54,812
You're adding one vector to another
vector, except that you're taking one of

97
00:07:54,812 --> 00:07:58,140
the vectors and multiplying by a scalar.
Pretty, pretty simple.

98
00:07:58,640 --> 00:08:05,782
So, in CUDA, the, the basic idea is that
you don't do a lot of operation, you don't

99
00:08:05,782 --> 00:08:11,811
do a lot of work per, what we're going to
call threads.

100
00:08:11,811 --> 00:08:20,840
So, what we're going to do here is we're
going to define a block of data,

101
00:08:21,720 --> 00:08:28,813
And we use some special tabs on here to
say what's going on.

102
00:08:28,813 --> 00:08:38,852
And, this block has some size.
And then, what we do is, we define a

103
00:08:38,852 --> 00:08:47,059
thread that can operate in parallel times
these 256, or there's 256 data values and

104
00:08:47,059 --> 00:08:52,337
256 threads.
So, what's going to happen here is we're

105
00:08:52,337 --> 00:09:01,236
basically going to do the same operation
here, y times y of i plus a times x of i,

106
00:09:01,236 --> 00:09:09,580
And we're going to store it into here.
But, let's look where i comes from.

107
00:09:10,820 --> 00:09:19,610
I comes from some special keywords here
where it computes which thread number you

108
00:09:19,610 --> 00:09:25,640
are or which index number you effectively
are in this thread block.

109
00:09:26,220 --> 00:09:32,382
And then, this if statement here is the
moral equivalent of our strip mining loop.

110
00:09:32,382 --> 00:09:38,032
So, this says, if we get above how ever
many, however much work we're trying to

111
00:09:38,032 --> 00:09:42,727
do, don't do anymore work.
So, we can block it into, let's say, some

112
00:09:42,727 --> 00:09:46,982
number of threads.
And then, we can actually pass into an n

113
00:09:46,982 --> 00:09:53,145
where n is the amount of work to actually
do or the length of the vector, and we can

114
00:09:53,145 --> 00:09:59,227
check that along the way.
What I find interesting about this is, if

115
00:09:59,227 --> 00:10:05,152
you look at this on first view, it looks
like each of these threads, oops, is

116
00:10:05,152 --> 00:10:12,555
completely independent.
So, that is the programming model, is that

117
00:10:12,555 --> 00:10:17,700
each of the threads are independent.
Unfortunately, the computer architecture

118
00:10:17,700 --> 00:10:22,370
that this is going to run on, each of the
threads are not independent.

119
00:10:22,370 --> 00:10:25,620
So, in CUDA, you don't want to have these
diverge.

120
00:10:26,260 --> 00:10:30,429
It's allowed to have them diverge.
So, for instance, there's a, there's a if

121
00:10:30,429 --> 00:10:33,285
statement in here.
So, you're going to have one go into the

122
00:10:33,285 --> 00:10:37,797
if statement, and one, one, a different
thread follow through on the if statement.

123
00:10:37,797 --> 00:10:41,452
So, it is, it is allowed.
But if you do that, you're basically just

124
00:10:41,452 --> 00:10:44,080
not using one of the pipelines for a
while.

125
00:10:44,860 --> 00:10:50,466
So, I should of come back to this.
Lets, let's talk about how this

126
00:10:50,466 --> 00:10:55,281
programming model shows up.
So, the programming model is having lots

127
00:10:55,281 --> 00:11:01,102
of different little threads and the idea
is you make sort of these micro threads.

128
00:11:01,102 --> 00:11:06,995
And then, you want to take these micro
threads and the run time system plus the

129
00:11:06,995 --> 00:11:10,444
compiler, the CUDA compiler, will put it
together.

130
00:11:10,444 --> 00:11:16,193
And actually put all the threads together,
and actually have them operate on the

131
00:11:16,193 --> 00:11:24,260
exact same instructions at the same time.
So, they're hiding a single instruction

132
00:11:24,260 --> 00:11:30,160
multiple data architecture under the hood
of these threads.

133
00:11:30,780 --> 00:11:35,780
So, if we look at our example here, we do
a load.

134
00:11:36,220 --> 00:11:40,476
So, multiply a different load add in the
store,

135
00:11:40,476 --> 00:11:48,460
And that's, that's our a, x a times x plus
y operation going on here.

136
00:11:49,100 --> 00:11:55,073
And, across the other way,
These are all these threads are doing the

137
00:11:55,073 --> 00:11:57,540
same operation or effectively the same
operation.

138
00:12:00,120 --> 00:12:05,538
So, they call these single instruction,
multiple thread, which is kind of a funny

139
00:12:05,538 --> 00:12:08,076
idea.
In reality, it's actually single

140
00:12:08,076 --> 00:12:12,877
instruction multiple data.
But, they introduced this notion of

141
00:12:12,877 --> 00:12:18,495
threads with predication into the thread
in, in, into the single structure multiple

142
00:12:18,495 --> 00:12:24,454
data to allow, effectively, one pipe, one
microthread here to do something slightly

143
00:12:24,454 --> 00:12:27,340
different,
By predication.

144
00:12:30,740 --> 00:12:36,788
So, what is the implications of the single
instruction multiple thread?

145
00:12:36,788 --> 00:12:41,990
Well,
Strangely enough, because you have it's

146
00:12:41,990 --> 00:12:47,769
hard to control the order of the threads
relative to each other with the data, the

147
00:12:47,769 --> 00:12:53,617
memory system has to support all different
notions and all different alignments of

148
00:12:53,617 --> 00:12:58,251
scatter-gather operations.
So, they don't actually try to control the

149
00:12:58,251 --> 00:13:03,589
addressing because each of the threads
could potentially try to do some scatter

150
00:13:03,589 --> 00:13:06,724
operation.
So instead, what they do is, they have

151
00:13:06,724 --> 00:13:12,128
some really smart intelligence that takes
all the addresses that come out of the

152
00:13:12,128 --> 00:13:15,664
execution units.
And they say, oh, these look like they

153
00:13:15,664 --> 00:13:19,266
should line up and we'll issue these at
the same time.

154
00:13:19,266 --> 00:13:24,837
So, if you happen to have threads which
all try to do, let's say, a of i, we'll

155
00:13:24,837 --> 00:13:30,198
say, where i is a thread number,
Then you have units trying operations.

156
00:13:30,198 --> 00:13:36,228
If you try to pack them all together and
the hardware actually goes and tries to

157
00:13:36,228 --> 00:13:42,660
figure this out.
And, as I had mentioned before, if you,

158
00:13:43,540 --> 00:13:48,235
You have to use predication here if you
have different control flow.

159
00:13:48,235 --> 00:13:52,930
So, you need strong, strong use of
predication to allow threads to go

160
00:13:52,930 --> 00:13:59,748
different directions.
So, things get even more complicated in

161
00:13:59,748 --> 00:14:07,351
these architectures. These GPU, general
purpose GPU architectures and take the

162
00:14:07,351 --> 00:14:14,778
word warp here and replace it with thread.
So unfortunately, if you, if you go read

163
00:14:14,778 --> 00:14:21,714
your textbook, they have a nice table
which translates GPGPU nomenclature to the

164
00:14:21,714 --> 00:14:25,203
whole rest of computer architecture
nomenclature.

165
00:14:25,203 --> 00:14:30,471
But, the GPU people came up with
completely different names for everything

166
00:14:30,685 --> 00:14:35,313
which is just kind of annoying.
Because names already existed for

167
00:14:35,313 --> 00:14:36,310
everything.
So,

168
00:14:36,310 --> 00:14:43,785
If you go look inside of one these GPUs,
they actually are a massively parallel

169
00:14:43,785 --> 00:14:49,436
architecture with multiple lanes, like our
vector architecture.

170
00:14:49,436 --> 00:14:55,180
And then, on top of that, they are a
multi-threaded architecture.

171
00:14:55,740 --> 00:15:00,159
Typically, these architectures don't have
caches. So, to hide a lot of the memory

172
00:15:00,159 --> 00:15:04,523
latency what they'll do is you'll take all
the threads that are active in the

173
00:15:04,523 --> 00:15:08,999
machine, and this is part of the reason
they have this threading model, is you'll

174
00:15:08,999 --> 00:15:13,251
take all the threads that are active in
the machine and you'll schedule one

175
00:15:13,251 --> 00:15:16,440
thread. And if that thread, let's say,
misses out to memory,

176
00:15:16,740 --> 00:15:21,070
You'll time slice it out and then schedule
a different thread.

177
00:15:21,070 --> 00:15:26,240
So, that actually we'll fine grain
interweave threads on a functional unit.

178
00:15:27,360 --> 00:15:36,780
So, it's a strange idea here mixing
multi-string with SIM D at the same time.

179
00:15:37,060 --> 00:15:43,860
So, lots of different parallels and
aspects coming together in these GPGPUs.

180
00:15:44,460 --> 00:15:50,191
I don't want to go into that much detail
because then you have a whole class on how

181
00:15:50,191 --> 00:15:54,557
to program GPGPU's.
But basic idea I wanted to get across is

182
00:15:54,557 --> 00:15:59,742
that they are a multi-threaded single
instruction multiple-data machine, but

183
00:15:59,742 --> 00:16:03,768
they overlay on top of that, this strange
notion of threads.

184
00:16:03,768 --> 00:16:09,226
And, but, the threads don't do exactly the
same work because it's a SIM D machine.

185
00:16:09,430 --> 00:16:18,825
You basically just end up wasting slots.
So, these have a lot of performance.

186
00:16:18,825 --> 00:16:25,042
So, some examples of this the Nvidia,
This is actually the modern day, Nvidia

187
00:16:25,042 --> 00:16:29,339
computer archi, or GPU architecture you
can go buy.

188
00:16:29,339 --> 00:16:34,404
They call it the Fermi.
This is in that card I showed last time

189
00:16:34,404 --> 00:16:41,360
with the, the, the Tesla.
Let's zoom in on one of these and talk

190
00:16:41,360 --> 00:16:46,677
about what's inside of here.
So, roughly what's going on is they

191
00:16:46,677 --> 00:16:51,289
actually have,
Well, first of all, let's see the stuff

192
00:16:51,289 --> 00:16:56,901
that's not programmable, which is actually
a significant portion of the design here.

193
00:16:56,901 --> 00:17:02,445
If you look down here, they have vertex
shaders and tessellation units and texture

194
00:17:02,445 --> 00:17:06,299
mapping units,
Texture, texture caches And really, what

195
00:17:06,299 --> 00:17:09,748
this is all for, is this if for graphics
processing.

196
00:17:09,748 --> 00:17:14,210
[laugh] And then, [laugh] we sort of smush
onto that, some array of general purpose

197
00:17:14,210 --> 00:17:21,828
units or mildly general purpose units.
And inside of each one of these cores

198
00:17:21,828 --> 00:17:26,018
here, there's a floating point unit and an
integer unit.

199
00:17:26,018 --> 00:17:32,536
And if you cut this way, that's actually
each one of these here what they call

200
00:17:32,536 --> 00:17:37,735
core, is effectively a lane.
So, they're replicated in that direction.

201
00:17:37,735 --> 00:17:43,400
And then, one, two, three, four, five,
six, seven, eight, this direction is SIM

202
00:17:43,400 --> 00:17:48,680
D.
So, they basically have a SIM D

203
00:17:48,680 --> 00:17:55,498
architecture with multiple lanes.
So, lots of parallelism going on.

204
00:17:55,498 --> 00:18:01,336
And then, at the top here, they have what
they call the warp scheduler which is the

205
00:18:01,336 --> 00:18:06,866
thread scheduler, which will assign
instructions down into the different

206
00:18:07,327 --> 00:18:11,109
parallel units.
So, lots of, lots of interesting things

207
00:18:11,296 --> 00:18:14,040
going on in parallel here on these
machines.

208
00:18:16,000 --> 00:18:22,887
So, I wanted to stop here on GPUs because
we could have a whole another lecture on

209
00:18:22,887 --> 00:18:26,171
that.
I, I don't really want to go into that

210
00:18:26,171 --> 00:18:31,698
level of detail on GPUs.
But, let's switch topics and start talking

211
00:18:31,698 --> 00:18:36,310
about multi-threading.
Actually, before we go off this thing, I'm

212
00:18:36,310 --> 00:18:40,351
sure people have questions about GPUs.
But I'll take one or two of them, but I

213
00:18:40,351 --> 00:18:44,240
don't want to go into that much detail.
I just want to sort of introduce the idea

214
00:18:44,240 --> 00:18:48,281
that you could use graphics processors.
They have some similarity to vectors, but

215
00:18:48,281 --> 00:18:51,160
they're kind of the degenerate case of
vector processors.

216
00:18:55,620 --> 00:19:01,136
Actually, a quick show of hands.
Who, who, who has a ATI video card in

217
00:19:01,136 --> 00:19:06,415
their machine?
Okay. Who has Nvidia?

218
00:19:06,415 --> 00:19:12,421
Okay. Who has Intel?
Aha. So, so, interesting tidbit here.

219
00:19:12,421 --> 00:19:20,158
Everyone always thinks about ATI and
Nvidia as being the sort of leaders in

220
00:19:20,158 --> 00:19:25,840
these fancy graphics cards.
But in reality,

221
00:19:26,220 --> 00:19:30,184
Intel sells the most number of graphics
processors in the world today.

222
00:19:30,184 --> 00:19:34,149
And it's partially cuz they kind of give
them away for free, and they've

223
00:19:34,149 --> 00:19:37,377
effectively integrated them onto all the
Intel chips now.

224
00:19:37,377 --> 00:19:39,926
So,
It's kind of a funny thing that the least

225
00:19:39,926 --> 00:19:44,570
innovative the, the, the least exciting
graphics processors out there are, are

226
00:19:44,570 --> 00:19:48,251
kind of there not because they're good,
but because they're cheap.

227
00:19:48,251 --> 00:19:51,140
Lots of, lots of things in the world work
like that.