1
00:00:03,740 --> 00:00:10,850
So, the idea here is if we can guarantee
that there's no dependence between the

2
00:00:10,850 --> 00:00:15,580
instructions,
We can think about having,

3
00:00:16,720 --> 00:00:20,064
You can execute those instructions in
parallel.

4
00:00:20,064 --> 00:00:25,615
And one really good way to have guaranteed
independence is to have completely

5
00:00:25,615 --> 00:00:32,402
programs or completely different threads.
And the naive approach here is we

6
00:00:32,402 --> 00:00:38,537
interleave instructions from different
threads to cover latency.

7
00:00:38,645 --> 00:00:42,974
So we send a load out here and while this
load is off revolving in the memory

8
00:00:42,974 --> 00:00:45,896
system,, this is our thread one that we
had here before.

9
00:00:45,896 --> 00:00:50,280
We then load into R1 and we went to go use
the results here, also from thread one.

10
00:00:51,420 --> 00:00:57,047
Well, by the time the data comes back from
memory, we know it's all, all ready to go.

11
00:00:57,047 --> 00:01:00,066
We put other work in our processor
pipeline.

12
00:01:00,066 --> 00:01:06,520
And this is called multithreading.
Now, one of the big insights here is you

13
00:01:06,520 --> 00:01:12,839
need to have enough other work to do and,
these other threads add some complexity to

14
00:01:12,839 --> 00:01:17,005
your processor design,
And they may cause some locality

15
00:01:17,005 --> 00:01:20,489
challenges.
So if you have, if you're trying to

16
00:01:20,489 --> 00:01:26,700
exploit temporal locality in, let's say, a
cash or a translation look inside buffer

17
00:01:26,700 --> 00:01:31,320
or some other buffer.
And you start putting other data axises

18
00:01:31,320 --> 00:01:36,773
sort of in the middle here or other
operations in the middle, this might

19
00:01:36,773 --> 00:01:41,705
destroy your locality.
But, if, let's think about this in a

20
00:01:41,705 --> 00:01:50,768
happy, happy way for right now.
If you can find other work to do, you can

21
00:01:50,768 --> 00:01:58,060
just shove it in, the slots while you're,
while you're waiting for this load to come

22
00:01:58,060 --> 00:02:01,830
back.
The most basic version of multithreading

23
00:02:01,830 --> 00:02:06,254
is going to be some sort of very basic
interweaving.

24
00:02:06,254 --> 00:02:08,815
So we run thread one, then thread two,
then thread three, then thread four.

25
00:02:08,815 --> 00:02:08,815
Thread one, thread two, thread three, and
thread four.

26
00:02:08,815 --> 00:02:08,815
So we don't pay attention to the,
latencies or dependencies of the

27
00:02:08,815 --> 00:02:09,076
instructions when we go to schedule.
And if you come around to a thread slot

28
00:02:09,076 --> 00:02:09,402
and there is nothing to do, Let's say this
load missed in the cache and took a

29
00:02:09,402 --> 00:02:09,402
thousand cycles to come back.
You just don't schedule anything.

30
00:02:09,402 --> 00:02:09,402
You have a dead cycle there.
One of the nice things about

31
00:02:09,402 --> 00:02:09,402
multithreading that you can take advantage
of, is that you don't have to worry about

32
00:02:09,402 --> 00:02:09,402
bypassing anymore.
Because if you know that you're not going

33
00:02:09,402 --> 00:02:10,118
to be executing an instruction from the
same thread until a couple seconds later,

34
00:02:10,118 --> 00:02:10,444
why do you need to have a fast bypass from
the ALU back to itself?

35
00:02:10,444 --> 00:02:10,444
You just don't care.
So you, let's see, do we actually have an

36
00:02:10,444 --> 00:02:10,444
example of that.
So here's, here's an example that say you

37
00:02:10,444 --> 00:02:10,444
have a, a write to R1, and then a read of
R1.

38
00:02:10,444 --> 00:02:10,900
What does value does this thread two get
here?

39
00:02:10,900 --> 00:02:10,900
So, the first versions of multi threading
machines, Try to partition the register

40
00:02:10,900 --> 00:02:10,900
file, So this is when it happened, So you
compile up your programs differently.

41
00:02:10,900 --> 00:02:10,900
That's not really common anymore.
We'll, we'll talk about an example where

42
00:02:10,900 --> 00:02:11,290
they actually did that.
For instance, the that was actually also a

43
00:02:11,290 --> 00:02:11,616
limiter to threads for awhile, Until sort
of people came up with the idea of what

44
00:02:11,616 --> 00:02:11,616
you just suggested which is somehow change
the name space of the registers.

45
00:02:11,616 --> 00:02:11,616
And the way that, that was actually
changed was well, the first

46
00:02:11,616 --> 00:02:11,616
implementation, we'll talk about in a
little bit.

47
00:02:11,616 --> 00:02:11,616
But the first implementation was on Spark,
actually.

48
00:02:11,616 --> 00:02:11,616
And they used the register windows on
Spark, which gave you different name

49
00:02:11,616 --> 00:02:11,616
spaces for different registers.
And then, then that finally evolved into

50
00:02:11,616 --> 00:02:12,072
actually having a thread identifier and
having copies of the entire register

51
00:02:12,072 --> 00:02:12,398
space, So each thread had a
Different register set.

52
00:02:12,398 --> 00:02:12,398
And that's what we're going to show in
this picture here.

53
00:02:12,398 --> 00:02:12,398
So as, as shown here, you have to copy the
general purpose register file four times.

54
00:02:12,398 --> 00:02:12,398
And you have to copy the program counter
four times, and then, you have a

55
00:02:12,398 --> 00:02:12,854
incrementer out here on the front of the
pipe, which chooses the thread ID, or the

56
00:02:12,854 --> 00:02:12,854
thread select.
And in our simple case here, we're just

57
00:02:12,854 --> 00:02:12,854
going to keep incrementing this and choose
one, two, three, four.

58
00:02:12,854 --> 00:02:12,854
One, two, three, four.
One, two, three, four and just continually

59
00:02:12,854 --> 00:02:13,114
does that and likewise that's a indexing
to which general purpose register file

60
00:02:13,114 --> 00:02:13,114
we're suppose to be accessing.
Or which logical general purpose register

61
00:02:13,114 --> 00:02:13,114
file cuz most of these architectures will
actually put this together in one larger

62
00:02:13,114 --> 00:02:13,114
general purpose register file and then
have, this just be some addressing bits

63
00:02:13,114 --> 00:02:13,114
that are architecturally not, not visible.
But you bring up a good, a point that,

64
00:02:13,114 --> 00:02:13,114
there are a couple userland threading
libraries out there.

65
00:02:13,114 --> 00:02:13,114
So, it issues this notion of threading,
but not on a multithreaded computer

66
00:02:13,114 --> 00:02:13,114
architecture.
So where you don't have this and this, and

67
00:02:13,114 --> 00:02:13,114
there's still some advantages to that
cause you still, can swap between

68
00:02:13,114 --> 00:02:13,570
different threads and actually try to
cover memory seek cuz your load to use

69
00:02:13,570 --> 00:02:13,570
will be longer.
And those threading libraries, largely

70
00:02:13,570 --> 00:02:13,570
they do try to partition the register file
space still.

71
00:02:13,570 --> 00:02:13,570
So you can actually go download one of
these, on, you know, Get Linux and go run

72
00:02:13,570 --> 00:02:13,570
it and people still use these threading
libraries to go do that and these usually

73
00:02:13,570 --> 00:02:13,570
lend thread libraries.
The other way to go do that is actually to

74
00:02:13,570 --> 00:02:13,570
swap out the entire register space, But
people typically try not to do that cuz

75
00:02:13,570 --> 00:02:13,570
that, if you're trying to do fine grade,
interweaving and you want to save your

76
00:02:13,570 --> 00:02:13,570
entire register space to memory, That's,
that's very expensive.

77
00:02:13,570 --> 00:02:13,570
But the, the, these threading libraries
sort of work together with the compiler,

78
00:02:13,570 --> 00:02:13,570
and tell the compiler don't, Don't use all
the registers for thread one.

79
00:02:13,570 --> 00:02:13,570
Use half the registers for thread one, and
half the registers, let's say, for thread

80
00:02:13,570 --> 00:02:13,570
two.
So this is our simple multithreading

81
00:02:13,570 --> 00:02:13,570
pipeline.
It's relatively small changes so far.

82
00:02:13,570 --> 00:02:13,570
Replicate the register file, Replicate the
PC, And this can help us recover

83
00:02:13,570 --> 00:02:13,570
utilization on our ALU, for instance.
And what we wanted to point out that the

84
00:02:13,570 --> 00:02:13,570
software, what this look like, is that it
looks likes multiple processors.

85
00:02:13,570 --> 00:02:13,831
So if you go use something like a modern
day Core i7, Those have two hardware

86
00:02:13,831 --> 00:02:14,287
threads. But when you go to look at it,
it'll look like there's twice as many

87
00:02:14,287 --> 00:02:14,287
cores in the machine than you think there
are.

88
00:02:14,287 --> 00:02:14,938
So, for instance, if you have a four-core
Core i7 and you open up in Windows the

89
00:02:14,938 --> 00:02:14,938
little C, process management sort of
dialogue box, You'll see that there's

90
00:02:14,938 --> 00:02:15,394
eight little bars in there cuz there's one
thread or one virtual processor, or excuse

91
00:02:15,394 --> 00:02:15,394
me two virtual processors per physical
core in the machine.

92
00:02:15,394 --> 00:02:15,394
So they actually look like there's
multiple slower CPUs.

93
00:02:15,394 --> 00:02:15,394
Okay, so what are the costs?
The easy costs, are replicating the

94
00:02:15,394 --> 00:02:15,394
program counter and the General Purpose
Registers.

95
00:02:15,394 --> 00:02:15,394
Things that start to become harder to
replicate, but you're going to have to

96
00:02:15,394 --> 00:02:15,394
replicate also, is if you want to do full
isolation of these CPUs, is you're going

97
00:02:15,394 --> 00:02:15,394
to have to have replication of system
state.

98
00:02:15,394 --> 00:02:15,394
So things like page tables, things like
page table base registers.

99
00:02:15,394 --> 00:02:15,394
All the different, system state about
where the interrupt handlers, so things

100
00:02:15,394 --> 00:02:15,720
like the exceptional PC.
And this actually gets a

101
00:02:15,720 --> 00:02:16,166
Little bit, hard to do and some processors
have a fair amount of system states.

102
00:02:16,166 --> 00:02:16,166
If you'll look at x86, there's a lot of
system states.

103
00:02:16,166 --> 00:02:16,166
Mips is relatively minimal, because it's
sort of exceptional PC and the interrupt

104
00:02:16,166 --> 00:02:16,353
mask register.
But it can be hard, and, and, the, because

105
00:02:16,353 --> 00:02:16,726
the TLB software maintains, so you don't
even have one of these, you don't even

106
00:02:16,726 --> 00:02:16,726
have a virtual memory page table base
register.

107
00:02:16,726 --> 00:02:16,726
So this something to think about is that
you have to replicate all the state per

108
00:02:16,726 --> 00:02:16,726
thread.
So there is some cost to it, but you still

109
00:02:16,726 --> 00:02:16,726
get to hide latencies to memory.
You still get to reuse ALUs and increase

110
00:02:16,726 --> 00:02:16,726
the efficiencies of your ALUs.
Personally, I think these are the smaller,

111
00:02:16,726 --> 00:02:16,726
smaller things.
So this is very small.

112
00:02:16,726 --> 00:02:16,726
This is smaller.
It's bigger than the first one, but harder

113
00:02:16,726 --> 00:02:16,726
to do.
The big things out there, When you start

114
00:02:16,726 --> 00:02:16,726
to have temporal locality conflicts and
spatial locality conflicts.

115
00:02:16,726 --> 00:02:16,726
So if you have four threads that are all
executing, and they're fighting for space,

116
00:02:16,726 --> 00:02:16,726
they're fighting for capacity in your
cache, this can be a problem.

117
00:02:16,726 --> 00:00:00,000
Or, let's say you have sixteen hardware
threads, and you have an 8-way set

118
00:00:00,000 --> 00:00:00,000
associative cache, And they always want to
access, let's say, index zero in the

119
00:00:00,000 --> 00:00:00,000
cache.
This is a common thing.

120
00:00:00,000 --> 00:00:00,000
Your stack always starts there or
something like that.

121
00:00:00,000 --> 00:00:00,000
Well, you only have 8-way, associativity,
and if you're time multiplexing thread

122
00:00:00,000 --> 00:00:00,000
one, two, three, four, five, six, seven,
eight, nine, ten, eleven, twelve,

123
00:00:00,000 --> 00:00:00,000
thirteen, fourteen, fifteen, sixteen and
then you move back around to one.

124
00:00:00,000 --> 00:00:00,000
You're guaranteed that by the time you
come back to one, because you only have an

125
00:00:00,000 --> 00:00:00,000
8-way set associative cache that you've
bumped out the data that you needed, So

126
00:00:00,000 --> 00:00:00,000
you get no temporal locality.
By the time you come back, it's not in

127
00:00:00,000 --> 00:00:00,000
your cache, So you had a, conflict.
You'll have a conflict that's going on in

128
00:00:00,000 --> 00:00:00,000
your cache.
Okay.

129
00:00:00,000 --> 00:00:00,000
So, so this is the negative side, is that
they can conflict with each other, But if

130
00:00:00,000 --> 00:00:00,000
two threads are working together.
You can actually get positive cache

131
00:00:00,000 --> 00:00:00,000
interference, or, or positive interference
or constructive interference in the cache

132
00:00:00,000 --> 00:00:00,000
and it definitely happens.
And that was actually, one of the reasons

133
00:00:00,000 --> 00:00:00,000
that people originally did this
multithreading, was that if you have a

134
00:00:00,000 --> 00:00:00,000
program where you have different threads
that are working on roughly the same data.

135
00:00:00,000 --> 00:00:00,000
This is actually very, very good, because
it will be precaching data for each other.

136
00:00:00,000 --> 00:00:00,000
Well, you have to make sure, It's a, it's
a, it's a fine line to sort of walk there

137
00:00:00,000 --> 00:00:00,000
that when the threads fight for shared
resources, and when they collaborate on

138
00:00:00,000 --> 00:00:00,000
pulling in the shared resources and
exploit the locality structures in the

139
00:00:00,000 --> 00:00:00,000
processor.
So the, the cash is one example.

140
00:00:00,000 --> 00:00:00,000
Another example is translation lookaside
buffer entries.

141
00:00:00,000 --> 00:00:00,000
One solution to this is you can actually
partition the cache, or partition the TLB,

142
00:00:00,000 --> 00:00:00,000
And then you can guarantee that you do not
fight.

143
00:00:00,000 --> 00:00:00,000
Unfortunately, if you go to do this and
you have a fixed size cache, You

144
00:00:00,000 --> 00:00:00,000
effectively cut the cache, so let's say in
half or however many threads by N, by N

145
00:00:00,000 --> 00:00:00,000
threads.
So, typically people don't try to

146
00:00:00,000 --> 00:00:00,000
statically partition these things.
But, You could think about doing that, if

147
00:00:00,000 --> 00:00:00,000
you want no interference between the
different threads.

148
00:00:00,000 --> 00:00:00,000
Okay.
So let's, let's look at a couple different

149
00:00:00,000 --> 00:00:00,000
ways that you can try to schedule threads,
and they've gone from simple to a little

150
00:00:00,000 --> 00:00:00,000
more complicated to even more complicated
over time.

151
00:00:00,000 --> 00:00:00,000
So, simple's what we've been talking about
up to this point, Is that there's some

152
00:00:00,000 --> 00:00:00,000
fixed interleaving of N hardware threads,
and you, basically, execute one

153
00:00:00,000 --> 00:00:00,000
instruction from each thread and change to
the next thread.

154
00:00:00,000 --> 00:00:00,000
So it's the, our first case.
So it's the fixed interleaving dates back

155
00:00:00,000 --> 00:00:00,000
to the, the 60s and this is pretty simple.
What's nice about this is you can remove

156
00:00:00,000 --> 00:00:00,000
some of the interlocking and some of the
bypassing from your processor.

157
00:00:00,000 --> 00:00:00,000
Next thing you can think about is, well,
You could try to allocate different

158
00:00:00,000 --> 00:00:00,000
locations in your scheduling quantum to
different threads.

159
00:00:00,000 --> 00:00:00,000
So for instance, We know that the, Cyan
thread here is the highest priority

160
00:00:00,000 --> 00:00:00,000
thread.
It needs the most number of slots.

161
00:00:00,000 --> 00:00:00,000
So we can do an uneven distribution.
And we can say, over our slots here, we

162
00:00:00,000 --> 00:00:00,000
can actually interleave and say, well, the
cyan thread gets every other one we'll

163
00:00:00,000 --> 00:00:00,000
say.
And then we can say, let's say the orange

164
00:00:00,000 --> 00:00:00,000
one gets a smaller percentage and the blue
one gets a smaller percentage.

165
00:00:00,000 --> 00:00:00,000
So it's a, still a fixed interleaving to
some extent or a fixed ordering, But it

166
00:00:00,000 --> 00:00:00,000
is, controllable by software depending, or
the operating system, depending on the

167
00:00:00,000 --> 00:00:00,000
priorities of the different threads.
So it's a little bit more flexible in

168
00:00:00,000 --> 00:00:00,000
hardware than the completely fixed
interleave, design, And this does require

169
00:00:00,000 --> 00:00:00,000
a little bit of change here, because you
can't just have this counter here, just

170
00:00:00,000 --> 00:00:00,000
sort of incrementing your thread ID.
Instead, you need to have something else

171
00:00:00,000 --> 00:00:00,000
here, which, sort of, is, is a picker for
which thread you go execute, But, still

172
00:00:00,000 --> 00:00:00,000
relatively simple.
And, because it's software programmable,

173
00:00:00,000 --> 00:00:00,000
you can actually choose a time and then
reallocate, so the OS can change the

174
00:00:00,000 --> 00:00:00,000
allocation for a, a different time here,
and, let's say, make orange a higher

175
00:00:00,000 --> 00:00:00,000
priority instead of the cyan.
Then we start to go to something a little

176
00:00:00,000 --> 00:00:00,000
more complicated, So this is still
relatively fixed priorities.

177
00:00:00,000 --> 00:00:00,000
You could think about something where you
actually have the hardware making

178
00:00:00,000 --> 00:00:00,000
decisions about which one to go execute.
And you could actually even have it go as

179
00:00:00,000 --> 00:00:00,000
far as, let's say, Determining if you're
executing a thread, which then has a long

180
00:00:00,000 --> 00:00:00,000
latency operation, it'll switch to another
thread at that point.

181
00:00:00,000 --> 00:00:00,000
So, it will try to fill in backfilled work
from other threads purposely when one

182
00:00:00,000 --> 00:00:00,000
thread gets swapped out or one set of
thread goes to do something that's a long

183
00:00:00,000 --> 00:00:00,000
latency operation.
And you could think about designs like

184
00:00:00,000 --> 00:00:00,000
that, And that is something like the HEP
processor, which we'll be talking about in

185
00:00:00,000 --> 00:00:00,000
a minute or two.
Has, is, is, was one of the first early

186
00:00:00,000 --> 00:00:00,000
examples of that.
So just to recap here.

187
00:00:00,000 --> 00:00:00,000
We're going to call this coarse-grain
multithreading.

188
00:00:00,000 --> 00:00:00,000
The reason we're calling it coarse-grain
is because for any one cycle, Only one

189
00:00:00,000 --> 00:00:00,000
thread is running.
And you might scratch your side of the

190
00:00:00,000 --> 00:00:00,000
head and say, how do you have multiple
threads running in one, one cycle?

191
00:00:00,000 --> 00:00:00,000
Well, we'll show, if you go to a
superscalar, or multi-issue processor, you

192
00:00:00,000 --> 00:00:00,000
could think about having the different
pipelines executing instructions from

193
00:00:00,000 --> 00:00:00,000
different threads simultaneously.
That's called simultaneous multithreading

194
00:00:00,000 --> 00:00:00,000
and we'll talk about that in a second.
But in our core screen approach here, just

195
00:00:00,000 --> 00:00:00,000
to, to recap, you can swap threads on
hardware misses, and you can take

196
00:00:00,000 --> 00:00:00,000
advantage of bubbles in the processor to
do other work.

197
00:00:00,000 --> 00:00:00,000
Okay, so a little bit of a history lesson
here.

198
00:00:00,000 --> 00:00:00,000
The HEP processor was Burton Smith, who's
now at Microsoft Research.

199
00:00:00,000 --> 00:00:00,000
He was the, the, the chief architect of
this back in the 80s.

200
00:00:00,000 --> 00:00:00,000
And this processor's actually pretty
interesting cuz there was lots and lots of

201
00:00:00,000 --> 00:00:00,000
threads, and small number of processors.
So there is 120 threads in this machine,

202
00:00:00,000 --> 00:00:00,000
Processor.
Relatively modest clock rate for the 80s,

203
00:00:00,000 --> 00:00:00,000
but what they were trying to do here is
they were trying to deal with memory

204
00:00:00,000 --> 00:00:00,000
latency.
So this machine had a very high bandwidth

205
00:00:00,000 --> 00:00:00,000
memory system and what would happen is
effectively you were allowed to have a

206
00:00:00,000 --> 00:00:00,000
load or store every instruction and your
performance would never degrade if you had

207
00:00:00,000 --> 00:00:00,000
a load and a store every instruction.
Because in this machine, they had 120

208
00:00:00,000 --> 00:00:00,000
threads per processor and the memory
latency was less than 120 cycles.

209
00:00:00,000 --> 00:00:00,000
So if you had to load every instruction,
and you have enough bandwidth to be able

210
00:00:00,000 --> 00:00:00,000
to feed all those loads, you could execute
a load from each different thread and none

211
00:00:00,000 --> 00:00:00,000
of them are each thread was independent of
each other.

212
00:00:00,000 --> 00:00:00,000
You could basically go up to the memory
system, wait the latency, have it come

213
00:00:00,000 --> 00:00:00,000
back in pipeline the, the latency out to
memory, and pipeline the memory access.

214
00:00:00,000 --> 00:00:00,000
And by the time you were to come back and
execute the next instruction from that

215
00:00:00,000 --> 00:00:00,000
thread, which would be at 120 cycles
later, the memory result would be there.

216
00:00:00,000 --> 00:00:00,000
So this insight was carried forward Burton
went and started this company called Tera,

217
00:00:00,000 --> 00:00:00,000
or Tera Systems which later went on to buy
Cray, strangely enough, cuz I always

218
00:00:00,000 --> 00:00:00,000
thought Cray was a bigger company than
Tera, and it definitely was.

219
00:00:00,000 --> 00:00:00,000
But we'll, we'll call it a merger, but I
think that's what they called it, but in

220
00:00:00,000 --> 00:00:00,000
reality, Tera acquired Cray and Tera had a
similar sort of idea here, More

221
00:00:00,000 --> 00:00:00,000
processors.
This was, sort of, further evolution of

222
00:00:00,000 --> 00:00:00,000
the HEP processor.
128 active threads per processor.

223
00:00:00,000 --> 00:00:00,000
Lots of processors, 256 processors.
So lots of active programs.

224
00:00:00,000 --> 00:00:00,000
So you had to find enough thread level
parallelism in your, in your program and

225
00:00:00,000 --> 00:00:00,000
this architecture has no, no caches,
Because they don't need it.

226
00:00:00,000 --> 00:00:00,000
If you have enough bandwidth out to your
memory system, you can have a load every

227
00:00:00,000 --> 00:00:00,000
cycle, and you can cover the latency with
other threads.

228
00:00:00,000 --> 00:00:00,000
Why do you, why do you care?
So, some, some interesting things about

229
00:00:00,000 --> 00:00:00,000
this is, this may not be good for power.
You are not exploiting locality in your

230
00:00:00,000 --> 00:00:00,000
data references at all and you're going
out to the memory system every cycle.

231
00:00:00,000 --> 00:00:00,000
And that could be far away.
And it is, well, it's far away in this

232
00:00:00,000 --> 00:00:00,000
machine.
So that's, you got to be a little careful

233
00:00:00,000 --> 00:00:00,000
about these machines, and then the second
idea here is, you have to come up with

234
00:00:00,000 --> 00:00:00,000
lots of threads.
Now we're talking about having similar

235
00:00:00,000 --> 00:00:00,000
numbers of threads in something like,
modern day, many core machines.

236
00:00:00,000 --> 00:00:00,000
But there's still a fair number of threads
to be able to effectively use the machine

237
00:00:00,000 --> 00:00:00,000
to its, to its best performance.
I wanted to say about this actually, this

238
00:00:00,000 --> 00:00:00,000
architecture, Where it got mostly used,
was in applications that had no locality

239
00:00:00,000 --> 00:00:00,000
anyway or no temporal or spacial locality
in their data references anyway.

240
00:00:00,000 --> 00:00:00,000
And good examples of this were things like
data mining, Huge data sets, Arbitrary

241
00:00:00,000 --> 00:00:00,000
access to the data sets.
Not, you couldn't have a prefetcher

242
00:00:00,000 --> 00:00:00,000
predict where your next memory reference
is going to be.

243
00:00:00,000 --> 00:00:00,000
So if you were just going basically not
going to be able to effectively fit in a

244
00:00:00,000 --> 00:00:00,000
cache or use a cache anyway, remove the
cache and go, go multithreaded.

245
00:00:00,000 --> 00:00:00,000
It was the idea behind these machines.
And they actually saw some, Big speed ups

246
00:00:00,000 --> 00:00:00,000
for applications that had, had data sets
like that.

247
00:00:00,000 --> 00:00:00,000
This still actually lives on today, in the
Cray XMT, the extreme multithreaded

248
00:00:00,000 --> 00:00:00,000
architecture.
And you can go buy this machine from

249
00:00:00,000 --> 00:00:00,000
what's nowadays called Cray Computer
Corp., which was Tera-eating Cray or

250
00:00:00,000 --> 00:00:00,000
buying Cray.
And it's not, a modest clock speed by

251
00:00:00,000 --> 00:00:00,000
modern day standards.
It's only 500 MHz, but it, they can

252
00:00:00,000 --> 00:00:00,000
intermix these with sort of Opterons and
other processors because they standardized

253
00:00:00,000 --> 00:00:00,000
on the AMD bus protocol, So you can plug
in AMD chips and these XMT chips in the

254
00:00:00,000 --> 00:00:00,000
same, system.
Just a recap here of, like, what their

255
00:00:00,000 --> 00:00:00,000
memory system looked like.
Their instruction pipelines were very

256
00:00:00,000 --> 00:00:00,000
long, and they didn't have to bypass,
because you could never execute

257
00:00:00,000 --> 00:00:00,000
instruction and a subsequent instruction
which would use the results.

258
00:00:00,000 --> 00:00:00,000
And the memory operations, the memory
latencies were about 150 cycles and they

259
00:00:00,000 --> 00:00:00,000
had 128 threads.
So, the probability they, they were

260
00:00:00,000 --> 00:00:00,000
typically not waiting for the memory
system and they could effectively pipeline

261
00:00:00,000 --> 00:00:00,000
their memory operations.
Another little tidbit of history here.

262
00:00:00,000 --> 00:00:00,000
So, this is the machine I was talking
about that is my academic lineage a little

263
00:00:00,000 --> 00:00:00,000
bit here.
When I first showed up at MIT there was

264
00:00:00,000 --> 00:00:00,000
this machine in the corner called the MIT
Alewife processor, which was a

265
00:00:00,000 --> 00:00:00,000
multiprocessor machine.
It went up to 128 nodes, which was a lot

266
00:00:00,000 --> 00:00:00,000
at, in 1990.
And, one of the, the little tidbits about

267
00:00:00,000 --> 00:00:00,000
this machine is, they had spark
processors, and a spark processor had this

268
00:00:00,000 --> 00:00:00,000
notion of a register window.
What a register window is, is every time

269
00:00:00,000 --> 00:00:00,000
you do a system call, You basically change
your addressing in the register file.

270
00:00:00,000 --> 00:00:00,000
So you have a larger register file, and
you have a smaller window of the register

271
00:00:00,000 --> 00:00:00,000
file, and you kind of slide this window
across the register file on function calls

272
00:00:00,000 --> 00:00:00,000
and function returns.
And the MIT Alewife processor used this.

273
00:00:00,000 --> 00:00:00,000
And they extended this register window
idea, Such that they could have a special

274
00:00:00,000 --> 00:00:00,000
instruction which would actually change
how much the register file they were

275
00:00:00,000 --> 00:00:00,000
looking at, at a time.
So as they could actually swap the entire

276
00:00:00,000 --> 00:00:00,000
register file very quickly, and, this
actually allowed them to multithread very

277
00:00:00,000 --> 00:00:00,000
effectively with four threads.
And this was one of the early

278
00:00:00,000 --> 00:00:00,000
multithreaded processors and they
introduced this notion of thread switch on

279
00:00:00,000 --> 00:00:00,000
remote memory access.
So if you had a memory access that had to

280
00:00:00,000 --> 00:00:00,000
go to one of the other nodes in this
machine.

281
00:00:00,000 --> 00:00:00,000
So if you were on this node here, and you
need to go down to this one, and you're

282
00:00:00,000 --> 00:00:00,000
going to send a message over there in a
multiprocessor, notion, you actually had

283
00:00:00,000 --> 00:00:00,000
to send a message and get a response back,
it took a long time.

284
00:00:00,000 --> 00:00:00,000
So you would actually switch threads at
that time, .

285
00:00:00,000 --> 00:00:00,000
Multithreading lives on today, especially
this, in this coarse-grain multithreading

286
00:00:00,000 --> 00:00:00,000
lives on today.
A good example of this is the Oracle, and

287
00:00:00,000 --> 00:00:00,000
what used to be Sun, and before that,
Afara, Niagara processors.

288
00:00:00,000 --> 00:00:00,000
Afara was the name of the, startup company
that made this, and then Sun acquired

289
00:00:00,000 --> 00:00:00,000
them, and then Oracle acquired Sun.
And, the Sun Niagara processor was

290
00:00:00,000 --> 00:00:00,000
designed for throughput computing, such
that you can have lots of, threads running

291
00:00:00,000 --> 00:00:00,000
and they were all independent and it was a
multithreaded processor.

292
00:00:00,000 --> 00:00:00,000
So, the Niagra one had eight cores and
then four threads per core.

293
00:00:00,000 --> 00:00:00,000
And they've, Turn that up over time.
So now they use the Niagara three, So this

294
00:00:00,000 --> 00:00:00,000
is what was called the, the Sun T1.
This is the Sun T2.

295
00:00:00,000 --> 00:00:00,000
This is the Sun T3 now.
They have sixteen cores where they have

296
00:00:00,000 --> 00:00:00,000
eight threads per core, So a lot of
parallelism going on here and this

297
00:00:00,000 --> 00:00:00,000
coarse-grain multithreading, goes on.
So here's our, a dye photo of the Niagara

298
00:00:00,000 --> 00:00:00,000
three processor.
We can see the different cores you might

299
00:00:00,000 --> 00:00:00,000
look in this picture.
And say.

300
00:00:00,000 --> 00:00:00,000
There is, what looks to be sixteen cores
and I say sixteen cores on the slide here,

301
00:00:00,000 --> 00:00:00,000
but one of the interesting things is they
actually sort of conjoin two cores

302
00:00:00,000 --> 00:00:00,000
together, internally.
It's a strange sort of design idea that

303
00:00:00,000 --> 00:00:00,000
they do, is that they mix the threads and
the cores together and they share, I think

304
00:00:00,000 --> 00:00:00,000
it was floating point unit between the two
cores.

305
00:00:00,000 --> 00:00:00,000
So, these cores are, to some extent,
conjoined cores together.