1
00:00:03,620 --> 00:00:09,335
We left off last time, talking about some
implementation issues around,

2
00:00:09,335 --> 00:00:16,510
multithreading. so a little bit of
background on what we're doing today is

3
00:00:16,510 --> 00:00:21,332
we're going to be finishing up
multithreading. We're going to be using

4
00:00:21,332 --> 00:00:26,765
cutlery and you'll see why we're using
cutlery in a minute. That's right. A whole

5
00:00:26,765 --> 00:00:34,486
box a forks. And we'll be talking about
synchronization and synchronization

6
00:00:34,486 --> 00:00:42,941
permitives. So let's, let's go back to
multithreading, And look a, actual

7
00:00:42,941 --> 00:00:48,349
simultaneous multithreading which is where
we left off. So just to recap, we talked

8
00:00:48,349 --> 00:00:53,296
about different types of multithreading.
We talked about not multithreading

9
00:00:53,296 --> 00:00:58,704
processors, so we have a superscalar where
we're trying to fill in the empty slots

10
00:00:58,704 --> 00:01:03,783
here. So we can do some fine-grain, but
not the simultaneous multithreading. We

11
00:01:03,783 --> 00:01:09,323
can do coarse-grain multithreading we can
even possibly think about trying to do

12
00:01:09,323 --> 00:01:14,460
this in some fast software layer. You
could think about just cutting your

13
00:01:14,460 --> 00:01:19,134
cluster in half and using half your ALUs
for one thread and half your ALUs for

14
00:01:19,134 --> 00:01:23,181
another thread. That is sort of
technically a version of multithreading.

15
00:01:23,181 --> 00:01:27,627
It's just not executing oh, it is
executing at the same time, but it's not

16
00:01:27,627 --> 00:01:31,902
necessarily using all the resources. and
then you could think about actual

17
00:01:31,902 --> 00:01:36,291
simultaneous multithreading, where you're
issuing instructions from different

18
00:01:36,291 --> 00:01:43,434
threads into different functional units at
the same time. So, we talked about this

19
00:01:43,434 --> 00:01:49,225
last time and we, one of the things that
came up was, how do you go about getting a

20
00:01:49,225 --> 00:01:54,450
parallelism? So, if you have thread-level
parallelism, you can mix and match

21
00:01:54,450 --> 00:02:01,026
different instruction slots from different
threads and get parallelism that way. And

22
00:02:01,026 --> 00:02:06,436
one of the great things about simultaneous
multithreading is that if you have a fast

23
00:02:06,436 --> 00:02:10,759
superscalar machine and you try to use it
for simultaneous multithreading, you could

24
00:02:10,759 --> 00:02:14,625
think about trying to reduce the number of
threads if you don't have enough

25
00:02:14,625 --> 00:02:18,878
thread-levle parallelism and actually go
at instruction-level parallelism. So here,

26
00:02:18,878 --> 00:02:24,068
we see this blue crosshatch. block zero
instructions, and this is the same threa d

27
00:02:24,068 --> 00:02:28,952
running in the same sort of time period.
And, if you have lots of threads running,

28
00:02:28,952 --> 00:02:33,715
you could actually intermix the different
threads and run the blue crosshatch

29
00:02:33,715 --> 00:02:38,599
threads slower, than if you were try to
use parallel resources to run one thread

30
00:02:38,599 --> 00:02:42,446
faster. And this is one of the big
insights here that symmetric

31
00:02:42,446 --> 00:02:48,402
multithreading lets us go after. So we
briefly talked and flew through this at

32
00:02:48,402 --> 00:02:53,431
the end of lecture last time cuz we talked
about, what is the cost of implementing

33
00:02:53,431 --> 00:02:57,514
symmetric multithreading. So,
conveniently, there is actually a really

34
00:02:57,514 --> 00:03:02,071
good example of this, the Power four by
IBM and the Power five by IB, IBM. They're

35
00:03:02,071 --> 00:03:06,450
very similar architectures, except the
Power five has two way simultaneous

36
00:03:06,450 --> 00:03:11,124
multithreading, and the Power four is
almost the same architecture without two

37
00:03:11,124 --> 00:03:16,669
way symmetric multithreading. So you look
at these two pictures. One of the big

38
00:03:16,669 --> 00:03:20,812
things they added is they added more fetch
bandwidth, so you can fetch from two

39
00:03:20,812 --> 00:03:24,850
different program counters at the same
time. You can decode more instructions,

40
00:03:24,850 --> 00:03:29,150
and then, they added an extra pipe staging
here, which they, they do what we'll call,

41
00:03:29,150 --> 00:03:33,135
what they call group formation, which is
basically the scheduling of the two

42
00:03:33,135 --> 00:03:37,645
different threads instructions, together
by the time that it reaches the execution

43
00:03:37,645 --> 00:03:42,466
units. Also over here, they, as, as we've
talked about before, even without

44
00:03:42,466 --> 00:03:46,580
symmetric multithreading, it's pretty
useful to have more registers or more

45
00:03:46,580 --> 00:03:49,982
physical registers, registers and more
architectural registers. You, you're

46
00:03:50,750 --> 00:03:54,810
minimally going to need more architectural
registers, more physical registers

47
00:03:54,810 --> 00:04:01,994
probably is a good performance thing. So,
we're talking about implementation

48
00:04:01,994 --> 00:04:06,335
details. So, one of the questions that
comes up is, what is the cost of actually

49
00:04:06,335 --> 00:04:12,575
going to implement this symmetric
multithreading on this processor? And one

50
00:04:12,575 --> 00:04:16,332
of the questions that also comes up here
is, why two threads instead of four

51
00:04:16,332 --> 00:04:20,237
threads, or eight threads, or more
threads? It has costs. So, that's one of

52
00:04:20,237 --> 00:04:25,611
the things that they would actually have
to start replicating more data structures.

53
00:04:25,611 --> 00:04:30,661
For the pipeline th at they had roughly
left over from the penny, excuse me, the

54
00:04:30,661 --> 00:04:35,316
Power four, they had enough compute and
bandwidth through their execution units to

55
00:04:35,316 --> 00:04:39,703
be able to handle two threads. But if they
wanted to try to go to more threads, they

56
00:04:39,703 --> 00:04:43,984
basically bottleneck somewhere in their
architectures. I don't completely know

57
00:04:43,984 --> 00:04:47,837
exactly where they turtlenecked. They,
the, the paper talks about this, it

58
00:04:47,837 --> 00:04:52,191
basically says that they bottleneck.
Somewhere in their execution I guess they

59
00:04:52,191 --> 00:04:56,880
kind of leave it a little vague. so they
decided that it wasn't worth adding more

60
00:04:56,880 --> 00:05:01,179
than two threads, because it's not going
to give you any performance increase,

61
00:05:01,179 --> 00:05:05,812
unless you have threads that don't have
or, or, threads that have a lot of stalls.

62
00:05:05,812 --> 00:05:10,277
If you have threads with enough stalls,
you probably have enough holes to try to

63
00:05:10,277 --> 00:05:17,367
intermix another thread in there. So, some
of the changes in the hardware they were

64
00:05:17,367 --> 00:05:22,784
about doing. Well, they actually increased
their cache size, they increased the

65
00:05:22,784 --> 00:05:28,553
associativity of their caches their last
level cache here that went to 1.92

66
00:05:28,553 --> 00:05:34,128
megabytes versus 1.44 of the L2 and L3
together. Sorry. and you can see here one

67
00:05:34,128 --> 00:05:39,371
of the interesting things is they start to
actually separate data structures, because

68
00:05:39,371 --> 00:05:44,490
if you have multiple threads running and
you try to mix them all into one big data

69
00:05:44,490 --> 00:05:49,177
structure, one of the problems like, or
one of the big hardware data structure,

70
00:05:49,177 --> 00:05:54,234
one of the problems that comes up is maybe
one of the threads is going to hog the

71
00:05:54,234 --> 00:05:59,587
data structure. So, threading actually
introduces a lot of complexity, having to

72
00:05:59,587 --> 00:06:05,271
do with figuring out how to be fair in the
usage of the structures. So one solution

73
00:06:05,271 --> 00:06:10,027
is you just replicate. So anywhere there
was, it was hard to get it right or hard

74
00:06:10,027 --> 00:06:15,021
to have they have a lot of conflicts you
could just replicate. So they added a

75
00:06:15,021 --> 00:06:19,718
separate instruction fetch, prefetch and
buffering. and then they added more

76
00:06:19,718 --> 00:06:24,355
physical they, they call them virtual
registers here, but it's more physical

77
00:06:24,355 --> 00:06:29,052
registers effectively, so more places to
schedule into. So there were some added

78
00:06:29,052 --> 00:06:35,348
hardware costs to this. And to give you an
idea of sort of area, 24% area improvement

79
00:06:35,538 --> 00:06:40,870
increase, a lot of this is just due to the
cache, because they made the cache so much

80
00:06:40,870 --> 00:06:45,884
larger. So, you have to ask yourself, does
it make sense to use a 25, 24% more area

81
00:06:45,884 --> 00:06:51,025
to run two threads or at one point does it
make sense just to plop down a second

82
00:06:51,025 --> 00:06:56,373
processor? It's tough trade there. Well,
if it's 24% bigger and you get a two times

83
00:06:56,543 --> 00:07:01,643
performance boost, it was a good trade. If
it was 24% bigger and you get less than a

84
00:07:01,813 --> 00:07:06,290
24% performance boost, it's pretty
questionable. So let's look at another

85
00:07:06,290 --> 00:07:10,710
example and then we'll wrap up on the
performance of this and see if they did

86
00:07:10,710 --> 00:07:15,869
get their 24% performance boost. The first
implementation of symmetric

87
00:07:16,300 --> 00:07:21,120
multithreading, there's been many
implementations of multi threading. But

88
00:07:21,552 --> 00:07:26,875
excuse me, not symmetric, simultaneous
multithreading, was the Pentium four

89
00:07:26,875 --> 00:07:32,126
processor. Now they called it,
hyperthreading. that's a Intel marketing

90
00:07:32,126 --> 00:07:38,124
term. it is simultaneous multithreading,
but the Pentium four, they didn't actually

91
00:07:38,124 --> 00:07:43,380
replicate very many structures. They
didn't really change the processor very

92
00:07:43,380 --> 00:07:48,912
much. And in fact there's basically modes
to turn off simultaneous multithreading

93
00:07:48,912 --> 00:07:54,168
they're paying for, and it was not a
flagship feature when the processor was

94
00:07:54,168 --> 00:07:58,801
first shipped. So the, the, the basic idea
in the Pentium four, the, the, the little

95
00:07:58,801 --> 00:08:04,965
bit of extra hardware they added was, they
did duplicate some, some resources here.

96
00:08:04,965 --> 00:08:10,864
So, overall they, they increased their dye
area by about five%., So very, very small

97
00:08:10,864 --> 00:08:16,620
amount. Now the question is what is the
performance boost they can get from this?

98
00:08:18,220 --> 00:08:26,613
, and one of the interesting things here
is that there was a big problem that

99
00:08:26,613 --> 00:08:30,845
started to come up with the Pentium four
that when you were trying to run one

100
00:08:30,845 --> 00:08:35,241
thread on it, well, you weren't guaranteed
that the performance of that one thread

101
00:08:35,241 --> 00:08:39,039
would be equal to if you turned off
hyperthreading, or sorry, symmetric

102
00:08:39,202 --> 00:08:43,597
simultaneous multithreading. So they had a
mode switch, they could turn off, where

103
00:08:43,597 --> 00:08:47,721
they would take some structures that they
would partition when they were in

104
00:08:48,046 --> 00:08:51,736
multithreading mode and partition t hem
50/50. So you didn't have deadlock

105
00:08:51,736 --> 00:08:55,643
problems and you didn't have resource
contention problems on the shared

106
00:08:55,643 --> 00:09:01,761
structures. And when you turn this mode
switch off, , some programs got faster, to

107
00:09:01,761 --> 00:09:07,088
some extent what that meant is, they
didn't size those structures correctly to

108
00:09:07,088 --> 00:09:12,078
be running in multitraining mode at all
times. And the switch was not a little

109
00:09:12,078 --> 00:09:16,015
switch. It was like a big switch. It's
like reboot the computer and change the

110
00:09:16,015 --> 00:09:20,079
switch. It was it was a boot time
parameter, parameter to the chip. And this

111
00:09:20,079 --> 00:09:27,731
left such a bad taste in, in Intel's mouth
that simultaneous multithreading got

112
00:09:27,731 --> 00:09:34,449
kicked out of Intel land for a few
generations. The Pentium M, Coreduo,

113
00:09:34,449 --> 00:09:41,635
Core2duo all kicked out simultaneous
multithreading. And it didn't work it's

114
00:09:41,635 --> 00:09:50,985
way back in until inhalem core i5, i7 sort
of processors there recently. So, it's

115
00:09:50,985 --> 00:09:56,224
interesting to see that, you can use a
little bit of hardware, but you have to be

116
00:09:56,224 --> 00:10:02,222
careful, now, what was, what was the
biggest problem that they had here? The,

117
00:10:02,222 --> 00:10:10,420
the biggest problem was that the load
store queue on the Pentium four was. Split

118
00:10:10,420 --> 00:10:15,372
in half when they were running in two
thread mode. So the first Pentium four

119
00:10:15,373 --> 00:10:20,912
architecture that was shipped. It had
symmetric or simultaneous multi threading

120
00:10:20,912 --> 00:10:25,735
turned on. They just split the structure
in half. And it didn't have enough

121
00:10:25,735 --> 00:10:31,078
bandwidth for lots of programs. Or didn't
have enough entries in there for lots of

122
00:10:31,078 --> 00:10:36,226
programs when it was cut in half. So it
was sized to run one thread. They cut it

123
00:10:36,226 --> 00:10:42,015
in half statically and when they tried to
run two threads it was, effectively. Not

124
00:10:42,015 --> 00:10:45,639
providing enough performance. So they
couldn't get enough loads out to the

125
00:10:45,639 --> 00:10:51,213
memory system from one thread when they
cut that structure in half. So it really

126
00:10:51,213 --> 00:10:55,410
starts to bring up the question of what is
the right allocation of resources. And,

127
00:10:55,410 --> 00:10:59,608
should you use some sort of round robin
scheme? Should you use dynamic allocation?

128
00:10:59,608 --> 00:11:03,753
Should you use static allocation? Should
you replicate? If you replicate it costs

129
00:11:03,753 --> 00:11:08,164
more area and this is, these are some of
the big challenges here. Okay so let's

130
00:11:08,164 --> 00:11:14,569
look at the, the, some of the performance
here. So the, The numbers were pretty

131
00:11:14,569 --> 00:11:20,391
abysmal for the Pentium four.
Even when. Everything was going well. So.

132
00:11:20,391 --> 00:11:24,574
Example here. Pentium four.
Extreme edition with simultaneous

133
00:11:24,574 --> 00:11:30,500
multithreading give you one% speed up when
you're running two threads, for spec int.

134
00:11:30,500 --> 00:11:36,287
Rate. So you're running multiple, copies
of the same program, that's how spec, spec

135
00:11:36,287 --> 00:11:41,865
rate is done. And that was the integer of
one. For the floating.1, it did a little

136
00:11:41,865 --> 00:11:46,884
bit better. Seven% improvement. Now,
that's not horrible, considering it was

137
00:11:46,884 --> 00:11:51,949
only a five% area. Improvement or area
increase, but still that's, that's not

138
00:11:51,949 --> 00:11:57,054
saying a whole lot. I mean you have to
question if that's really, this whole

139
00:11:57,054 --> 00:12:02,577
level of complexity was worth it. And was
at five% area, could have been used for

140
00:12:02,577 --> 00:12:07,150
something else to get a few percent
performance increase. And this was to some

141
00:12:07,150 --> 00:12:11,996
extent pretty typical across applications
for the Pentium four.

142
00:12:12,000 --> 00:12:18,864
One of the, the interesting things about
the, the, the Pentium four was. Because

143
00:12:18,864 --> 00:12:23,362
the cash wasn't huge, especially the L1
cash was relatively small cuz they wanted

144
00:12:23,362 --> 00:12:27,750
it to operate so fast, at the very high
clock frequencies. You get a lot of data

145
00:12:27,750 --> 00:12:31,860
pollution in there from the two threats
that would actually destructively

146
00:12:31,860 --> 00:12:36,470
interfere in the level one cache. So that
was one of the big things they had to try

147
00:12:36,470 --> 00:12:41,024
and make up was this is a problem of any
multi-threading. Not just simultaneous

148
00:12:41,024 --> 00:12:45,356
multi-threading, that you can actually
fight for space in the cache. And end up

149
00:12:45,356 --> 00:12:52,136
with both capacity and . The Power five
did a lot better here. You know it was,

150
00:12:52,136 --> 00:12:57,693
they, they thought a little bit harder
about sharing structures, duplicating

151
00:12:57,693 --> 00:13:03,036
structures, and allocation of different
resources in there. So on, on Specint,

152
00:13:03,036 --> 00:13:08,380
they, Specint rates, they were getting 23%
improvement for their 25% area. Okay.

153
00:13:08,380 --> 00:13:14,507
Well, this, this might actually be a good
idea. It might have some value. And for

154
00:13:14,507 --> 00:13:22,060
the floating point version, they are not
doing great, but doing better here. And.

155
00:13:22,420 --> 00:13:27,670
The, the floating point apps had a lot of,
cache conflicts, so the floating point

156
00:13:27,670 --> 00:13:32,856
spec FP apps, if you go look at them in
the inside, they typically have very large

157
00:13:32,856 --> 00:13:37,658
data sets, so they were basically
conflicting in their large, larger, last

158
00:13:37,658 --> 00:13:47,598
little caches and things like that. So
finally I wanted to just give a little bit

159
00:13:47,598 --> 00:13:56,496
of color to this idea of picking fairly
and not having starvation in different

160
00:13:56,496 --> 00:14:06,582
resources in a simultaneous multi trading
pipeline. So one processor that's sort of

161
00:14:06,582 --> 00:14:14,188
the famous, simultaneous multiframe
processor that was never built, was the EV

162
00:14:14,188 --> 00:14:20,749
eight, or the Digital Equipment
Corporation last alpha, that was never

163
00:14:20,749 --> 00:14:28,296
built. So this last alpha, the, EV eight
or also known as the 21464, Have

164
00:14:28,296 --> 00:14:33,932
introduced they had eight way threaded
processors note this was never, never

165
00:14:33,932 --> 00:14:40,044
built but, it was super scale that it had
eight way, eight way. Simultaneous

166
00:14:40,044 --> 00:14:45,193
multi-threading. It also was a very
aggressive processor. And when I say it

167
00:14:45,193 --> 00:14:50,621
was never built, it was never shipped
commercially. They went pretty far in the

168
00:14:50,621 --> 00:14:56,257
design of it. So one of the questions that
they realized upon here is, what is the

169
00:14:56,257 --> 00:15:01,754
right way in an out of order pipeline to
make sure that you're getting correct

170
00:15:01,754 --> 00:15:07,043
utilization of the pipeline when you're
issuing instructions from different

171
00:15:07,043 --> 00:15:12,811
threads? So in this example here, we have
four different threads we'll say. And, you

172
00:15:12,811 --> 00:15:18,737
need to choose to go into your multi issue
out of order pipeline here, which thread

173
00:15:18,737 --> 00:15:25,488
to go pick from. And, by definition you
want to try to fill in the holes of one of

174
00:15:25,488 --> 00:15:31,188
the threads with work from another thread.
So, complete fairness here or lockstep

175
00:15:31,616 --> 00:15:37,059
round robin choosing for instance is not
what you want to do. Completely wrong

176
00:15:37,059 --> 00:15:41,725
thing to do. Because you want to fill in
when one processor is stalled, with other

177
00:15:41,725 --> 00:15:46,219
processors work. But, you want to make
sure that the thread which let's say, has

178
00:15:46,219 --> 00:15:50,771
no dependencies on each other. Or could
run very fast, doesn't go to memory very

179
00:15:50,771 --> 00:15:54,803
often, we'll say, it just doesn't stall
very often. Doesn't just hog the

180
00:15:54,803 --> 00:15:59,617
processor. Because it's never going to
introduce stall cycles on itself by

181
00:15:59,617 --> 00:16:05,275
itself. So they came up with this idea
called icount, and it was a choosing

182
00:16:05,275 --> 00:16:11,031
policy which basically lo oked at the
number of instructions that were retiring

183
00:16:11,031 --> 00:16:17,079
out of the back of the pipeline, had like
a moving window to try to estimate how

184
00:16:17,079 --> 00:16:22,836
many instructions from each thread were
completing and then, feedback around to

185
00:16:22,836 --> 00:16:28,780
the front of the pipeline here to
determine where to issue from. So I, I

186
00:16:28,780 --> 00:16:32,765
just wanted to get across the idea that,
if you try to intermix different threads

187
00:16:32,910 --> 00:16:36,361
in a data structure, it's relatively
simple. You can either cut the data

188
00:16:36,361 --> 00:16:39,908
structure in half, or have some
round-robin allocation there. It gets much

189
00:16:39,908 --> 00:16:43,845
more complicated when you have a full
processor pipeline that you need to figure

190
00:16:43,845 --> 00:16:47,781
out how to issue into that processor
pipeline. And, then you know twenty stages

191
00:16:47,781 --> 00:16:51,475
later, if something happens, there's so
much stuff in flight that you need to,

192
00:16:51,475 --> 00:16:55,265
need to worry about this a little bit
harder. So here, they, they basically were

193
00:16:55,265 --> 00:16:59,056
looking at how many instructions were in
flight and how many instructions had

194
00:16:59,056 --> 00:17:03,486
recently committed from a thread to
determine what to do. And, on average they

195
00:17:03,486 --> 00:17:07,340
were hoping that they had some fairness
between the threads.