1
00:00:03,520 --> 00:00:11,784
Okay so, those core screen multi-turning.
We're going to move onto figuring out how

2
00:00:11,784 --> 00:00:17,950
to mix instructions of different threads
in the pipeline at the same time, and mix

3
00:00:17,950 --> 00:00:24,116
them in the same, and issue two different
threads instructions at the same time to

4
00:00:24,116 --> 00:00:29,366
different pipelines. So, the idea here,
this is called simultaneous

5
00:00:29,366 --> 00:00:34,607
multi-threading, and actually, let me
start of with a, a, a picture of what this

6
00:00:34,607 --> 00:00:43,076
looks like. Let's work backwards in the
slide deck here for a second. So we have a

7
00:00:43,076 --> 00:00:48,934
four issue processor. This is our issue
with. And this is time increasing going

8
00:00:48,934 --> 00:00:54,492
down. Each of the different patterns
represents a different thread. And the

9
00:00:54,492 --> 00:01:00,425
idea here, in simultaneous multi-threading
is you can execute instructions from

10
00:01:00,425 --> 00:01:06,734
different threads into different pipelines
simultaneously. Now, this gets quite a bit

11
00:01:06,734 --> 00:01:15,613
harder. Then this basic design here,
because now we actually have to read from

12
00:01:15,613 --> 00:01:22,043
the register file four different threads
simultaneously, and we have to fetch

13
00:01:22,043 --> 00:01:27,500
different code from different program
counters simultaneously in this processor.

14
00:01:32,500 --> 00:01:39,613
So simultaneous multi-threading where it
came out of was, people were building

15
00:01:39,613 --> 00:01:45,772
complex out of order superscalars. And
these complex out-of-order superscalars

16
00:01:45,772 --> 00:01:51,092
had all this logic, to be able to track
different dependencies between different

17
00:01:51,092 --> 00:01:55,564
instructions, and to basically restart sub
portions of instruction sequences. So this

18
00:01:55,858 --> 00:02:00,684
is like, you take a branch of mispredict.
You had to kill all the instructions that

19
00:02:00,684 --> 00:02:05,803
were dependent on the branch mispredict
and leave other ones that were not there

20
00:02:05,803 --> 00:02:10,275
alone. And, when you have this out of
order mechanism you have all this extra

21
00:02:10,275 --> 00:02:21,443
logic there to figure that out. Dean
Tolson, Susan Eggers, Jim Levy, came up

22
00:02:21,443 --> 00:02:29,572
with this idea that, well, what if we try
to utilize all the dead slots in our out

23
00:02:29,572 --> 00:02:35,031
of order superscalar, but intermix
different threads in there simultaneously

24
00:02:35,031 --> 00:02:40,754
to fill the time? So they did this study
back from ISCA in 95,' and read a bunch of

25
00:02:40,754 --> 00:02:45,958
different applications, and this right
most bar here, is our composite or our,

26
00:02:45,958 --> 00:02:51,162
our average here. And what you should
figure out from this is this black bar on

27
00:02:51,162 --> 00:02:56,630
the bottom here is how long the processor
is busy, actually doing work and the rest

28
00:02:56,630 --> 00:03:01,702
of this is different reasons the processor
was stalled. So we were stalled on

29
00:03:01,702 --> 00:03:06,445
instruction task misses, branch
mispredictions, load delays, just pipeline

30
00:03:06,445 --> 00:03:11,271
interlocking, memory conflicts, other,
other sorts of things, and we're only

31
00:03:11,271 --> 00:03:18,532
using this processor less than
twenty percent of the time. So, to show

32
00:03:18,532 --> 00:03:22,693
this a different way. We have our multi
issue processor. And we have time here. We

33
00:03:22,693 --> 00:03:26,958
have all these purple boxes which are just
dead, dead time. And we might be able to

34
00:03:26,958 --> 00:03:31,068
use subsets, you know. This is very good.
We actually issued four instructions in

35
00:03:31,068 --> 00:03:35,073
this cycle. But here, we only issued two.
Here, we issued one. Here, we issued two

36
00:03:35,073 --> 00:03:38,974
to these two pipelines in the middle.
Maybe these are the two ALU pipes. And

37
00:03:38,974 --> 00:03:42,980
there's, like, a load pipe here and a
branch pipe there, or something like that.

38
00:03:44,320 --> 00:03:50,285
This is, this is a kind of a disaster from
an IPC perspective. Can we try to re-use

39
00:03:50,285 --> 00:03:56,531
that hardware? And we talked about our
core screen multithreading, which was

40
00:03:56,531 --> 00:04:01,674
effectively temporally slicing up the
cycles here. So you run one thread, switch

41
00:04:01,674 --> 00:04:06,557
to a different thread, run a different
thread, and you can temporally switch

42
00:04:06,557 --> 00:04:16,913
between the threads. But what would happen
in, in another approach, and this is

43
00:04:16,913 --> 00:04:26,356
actually done in the failed Sun Millenium
processor, what was going to be Ultra

44
00:04:26,356 --> 00:04:30,044
Spark five.
The, they have this idea that they

45
00:04:30,044 --> 00:04:34,723
actually had a clustered VLIW, or excuse
me, a clustered superscalar, and they had

46
00:04:34,723 --> 00:04:39,229
a mode that you could flip a bit, and you
could basically cut the two clusters

47
00:04:39,229 --> 00:04:44,792
separately and run them as two separate
processors, or two different threads. So

48
00:04:44,792 --> 00:04:50,828
you can actually decrease, let's say you
had this width four processor, and instead

49
00:04:50,828 --> 00:04:56,351
you split it in half, and you have two
functional units running one thread, two

50
00:04:56,351 --> 00:05:04,919
functional units running another thread.
Well, this has, this has some good

51
00:05:04,919 --> 00:05:15,860
effects. You don't actually have to have
really high IPC, or really high

52
00:05:15,860 --> 00:05:21,197
instructions per clock. You don't ever
have to reach four in this design because

53
00:05:21,197 --> 00:05:26,467
we've narrowed the two pipeline widths.
And you could think about trying to put

54
00:05:26,467 --> 00:05:32,004
them together. This is what the Millennium
processor, the UUltra-Spark five tried to

55
00:05:32,004 --> 00:05:39,267
do, is you can actually switch between
this mode and this mode. But you still

56
00:05:39,267 --> 00:05:46,764
have a lot of open slots here that you
can't go use. So, it still leaves some

57
00:05:46,764 --> 00:05:53,882
vertical waste. Another downside to this
was you can't have one thread using all of

58
00:05:53,882 --> 00:05:59,875
the resources very easily here in this
mode. You can't have, let's say, thread

59
00:05:59,875 --> 00:06:06,736
one use this functional unit over here in
this very static design. So, this brings

60
00:06:06,736 --> 00:06:13,737
us to full simultaneous multithreading, or
SMT. And in SMT we were able to mix and

61
00:06:13,737 --> 00:06:20,565
match all these different instructions,
but this is going to change our proster

62
00:06:20,565 --> 00:06:31,630
pipeline quite a bit. Okay, so let's,
let's look at what this does to a

63
00:06:31,630 --> 00:06:39,073
processor pipeline. Well. All of a sudden
we need to be fetching multiple

64
00:06:39,073 --> 00:06:45,613
instructions at a time. That's, that's
definitely, a hard, harder thing to do

65
00:06:45,613 --> 00:06:53,929
here from different threads. Conveniently,
we can actually use our instruction queue,

66
00:06:53,929 --> 00:07:02,290
or our reservation stations, if you will,
to go find different instructions to go

67
00:07:02,290 --> 00:07:07,535
execute at the issue stage of our out of
order processor. So what we can do is we

68
00:07:07,535 --> 00:07:12,003
can tag the different threads
independently, and have that be part of

69
00:07:12,003 --> 00:07:18,933
the instruction, issue logic, such that
you can't issue or, or, you know that

70
00:07:18,933 --> 00:07:25,557
thread one, register one is different than
thread two, register one. And then you use

71
00:07:25,557 --> 00:07:30,946
the same logic that we've had in our out
of order superscalars, in our issue queue

72
00:07:30,946 --> 00:07:36,862
or our to go find different instructions
from multiple threads to go stick down the

73
00:07:36,862 --> 00:07:43,173
pipe, or the respective multiple pipes.
And conveniently we have most of this

74
00:07:43,173 --> 00:07:50,343
logic already. And what's also nice about
this is, if you're only running one thread

75
00:07:50,343 --> 00:07:55,740
at a time. When you go to look in your
instruction queue, or your issue window,

76
00:07:57,040 --> 00:08:02,243
that only that one thread will show up. So
you can know that you don't need to go and

77
00:08:02,243 --> 00:08:10,612
look around the other threads, and you can
use the full machine. And this allows you

78
00:08:10,612 --> 00:08:15,296
to use the same machine such that if you
have lots of parallelism, you can

79
00:08:15,296 --> 00:08:20,038
different threads and fill all different
slots. And then, if you don't have the

80
00:08:20,038 --> 00:08:24,403
parallelism, and you wanna exploit
instructionable parallelism, you can use

81
00:08:24,403 --> 00:08:29,357
the slots in different allocation and just
have the dead slots, the purple slots. And

82
00:08:29,357 --> 00:08:33,781
you can see here, here we are issuing
three instructions from the checkered

83
00:08:33,781 --> 00:08:38,205
board blue, and here we are issuing four.
Well, let's say it's the exact same

84
00:08:38,205 --> 00:08:43,041
program, executing on the same cycle. You
could actually have the machine such that

85
00:08:43,041 --> 00:08:46,816
it'll issue different amounts of
parallelism, of instructionable

86
00:08:46,816 --> 00:08:51,734
parallelism, from a thread, depending on
if there's multiple threads running. And

87
00:08:51,734 --> 00:08:56,971
this is what these truly multi-threading
machines, simultaneous multi-threading

88
00:08:56,971 --> 00:09:18,004
machines, attempt to do. . See, there's
one other point I want to make here. you

89
00:09:18,004 --> 00:09:23,480
have to worry when you're doing this
simultaneous multi-threading here, about

90
00:09:23,480 --> 00:09:30,032
priorities between threads, because you
want to make sure they, you have some sort

91
00:09:30,032 --> 00:09:36,193
of equal progress guarantee, or some, and
round robin's probably not good enough. So

92
00:09:36,193 --> 00:09:42,089
you need to somehow figure out how to not
have one thread hog the machine, and have

93
00:09:42,089 --> 00:09:49,877
some, some, fairness between different
threads. So I wanted to show a few

94
00:09:49,877 --> 00:09:55,228
examples here. So here we have on the top,
we have the Power four IBM Power four

95
00:09:55,237 --> 00:10:00,453
processor pipeline. And this was a
non-multi-thread machine. And the Power

96
00:10:00,453 --> 00:10:06,027
five architecture actually looks very
similar to the Power four architecture,

97
00:10:06,027 --> 00:10:12,398
except they added a second hardware
thread. So what they had to do here is

98
00:10:12,398 --> 00:10:20,445
they actually had to add more fetch
bandwidth, and then they had it, this

99
00:10:20,445 --> 00:10:25,600
notion of group formation. And this group
formation was the picking, out of the two

100
00:10:25,600 --> 00:10:30,691
threads. So they could figure out what
they could actually execute simultaneously

101
00:10:30,691 --> 00:10:35,657
at the same time, and they, effectively
had extra pipe stages out in front to go

102
00:10:35,657 --> 00:10:41,956
do that. And at the end here you have to
commit to a, different program counters at

103
00:10:41,956 --> 00:10:46,072
the same time. You can see there is
definitely complexity in building this.

104
00:10:46,072 --> 00:10:50,244
That all of t he sudden, you have to
basically into reorder buffer or kill a

105
00:10:50,244 --> 00:10:54,805
subthread when it branch mispredicts, but
not the rest of it. And we already talked

106
00:10:54,805 --> 00:10:58,754
about how to do that in a super scale
where we can kill a subsequent a

107
00:10:58,754 --> 00:11:08,100
subsection of instructions, and not have
to kill the whole instruction sequence. .

108
00:11:08,520 --> 00:11:16,382
Here's a different view of the, the Power
four in a, in a, physical chip view, here.

109
00:11:16,382 --> 00:11:21,727
And one of the interesting things to, to
see is that they went with two threads,

110
00:11:21,727 --> 00:11:26,633
because if they went to four threads, they
figured out that they would actually be

111
00:11:26,633 --> 00:11:31,181
using the resources too much and be
bottlenecking in different locations in

112
00:11:31,181 --> 00:11:35,847
the pipe. It was basically, they didn't
have enough resources, sort of, over here

113
00:11:35,847 --> 00:11:41,173
to have it be filled. it would be filled
if you had more than two threads executing

114
00:11:41,173 --> 00:11:45,840
on this anyway, so there was no, so sort
of diminishing returns past that point.

115
00:11:47,680 --> 00:11:55,420
We're almost done, but I think we're out
of time. Let's, actually, before we do

116
00:11:55,420 --> 00:12:01,532
that, let me skip forward here and just
summarize, with all the different types of

117
00:12:01,532 --> 00:12:09,285
multi-threading. You have your superscalar
processor. You have your, very, very

118
00:12:09,285 --> 00:12:13,751
fine-grain multi-threading, where you're
doing it on each cycle, you have coarse

119
00:12:13,751 --> 00:12:18,500
grain, coarser grain where you sort of do
it every few cycles. You can think about

120
00:12:18,500 --> 00:12:23,984
cutting the processor in half, and using
some of the function units for one thread,

121
00:12:23,984 --> 00:12:28,412
and some of it for another thread, and
then we have our full simultaneous

122
00:12:28,412 --> 00:12:33,629
multi-threading. So we'll pick up on this
next time and, and, finish up about some

123
00:12:33,629 --> 00:12:38,240
of the implementations of the Pentium
four, and how they did multi-threading.

124
00:12:39,500 --> 00:12:41,780
But we'll stop here for today.