1 00:00:03,520 --> 00:00:11,784 Okay so, those core screen multi-turning. We're going to move onto figuring out how 2 00:00:11,784 --> 00:00:17,950 to mix instructions of different threads in the pipeline at the same time, and mix 3 00:00:17,950 --> 00:00:24,116 them in the same, and issue two different threads instructions at the same time to 4 00:00:24,116 --> 00:00:29,366 different pipelines. So, the idea here, this is called simultaneous 5 00:00:29,366 --> 00:00:34,607 multi-threading, and actually, let me start of with a, a, a picture of what this 6 00:00:34,607 --> 00:00:43,076 looks like. Let's work backwards in the slide deck here for a second. So we have a 7 00:00:43,076 --> 00:00:48,934 four issue processor. This is our issue with. And this is time increasing going 8 00:00:48,934 --> 00:00:54,492 down. Each of the different patterns represents a different thread. And the 9 00:00:54,492 --> 00:01:00,425 idea here, in simultaneous multi-threading is you can execute instructions from 10 00:01:00,425 --> 00:01:06,734 different threads into different pipelines simultaneously. Now, this gets quite a bit 11 00:01:06,734 --> 00:01:15,613 harder. Then this basic design here, because now we actually have to read from 12 00:01:15,613 --> 00:01:22,043 the register file four different threads simultaneously, and we have to fetch 13 00:01:22,043 --> 00:01:27,500 different code from different program counters simultaneously in this processor. 14 00:01:32,500 --> 00:01:39,613 So simultaneous multi-threading where it came out of was, people were building 15 00:01:39,613 --> 00:01:45,772 complex out of order superscalars. And these complex out-of-order superscalars 16 00:01:45,772 --> 00:01:51,092 had all this logic, to be able to track different dependencies between different 17 00:01:51,092 --> 00:01:55,564 instructions, and to basically restart sub portions of instruction sequences. So this 18 00:01:55,858 --> 00:02:00,684 is like, you take a branch of mispredict. You had to kill all the instructions that 19 00:02:00,684 --> 00:02:05,803 were dependent on the branch mispredict and leave other ones that were not there 20 00:02:05,803 --> 00:02:10,275 alone. And, when you have this out of order mechanism you have all this extra 21 00:02:10,275 --> 00:02:21,443 logic there to figure that out. Dean Tolson, Susan Eggers, Jim Levy, came up 22 00:02:21,443 --> 00:02:29,572 with this idea that, well, what if we try to utilize all the dead slots in our out 23 00:02:29,572 --> 00:02:35,031 of order superscalar, but intermix different threads in there simultaneously 24 00:02:35,031 --> 00:02:40,754 to fill the time? So they did this study back from ISCA in 95,' and read a bunch of 25 00:02:40,754 --> 00:02:45,958 different applications, and this right most bar here, is our composite or our, 26 00:02:45,958 --> 00:02:51,162 our average here. And what you should figure out from this is this black bar on 27 00:02:51,162 --> 00:02:56,630 the bottom here is how long the processor is busy, actually doing work and the rest 28 00:02:56,630 --> 00:03:01,702 of this is different reasons the processor was stalled. So we were stalled on 29 00:03:01,702 --> 00:03:06,445 instruction task misses, branch mispredictions, load delays, just pipeline 30 00:03:06,445 --> 00:03:11,271 interlocking, memory conflicts, other, other sorts of things, and we're only 31 00:03:11,271 --> 00:03:18,532 using this processor less than twenty percent of the time. So, to show 32 00:03:18,532 --> 00:03:22,693 this a different way. We have our multi issue processor. And we have time here. We 33 00:03:22,693 --> 00:03:26,958 have all these purple boxes which are just dead, dead time. And we might be able to 34 00:03:26,958 --> 00:03:31,068 use subsets, you know. This is very good. We actually issued four instructions in 35 00:03:31,068 --> 00:03:35,073 this cycle. But here, we only issued two. Here, we issued one. Here, we issued two 36 00:03:35,073 --> 00:03:38,974 to these two pipelines in the middle. Maybe these are the two ALU pipes. And 37 00:03:38,974 --> 00:03:42,980 there's, like, a load pipe here and a branch pipe there, or something like that. 38 00:03:44,320 --> 00:03:50,285 This is, this is a kind of a disaster from an IPC perspective. Can we try to re-use 39 00:03:50,285 --> 00:03:56,531 that hardware? And we talked about our core screen multithreading, which was 40 00:03:56,531 --> 00:04:01,674 effectively temporally slicing up the cycles here. So you run one thread, switch 41 00:04:01,674 --> 00:04:06,557 to a different thread, run a different thread, and you can temporally switch 42 00:04:06,557 --> 00:04:16,913 between the threads. But what would happen in, in another approach, and this is 43 00:04:16,913 --> 00:04:26,356 actually done in the failed Sun Millenium processor, what was going to be Ultra 44 00:04:26,356 --> 00:04:30,044 Spark five. The, they have this idea that they 45 00:04:30,044 --> 00:04:34,723 actually had a clustered VLIW, or excuse me, a clustered superscalar, and they had 46 00:04:34,723 --> 00:04:39,229 a mode that you could flip a bit, and you could basically cut the two clusters 47 00:04:39,229 --> 00:04:44,792 separately and run them as two separate processors, or two different threads. So 48 00:04:44,792 --> 00:04:50,828 you can actually decrease, let's say you had this width four processor, and instead 49 00:04:50,828 --> 00:04:56,351 you split it in half, and you have two functional units running one thread, two 50 00:04:56,351 --> 00:05:04,919 functional units running another thread. Well, this has, this has some good 51 00:05:04,919 --> 00:05:15,860 effects. You don't actually have to have really high IPC, or really high 52 00:05:15,860 --> 00:05:21,197 instructions per clock. You don't ever have to reach four in this design because 53 00:05:21,197 --> 00:05:26,467 we've narrowed the two pipeline widths. And you could think about trying to put 54 00:05:26,467 --> 00:05:32,004 them together. This is what the Millennium processor, the UUltra-Spark five tried to 55 00:05:32,004 --> 00:05:39,267 do, is you can actually switch between this mode and this mode. But you still 56 00:05:39,267 --> 00:05:46,764 have a lot of open slots here that you can't go use. So, it still leaves some 57 00:05:46,764 --> 00:05:53,882 vertical waste. Another downside to this was you can't have one thread using all of 58 00:05:53,882 --> 00:05:59,875 the resources very easily here in this mode. You can't have, let's say, thread 59 00:05:59,875 --> 00:06:06,736 one use this functional unit over here in this very static design. So, this brings 60 00:06:06,736 --> 00:06:13,737 us to full simultaneous multithreading, or SMT. And in SMT we were able to mix and 61 00:06:13,737 --> 00:06:20,565 match all these different instructions, but this is going to change our proster 62 00:06:20,565 --> 00:06:31,630 pipeline quite a bit. Okay, so let's, let's look at what this does to a 63 00:06:31,630 --> 00:06:39,073 processor pipeline. Well. All of a sudden we need to be fetching multiple 64 00:06:39,073 --> 00:06:45,613 instructions at a time. That's, that's definitely, a hard, harder thing to do 65 00:06:45,613 --> 00:06:53,929 here from different threads. Conveniently, we can actually use our instruction queue, 66 00:06:53,929 --> 00:07:02,290 or our reservation stations, if you will, to go find different instructions to go 67 00:07:02,290 --> 00:07:07,535 execute at the issue stage of our out of order processor. So what we can do is we 68 00:07:07,535 --> 00:07:12,003 can tag the different threads independently, and have that be part of 69 00:07:12,003 --> 00:07:18,933 the instruction, issue logic, such that you can't issue or, or, you know that 70 00:07:18,933 --> 00:07:25,557 thread one, register one is different than thread two, register one. And then you use 71 00:07:25,557 --> 00:07:30,946 the same logic that we've had in our out of order superscalars, in our issue queue 72 00:07:30,946 --> 00:07:36,862 or our to go find different instructions from multiple threads to go stick down the 73 00:07:36,862 --> 00:07:43,173 pipe, or the respective multiple pipes. And conveniently we have most of this 74 00:07:43,173 --> 00:07:50,343 logic already. And what's also nice about this is, if you're only running one thread 75 00:07:50,343 --> 00:07:55,740 at a time. When you go to look in your instruction queue, or your issue window, 76 00:07:57,040 --> 00:08:02,243 that only that one thread will show up. So you can know that you don't need to go and 77 00:08:02,243 --> 00:08:10,612 look around the other threads, and you can use the full machine. And this allows you 78 00:08:10,612 --> 00:08:15,296 to use the same machine such that if you have lots of parallelism, you can 79 00:08:15,296 --> 00:08:20,038 different threads and fill all different slots. And then, if you don't have the 80 00:08:20,038 --> 00:08:24,403 parallelism, and you wanna exploit instructionable parallelism, you can use 81 00:08:24,403 --> 00:08:29,357 the slots in different allocation and just have the dead slots, the purple slots. And 82 00:08:29,357 --> 00:08:33,781 you can see here, here we are issuing three instructions from the checkered 83 00:08:33,781 --> 00:08:38,205 board blue, and here we are issuing four. Well, let's say it's the exact same 84 00:08:38,205 --> 00:08:43,041 program, executing on the same cycle. You could actually have the machine such that 85 00:08:43,041 --> 00:08:46,816 it'll issue different amounts of parallelism, of instructionable 86 00:08:46,816 --> 00:08:51,734 parallelism, from a thread, depending on if there's multiple threads running. And 87 00:08:51,734 --> 00:08:56,971 this is what these truly multi-threading machines, simultaneous multi-threading 88 00:08:56,971 --> 00:09:18,004 machines, attempt to do. . See, there's one other point I want to make here. you 89 00:09:18,004 --> 00:09:23,480 have to worry when you're doing this simultaneous multi-threading here, about 90 00:09:23,480 --> 00:09:30,032 priorities between threads, because you want to make sure they, you have some sort 91 00:09:30,032 --> 00:09:36,193 of equal progress guarantee, or some, and round robin's probably not good enough. So 92 00:09:36,193 --> 00:09:42,089 you need to somehow figure out how to not have one thread hog the machine, and have 93 00:09:42,089 --> 00:09:49,877 some, some, fairness between different threads. So I wanted to show a few 94 00:09:49,877 --> 00:09:55,228 examples here. So here we have on the top, we have the Power four IBM Power four 95 00:09:55,237 --> 00:10:00,453 processor pipeline. And this was a non-multi-thread machine. And the Power 96 00:10:00,453 --> 00:10:06,027 five architecture actually looks very similar to the Power four architecture, 97 00:10:06,027 --> 00:10:12,398 except they added a second hardware thread. So what they had to do here is 98 00:10:12,398 --> 00:10:20,445 they actually had to add more fetch bandwidth, and then they had it, this 99 00:10:20,445 --> 00:10:25,600 notion of group formation. And this group formation was the picking, out of the two 100 00:10:25,600 --> 00:10:30,691 threads. So they could figure out what they could actually execute simultaneously 101 00:10:30,691 --> 00:10:35,657 at the same time, and they, effectively had extra pipe stages out in front to go 102 00:10:35,657 --> 00:10:41,956 do that. And at the end here you have to commit to a, different program counters at 103 00:10:41,956 --> 00:10:46,072 the same time. You can see there is definitely complexity in building this. 104 00:10:46,072 --> 00:10:50,244 That all of t he sudden, you have to basically into reorder buffer or kill a 105 00:10:50,244 --> 00:10:54,805 subthread when it branch mispredicts, but not the rest of it. And we already talked 106 00:10:54,805 --> 00:10:58,754 about how to do that in a super scale where we can kill a subsequent a 107 00:10:58,754 --> 00:11:08,100 subsection of instructions, and not have to kill the whole instruction sequence. . 108 00:11:08,520 --> 00:11:16,382 Here's a different view of the, the Power four in a, in a, physical chip view, here. 109 00:11:16,382 --> 00:11:21,727 And one of the interesting things to, to see is that they went with two threads, 110 00:11:21,727 --> 00:11:26,633 because if they went to four threads, they figured out that they would actually be 111 00:11:26,633 --> 00:11:31,181 using the resources too much and be bottlenecking in different locations in 112 00:11:31,181 --> 00:11:35,847 the pipe. It was basically, they didn't have enough resources, sort of, over here 113 00:11:35,847 --> 00:11:41,173 to have it be filled. it would be filled if you had more than two threads executing 114 00:11:41,173 --> 00:11:45,840 on this anyway, so there was no, so sort of diminishing returns past that point. 115 00:11:47,680 --> 00:11:55,420 We're almost done, but I think we're out of time. Let's, actually, before we do 116 00:11:55,420 --> 00:12:01,532 that, let me skip forward here and just summarize, with all the different types of 117 00:12:01,532 --> 00:12:09,285 multi-threading. You have your superscalar processor. You have your, very, very 118 00:12:09,285 --> 00:12:13,751 fine-grain multi-threading, where you're doing it on each cycle, you have coarse 119 00:12:13,751 --> 00:12:18,500 grain, coarser grain where you sort of do it every few cycles. You can think about 120 00:12:18,500 --> 00:12:23,984 cutting the processor in half, and using some of the function units for one thread, 121 00:12:23,984 --> 00:12:28,412 and some of it for another thread, and then we have our full simultaneous 122 00:12:28,412 --> 00:12:33,629 multi-threading. So we'll pick up on this next time and, and, finish up about some 123 00:12:33,629 --> 00:12:38,240 of the implementations of the Pentium four, and how they did multi-threading. 124 00:12:39,500 --> 00:12:41,780 But we'll stop here for today.