1 00:00:03,620 --> 00:00:09,335 We left off last time, talking about some implementation issues around, 2 00:00:09,335 --> 00:00:16,510 multithreading. so a little bit of background on what we're doing today is 3 00:00:16,510 --> 00:00:21,332 we're going to be finishing up multithreading. We're going to be using 4 00:00:21,332 --> 00:00:26,765 cutlery and you'll see why we're using cutlery in a minute. That's right. A whole 5 00:00:26,765 --> 00:00:34,486 box a forks. And we'll be talking about synchronization and synchronization 6 00:00:34,486 --> 00:00:42,941 permitives. So let's, let's go back to multithreading, And look a, actual 7 00:00:42,941 --> 00:00:48,349 simultaneous multithreading which is where we left off. So just to recap, we talked 8 00:00:48,349 --> 00:00:53,296 about different types of multithreading. We talked about not multithreading 9 00:00:53,296 --> 00:00:58,704 processors, so we have a superscalar where we're trying to fill in the empty slots 10 00:00:58,704 --> 00:01:03,783 here. So we can do some fine-grain, but not the simultaneous multithreading. We 11 00:01:03,783 --> 00:01:09,323 can do coarse-grain multithreading we can even possibly think about trying to do 12 00:01:09,323 --> 00:01:14,460 this in some fast software layer. You could think about just cutting your 13 00:01:14,460 --> 00:01:19,134 cluster in half and using half your ALUs for one thread and half your ALUs for 14 00:01:19,134 --> 00:01:23,181 another thread. That is sort of technically a version of multithreading. 15 00:01:23,181 --> 00:01:27,627 It's just not executing oh, it is executing at the same time, but it's not 16 00:01:27,627 --> 00:01:31,902 necessarily using all the resources. and then you could think about actual 17 00:01:31,902 --> 00:01:36,291 simultaneous multithreading, where you're issuing instructions from different 18 00:01:36,291 --> 00:01:43,434 threads into different functional units at the same time. So, we talked about this 19 00:01:43,434 --> 00:01:49,225 last time and we, one of the things that came up was, how do you go about getting a 20 00:01:49,225 --> 00:01:54,450 parallelism? So, if you have thread-level parallelism, you can mix and match 21 00:01:54,450 --> 00:02:01,026 different instruction slots from different threads and get parallelism that way. And 22 00:02:01,026 --> 00:02:06,436 one of the great things about simultaneous multithreading is that if you have a fast 23 00:02:06,436 --> 00:02:10,759 superscalar machine and you try to use it for simultaneous multithreading, you could 24 00:02:10,759 --> 00:02:14,625 think about trying to reduce the number of threads if you don't have enough 25 00:02:14,625 --> 00:02:18,878 thread-levle parallelism and actually go at instruction-level parallelism. So here, 26 00:02:18,878 --> 00:02:24,068 we see this blue crosshatch. block zero instructions, and this is the same threa d 27 00:02:24,068 --> 00:02:28,952 running in the same sort of time period. And, if you have lots of threads running, 28 00:02:28,952 --> 00:02:33,715 you could actually intermix the different threads and run the blue crosshatch 29 00:02:33,715 --> 00:02:38,599 threads slower, than if you were try to use parallel resources to run one thread 30 00:02:38,599 --> 00:02:42,446 faster. And this is one of the big insights here that symmetric 31 00:02:42,446 --> 00:02:48,402 multithreading lets us go after. So we briefly talked and flew through this at 32 00:02:48,402 --> 00:02:53,431 the end of lecture last time cuz we talked about, what is the cost of implementing 33 00:02:53,431 --> 00:02:57,514 symmetric multithreading. So, conveniently, there is actually a really 34 00:02:57,514 --> 00:03:02,071 good example of this, the Power four by IBM and the Power five by IB, IBM. They're 35 00:03:02,071 --> 00:03:06,450 very similar architectures, except the Power five has two way simultaneous 36 00:03:06,450 --> 00:03:11,124 multithreading, and the Power four is almost the same architecture without two 37 00:03:11,124 --> 00:03:16,669 way symmetric multithreading. So you look at these two pictures. One of the big 38 00:03:16,669 --> 00:03:20,812 things they added is they added more fetch bandwidth, so you can fetch from two 39 00:03:20,812 --> 00:03:24,850 different program counters at the same time. You can decode more instructions, 40 00:03:24,850 --> 00:03:29,150 and then, they added an extra pipe staging here, which they, they do what we'll call, 41 00:03:29,150 --> 00:03:33,135 what they call group formation, which is basically the scheduling of the two 42 00:03:33,135 --> 00:03:37,645 different threads instructions, together by the time that it reaches the execution 43 00:03:37,645 --> 00:03:42,466 units. Also over here, they, as, as we've talked about before, even without 44 00:03:42,466 --> 00:03:46,580 symmetric multithreading, it's pretty useful to have more registers or more 45 00:03:46,580 --> 00:03:49,982 physical registers, registers and more architectural registers. You, you're 46 00:03:50,750 --> 00:03:54,810 minimally going to need more architectural registers, more physical registers 47 00:03:54,810 --> 00:04:01,994 probably is a good performance thing. So, we're talking about implementation 48 00:04:01,994 --> 00:04:06,335 details. So, one of the questions that comes up is, what is the cost of actually 49 00:04:06,335 --> 00:04:12,575 going to implement this symmetric multithreading on this processor? And one 50 00:04:12,575 --> 00:04:16,332 of the questions that also comes up here is, why two threads instead of four 51 00:04:16,332 --> 00:04:20,237 threads, or eight threads, or more threads? It has costs. So, that's one of 52 00:04:20,237 --> 00:04:25,611 the things that they would actually have to start replicating more data structures. 53 00:04:25,611 --> 00:04:30,661 For the pipeline th at they had roughly left over from the penny, excuse me, the 54 00:04:30,661 --> 00:04:35,316 Power four, they had enough compute and bandwidth through their execution units to 55 00:04:35,316 --> 00:04:39,703 be able to handle two threads. But if they wanted to try to go to more threads, they 56 00:04:39,703 --> 00:04:43,984 basically bottleneck somewhere in their architectures. I don't completely know 57 00:04:43,984 --> 00:04:47,837 exactly where they turtlenecked. They, the, the paper talks about this, it 58 00:04:47,837 --> 00:04:52,191 basically says that they bottleneck. Somewhere in their execution I guess they 59 00:04:52,191 --> 00:04:56,880 kind of leave it a little vague. so they decided that it wasn't worth adding more 60 00:04:56,880 --> 00:05:01,179 than two threads, because it's not going to give you any performance increase, 61 00:05:01,179 --> 00:05:05,812 unless you have threads that don't have or, or, threads that have a lot of stalls. 62 00:05:05,812 --> 00:05:10,277 If you have threads with enough stalls, you probably have enough holes to try to 63 00:05:10,277 --> 00:05:17,367 intermix another thread in there. So, some of the changes in the hardware they were 64 00:05:17,367 --> 00:05:22,784 about doing. Well, they actually increased their cache size, they increased the 65 00:05:22,784 --> 00:05:28,553 associativity of their caches their last level cache here that went to 1.92 66 00:05:28,553 --> 00:05:34,128 megabytes versus 1.44 of the L2 and L3 together. Sorry. and you can see here one 67 00:05:34,128 --> 00:05:39,371 of the interesting things is they start to actually separate data structures, because 68 00:05:39,371 --> 00:05:44,490 if you have multiple threads running and you try to mix them all into one big data 69 00:05:44,490 --> 00:05:49,177 structure, one of the problems like, or one of the big hardware data structure, 70 00:05:49,177 --> 00:05:54,234 one of the problems that comes up is maybe one of the threads is going to hog the 71 00:05:54,234 --> 00:05:59,587 data structure. So, threading actually introduces a lot of complexity, having to 72 00:05:59,587 --> 00:06:05,271 do with figuring out how to be fair in the usage of the structures. So one solution 73 00:06:05,271 --> 00:06:10,027 is you just replicate. So anywhere there was, it was hard to get it right or hard 74 00:06:10,027 --> 00:06:15,021 to have they have a lot of conflicts you could just replicate. So they added a 75 00:06:15,021 --> 00:06:19,718 separate instruction fetch, prefetch and buffering. and then they added more 76 00:06:19,718 --> 00:06:24,355 physical they, they call them virtual registers here, but it's more physical 77 00:06:24,355 --> 00:06:29,052 registers effectively, so more places to schedule into. So there were some added 78 00:06:29,052 --> 00:06:35,348 hardware costs to this. And to give you an idea of sort of area, 24% area improvement 79 00:06:35,538 --> 00:06:40,870 increase, a lot of this is just due to the cache, because they made the cache so much 80 00:06:40,870 --> 00:06:45,884 larger. So, you have to ask yourself, does it make sense to use a 25, 24% more area 81 00:06:45,884 --> 00:06:51,025 to run two threads or at one point does it make sense just to plop down a second 82 00:06:51,025 --> 00:06:56,373 processor? It's tough trade there. Well, if it's 24% bigger and you get a two times 83 00:06:56,543 --> 00:07:01,643 performance boost, it was a good trade. If it was 24% bigger and you get less than a 84 00:07:01,813 --> 00:07:06,290 24% performance boost, it's pretty questionable. So let's look at another 85 00:07:06,290 --> 00:07:10,710 example and then we'll wrap up on the performance of this and see if they did 86 00:07:10,710 --> 00:07:15,869 get their 24% performance boost. The first implementation of symmetric 87 00:07:16,300 --> 00:07:21,120 multithreading, there's been many implementations of multi threading. But 88 00:07:21,552 --> 00:07:26,875 excuse me, not symmetric, simultaneous multithreading, was the Pentium four 89 00:07:26,875 --> 00:07:32,126 processor. Now they called it, hyperthreading. that's a Intel marketing 90 00:07:32,126 --> 00:07:38,124 term. it is simultaneous multithreading, but the Pentium four, they didn't actually 91 00:07:38,124 --> 00:07:43,380 replicate very many structures. They didn't really change the processor very 92 00:07:43,380 --> 00:07:48,912 much. And in fact there's basically modes to turn off simultaneous multithreading 93 00:07:48,912 --> 00:07:54,168 they're paying for, and it was not a flagship feature when the processor was 94 00:07:54,168 --> 00:07:58,801 first shipped. So the, the, the basic idea in the Pentium four, the, the, the little 95 00:07:58,801 --> 00:08:04,965 bit of extra hardware they added was, they did duplicate some, some resources here. 96 00:08:04,965 --> 00:08:10,864 So, overall they, they increased their dye area by about five%., So very, very small 97 00:08:10,864 --> 00:08:16,620 amount. Now the question is what is the performance boost they can get from this? 98 00:08:18,220 --> 00:08:26,613 , and one of the interesting things here is that there was a big problem that 99 00:08:26,613 --> 00:08:30,845 started to come up with the Pentium four that when you were trying to run one 100 00:08:30,845 --> 00:08:35,241 thread on it, well, you weren't guaranteed that the performance of that one thread 101 00:08:35,241 --> 00:08:39,039 would be equal to if you turned off hyperthreading, or sorry, symmetric 102 00:08:39,202 --> 00:08:43,597 simultaneous multithreading. So they had a mode switch, they could turn off, where 103 00:08:43,597 --> 00:08:47,721 they would take some structures that they would partition when they were in 104 00:08:48,046 --> 00:08:51,736 multithreading mode and partition t hem 50/50. So you didn't have deadlock 105 00:08:51,736 --> 00:08:55,643 problems and you didn't have resource contention problems on the shared 106 00:08:55,643 --> 00:09:01,761 structures. And when you turn this mode switch off, , some programs got faster, to 107 00:09:01,761 --> 00:09:07,088 some extent what that meant is, they didn't size those structures correctly to 108 00:09:07,088 --> 00:09:12,078 be running in multitraining mode at all times. And the switch was not a little 109 00:09:12,078 --> 00:09:16,015 switch. It was like a big switch. It's like reboot the computer and change the 110 00:09:16,015 --> 00:09:20,079 switch. It was it was a boot time parameter, parameter to the chip. And this 111 00:09:20,079 --> 00:09:27,731 left such a bad taste in, in Intel's mouth that simultaneous multithreading got 112 00:09:27,731 --> 00:09:34,449 kicked out of Intel land for a few generations. The Pentium M, Coreduo, 113 00:09:34,449 --> 00:09:41,635 Core2duo all kicked out simultaneous multithreading. And it didn't work it's 114 00:09:41,635 --> 00:09:50,985 way back in until inhalem core i5, i7 sort of processors there recently. So, it's 115 00:09:50,985 --> 00:09:56,224 interesting to see that, you can use a little bit of hardware, but you have to be 116 00:09:56,224 --> 00:10:02,222 careful, now, what was, what was the biggest problem that they had here? The, 117 00:10:02,222 --> 00:10:10,420 the biggest problem was that the load store queue on the Pentium four was. Split 118 00:10:10,420 --> 00:10:15,372 in half when they were running in two thread mode. So the first Pentium four 119 00:10:15,373 --> 00:10:20,912 architecture that was shipped. It had symmetric or simultaneous multi threading 120 00:10:20,912 --> 00:10:25,735 turned on. They just split the structure in half. And it didn't have enough 121 00:10:25,735 --> 00:10:31,078 bandwidth for lots of programs. Or didn't have enough entries in there for lots of 122 00:10:31,078 --> 00:10:36,226 programs when it was cut in half. So it was sized to run one thread. They cut it 123 00:10:36,226 --> 00:10:42,015 in half statically and when they tried to run two threads it was, effectively. Not 124 00:10:42,015 --> 00:10:45,639 providing enough performance. So they couldn't get enough loads out to the 125 00:10:45,639 --> 00:10:51,213 memory system from one thread when they cut that structure in half. So it really 126 00:10:51,213 --> 00:10:55,410 starts to bring up the question of what is the right allocation of resources. And, 127 00:10:55,410 --> 00:10:59,608 should you use some sort of round robin scheme? Should you use dynamic allocation? 128 00:10:59,608 --> 00:11:03,753 Should you use static allocation? Should you replicate? If you replicate it costs 129 00:11:03,753 --> 00:11:08,164 more area and this is, these are some of the big challenges here. Okay so let's 130 00:11:08,164 --> 00:11:14,569 look at the, the, some of the performance here. So the, The numbers were pretty 131 00:11:14,569 --> 00:11:20,391 abysmal for the Pentium four. Even when. Everything was going well. So. 132 00:11:20,391 --> 00:11:24,574 Example here. Pentium four. Extreme edition with simultaneous 133 00:11:24,574 --> 00:11:30,500 multithreading give you one% speed up when you're running two threads, for spec int. 134 00:11:30,500 --> 00:11:36,287 Rate. So you're running multiple, copies of the same program, that's how spec, spec 135 00:11:36,287 --> 00:11:41,865 rate is done. And that was the integer of one. For the floating.1, it did a little 136 00:11:41,865 --> 00:11:46,884 bit better. Seven% improvement. Now, that's not horrible, considering it was 137 00:11:46,884 --> 00:11:51,949 only a five% area. Improvement or area increase, but still that's, that's not 138 00:11:51,949 --> 00:11:57,054 saying a whole lot. I mean you have to question if that's really, this whole 139 00:11:57,054 --> 00:12:02,577 level of complexity was worth it. And was at five% area, could have been used for 140 00:12:02,577 --> 00:12:07,150 something else to get a few percent performance increase. And this was to some 141 00:12:07,150 --> 00:12:11,996 extent pretty typical across applications for the Pentium four. 142 00:12:12,000 --> 00:12:18,864 One of the, the interesting things about the, the, the Pentium four was. Because 143 00:12:18,864 --> 00:12:23,362 the cash wasn't huge, especially the L1 cash was relatively small cuz they wanted 144 00:12:23,362 --> 00:12:27,750 it to operate so fast, at the very high clock frequencies. You get a lot of data 145 00:12:27,750 --> 00:12:31,860 pollution in there from the two threats that would actually destructively 146 00:12:31,860 --> 00:12:36,470 interfere in the level one cache. So that was one of the big things they had to try 147 00:12:36,470 --> 00:12:41,024 and make up was this is a problem of any multi-threading. Not just simultaneous 148 00:12:41,024 --> 00:12:45,356 multi-threading, that you can actually fight for space in the cache. And end up 149 00:12:45,356 --> 00:12:52,136 with both capacity and . The Power five did a lot better here. You know it was, 150 00:12:52,136 --> 00:12:57,693 they, they thought a little bit harder about sharing structures, duplicating 151 00:12:57,693 --> 00:13:03,036 structures, and allocation of different resources in there. So on, on Specint, 152 00:13:03,036 --> 00:13:08,380 they, Specint rates, they were getting 23% improvement for their 25% area. Okay. 153 00:13:08,380 --> 00:13:14,507 Well, this, this might actually be a good idea. It might have some value. And for 154 00:13:14,507 --> 00:13:22,060 the floating point version, they are not doing great, but doing better here. And. 155 00:13:22,420 --> 00:13:27,670 The, the floating point apps had a lot of, cache conflicts, so the floating point 156 00:13:27,670 --> 00:13:32,856 spec FP apps, if you go look at them in the inside, they typically have very large 157 00:13:32,856 --> 00:13:37,658 data sets, so they were basically conflicting in their large, larger, last 158 00:13:37,658 --> 00:13:47,598 little caches and things like that. So finally I wanted to just give a little bit 159 00:13:47,598 --> 00:13:56,496 of color to this idea of picking fairly and not having starvation in different 160 00:13:56,496 --> 00:14:06,582 resources in a simultaneous multi trading pipeline. So one processor that's sort of 161 00:14:06,582 --> 00:14:14,188 the famous, simultaneous multiframe processor that was never built, was the EV 162 00:14:14,188 --> 00:14:20,749 eight, or the Digital Equipment Corporation last alpha, that was never 163 00:14:20,749 --> 00:14:28,296 built. So this last alpha, the, EV eight or also known as the 21464, Have 164 00:14:28,296 --> 00:14:33,932 introduced they had eight way threaded processors note this was never, never 165 00:14:33,932 --> 00:14:40,044 built but, it was super scale that it had eight way, eight way. Simultaneous 166 00:14:40,044 --> 00:14:45,193 multi-threading. It also was a very aggressive processor. And when I say it 167 00:14:45,193 --> 00:14:50,621 was never built, it was never shipped commercially. They went pretty far in the 168 00:14:50,621 --> 00:14:56,257 design of it. So one of the questions that they realized upon here is, what is the 169 00:14:56,257 --> 00:15:01,754 right way in an out of order pipeline to make sure that you're getting correct 170 00:15:01,754 --> 00:15:07,043 utilization of the pipeline when you're issuing instructions from different 171 00:15:07,043 --> 00:15:12,811 threads? So in this example here, we have four different threads we'll say. And, you 172 00:15:12,811 --> 00:15:18,737 need to choose to go into your multi issue out of order pipeline here, which thread 173 00:15:18,737 --> 00:15:25,488 to go pick from. And, by definition you want to try to fill in the holes of one of 174 00:15:25,488 --> 00:15:31,188 the threads with work from another thread. So, complete fairness here or lockstep 175 00:15:31,616 --> 00:15:37,059 round robin choosing for instance is not what you want to do. Completely wrong 176 00:15:37,059 --> 00:15:41,725 thing to do. Because you want to fill in when one processor is stalled, with other 177 00:15:41,725 --> 00:15:46,219 processors work. But, you want to make sure that the thread which let's say, has 178 00:15:46,219 --> 00:15:50,771 no dependencies on each other. Or could run very fast, doesn't go to memory very 179 00:15:50,771 --> 00:15:54,803 often, we'll say, it just doesn't stall very often. Doesn't just hog the 180 00:15:54,803 --> 00:15:59,617 processor. Because it's never going to introduce stall cycles on itself by 181 00:15:59,617 --> 00:16:05,275 itself. So they came up with this idea called icount, and it was a choosing 182 00:16:05,275 --> 00:16:11,031 policy which basically lo oked at the number of instructions that were retiring 183 00:16:11,031 --> 00:16:17,079 out of the back of the pipeline, had like a moving window to try to estimate how 184 00:16:17,079 --> 00:16:22,836 many instructions from each thread were completing and then, feedback around to 185 00:16:22,836 --> 00:16:28,780 the front of the pipeline here to determine where to issue from. So I, I 186 00:16:28,780 --> 00:16:32,765 just wanted to get across the idea that, if you try to intermix different threads 187 00:16:32,910 --> 00:16:36,361 in a data structure, it's relatively simple. You can either cut the data 188 00:16:36,361 --> 00:16:39,908 structure in half, or have some round-robin allocation there. It gets much 189 00:16:39,908 --> 00:16:43,845 more complicated when you have a full processor pipeline that you need to figure 190 00:16:43,845 --> 00:16:47,781 out how to issue into that processor pipeline. And, then you know twenty stages 191 00:16:47,781 --> 00:16:51,475 later, if something happens, there's so much stuff in flight that you need to, 192 00:16:51,475 --> 00:16:55,265 need to worry about this a little bit harder. So here, they, they basically were 193 00:16:55,265 --> 00:16:59,056 looking at how many instructions were in flight and how many instructions had 194 00:16:59,056 --> 00:17:03,486 recently committed from a thread to determine what to do. And, on average they 195 00:17:03,486 --> 00:17:07,340 were hoping that they had some fairness between the threads.