1 00:00:03,740 --> 00:00:10,850 So, the idea here is if we can guarantee that there's no dependence between the 2 00:00:10,850 --> 00:00:15,580 instructions, We can think about having, 3 00:00:16,720 --> 00:00:20,064 You can execute those instructions in parallel. 4 00:00:20,064 --> 00:00:25,615 And one really good way to have guaranteed independence is to have completely 5 00:00:25,615 --> 00:00:32,402 programs or completely different threads. And the naive approach here is we 6 00:00:32,402 --> 00:00:38,537 interleave instructions from different threads to cover latency. 7 00:00:38,645 --> 00:00:42,974 So we send a load out here and while this load is off revolving in the memory 8 00:00:42,974 --> 00:00:45,896 system,, this is our thread one that we had here before. 9 00:00:45,896 --> 00:00:50,280 We then load into R1 and we went to go use the results here, also from thread one. 10 00:00:51,420 --> 00:00:57,047 Well, by the time the data comes back from memory, we know it's all, all ready to go. 11 00:00:57,047 --> 00:01:00,066 We put other work in our processor pipeline. 12 00:01:00,066 --> 00:01:06,520 And this is called multithreading. Now, one of the big insights here is you 13 00:01:06,520 --> 00:01:12,839 need to have enough other work to do and, these other threads add some complexity to 14 00:01:12,839 --> 00:01:17,005 your processor design, And they may cause some locality 15 00:01:17,005 --> 00:01:20,489 challenges. So if you have, if you're trying to 16 00:01:20,489 --> 00:01:26,700 exploit temporal locality in, let's say, a cash or a translation look inside buffer 17 00:01:26,700 --> 00:01:31,320 or some other buffer. And you start putting other data axises 18 00:01:31,320 --> 00:01:36,773 sort of in the middle here or other operations in the middle, this might 19 00:01:36,773 --> 00:01:41,705 destroy your locality. But, if, let's think about this in a 20 00:01:41,705 --> 00:01:50,768 happy, happy way for right now. If you can find other work to do, you can 21 00:01:50,768 --> 00:01:58,060 just shove it in, the slots while you're, while you're waiting for this load to come 22 00:01:58,060 --> 00:02:01,830 back. The most basic version of multithreading 23 00:02:01,830 --> 00:02:06,254 is going to be some sort of very basic interweaving. 24 00:02:06,254 --> 00:02:08,815 So we run thread one, then thread two, then thread three, then thread four. 25 00:02:08,815 --> 00:02:08,815 Thread one, thread two, thread three, and thread four. 26 00:02:08,815 --> 00:02:08,815 So we don't pay attention to the, latencies or dependencies of the 27 00:02:08,815 --> 00:02:09,076 instructions when we go to schedule. And if you come around to a thread slot 28 00:02:09,076 --> 00:02:09,402 and there is nothing to do, Let's say this load missed in the cache and took a 29 00:02:09,402 --> 00:02:09,402 thousand cycles to come back. You just don't schedule anything. 30 00:02:09,402 --> 00:02:09,402 You have a dead cycle there. One of the nice things about 31 00:02:09,402 --> 00:02:09,402 multithreading that you can take advantage of, is that you don't have to worry about 32 00:02:09,402 --> 00:02:09,402 bypassing anymore. Because if you know that you're not going 33 00:02:09,402 --> 00:02:10,118 to be executing an instruction from the same thread until a couple seconds later, 34 00:02:10,118 --> 00:02:10,444 why do you need to have a fast bypass from the ALU back to itself? 35 00:02:10,444 --> 00:02:10,444 You just don't care. So you, let's see, do we actually have an 36 00:02:10,444 --> 00:02:10,444 example of that. So here's, here's an example that say you 37 00:02:10,444 --> 00:02:10,444 have a, a write to R1, and then a read of R1. 38 00:02:10,444 --> 00:02:10,900 What does value does this thread two get here? 39 00:02:10,900 --> 00:02:10,900 So, the first versions of multi threading machines, Try to partition the register 40 00:02:10,900 --> 00:02:10,900 file, So this is when it happened, So you compile up your programs differently. 41 00:02:10,900 --> 00:02:10,900 That's not really common anymore. We'll, we'll talk about an example where 42 00:02:10,900 --> 00:02:11,290 they actually did that. For instance, the that was actually also a 43 00:02:11,290 --> 00:02:11,616 limiter to threads for awhile, Until sort of people came up with the idea of what 44 00:02:11,616 --> 00:02:11,616 you just suggested which is somehow change the name space of the registers. 45 00:02:11,616 --> 00:02:11,616 And the way that, that was actually changed was well, the first 46 00:02:11,616 --> 00:02:11,616 implementation, we'll talk about in a little bit. 47 00:02:11,616 --> 00:02:11,616 But the first implementation was on Spark, actually. 48 00:02:11,616 --> 00:02:11,616 And they used the register windows on Spark, which gave you different name 49 00:02:11,616 --> 00:02:11,616 spaces for different registers. And then, then that finally evolved into 50 00:02:11,616 --> 00:02:12,072 actually having a thread identifier and having copies of the entire register 51 00:02:12,072 --> 00:02:12,398 space, So each thread had a Different register set. 52 00:02:12,398 --> 00:02:12,398 And that's what we're going to show in this picture here. 53 00:02:12,398 --> 00:02:12,398 So as, as shown here, you have to copy the general purpose register file four times. 54 00:02:12,398 --> 00:02:12,398 And you have to copy the program counter four times, and then, you have a 55 00:02:12,398 --> 00:02:12,854 incrementer out here on the front of the pipe, which chooses the thread ID, or the 56 00:02:12,854 --> 00:02:12,854 thread select. And in our simple case here, we're just 57 00:02:12,854 --> 00:02:12,854 going to keep incrementing this and choose one, two, three, four. 58 00:02:12,854 --> 00:02:12,854 One, two, three, four. One, two, three, four and just continually 59 00:02:12,854 --> 00:02:13,114 does that and likewise that's a indexing to which general purpose register file 60 00:02:13,114 --> 00:02:13,114 we're suppose to be accessing. Or which logical general purpose register 61 00:02:13,114 --> 00:02:13,114 file cuz most of these architectures will actually put this together in one larger 62 00:02:13,114 --> 00:02:13,114 general purpose register file and then have, this just be some addressing bits 63 00:02:13,114 --> 00:02:13,114 that are architecturally not, not visible. But you bring up a good, a point that, 64 00:02:13,114 --> 00:02:13,114 there are a couple userland threading libraries out there. 65 00:02:13,114 --> 00:02:13,114 So, it issues this notion of threading, but not on a multithreaded computer 66 00:02:13,114 --> 00:02:13,114 architecture. So where you don't have this and this, and 67 00:02:13,114 --> 00:02:13,114 there's still some advantages to that cause you still, can swap between 68 00:02:13,114 --> 00:02:13,570 different threads and actually try to cover memory seek cuz your load to use 69 00:02:13,570 --> 00:02:13,570 will be longer. And those threading libraries, largely 70 00:02:13,570 --> 00:02:13,570 they do try to partition the register file space still. 71 00:02:13,570 --> 00:02:13,570 So you can actually go download one of these, on, you know, Get Linux and go run 72 00:02:13,570 --> 00:02:13,570 it and people still use these threading libraries to go do that and these usually 73 00:02:13,570 --> 00:02:13,570 lend thread libraries. The other way to go do that is actually to 74 00:02:13,570 --> 00:02:13,570 swap out the entire register space, But people typically try not to do that cuz 75 00:02:13,570 --> 00:02:13,570 that, if you're trying to do fine grade, interweaving and you want to save your 76 00:02:13,570 --> 00:02:13,570 entire register space to memory, That's, that's very expensive. 77 00:02:13,570 --> 00:02:13,570 But the, the, these threading libraries sort of work together with the compiler, 78 00:02:13,570 --> 00:02:13,570 and tell the compiler don't, Don't use all the registers for thread one. 79 00:02:13,570 --> 00:02:13,570 Use half the registers for thread one, and half the registers, let's say, for thread 80 00:02:13,570 --> 00:02:13,570 two. So this is our simple multithreading 81 00:02:13,570 --> 00:02:13,570 pipeline. It's relatively small changes so far. 82 00:02:13,570 --> 00:02:13,570 Replicate the register file, Replicate the PC, And this can help us recover 83 00:02:13,570 --> 00:02:13,570 utilization on our ALU, for instance. And what we wanted to point out that the 84 00:02:13,570 --> 00:02:13,570 software, what this look like, is that it looks likes multiple processors. 85 00:02:13,570 --> 00:02:13,831 So if you go use something like a modern day Core i7, Those have two hardware 86 00:02:13,831 --> 00:02:14,287 threads. But when you go to look at it, it'll look like there's twice as many 87 00:02:14,287 --> 00:02:14,287 cores in the machine than you think there are. 88 00:02:14,287 --> 00:02:14,938 So, for instance, if you have a four-core Core i7 and you open up in Windows the 89 00:02:14,938 --> 00:02:14,938 little C, process management sort of dialogue box, You'll see that there's 90 00:02:14,938 --> 00:02:15,394 eight little bars in there cuz there's one thread or one virtual processor, or excuse 91 00:02:15,394 --> 00:02:15,394 me two virtual processors per physical core in the machine. 92 00:02:15,394 --> 00:02:15,394 So they actually look like there's multiple slower CPUs. 93 00:02:15,394 --> 00:02:15,394 Okay, so what are the costs? The easy costs, are replicating the 94 00:02:15,394 --> 00:02:15,394 program counter and the General Purpose Registers. 95 00:02:15,394 --> 00:02:15,394 Things that start to become harder to replicate, but you're going to have to 96 00:02:15,394 --> 00:02:15,394 replicate also, is if you want to do full isolation of these CPUs, is you're going 97 00:02:15,394 --> 00:02:15,394 to have to have replication of system state. 98 00:02:15,394 --> 00:02:15,394 So things like page tables, things like page table base registers. 99 00:02:15,394 --> 00:02:15,394 All the different, system state about where the interrupt handlers, so things 100 00:02:15,394 --> 00:02:15,720 like the exceptional PC. And this actually gets a 101 00:02:15,720 --> 00:02:16,166 Little bit, hard to do and some processors have a fair amount of system states. 102 00:02:16,166 --> 00:02:16,166 If you'll look at x86, there's a lot of system states. 103 00:02:16,166 --> 00:02:16,166 Mips is relatively minimal, because it's sort of exceptional PC and the interrupt 104 00:02:16,166 --> 00:02:16,353 mask register. But it can be hard, and, and, the, because 105 00:02:16,353 --> 00:02:16,726 the TLB software maintains, so you don't even have one of these, you don't even 106 00:02:16,726 --> 00:02:16,726 have a virtual memory page table base register. 107 00:02:16,726 --> 00:02:16,726 So this something to think about is that you have to replicate all the state per 108 00:02:16,726 --> 00:02:16,726 thread. So there is some cost to it, but you still 109 00:02:16,726 --> 00:02:16,726 get to hide latencies to memory. You still get to reuse ALUs and increase 110 00:02:16,726 --> 00:02:16,726 the efficiencies of your ALUs. Personally, I think these are the smaller, 111 00:02:16,726 --> 00:02:16,726 smaller things. So this is very small. 112 00:02:16,726 --> 00:02:16,726 This is smaller. It's bigger than the first one, but harder 113 00:02:16,726 --> 00:02:16,726 to do. The big things out there, When you start 114 00:02:16,726 --> 00:02:16,726 to have temporal locality conflicts and spatial locality conflicts. 115 00:02:16,726 --> 00:02:16,726 So if you have four threads that are all executing, and they're fighting for space, 116 00:02:16,726 --> 00:02:16,726 they're fighting for capacity in your cache, this can be a problem. 117 00:02:16,726 --> 00:00:00,000 Or, let's say you have sixteen hardware threads, and you have an 8-way set 118 00:00:00,000 --> 00:00:00,000 associative cache, And they always want to access, let's say, index zero in the 119 00:00:00,000 --> 00:00:00,000 cache. This is a common thing. 120 00:00:00,000 --> 00:00:00,000 Your stack always starts there or something like that. 121 00:00:00,000 --> 00:00:00,000 Well, you only have 8-way, associativity, and if you're time multiplexing thread 122 00:00:00,000 --> 00:00:00,000 one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, 123 00:00:00,000 --> 00:00:00,000 thirteen, fourteen, fifteen, sixteen and then you move back around to one. 124 00:00:00,000 --> 00:00:00,000 You're guaranteed that by the time you come back to one, because you only have an 125 00:00:00,000 --> 00:00:00,000 8-way set associative cache that you've bumped out the data that you needed, So 126 00:00:00,000 --> 00:00:00,000 you get no temporal locality. By the time you come back, it's not in 127 00:00:00,000 --> 00:00:00,000 your cache, So you had a, conflict. You'll have a conflict that's going on in 128 00:00:00,000 --> 00:00:00,000 your cache. Okay. 129 00:00:00,000 --> 00:00:00,000 So, so this is the negative side, is that they can conflict with each other, But if 130 00:00:00,000 --> 00:00:00,000 two threads are working together. You can actually get positive cache 131 00:00:00,000 --> 00:00:00,000 interference, or, or positive interference or constructive interference in the cache 132 00:00:00,000 --> 00:00:00,000 and it definitely happens. And that was actually, one of the reasons 133 00:00:00,000 --> 00:00:00,000 that people originally did this multithreading, was that if you have a 134 00:00:00,000 --> 00:00:00,000 program where you have different threads that are working on roughly the same data. 135 00:00:00,000 --> 00:00:00,000 This is actually very, very good, because it will be precaching data for each other. 136 00:00:00,000 --> 00:00:00,000 Well, you have to make sure, It's a, it's a, it's a fine line to sort of walk there 137 00:00:00,000 --> 00:00:00,000 that when the threads fight for shared resources, and when they collaborate on 138 00:00:00,000 --> 00:00:00,000 pulling in the shared resources and exploit the locality structures in the 139 00:00:00,000 --> 00:00:00,000 processor. So the, the cash is one example. 140 00:00:00,000 --> 00:00:00,000 Another example is translation lookaside buffer entries. 141 00:00:00,000 --> 00:00:00,000 One solution to this is you can actually partition the cache, or partition the TLB, 142 00:00:00,000 --> 00:00:00,000 And then you can guarantee that you do not fight. 143 00:00:00,000 --> 00:00:00,000 Unfortunately, if you go to do this and you have a fixed size cache, You 144 00:00:00,000 --> 00:00:00,000 effectively cut the cache, so let's say in half or however many threads by N, by N 145 00:00:00,000 --> 00:00:00,000 threads. So, typically people don't try to 146 00:00:00,000 --> 00:00:00,000 statically partition these things. But, You could think about doing that, if 147 00:00:00,000 --> 00:00:00,000 you want no interference between the different threads. 148 00:00:00,000 --> 00:00:00,000 Okay. So let's, let's look at a couple different 149 00:00:00,000 --> 00:00:00,000 ways that you can try to schedule threads, and they've gone from simple to a little 150 00:00:00,000 --> 00:00:00,000 more complicated to even more complicated over time. 151 00:00:00,000 --> 00:00:00,000 So, simple's what we've been talking about up to this point, Is that there's some 152 00:00:00,000 --> 00:00:00,000 fixed interleaving of N hardware threads, and you, basically, execute one 153 00:00:00,000 --> 00:00:00,000 instruction from each thread and change to the next thread. 154 00:00:00,000 --> 00:00:00,000 So it's the, our first case. So it's the fixed interleaving dates back 155 00:00:00,000 --> 00:00:00,000 to the, the 60s and this is pretty simple. What's nice about this is you can remove 156 00:00:00,000 --> 00:00:00,000 some of the interlocking and some of the bypassing from your processor. 157 00:00:00,000 --> 00:00:00,000 Next thing you can think about is, well, You could try to allocate different 158 00:00:00,000 --> 00:00:00,000 locations in your scheduling quantum to different threads. 159 00:00:00,000 --> 00:00:00,000 So for instance, We know that the, Cyan thread here is the highest priority 160 00:00:00,000 --> 00:00:00,000 thread. It needs the most number of slots. 161 00:00:00,000 --> 00:00:00,000 So we can do an uneven distribution. And we can say, over our slots here, we 162 00:00:00,000 --> 00:00:00,000 can actually interleave and say, well, the cyan thread gets every other one we'll 163 00:00:00,000 --> 00:00:00,000 say. And then we can say, let's say the orange 164 00:00:00,000 --> 00:00:00,000 one gets a smaller percentage and the blue one gets a smaller percentage. 165 00:00:00,000 --> 00:00:00,000 So it's a, still a fixed interleaving to some extent or a fixed ordering, But it 166 00:00:00,000 --> 00:00:00,000 is, controllable by software depending, or the operating system, depending on the 167 00:00:00,000 --> 00:00:00,000 priorities of the different threads. So it's a little bit more flexible in 168 00:00:00,000 --> 00:00:00,000 hardware than the completely fixed interleave, design, And this does require 169 00:00:00,000 --> 00:00:00,000 a little bit of change here, because you can't just have this counter here, just 170 00:00:00,000 --> 00:00:00,000 sort of incrementing your thread ID. Instead, you need to have something else 171 00:00:00,000 --> 00:00:00,000 here, which, sort of, is, is a picker for which thread you go execute, But, still 172 00:00:00,000 --> 00:00:00,000 relatively simple. And, because it's software programmable, 173 00:00:00,000 --> 00:00:00,000 you can actually choose a time and then reallocate, so the OS can change the 174 00:00:00,000 --> 00:00:00,000 allocation for a, a different time here, and, let's say, make orange a higher 175 00:00:00,000 --> 00:00:00,000 priority instead of the cyan. Then we start to go to something a little 176 00:00:00,000 --> 00:00:00,000 more complicated, So this is still relatively fixed priorities. 177 00:00:00,000 --> 00:00:00,000 You could think about something where you actually have the hardware making 178 00:00:00,000 --> 00:00:00,000 decisions about which one to go execute. And you could actually even have it go as 179 00:00:00,000 --> 00:00:00,000 far as, let's say, Determining if you're executing a thread, which then has a long 180 00:00:00,000 --> 00:00:00,000 latency operation, it'll switch to another thread at that point. 181 00:00:00,000 --> 00:00:00,000 So, it will try to fill in backfilled work from other threads purposely when one 182 00:00:00,000 --> 00:00:00,000 thread gets swapped out or one set of thread goes to do something that's a long 183 00:00:00,000 --> 00:00:00,000 latency operation. And you could think about designs like 184 00:00:00,000 --> 00:00:00,000 that, And that is something like the HEP processor, which we'll be talking about in 185 00:00:00,000 --> 00:00:00,000 a minute or two. Has, is, is, was one of the first early 186 00:00:00,000 --> 00:00:00,000 examples of that. So just to recap here. 187 00:00:00,000 --> 00:00:00,000 We're going to call this coarse-grain multithreading. 188 00:00:00,000 --> 00:00:00,000 The reason we're calling it coarse-grain is because for any one cycle, Only one 189 00:00:00,000 --> 00:00:00,000 thread is running. And you might scratch your side of the 190 00:00:00,000 --> 00:00:00,000 head and say, how do you have multiple threads running in one, one cycle? 191 00:00:00,000 --> 00:00:00,000 Well, we'll show, if you go to a superscalar, or multi-issue processor, you 192 00:00:00,000 --> 00:00:00,000 could think about having the different pipelines executing instructions from 193 00:00:00,000 --> 00:00:00,000 different threads simultaneously. That's called simultaneous multithreading 194 00:00:00,000 --> 00:00:00,000 and we'll talk about that in a second. But in our core screen approach here, just 195 00:00:00,000 --> 00:00:00,000 to, to recap, you can swap threads on hardware misses, and you can take 196 00:00:00,000 --> 00:00:00,000 advantage of bubbles in the processor to do other work. 197 00:00:00,000 --> 00:00:00,000 Okay, so a little bit of a history lesson here. 198 00:00:00,000 --> 00:00:00,000 The HEP processor was Burton Smith, who's now at Microsoft Research. 199 00:00:00,000 --> 00:00:00,000 He was the, the, the chief architect of this back in the 80s. 200 00:00:00,000 --> 00:00:00,000 And this processor's actually pretty interesting cuz there was lots and lots of 201 00:00:00,000 --> 00:00:00,000 threads, and small number of processors. So there is 120 threads in this machine, 202 00:00:00,000 --> 00:00:00,000 Processor. Relatively modest clock rate for the 80s, 203 00:00:00,000 --> 00:00:00,000 but what they were trying to do here is they were trying to deal with memory 204 00:00:00,000 --> 00:00:00,000 latency. So this machine had a very high bandwidth 205 00:00:00,000 --> 00:00:00,000 memory system and what would happen is effectively you were allowed to have a 206 00:00:00,000 --> 00:00:00,000 load or store every instruction and your performance would never degrade if you had 207 00:00:00,000 --> 00:00:00,000 a load and a store every instruction. Because in this machine, they had 120 208 00:00:00,000 --> 00:00:00,000 threads per processor and the memory latency was less than 120 cycles. 209 00:00:00,000 --> 00:00:00,000 So if you had to load every instruction, and you have enough bandwidth to be able 210 00:00:00,000 --> 00:00:00,000 to feed all those loads, you could execute a load from each different thread and none 211 00:00:00,000 --> 00:00:00,000 of them are each thread was independent of each other. 212 00:00:00,000 --> 00:00:00,000 You could basically go up to the memory system, wait the latency, have it come 213 00:00:00,000 --> 00:00:00,000 back in pipeline the, the latency out to memory, and pipeline the memory access. 214 00:00:00,000 --> 00:00:00,000 And by the time you were to come back and execute the next instruction from that 215 00:00:00,000 --> 00:00:00,000 thread, which would be at 120 cycles later, the memory result would be there. 216 00:00:00,000 --> 00:00:00,000 So this insight was carried forward Burton went and started this company called Tera, 217 00:00:00,000 --> 00:00:00,000 or Tera Systems which later went on to buy Cray, strangely enough, cuz I always 218 00:00:00,000 --> 00:00:00,000 thought Cray was a bigger company than Tera, and it definitely was. 219 00:00:00,000 --> 00:00:00,000 But we'll, we'll call it a merger, but I think that's what they called it, but in 220 00:00:00,000 --> 00:00:00,000 reality, Tera acquired Cray and Tera had a similar sort of idea here, More 221 00:00:00,000 --> 00:00:00,000 processors. This was, sort of, further evolution of 222 00:00:00,000 --> 00:00:00,000 the HEP processor. 128 active threads per processor. 223 00:00:00,000 --> 00:00:00,000 Lots of processors, 256 processors. So lots of active programs. 224 00:00:00,000 --> 00:00:00,000 So you had to find enough thread level parallelism in your, in your program and 225 00:00:00,000 --> 00:00:00,000 this architecture has no, no caches, Because they don't need it. 226 00:00:00,000 --> 00:00:00,000 If you have enough bandwidth out to your memory system, you can have a load every 227 00:00:00,000 --> 00:00:00,000 cycle, and you can cover the latency with other threads. 228 00:00:00,000 --> 00:00:00,000 Why do you, why do you care? So, some, some interesting things about 229 00:00:00,000 --> 00:00:00,000 this is, this may not be good for power. You are not exploiting locality in your 230 00:00:00,000 --> 00:00:00,000 data references at all and you're going out to the memory system every cycle. 231 00:00:00,000 --> 00:00:00,000 And that could be far away. And it is, well, it's far away in this 232 00:00:00,000 --> 00:00:00,000 machine. So that's, you got to be a little careful 233 00:00:00,000 --> 00:00:00,000 about these machines, and then the second idea here is, you have to come up with 234 00:00:00,000 --> 00:00:00,000 lots of threads. Now we're talking about having similar 235 00:00:00,000 --> 00:00:00,000 numbers of threads in something like, modern day, many core machines. 236 00:00:00,000 --> 00:00:00,000 But there's still a fair number of threads to be able to effectively use the machine 237 00:00:00,000 --> 00:00:00,000 to its, to its best performance. I wanted to say about this actually, this 238 00:00:00,000 --> 00:00:00,000 architecture, Where it got mostly used, was in applications that had no locality 239 00:00:00,000 --> 00:00:00,000 anyway or no temporal or spacial locality in their data references anyway. 240 00:00:00,000 --> 00:00:00,000 And good examples of this were things like data mining, Huge data sets, Arbitrary 241 00:00:00,000 --> 00:00:00,000 access to the data sets. Not, you couldn't have a prefetcher 242 00:00:00,000 --> 00:00:00,000 predict where your next memory reference is going to be. 243 00:00:00,000 --> 00:00:00,000 So if you were just going basically not going to be able to effectively fit in a 244 00:00:00,000 --> 00:00:00,000 cache or use a cache anyway, remove the cache and go, go multithreaded. 245 00:00:00,000 --> 00:00:00,000 It was the idea behind these machines. And they actually saw some, Big speed ups 246 00:00:00,000 --> 00:00:00,000 for applications that had, had data sets like that. 247 00:00:00,000 --> 00:00:00,000 This still actually lives on today, in the Cray XMT, the extreme multithreaded 248 00:00:00,000 --> 00:00:00,000 architecture. And you can go buy this machine from 249 00:00:00,000 --> 00:00:00,000 what's nowadays called Cray Computer Corp., which was Tera-eating Cray or 250 00:00:00,000 --> 00:00:00,000 buying Cray. And it's not, a modest clock speed by 251 00:00:00,000 --> 00:00:00,000 modern day standards. It's only 500 MHz, but it, they can 252 00:00:00,000 --> 00:00:00,000 intermix these with sort of Opterons and other processors because they standardized 253 00:00:00,000 --> 00:00:00,000 on the AMD bus protocol, So you can plug in AMD chips and these XMT chips in the 254 00:00:00,000 --> 00:00:00,000 same, system. Just a recap here of, like, what their 255 00:00:00,000 --> 00:00:00,000 memory system looked like. Their instruction pipelines were very 256 00:00:00,000 --> 00:00:00,000 long, and they didn't have to bypass, because you could never execute 257 00:00:00,000 --> 00:00:00,000 instruction and a subsequent instruction which would use the results. 258 00:00:00,000 --> 00:00:00,000 And the memory operations, the memory latencies were about 150 cycles and they 259 00:00:00,000 --> 00:00:00,000 had 128 threads. So, the probability they, they were 260 00:00:00,000 --> 00:00:00,000 typically not waiting for the memory system and they could effectively pipeline 261 00:00:00,000 --> 00:00:00,000 their memory operations. Another little tidbit of history here. 262 00:00:00,000 --> 00:00:00,000 So, this is the machine I was talking about that is my academic lineage a little 263 00:00:00,000 --> 00:00:00,000 bit here. When I first showed up at MIT there was 264 00:00:00,000 --> 00:00:00,000 this machine in the corner called the MIT Alewife processor, which was a 265 00:00:00,000 --> 00:00:00,000 multiprocessor machine. It went up to 128 nodes, which was a lot 266 00:00:00,000 --> 00:00:00,000 at, in 1990. And, one of the, the little tidbits about 267 00:00:00,000 --> 00:00:00,000 this machine is, they had spark processors, and a spark processor had this 268 00:00:00,000 --> 00:00:00,000 notion of a register window. What a register window is, is every time 269 00:00:00,000 --> 00:00:00,000 you do a system call, You basically change your addressing in the register file. 270 00:00:00,000 --> 00:00:00,000 So you have a larger register file, and you have a smaller window of the register 271 00:00:00,000 --> 00:00:00,000 file, and you kind of slide this window across the register file on function calls 272 00:00:00,000 --> 00:00:00,000 and function returns. And the MIT Alewife processor used this. 273 00:00:00,000 --> 00:00:00,000 And they extended this register window idea, Such that they could have a special 274 00:00:00,000 --> 00:00:00,000 instruction which would actually change how much the register file they were 275 00:00:00,000 --> 00:00:00,000 looking at, at a time. So as they could actually swap the entire 276 00:00:00,000 --> 00:00:00,000 register file very quickly, and, this actually allowed them to multithread very 277 00:00:00,000 --> 00:00:00,000 effectively with four threads. And this was one of the early 278 00:00:00,000 --> 00:00:00,000 multithreaded processors and they introduced this notion of thread switch on 279 00:00:00,000 --> 00:00:00,000 remote memory access. So if you had a memory access that had to 280 00:00:00,000 --> 00:00:00,000 go to one of the other nodes in this machine. 281 00:00:00,000 --> 00:00:00,000 So if you were on this node here, and you need to go down to this one, and you're 282 00:00:00,000 --> 00:00:00,000 going to send a message over there in a multiprocessor, notion, you actually had 283 00:00:00,000 --> 00:00:00,000 to send a message and get a response back, it took a long time. 284 00:00:00,000 --> 00:00:00,000 So you would actually switch threads at that time, . 285 00:00:00,000 --> 00:00:00,000 Multithreading lives on today, especially this, in this coarse-grain multithreading 286 00:00:00,000 --> 00:00:00,000 lives on today. A good example of this is the Oracle, and 287 00:00:00,000 --> 00:00:00,000 what used to be Sun, and before that, Afara, Niagara processors. 288 00:00:00,000 --> 00:00:00,000 Afara was the name of the, startup company that made this, and then Sun acquired 289 00:00:00,000 --> 00:00:00,000 them, and then Oracle acquired Sun. And, the Sun Niagara processor was 290 00:00:00,000 --> 00:00:00,000 designed for throughput computing, such that you can have lots of, threads running 291 00:00:00,000 --> 00:00:00,000 and they were all independent and it was a multithreaded processor. 292 00:00:00,000 --> 00:00:00,000 So, the Niagra one had eight cores and then four threads per core. 293 00:00:00,000 --> 00:00:00,000 And they've, Turn that up over time. So now they use the Niagara three, So this 294 00:00:00,000 --> 00:00:00,000 is what was called the, the Sun T1. This is the Sun T2. 295 00:00:00,000 --> 00:00:00,000 This is the Sun T3 now. They have sixteen cores where they have 296 00:00:00,000 --> 00:00:00,000 eight threads per core, So a lot of parallelism going on here and this 297 00:00:00,000 --> 00:00:00,000 coarse-grain multithreading, goes on. So here's our, a dye photo of the Niagara 298 00:00:00,000 --> 00:00:00,000 three processor. We can see the different cores you might 299 00:00:00,000 --> 00:00:00,000 look in this picture. And say. 300 00:00:00,000 --> 00:00:00,000 There is, what looks to be sixteen cores and I say sixteen cores on the slide here, 301 00:00:00,000 --> 00:00:00,000 but one of the interesting things is they actually sort of conjoin two cores 302 00:00:00,000 --> 00:00:00,000 together, internally. It's a strange sort of design idea that 303 00:00:00,000 --> 00:00:00,000 they do, is that they mix the threads and the cores together and they share, I think 304 00:00:00,000 --> 00:00:00,000 it was floating point unit between the two cores. 305 00:00:00,000 --> 00:00:00,000 So, these cores are, to some extent, conjoined cores together.