1 00:00:03,340 --> 00:00:09,400 Okay so let's switch topics here. We finished our vector processors, 2 00:00:11,740 --> 00:00:14,800 And we're going to start talking about technique. 3 00:00:15,100 --> 00:00:20,009 That's looking at how to exploit a different, different forms of parallelism, 4 00:00:20,009 --> 00:00:24,660 or thread-level parallelism, than we've been looking at up to this point. 5 00:00:25,080 --> 00:00:31,020 So this is not data level parallelism. This is actual thread level parallelism or 6 00:00:31,020 --> 00:00:39,757 process level parallelism in your machine. So what's the motivation for changing your 7 00:00:39,757 --> 00:00:43,773 computer architectures to run threads very effectively? 8 00:00:43,773 --> 00:00:48,979 So we've talked about threading or multi-programming in the past, where you 9 00:00:48,979 --> 00:00:54,907 time slice between different processes. That's not what this is about. 10 00:00:54,907 --> 00:01:00,594 This is about either, this is about executing multiple programs at the same 11 00:01:00,594 --> 00:01:05,512 time, or time multiplexing between processes on the time slice of the 12 00:01:05,512 --> 00:01:10,520 instruction at a time. So where this really came from is that 13 00:01:10,980 --> 00:01:15,823 sometimes you can't really extract parallelism in the instructional 14 00:01:15,823 --> 00:01:20,950 parallelism. So this is where out of border processors, try to exploit, 15 00:01:20,950 --> 00:01:25,788 And super Scalars try to exploit. And sometimes you can't necessarily 16 00:01:25,788 --> 00:01:31,117 exploit data level parallelism in a program because data level parallelism 17 00:01:31,117 --> 00:01:37,300 will say it doesn't necessarily work out, or you don't have dense matrix sorts of 18 00:01:37,300 --> 00:01:41,279 codes, which is very easy to find data level parallelism. 19 00:01:41,279 --> 00:01:45,756 Instead, sometimes your workloads have thread level parallelism, 20 00:01:45,756 --> 00:01:51,037 And what we mean by thread level parallelism is that either there is 21 00:01:51,037 --> 00:01:54,742 independent sequential jobs that need to happen, 22 00:01:54,742 --> 00:02:00,300 And an example of these would be you have to operate on a million things, 23 00:02:00,680 --> 00:02:04,995 But the control flow for each of those million things is wildly different. 24 00:02:04,995 --> 00:02:09,776 So you, the, the most efficient way to go about doing this is just to actually have 25 00:02:09,776 --> 00:02:15,097 a process or a job per work unit that you need to work on. 26 00:02:15,097 --> 00:02:18,953 So, a good example of this is network processing. 27 00:02:18,953 --> 00:02:25,300 You are trying to build a fire wall and a packet comes in on your network card. 28 00:02:26,040 --> 00:02:30,652 And your firewall wants to inspect this packet and look for, I don't know, 29 00:02:30,652 --> 00:02:34,127 malicious tacks, Or it looks for, looks, want, it wants to 30 00:02:34,127 --> 00:02:39,435 look for something like viruses or it wants to look for attack code sequences or 31 00:02:39,435 --> 00:02:48,180 trojans or, or something else like that. This isn't data level parallelism, because 32 00:02:48,180 --> 00:02:53,430 you can't try to have each of the operations working on these packets be the 33 00:02:53,430 --> 00:02:56,045 same. So depending on the type of packet, you're 34 00:02:56,045 --> 00:02:59,049 going to make wildly different control flow decisions. 35 00:02:59,049 --> 00:03:01,720 So for instance, if the packet is TCP versus UDP, 36 00:03:01,720 --> 00:03:04,921 Is, is, like a big choice in the beginning. 37 00:03:04,921 --> 00:03:10,856 And then, is it, what port is it on, is it SSH, is it HTTP, is it web traffic? 38 00:03:10,856 --> 00:03:16,478 You know, you're going to do lot's of different processing in your firewall 39 00:03:16,478 --> 00:03:21,944 based on that decision tree. So, typically the most efficient way to go 40 00:03:21,944 --> 00:03:27,801 out and try to attack this is that you just have a job or a process or possibly a 41 00:03:27,801 --> 00:03:35,766 thread per data element that comes in. Another, reason you might want to do this 42 00:03:35,766 --> 00:03:40,972 is there are examples of applications where you actually having parallel threads 43 00:03:40,972 --> 00:03:46,189 will solve the problem faster. So believe it or not the, the traditional 44 00:03:46,189 --> 00:03:50,080 traveling salesman problem. If you throw threads at that, 45 00:03:50,080 --> 00:03:54,804 You can actually have this problem get super linear speed-ups, because you can 46 00:03:54,804 --> 00:03:59,004 have basically other threads sort of parsing off the search tree faster. 47 00:03:59,004 --> 00:04:03,378 And then, the threads can share data between each other, and when another 48 00:04:03,378 --> 00:04:07,519 thread comes to a certain point, it doesn't have to recompute a certain 49 00:04:07,519 --> 00:04:12,186 location. It can just use the previous result of another thread that already got 50 00:04:12,186 --> 00:04:16,268 to that point and know that it doesn't have to go down a certain path. 51 00:04:16,268 --> 00:04:20,526 So this is kind of like dynamic programming, if you will, but for some 52 00:04:20,526 --> 00:04:24,575 sort of large search. That's one example, as a very plain 53 00:04:24,575 --> 00:04:28,704 example having threads, get you super linear speed up. 54 00:04:28,938 --> 00:04:35,171 But you can also just use threads to go after a certain types of parallelism that 55 00:04:35,171 --> 00:04:41,248 are hard to get at data level parallelism. Also in the, you could actually hide 56 00:04:41,248 --> 00:04:46,780 latencies by using threads. So an example hiding latency using threads 57 00:04:46,780 --> 00:04:52,780 is, let's say you have a program that you know typically misses in your cache, 58 00:04:53,320 --> 00:04:59,275 But you have parallelism in this program. One way to attack it is to cut up the 59 00:04:59,275 --> 00:05:05,081 problem and have a thread per different problem, and when one of the programs 60 00:05:05,081 --> 00:05:08,700 blocks on the cache, switch to the other program. 61 00:05:09,180 --> 00:05:13,580 So while you're waiting for memory to respond back you can do some useful work. 62 00:05:16,520 --> 00:05:22,425 Okay, so let's look at this from a pipeline perspective and look at ways to 63 00:05:22,425 --> 00:05:25,755 recover cycles. So here we have some loads. 64 00:05:25,755 --> 00:05:32,177 Load, load, something that's dependent on, We've a, we've a load that's dependent on 65 00:05:32,177 --> 00:05:36,854 another load and an add that's dependent on the second load. 66 00:05:36,854 --> 00:05:41,135 And then finally, a store which is dependent on the add. 67 00:05:41,135 --> 00:05:47,002 So the down, down side of doing this is, you got all this sort of dead time. 68 00:05:47,002 --> 00:05:51,680 All this purple time here is, is dead time on the processor. 69 00:05:54,040 --> 00:05:58,047 We could throw an out-of-order super Scalar at this. 70 00:05:58,047 --> 00:06:08,986 It's not going to do any better. Well yeah? So you're right this could, 71 00:06:08,986 --> 00:06:13,723 If we're doing bypassing, we can pull these one, one cycle earlier. 72 00:06:13,723 --> 00:06:21,091 But we still have a lot of dead time. And if you were to have an out of order 73 00:06:21,091 --> 00:06:25,093 super Scalar, Out of order would not actually help you 74 00:06:25,093 --> 00:06:31,022 out in this case either because you, is an actual dependency string through all these 75 00:06:31,022 --> 00:06:36,940 instructions. Hm, now that's, that's not great. 76 00:06:38,080 --> 00:06:46,117 So can we, Can we come up with some ideas to, to cope 77 00:06:46,117 --> 00:06:48,852 with this. So one thing we were said is we can add 78 00:06:48,852 --> 00:06:53,174 bypassing to, to decrease the time here. But having out of order super Scalar is 79 00:06:53,174 --> 00:06:58,314 not going to make this go faster. So that technique doesn't work. what other 80 00:06:58,314 --> 00:07:00,948 techniques have we talked about? Vector processors? 81 00:07:00,948 --> 00:07:04,342 Well, there's no, no vectors of data here. We can go wide. 82 00:07:04,342 --> 00:07:08,323 Well, that doesn't hap-, help. We're still going to only be executing one 83 00:07:08,323 --> 00:07:10,196 instruction per cycle. So we can, 84 00:07:10,196 --> 00:07:12,888 We can try to do VLIW. Doesn't, doesn't help. 85 00:07:12,888 --> 00:07:17,161 If out of order superscalar can't do it, the VLIW probably can't do it. 86 00:07:17,161 --> 00:07:21,317 So we have all these dead cycles, We want to try to recover some of these 87 00:07:21,317 --> 00:07:22,020 dead cycles.