1 00:00:03,420 --> 00:00:08,195 Okay, so this brings us to the end of interconnection networks. And what I 2 00:00:08,195 --> 00:00:15,070 wanted to talk about was Deadlock. So let's assume that you have three nodes 3 00:00:15,070 --> 00:00:22,252 labeled one, two, and three here. It's on a 1D, or excuse me, yeah, oh 1D tourist, 4 00:00:22,252 --> 00:00:31,016 that is unidirectional. So links only flow in the clockwise direction here. Now, all 5 00:00:31,016 --> 00:00:37,658 of a sudden, you have node one wants to send to node three, node two wants to send 6 00:00:37,658 --> 00:00:44,774 to node one, and node three wants to send to, to two. And we're going to look at a 7 00:00:44,774 --> 00:00:49,283 wormhole routing network. So the wormhole routing network you start to send the 8 00:00:49,283 --> 00:00:54,532 header words and has not, potentially not even been injected into the network at 9 00:00:54,532 --> 00:00:59,041 this point. Let's say they all start at exactly the same time. So one is going to 10 00:00:59,041 --> 00:01:03,647 start sending to three. Two's going to start sending to one. Three's going to 11 00:01:04,034 --> 00:01:08,616 start sending to two. And if you'll notice, is each set of these has one 12 00:01:08,616 --> 00:01:13,651 overlapping link with the previous one. So there's going to be some contention on 13 00:01:13,651 --> 00:01:18,492 that link. So very quickly, you can see that. If we were to all, if we were to 14 00:01:18,492 --> 00:01:23,397 launch these three messages exactly at same time, and the message length was 15 00:01:23,397 --> 00:01:30,644 long. None of them would actually get to the receivers. One would use this link, or 16 00:01:30,644 --> 00:01:35,220 one sending to three would use this link, two sending to one would use this link, 17 00:01:35,220 --> 00:01:39,683 three sending to two would use this wraparound link. But then all of a sudden, 18 00:01:39,683 --> 00:01:44,259 they would try to acquire the next link but it's already been reserved. But they 19 00:01:44,259 --> 00:01:49,064 can't give up the link cuz they've already acquired the link, so all of a sudden you 20 00:01:49,064 --> 00:01:56,495 have a deadlock. So these numbers get pretty tricky. and one way that we go 21 00:01:56,495 --> 00:02:01,843 about trying to see if our routing protocol is deadlock free or actually can 22 00:02:01,843 --> 00:02:07,398 deadlock is we start to do something called Waits-for-analysis or waits-for and 23 00:02:07,398 --> 00:02:13,902 Holds analysis. And waits for, and Waits-for and Holds analysis, the basic 24 00:02:13,902 --> 00:02:21,266 idea is, so this is looking at our example that we saw before where we have three 25 00:02:21,266 --> 00:02:28,539 routers. What's happening here is we're injecting into A and this A is acquiring 26 00:02:28,539 --> 00:02:37,713 this link or this here, and then it needs to acquire the next one. So let's draw 27 00:02:37,713 --> 00:02:51,194 this as a diagram. So A holds TQ2, so A holds TQ2. This one here, this inbound Q 28 00:02:51,194 --> 00:02:59,029 here. And when we draw Waits-for and Hold analysis diagrams. When we are waiting on 29 00:02:59,029 --> 00:03:05,140 something we draw a edge from the thing that is the, the actor to the resource 30 00:03:05,140 --> 00:03:10,874 that we're waiting on, and the holds actually goes in the opposite direction. 31 00:03:10,874 --> 00:03:19,124 So if a holds TQ2 here we actually draw the edge the other direction. all of a 32 00:03:19,124 --> 00:03:27,838 sudden, A is holding TQ2 two, and it's waiting to acquire TQ3, this one here. Cuz 33 00:03:27,838 --> 00:03:38,806 it needs to transit this link to deliver it to TQ3. But at the same time, B has 34 00:03:38,806 --> 00:03:50,614 acquired TQ3, so it's holding that and it's waiting for TQ1. But C currently 35 00:03:51,080 --> 00:04:00,791 holding TQ1 and its waiting on TQ2. This is how we do deadlock analysis from a 36 00:04:01,207 --> 00:04:06,963 proof perspective. We actually write these dependency graphs or Waits-for and Hods 37 00:04:06,963 --> 00:04:12,025 graphs and if you find a cycle in your protocol, your protocol can have a 38 00:04:12,025 --> 00:04:17,781 deadlock. It need a couple of some way to color link your, color, color edge in this 39 00:04:17,781 --> 00:04:22,539 graph or you need to really design your protocol. But this is the sort of 40 00:04:22,539 --> 00:04:27,128 canonical way that you can do this and if you start to look at different routing 41 00:04:27,128 --> 00:04:31,831 protocols some of them could possibly deadlock and some of'em can, you can prove 42 00:04:31,831 --> 00:04:36,420 statically to all possible weights for of whole graphs you can come up with will 43 00:04:36,420 --> 00:04:41,696 never have cycles. So I, I think I mentioned last time, if you look at 44 00:04:41,696 --> 00:04:47,222 dimension order routing with one more routing. All possible inputs and outputs 45 00:04:47,435 --> 00:04:53,483 on that you never actually going to have a deadlock, you never call for cycle in the 46 00:04:53,483 --> 00:04:59,317 graph. And the reason for this is the only ever quire X routers and then Y routers, 47 00:04:59,317 --> 00:05:04,866 will say if you have the dimension that you acquire X and quire Y. But you never 48 00:05:04,866 --> 00:05:10,630 going into have effectively something that is holding Y trying to acquire something 49 00:05:10,630 --> 00:05:15,041 on the X axis. So, you never going to have a, a cycle in that Waits-for and Holds 50 00:05:15,325 --> 00:05:24,956 whole analysis graph. I just briefly wanted to some up here that Deadlock is 51 00:05:24,956 --> 00:05:32,015 actually not an enemy. You can try to use deadlock, to your benefit or, or just live 52 00:05:32,015 --> 00:05:36,762 with deadlock. Just cuz your protocol has deadlock does not mean you can't try to 53 00:05:36,762 --> 00:05:42,842 recover from it. So what I mean by this is deadlock avoidance can be very expensive. 54 00:05:42,842 --> 00:05:47,769 So trying to draw all possible graphs here, you should think about, you know, 55 00:05:47,769 --> 00:05:52,763 where your deadlocks happen in your systems, and plan them out and make sure 56 00:05:52,763 --> 00:05:58,289 they don't happen often. But some of the things you need to do, to try and cut 57 00:05:58,289 --> 00:06:02,827 edges here, might be too restrictive. So for instance, dimensional routing may only 58 00:06:02,827 --> 00:06:07,141 route in the X dimension and then in the Y dimension, could be very expensive 59 00:06:07,141 --> 00:06:12,071 relative to an adaptive routing scheme, or you try to route around congestion in your 60 00:06:12,071 --> 00:06:17,417 network. So this deadlock avoidance, you know, you'll, you'll design the protocol 61 00:06:17,417 --> 00:06:22,227 to never deadlock, this could be quite expensive. So an alternative, and this is 62 00:06:22,227 --> 00:06:27,161 something you should not necessarily be afraid of, but you have to be careful of 63 00:06:27,161 --> 00:06:32,204 this is that you find the deadlock and then you try to recover from it. So what I 64 00:06:32,204 --> 00:06:37,853 mean by this is on-chip network or a, a, multi-chip network you can actually have a 65 00:06:37,853 --> 00:06:43,010 deadlock occur, notice that a deadlock occurred. And either try to roll back the 66 00:06:43,010 --> 00:06:47,303 state and somehow jitter the state so the deadlock won't happen a second time. Or 67 00:06:47,303 --> 00:06:52,211 many times these deadlocks happen because your dependent on one particular buffer in 68 00:06:52,211 --> 00:06:56,483 the system and two people trying to acquire that buffer. So you can try to 69 00:06:56,483 --> 00:07:01,160 virtualize that buffer. Effectively saving the state of that buffer in memory and 70 00:07:01,160 --> 00:07:05,780 adding more, more states. So you, a lot of these protocols, if you were to add more 71 00:07:05,780 --> 00:07:09,883 FIFO entries into the system, it would actually break the deadlock. So, sort of 72 00:07:09,883 --> 00:07:14,360 going back into this picture here, if all of a sudden if I added another, let's say, 73 00:07:14,360 --> 00:07:18,730 FIFO here that the inbound traffic went into and then this other traffic tried to 74 00:07:18,730 --> 00:07:23,047 bypass it, it would actually cut one of these edges, and all of a sudden you would 75 00:07:23,047 --> 00:07:28,066 not be having a deadlock. So you can think about trying to have a deadlock recovery 76 00:07:28,066 --> 00:07:33,253 mechanism that's an example of it. So a good example this is actually in the raw 77 00:07:33,253 --> 00:07:38,252 micro pro cessor from an IT where we I, you can actually have that much on the 78 00:07:38,252 --> 00:07:43,377 on-chip network. Its dimensional routed, but you can still have message depended 79 00:07:43,377 --> 00:07:47,001 deadlocks. So what I mean by that is while the network itself is not going to 80 00:07:47,563 --> 00:07:52,438 deadlock, the traffic flows you have on top of it that from since your memory 81 00:07:52,438 --> 00:07:57,806 coherence protocol on top of that could be potentially deadlock. So how do you 82 00:07:57,806 --> 00:08:02,073 recover from that? Well you can actually have a counter, a timer which goes off 83 00:08:02,073 --> 00:08:06,396 when you determine that the network has not moved for a 1000 cycles. So all the 84 00:08:06,396 --> 00:08:10,992 sudden you're running along and no traffic moves on your network. Well it's a pretty 85 00:08:10,992 --> 00:08:15,041 good indicator that you have a deadlock condition, cuz something should be 86 00:08:15,041 --> 00:08:19,555 flowing. There should be some forward progress guarantees. And if you notice 87 00:08:19,555 --> 00:08:23,903 that, an interrupt will go off. So this timer will go off saying nothing has moved 88 00:08:23,903 --> 00:08:28,036 on my network for 1,000 cycles, this is probably a deadlock. So that, you could 89 00:08:28,036 --> 00:08:32,545 take an interrupt and then software can go look at all this data in the network and 90 00:08:32,545 --> 00:08:38,239 introduce more buffering into the network. So, effectively through software try to 91 00:08:38,239 --> 00:08:44,083 virtualize a particular FIFO entry and that can break the deadlock. So there are 92 00:08:44,083 --> 00:08:49,457 ways to recover from deadlocks. If your network is for instance used for message 93 00:08:49,457 --> 00:08:54,831 passing network, you can just use memory to go do that. So actually on the raw 94 00:08:54,831 --> 00:09:00,204 processor, our memory coherence protocol, we used deadlock avoidance and then our 95 00:09:00,204 --> 00:09:05,310 message passing protocol, our message passing networks use deadlock recovery. 96 00:09:05,470 --> 00:09:09,475 later on in, in Tylera actually we have something depending on which memory 97 00:09:09,475 --> 00:09:13,800 network we have mixtures of these two. And you might say that sounds scary, but if 98 00:09:13,800 --> 00:09:18,126 you walk through the proofs and are very careful about it and you are guaranteed 99 00:09:18,126 --> 00:09:22,558 that you don't need anymore resources to go break the deadlock you may be okay. But 100 00:09:22,558 --> 00:09:26,670 yeah it's, its just some say it's playing with fire. You've got to be a little 101 00:09:26,670 --> 00:09:30,889 careful of these, these sorts of solutions when trying to play with deadlock 102 00:09:30,889 --> 00:09:33,942 recovery. But if you're guaranteed that you're not goi ng to need any more 103 00:09:33,942 --> 00:09:39,558 resources you can just resolve it right then. And oh, so the reason you would want 104 00:09:39,558 --> 00:09:45,090 to, to use the lock recovery is it does not restrict your protocol. And by doing 105 00:09:45,090 --> 00:09:49,061 that, you can have the common case of your protocol be very, very fast. And, you 106 00:09:49,061 --> 00:09:53,083 could make it so that the deadlock almost never happens. Or never, in practical 107 00:09:53,083 --> 00:09:57,105 concerns ever happens. Then, when the deadlock does happen oh, you take a little 108 00:09:57,105 --> 00:10:01,334 bit of performance bump, or a performance hit there. Because you have to virtualize 109 00:10:01,334 --> 00:10:04,480 it in software. But the probability of it happening is so low.