In-order fetch. So, we have an in-order frontend and we have an out-of-order issue, writeback and commit. So, what this is going to try and do is we're going to try and fix some of the problems that we saw, oops, here, where we'll actually had, let's say this instruction here, this add is waiting to issue because our issue was in order. So, that instruction could issue. All of its inputs are ready. R11 was written right there, or something like that. So, R11 is ready. It's ready to issue. But because we have in-order issue, by definition, we can't issue out-of-order. So, now let's look at a machine where we can issue out-of-order. So, we actually move it to the execute units, do our register fetch, out-of-order. So, to do that, we actually have to add another data structure, and this data structure we're going to call the issue queue. And this is going to be something like an associative FIFO, but it's not going to be FIFO, we're going to basically going to store instructions into it, and then we're going to, we're going to put instructions into it in order, because our fetch is in order, or frontend is in order. But we're going to pull out of it out of order. And, but we have to be a little careful, because when we go to pull out of that data structure, we have to make sure that the. All of the appropriate registers are, are ready. So, all of our dependencies are ready or at least somewhere in the bypass or we can go pick it off the bypass by the time it actually needs the value. Or in our, in our bypass stage. Let's take a look here. We still have architectural register file. We got rid of all that complex reorder buffer stuff on this example. So, we don't have a reorder buffer. We don't have a store buffer. We have out-of-order commit. So, we're going to have the same problems we had before with precise exceptions in this pipe but we, it's going to give us sort of easy introduction to understanding what the issue queue looks like. So, the issue queue gets written when instructions enter it. Sort of at the decode stage. This is, itself is a register structure. So, it's like, flip flops in here and a bunch of logic and stuff. So, you can just go from this stage to this stage in the pipe without having any, registers. So it gets read in there and things are basically going to update into this. So, when things actually end up in the architecture register file, we're going to mark bits in the issue queue, saying, oh, that register you were waiting on, it's now ready. And if you get, let's say, two registers that are ready and you're dependent on two registers, you can issue. And we can issue out-of-order. Okay. So, I have read and write here in the, the issue stage. Why do I, why do I have that? Well, when you actually go to issue it you basically want to mark that instruction as, oh, yeah, I actually did issue this instruction. You want to, sometimes people build this as actually remove the instruction from the issue queue. In this basic case, we're gonna leave it in the issue queue until the end of the pipe because we need some way to track the, the liveness of the instruction. But the, the next processor we're basically gonna think of it as sort of a. I, I won't say FIFO, because it's not strictly first in first out, but it's, it's a data structure where you can put stuff in and then remove stuff out of. So, it's a buffer or a multi entry buffer. Okay, let's, let's take a look inside the issue queue. And this is assuming something like MIPS. So, in our issue queue here we're going to have. First of all, we're going to actually going to have the Opcode, because we're going to have multiple instructions sort of ganging up in here. So, it's possible it could be like, I don't know, in this case one, two, three, four, five. There's five instructions just sort of sitting in this, in this data structure. So, just sort of slacking. There's a buffer and a slack in the front of your pipe, effectively. You might need the immediate's values, because that's sort of part of the Opcode, or part of the instruction, and then, then we're going to basically have things for the three different registers. And what were going to put in here is we are going to put the register identifiers, so. Here let's look, let's look at the sources first, because that makes a little more sense. The V bit is going to say that this instruction needs the source. So, it's possible that, as we said, you know, there's some like an immediate instructions don't read both source operands. So, for an immediate, only one, probably this valid bit is going to be set and this one is going to be zero. The P bit is pending. So, what that means is somewhere, later in the pipe there is a in-flight instruction that writes to that register. So, we're going to basically track that with a bit here saying that it, that iis still pending. Now, the reason we actually have to keep that pending information and the validity informations are separate and why need to keep both of them is it's possible that multiple instructions could light up simultaneously as being ready to go. So, if you have, let's say, two instructions that are both dependent only on register five and register five is not ready, it's, it's the destination of a multiply. But then register five gets ridden. It's going to actually clear the pending bits. So, it's no longer pending and it might be in multiple places. So, it might be like here and here or maybe even there. It's going to clear the pending bits, and whoever is trying to read register five, and then, because we're only let's say going to have single issue on this processor. You can only pull one thing out at a time. So, you need to sort of pull one thing out and you need someway to leave the other instruction in the issue queue having all of it's operands ready. So, that's how you do that. Okay, so how do we figure out if an instruction is ready? Well, we see if the if the, if it actually uses it, that the operand, uses the, excuse me, the, particular upper and identifier and we see if it's pending. And then we have to check the other, other for, for two inputs. We need to check the two inputs, and then we also have to make sure there's no structural hazard somewhere else in the pipe, like if we're trying to schedule write, write ports. And we're going to use the scoreboard to do that. The other thing is if we want high performance, we probably don't want to have to wait for values to get to the end of the pipe. So, in reality, this, this, this is even going to get more complicated because it's going to be, these things plus information coming from the scoreboard which is when, when a particular register identifier is ready or when a particular register is ready. Let's talk about the destination here. This just tracks whether an instruction writes a destination or not. And like I said, in these, in these basic pipelines, we're going to leave instructions in the instruction queue until they get to the end of the pipe. And when it gets to the end of the pipe, this is where we go to check to see, what other locations we need to, sort of, clear out. So, if there's an instruction, let's say, sitting here, which has the valid bits set, and its range of register five is the destination. When that instruction commits, we're going to clear it out of, the issue queue. And, we're going to say, oh well it wrote register five. Let's scan through all these other places and look for places that say register five. And if they say register five, we are going to flip the pending bid from pending to not pending anymore and we're going to remove the instruction from the issue queue. Okay, so, centralized versus distributed issue queues. In a, in a, in a sort of logical sense, in a perfect world it's probably nice to have a big centralized issue queue. You can scan over all the instructions. You don't have to sort of look around, but this can sometimes be harder to implement, because you have to put all of your instructions in one location. And sometimes they don't necessarily sort of cross communicate very well. Like floating points units. They have floating point registers versus integer register files is one type. They don't necessarily need a whole lot of communication. So, you could have distributed instruction queues where you sort of steer let's say here the execute or ALU and memory ops to one of these for these functional units. They have a different instruction queue just for, multiplies or something like that. I'm going to, put a paper on the website for you guys all to read. It's Tomasulo algorithm. It's a very famous paper, and in that they talk about distributed instruction queues. Strangely enough, sort of, the first place in literature that these instruction queues showed up was around that time, or in that paper, and they actually go straight to the distributed version and kind of skip the centralized version, which I always found a little bit odd but lots of people build centralized ones today. One of the, one of the reasons that the I think the Tomasulo algorithm people went for this distributing one first is because they were actually implementing it in multiple discreet chips. So they had a issue cube per chip. So, they had a floating point chip and a integer chip and they steered the instructions. They didn't wanna have to have one data structure that cross, cross two chips. Today we have, you know, lots of integration. So, we don't have to worry about that as much. Okay, so, I just wanted to briefly say, this is a question that came up last time, and scoreboards. If you have to worry about write-after-write hazards in the pipe. The things we are talking about today, we didn't have to really worry about that. But this is something to talk about next time, we're gonna have to think about better scoreboards. It looks the same as the previous scoreboard, but you might need to keep the functional unit, or you need to keep the functional unit number, or some bits that represent the functional unit, and then those march down the pipe versus 1s marching down the pipe. You have, let's say, different numbers, like one, two, three, or you have other things marching down the pipe. But that's only if you have to track write-after-write hazards. Okay. So. That was a quick aside. Let's look at this in order, fetch, out-of-order issue, out-of-order, writeback and out-of-order commit processor in a pipeline diagram. And I'm going to show a new little thing in our pipeline diagram here, we have a lowercase i. Which means the instruction enters the issue queue. And then, when it exits the issue queue, it, it just starts going down into the issue stage of the pipe. And, trying to think where at, where am I? This is not that complicated of a drawing. Here I'm actually showing the, the issue queue. What is in the issue queue. So I have two, or excuse me, three issue queue slots. Because, this is a relatively simplistic pipe. The, this first instruction which goes to execute register two and register three are just already ready, so we don't have to worry about it. But an example of something that's more interesting, let's say, is, I don't know. Where is it going. Here we go. This will enter the issue queue. But register twelve does not become ready until late. So, once that becomes, comes ready, we can basically start pulling things out of the issue queue, and, and issue that instruction. So that, that shows up. Let's see if we can show that here, register, what is this, fourteen, twelve? Okay. So this one here, we're waiting for, that to happen. Let's see. Yeah, we're waiting for no, where is it twelve, just add. [inaudible] this long. Wrong. Yes. That is actually is what I want to show here. This is interesting. That value actually becomes ready, early. But. Is that what's actually happening here, the value becomes ready at the end of this execute stage. But it can't go to issue. At that cycle because there's something else issuing. So, it has to be pushed out. So, we're only doing a single issue per cycle. If we had multiple issues, you could think of something more interesting happening here. Yeah, so this is actually a really complicated case because if we pull this back let's say to here we get a writeback hazard. The W would conflict with the other W. If we try to issue here, sorry, that was a reissue here. If we try to issue, here we get a structural hazard on the issue. So, it actually gets pushed way out there. So, it's kind of a, kind of a bummer. So, sometimes you just lose. Okay, so, here's a interesting case. So, let's assume that all the instructions are preloaded into the issue queue. So, showing this, we have fetch, decode, issue, issue, issue, and basically this thing's just sort of sitting in the issue queue and then we start looking. What happens? Those the performance get better than the previous example? So, interesting enough, even you if you sort of issue everything into the pipe early and then just sort of fill up your issue queue. Pulling out of the issue queue can still be limiter and the number of ALUs can still be a limiter. So, there is other structural hazards. The performance of this code in this pipeline does not actually get any better which is sort of interesting. You think, Oh, well, I don't have to wait for things to dribble into my instruction queue. I just sort of preload my instruction queue. But the performance really, really doesn't end up being better. So this motivates us going to both out of order mixed with duplication of structures. Like, maybe we can issue two instructions at the same time. And maybe we can have two ALUs. So this is, motivates us having a superscalar.