Is microprogramming ever used in RISC processors?

11 scalar and superscalar architectures


1 Scalar and superscalar architectures 11.1 Scalar architectures and instruction pipelining In a simple CISC processor, an instruction runs sequentially through the necessary processing stages, e.g. 1st cycle: command read cycle (opcode fetch), 2nd cycle: decoding (decode), 3rd cycle: read in operands (operand fetch), 4th cycle: execute, 5th cycle: write back. The situation is shown in Fig. Each processing stage is busy for one cycle and then idle for four cycles. So the utilization is 20%. If it is a microprogrammed instruction, it will perhaps spend several cycles in the execution stage, then the utilization of the processing stages is even worse. Figure 11.1: A simple instruction is processed sequentially on a simple processor. The execution needs at least as many cycles as there are processing stages, the processing stages are poorly utilized. The basic idea of ​​command pipelining (assembly line processing) is now to pass commands overlapping and in parallel through the processing stages. If a command is in the second

2 scalar and superscalar architectures processing stage goes, the following instruction can go to the first stage, etc. With each cycle, 1. the last instruction is removed from the pipeline after the results have been written back, 2. every other instruction is moved to the following processing stage , 3. A new instruction is added to the first stage of the pipeline. The execution of an instruction takes as many cycles as the pipeline has stages, this time is called latency. The latency is only noticeable when the pipeline is being filled. When the pipeline is full and functioning properly, the utilization of the processing stages is 100% and an instruction is terminated with each cycle. This would achieve the goal of scalarity. The situation can be solved with an assembly line, e.g. in automobile production: It takes several hours to manufacture a car, but a finished car regularly rolls off the assembly line after a few minutes. In Fig. The same processing stages are shown, which are now operated as an ideal five-stage pipeline. Figure 11.2: An ideal 5-stage pipeline. After the pipeline is full (latency), one instruction per cycle is completed. In practice, 100% utilization cannot be achieved because there are various pipeline obstacles (pipeline interlocks). A simple pipeline obstacle is a main memory access, which can only be carried out in one cycle if the L1 cache is operated at full processor cycle and the date is available there. Every L1 cache miss leads to waiting times, the pipeline then has to wait and runs empty for a few cycles. Reading in the operands is less critical, because the code is always loaded into the L1 code cache block by block at an early stage. It is already clear here that caching is an integral concept of RISC processors. Resource conflicts are another obstacle to the pipeline. These arise because different processing stages in the pipeline want to access the same piece of hardware at the same time. Resource conflicts usually only arise due to structural bottlenecks, e.g. a common bus for data and code.

3 11.1 Scalar Architectures and Command Pipelining 187 Another obstacle are data dependencies (data hazards). Data dependencies arise when commands are linked via the contents of registers. If e.g. If a machine instruction reads a register that the previous machine instruction writes, it must wait for the writing of the register to be completed. This common situation is called read-after-write hazard (RAW hazard). Let us consider the following example:; Example of a read-after-write hazard ADD R1, R1, R2; R1 = R1 + R2 SUB R3, R3, R1; R3 = R3-R1 XOR R2, R2, R4; R2 = R2 xor R4 If this instruction sequence is loaded into our five-stage sample pipeline, a problem arises after five cycles: In the same cycle in which the ADD instruction writes its result back to R1, the SUB instruction is already executed and uses the outdated content of register R1, a typical RAW hazard. (Fig. 11.3) Fig. 11.3: A problem due to a read-after-write data dependency: In cycle 5 the ADD command writes its result in R1, in the same cycle the subsequent SUB command already reads R1. Without a correction, incorrect results will come about here. A simple solution to the problem is to stop the first four stages of the pipeline for one cycle. This can also be anticipated in the software by inserting an empty command (No Operation, NOP) between the ADD and SUB commands. If SUB is now executed, the result of the ADD command is in the register. (Fig. 11.4) Fig. 11.4: By inserting empty commands (NOP), derraw hazard from Fig can be resolved at the expense of an empty cycle.

4 Scalar and superscalar architectures However, these two solutions lead to an idle cycle and thus a loss of performance. In order to avoid this idle cycle, imaginative solutions have been developed: An operand is transferred simultaneously to the target register and to the inputs of the ALU via a register bypass. Since the operand is a result from the ALU, one speaks of result forwarding (result forwarding, Fig. 11.5). A similar problem arises, Figure 11.5: A register bypass enables load forwarding or result forwarding. At the same time as it is written to the register, the new content is also available on the ALU for addition, which saves one cycle. if in the above code example there is a LOAD command with destination R1 or R3 immediately before the SUB command. As a rule, the new value has not yet been loaded into the register when it is accessed by the SUB command. A register bypass can also help here, the value is now not only transferred from the internal bus to the target register, but also to the ALU inputs at the same time. This is called load forwarding. Forwarding is not supported by all RISC processors because the control is very complex. There is another solution to the present case. An optimizing compiler could recognize that the following XOR instruction is not affected by a data dependency and can be brought forward. In this way, the problem is elegantly solved at the software level, without causing an idle (Fig. 11.6). It should be checked here whether the XOR instruction does not. Figure 11.6: By rearranging the instructions, an optimizing compiler can, under certain circumstances, resolve the data dependency without loss of time.

5 11.1 Scalar Architectures and Instruction Pipelining 189 Writes back the result too early, so that the ADD instruction works with the wrong content of R2 (Write-After-Read-Hazard); however, this cannot generally be done since the read operands are already read in the decoding phase. You can see very clearly here that with RISC processors, hardware and software form a unit and an optimizing compiler is very important for the performance of the processor. Jump instructions are a big problem for pipelines. The processing deviates from the sequential order when a jump is executed and continues at a completely different point (at the jump destination). The jump does not take effect until the last stage (write back results) when the new value is written into the program counter. At this point, however, some instructions in the pipeline have already been partially processed. The pipeline now has to be completely reloaded and the latency period occurs again, i.e. several idle cycles. (Fig. 11.7) The longer the pipeline, the greater the loss of time. Unconditional Figure 11.7: Branch through a conditional jump instruction. As an example, it is assumed here that the command BREQ (Branch if equal) branches to command 80. The already preprocessed commands (here command 2 command 5) must be discarded. The pipeline is refilled, and the latency period is again incurred. Jumps can already be intercepted in the decoding phase. Conditional jumps are more difficult, in which a decision is only made about the execution of the jump after evaluating a condition, i.e. in the execution stage. Various solutions have been developed for this problem area. With the delay branch technique, the processor always executes the machine command after the jump command (in the so-called delay slot). This is illogical, but it simplifies the processor design. The translator will try to put a command there that must be executed independently of the jump. This is usually a command that comes before the jump command and can be moved. If this does not succeed, an empty command must be set on the delay slot. Another measure is the detection of the jump as early as possible. For this purpose, the calculation of the jump destination can be brought forward from the execution stage to the decoding stage, as can the evaluation of the jump condition. This reduces the loss of time, but means some hardware expenditure. A branch prediction simply makes an assumption as to whether the

6 Scalar and Superscalar Architectures Jump is taken or not, and continues accordingly. One speaks of speculative execution. If the evaluation of the branch condition shows that the assumption was correct, the processor has speculated correctly and it continues without losing any time. If the assumption was wrong, the speculation has gone wrong. Then all instructions that were after the jump have to be deleted from the pipeline and the pipeline has to be refilled. The static branch prediction is based on fixed rules. One possible rule is: backward jumps are always taken, forward jumps never. This rule is based on the fact that loops are set up with backward jumps and are usually run through several times. Forward jumps serve e.g. the termination of loops or the reaction to error conditions, both of which occur somewhat less often. Of course these rules are primitive and produce a lot of false predictions, but they are still better than randomness. Modern processors work with a dynamic jump prediction. The jump commands are observed in the current program run and statistics on the jumps are set up in special hardware, the branch history table. A branch history table can contain the following components: 1. The address of the jump instruction or the block in which it is located, 2. a bit that indicates the validity of the entry (valid bit), 3. one or more history Bits that contain current experiences with this jump instruction (jump taken / not taken), 4. the address of the jump destination. Figure 11.8: A history table with two history bits and four states. A prediction logic with two history bits and four states is presented in Fig. If a jump was not taken twice in a row, you are in the state (0,0) on the far left. The prediction is then of course not a jump. If the jump is taken the next time, the logic initially remains with its prediction (state (0,1)). If the jump is then taken again, it changes to state (1,1) because the jump is yes

7 11.2 Superscalar Architectures 191 has now been taken twice in a row. If it is not taken the next time, you get into state (1,0) etc. 1 This type of history table does not only take into account the last jump, as it would be with a 1-bit history. Elaborate dynamic jump predictions achieve hit rates of 99%. The address of the jump destination can of course also be taken from the jump command; however, it is available earlier in the history table. In order for a RISC processor to perform really well, machine code must be generated that is well suited for instruction pipelining. Optimizing compilers are therefore firmly planned in, assembler programming is usually not used. Superscalar architectures Multiple parallel hardware units By scalarity, we mean the ability to basically execute one instruction per cycle. Superscalar architectures achieve throughputs of more than one instruction per cycle. This is only possible through real parallelization that goes beyond instruction pipelining. Superscalar architectures have multiple levels of decoding and multiple execution units that can work in parallel. As an example, consider the architecture in Fig. It has two decoders, two integer units, a floating point unit and a load / store unit. The two decoders limit the performance of the architecture to a maximum of two instructions per cycle. Separate integer registers R0 R31 and floating point registers F0 F15 are available. The two decoders store the decoded instructions (micro-operations) in a small buffer memory with two places. The commands are issued from there to execution units (issue) as soon as they are available. Of course, any data dependency must be taken into account. If spaces become free in the buffer memory, new commands are decoded and stored in the buffer memory in the same cycle. We assume that prefetching always reads in enough code in advance and passes it on to the decoder. The instructions are completed by the retirement and completion unit. In the case of STORE commands, this unit takes the results from the registers and transfers them to the LOAD / STORE unit. We assume that the integer instructions take three clocks to execute after output to the integer unit, while the floating point instructions take four clocks. The execution time of the load and store commands depends on the content of the L1 cache, but should be at least four clock cycles (L1 cache hit). We initially assume that all instructions must be issued in the correct order (in-order issue). A parallel output of consecutive commands is therefore allowed, but not the output of a command that appears in the code after commands that have not yet been output. We also want to assume that there is no bypassing and that, in the case of data dependency, we wait until the relevant register is written. We now want to try to understand the work of this example architecture using a small, hypothetical section of code. 1 The administration of the four states is a finite automaton.

8 Scalar and superscalar architectures Figure 11.9: Example of a superscalar architecture. ; Example command sequence for the superscalar architecture of Fig AND R1, R2, R3; R1 = R2 AND R3 INC R0; Increment R0 SUB R5, R4, R1; R5 = R4-R1 ADD R6, R5, R1; R6 = R5 + R1 COPY R5, 1200h; R5 = 1200h LOAD R7, [R5]; Loading with register indirect addressing FADD F0, F1, F2; F0 = F1 + F2 FSUB F0, F0, F3; F0 = F0-F3 This command sequence is now read in sequentially from the L1 code cache and processed, as shown in Fig. Cycle 1 The AND and INC commands are decoded and stored in the buffer memory. Cycle 2 AND and INC commands are free of data dependencies and are output in parallel to the integer units 1 and 2; the SUB and ADD commands are decoded. Bars 3, 4 Parallel execution of the AND and INC commands. Cycle 5 The SUB command has a data dependency (read-after-write) to the AND command via R1, but can be output because the AND command has been completed. The ADD command cannot be issued because it has to wait for the result in R5 (data dependency read-after write), COPY is decoded. Bars 6, 7 Another version of SUB.

9 11.2 Superscalar Architectures 193 Figure 11.10: Execution of the example instruction sequence if the processor issues the instructions in the correct order ... stands for a subsequent instruction. Measure 8 ADD is output, LOAD is decoded. The COPY command cannot be output in parallel because of a data dependency (write-after-read) in register R5. Measure 9, 10 Another version of ADD. Clock 11 Since register R5 has been written, the COPY instruction is issued; FADD is decoded. Measure 12, 13 Another version of COPY. Cycle 14 Since the result of the COPY command is now in R5, the LOAD command can be issued. The FADD instruction uses the floating point register and is independent of the integer instructions; it is output in parallel to the floating point unit. The two decoders decode the FSUB command and its successor in parallel. Measure 15, 16, 17 Further execution of FADD and LOAD. Clock 18 FSUB is output; another command is decoded. Bars 19, 20, 21 Another version from FSUB. A scoreboard can be used to monitor compliance with the data dependencies. There it is recorded for each register how many read accesses are still pending and whether a write access is still pending [38]. A new command may only be issued if 1. the registers that the command reads are no longer earmarked for writing (no RAW dependency),

10 Scalar and superscalar architectures 2. the registers it describes are no longer earmarked for reading (no WAR dependency), 3. the registers which it describes are no longer earmarked for writing (no WAW dependency) Execution in Changed order Overall, despite high hardware expenditure, we have not yet achieved a large degree of parallel processing; the code example in the last section was only processed after 21 cycles. This is not only due to the many data dependencies in the code, execution in the correct order also acts as a stumbling block.The FADD and subsequent FSUB instructions are completely independent of the previous instructions, but cannot be issued to the floating point unit because the processor adheres to the order in the code. Many processors have therefore switched to changing the order of execution when that is beneficial. Execution in the changed order, out of order execution, can of course only be carried out if, in the end, all results are as if the commands had been executed in the correct order. The output in the original order can be dispensed with if at least the results are in the correct order and all data dependencies are taken into account. If a result comes up too early, it can be temporarily stored and reintroduced into the process at the right time. That is the task of the retirement and closing unit, which is much more complex here. Let's take the above code as an example. If the result of the 5th command (COPY) was present before the 3rd command (SUB) when executed in a different order, it should not be written back because the SUB command then writes back later and an incorrect value remains in R5 (Write- After-write). Here, too, the scoreboard is an important tool. Let us look again at the execution of the above code section after we have introduced the command execution in a different order into our example architecture. So that we can do this at all, we enlarge the buffer memory so that it can hold six decoded commands. The course of the processing is shown in Fig. The last command of the sequence is now ended in measure 17. Cycle 1 The AND and INC commands are decoded and stored in the buffer memory. Clock 2 The AND and INC instructions are free of data dependency and are output in parallel to the integer units 1 and 2; the SUB and ADD commands are decoded. Cycle 3 parallel execution of the AND and INC instructions; the COPY and LOAD commands are decoded. Cycle 4 completion of the AND and INC command; the FADD and FSUB commands are decoded, the buffer memory is full. Cycle 5 The SUB command is issued because the AND command has been completed. The ADD command has to wait because of the RAW dependency on the SUB command. Likewise the COPY command (WAR dependency on ADD) and the LOAD command (RAW dependency on COPY). The FADD command is not affected by data dependencies; it is now brought forward and issued. The decoders fill the two vacant spaces in the buffer memory. Bars 6, 7 Further execution of SUB and FADD. Clock 8 ADD is output, FADD is ended and a new command is decoded.

11 11.2 Superscalar Architectures 195 Figure 11.11: Execution of the example instruction sequence when the processor issues the instructions in a different order. Clock 9 FSUB is output; another command is decoded. Cycle 10 Further execution of ADD and FSUB. Clock 11 Since register R5 has been written, the COPY instruction is issued; a new command is decoded. Measure 12 FSUB is completed. Measure 13 COPY is completed. Cycle 14 Since the value in R5 is now available, the LOAD command is issued. Measure 15, 16, 17 Further execution of LOAD. Executing in a different order brought us a gain of four clock cycles and was apparently relatively unproblematic. The hardware required should not be underestimated, however, and a new problem also arises: Of course, a superscalar architecture must also react to interrupts. Let's imagine e.g. that an interrupt request is registered in Fig. 8 in the 8th clock cycle. It would take too much time to first process all commands completely and only then to process the interrupt. But if you interrupt after the 8th clock cycle, you have a remarkable situation: The ADD command has been partially processed. The FADD command, which is much later in the code, has already ended and the COPY and LOAD commands in between are only decoded. Restoring the system status after the interruption has ended is correspondingly time-consuming (problem of precise interrupts). Renaming of registers The execution of the above code can be further improved. A critical look at the code shows that some of the data dependencies can be avoided. The COPY command loads

12 scalar and superscalar architectures the address 1200h after R5, so that the following LOAD command can address with R5. Any other register could just as easily have been used in these two instructions. The choice of R5 was very unfortunate, it produces an unnecessary data dependency (WAR) on the preceding ADD instruction and an optimizing compiler would have made a better choice. One way of eliminating the problem with hardware means is register renaming. The ISA registers referenced in the code are mapped to a larger number of background registers via an allocation table. Example The program contains the command ADD R1, R2, R3, essollalsodiesummevofr2 and R3 are calculated and stored in R1. The allocation table states that R2 can be found in the background register H13 and R3 in H19. The processor thus adds H13 and H19 in a free integer unit. He stores the result in a free background register, let's say in H5. Subsequent read accesses to R1 are now always routed to H5. If R1 is later rewritten, H5 need not be used again. On the contrary, it is advantageous to choose a different background register this time because this removes the data dependency. In concrete terms, this means that read access to the old value in R1 can still be carried out, although a subsequent command that writes to R1 has already been carried out. This is of course very useful for executing in a different order. The background registers can be organized linearly or in the manner of a ring buffer. The decoding unit (now to be intelligently constructed) would recognize the data dependency in our code example and use an internal register instead of R5 in the two commands COPY and LOAD, which is completely free at this point in time. This assignment is retained until R5 is overwritten by another command. The retirement and termination unit is responsible for the correct assignment when the commands are completed. The output of COPY and LOAD no longer needs to be withheld until the ADD command has been completed, because the register renaming has eliminated the data dependency. The result is shown in Fig. The last command of the sequence is already completed after 12 bars. Figure 11.12: Execution of the example command sequence with a changed order and use of register renaming.

13 11.2 Superscalar architectures Pipeline length, speculative execution Since an instruction continues to run through the processing stages and sub-stages sequentially, superscalar structures are still referred to as a pipeline, although this is no longer linear, but rather distributed. If we include the instruction read cycle, our example architecture corresponds to a branched five- to six-stage pipeline. The question of the optimal length of the pipeline is interesting. A pipeline with many stages can be controlled with a high processor speed because each stage only solves a small sub-task and is simply structured. This results in a high throughput of instructions per second. Processors with more than 10 pipeline stages are also called super-pipelined. The benefits of superpipelining are limited by two effects: firstly, accesses to memory modules are difficult to subdivide and secondly, the latency increases with the length of the pipeline. Nevertheless, there has been a certain tendency towards large pipeline lengths in recent years (see section). Conditional jump instructions also remain a problem for superscalar architectures, especially for those with a long pipeline and therefore high latency. An unpredicted jump can mean that all output units have to be emptied. In addition to the jump forecast, additional measures are therefore often taken here. With sufficient system resources, speculative execution in multiple branches can be afforded. In the case of a conditional jump instruction, the instructions of both branches are introduced into the pipeline. If the decision to jump has been made, one part is invalidated and the other is pursued. This requires a sophisticated renaming of registers, since the same register set is accessed in both branches from the program point of view. Some architectures are even able to trace both branches again when a further branch occurs in the speculative branch. However, speculative execution has its pitfalls. If e.g. If a load instruction is executed in a speculative branch that leads to a cache miss, many processor cycles may be sacrificed in order to reload a cache block that is then not needed after all. If the memory access leads to a page fault followed by hard disk access, the problem is made even worse. Many superscalar processors therefore work with speculative loading, in which the loading of a storage data item is aborted if it cannot be found in the L1 cache. An even bigger problem could arise if the following C code is translated: if (x> 0) z = y / x; In the generated machine code, the division is carried out speculatively before the jump condition is evaluated, i.e. even if x is equal to 0. The speculative execution thus leads to exactly the division error exception that the programmer correctly wanted to avoid! One possible solution is the so-called poison bit, which is set in this situation instead of triggering the exception. If it then turns out that this branch is actually taken, the exception can still be thrown [38]. Superscalar processors can also be supported by the compilers in the task of parallelization. One measure is to create blocks that are as long as possible without a jump command (basic blocks). You can do that e.g. by resolving loops (loop unrolling) or subroutine calls. It is even better if the architecture,

14 scalar and superscalar architectures such as Intel's IA-64, predication offers (see section 11.4). Predicated commands are commands that are only executed when a specified condition is met. If they are not fulfilled, they remain ineffective (NOP). Predicated instructions can reduce the number of branches VLIW processors Another way is the parallelization by the compiler. The code is already analyzed during the translation process and, with precise knowledge of the hardware, it is decided which machine commands can be executed in parallel. The translated program then already contains the instructions for parallel execution and this task has been relieved of the processor. This is also called explicit parallelization. The parallelism is usually formulated with the help of a very long instruction word. A VLIW contains several commands that were already bundled at compilation time. VLIW processors have advantages and disadvantages. One advantage is the simpler structure of the hardware, since the complicated task of parallelization is transferred to the translator. One disadvantage is that if the architecture is changed, e.g. in the case of the successor processor, all programs have to be recompiled. The IA-64 architecture, which is presented in Sect., Works with a triple parallel VLIW dual core processors. In recent years it has become apparent that the clock frequency of the processors can no longer be increased as easily as it has been the case for a long time . The gigahertz curve is only rising slowly. One reason for this is the increasing radiation of electromagnetic interference from the conductor tracks onto other conductor tracks on the chip. Too strong irradiation can tip a bit. The distances between the lines became smaller and smaller due to the constant shrinking of the structures. Production is currently being switched from a structure width of 90 nm to 65 nm and the goal of 45 nm production is already in mind. And the electromagnetic radiation increases rapidly with the clock frequency. The problem of heat generation, which increases with the clock frequency, is even more serious. A processor that clocks too fast can no longer dissipate its waste heat, its temperature rises and it destroys itself or at least reduces its service life. On the other hand, the computer user community is demanding and expects regular increases in computing power. In this situation the old idea of ​​multiprocessor computers was taken up. If a computer has two processors, it can theoretically perform twice as many calculations in the same time as with one processor. Both processors could work in parallel. However, multiprocessor systems are complex and expensive. You had to achieve parallelization with less effort. The Single Instruction Multiple Data (SIMD) already introduced in PC processors does not execute several commands in parallel, but acts on several data units with one command. (see Chapter 12) Intel took another step towards parallelization with Hyperthreading (HT), first with the Xeon processors, later with the Pentium 4-HT. Hyperthreading includes a

15 11.2 Superscalar Architectures 199 double register set, execution units and caches are still only available once. The double register set supports the efficient parallel execution of two independent threads [41]. A thread is an independent part of a process that has its own program flow, including program counter, register set and stack. However, a thread does not have its own address space, but shares the remaining address space with the other threads in this process. Some programming languages ​​directly support multithreading. From the point of view of software and operating system, the HT processors represent a dual processor system (virtual dual processor system) that is hardware-ready for the parallel execution of two threads. The benefit of the method is that two completely independent threads can be parallelized much better than a single thread with many dependencies. If z. B. results in a waiting time in the first thread due to a data dependency or a main memory access, the second thread can be processed during this time instead of waiting idly. Switching over hardly takes any time, since the second register set contains all the current data of the second thread, so nothing needs to be saved or loaded. Programs that only have one thread, however, do not benefit from hyperthreading at all. With hyperthreading, the load units are used better, but they are not increased. A program with two threads and little internal waiting times is not significantly accelerated by hyperthreading. The next step towards parallelization is therefore to build in not only the register set, but also the entire processor core with execution units, so that we have a dual core processor. If both processor cores are accommodated on the same die, only one housing and one socket are required for this double core processor and the additional costs are not too high. This path has been taken with the PCs and Intel has already announced that it will soon only be producing dual-core processors. In which situations do you benefit from dual core processors? In short, in the following situations: When several processes are active (multitasking) When processes with several threads are active (multithreading). Single-core processors implement multitasking (multi-process operation) by quickly switching between the processes, so that the impression of simultaneity is conveyed. The processing time of the processor must, however, be divided between the processes. A dual core processor can really handle two processes at the same time. Several processes are active, for example, when a user program is running in the foreground and another process is running in the background. The foreground process can e.g. B. a word processor and the background process a virus scanner or media playback. Since the background process has its own processor core, it will not slow down the foreground process. Another example is the translation of a program package from several files. If the translation of each file is started as a separate process, two translations can always run at the same time, independently of one another. Of course, the dual core also has advantages when an application carries out a lengthy calculation, e.g. burning a DVD and the user switches to another application in the meantime.

16 Scalar and Superscalar Architectures But when can an application be split across multiple threads? This is the old key question about parallelizability. It is always good when several calculations are carried out independently of each other. For example, if a graphics program renders an image according to a certain calculation rule, the image can be split into two halves and each half can be calculated by a separate thread. The same applies to many types of image or video editing. Multithreading can also be implemented in many games if the picture structure and the game logic take place in completely different program parts and each receive its own thread. An increase is possible if the two cores of a dual core processor each also support hyperthreading. In this case, four threads can be processed at the same time, with two sharing a core.Processors with four cores are also in preparation. Caching requires special measures for dual core processors, because both cores have their own L1 and L2 cache on the chip. These must always be kept consistent, e.g. B. via the MESI or the MOESI protocol. It is important that the query and update mechanisms required for these protocols are fast and efficient; In some cases the data bus is used for this, but in some cases they have also donated their own cables. Both Intel and AMD already offer several dual-core processors for desktop PCs. The software applications only benefit from dual cores if the software is multithreaded accordingly. In this regard, a large part of the software currently in use must be revised. Processors of the x86 series were able to maintain their leading role in the market. The series developed up to the typical CISC architecture with many and sometimes quite complex commands. However, Intel understood surprisingly well how to integrate modern RISC technologies into the x86 architecture, which was actually quite unsuitable for this. The whole series is now quite modern and powerful. At the same time, strict attention was paid to absolute downward compatibility. A machine program that was compiled for an 8086 over 20 years ago can also run on a Pentium 4 without any changes. Since it turned out that the production of software is complex and expensive, programs are often operated for a surprisingly long time and the downward compatibility of Intel's processors was an important plus point. There is now a huge amount of software for the x86 architectures and one wonders whether these will ever be superseded. When the was published, it was already recognized that the future belongs to RISC technology. The successor already had some RISC features: a five-stage pipeline, instructions implemented more in hardware, an L1 cache on the chip, etc. Its successor, the Pentium, was already a superscalar with two parallel pipelines. Pentium Pro,

17 11.3 Case study: Intel Pentium and Core Architecture 201 Pentium II and Pentium III have the P6 micro-architecture, a superscalar three-way architecture with execution in a different order. The P6 structure is similar to our superscalar example architecture in Fig. There are separate L1 data and L1 code caches on the chip. Three decoders pass on the generated micro-operations to a buffer memory (here: reservation station) with 20 entries. Register renaming maps the eight ISA registers to 40 background registers using an allocation table. The execution units are three LOAD / STORE units and two mixed ALUs for integer, SIMD and floating point processing. 2 The Pentium III pipeline has 10 to 11 stages. The main problem with the P6 architecture is the retention of support for the complex commands from the command set. The simple, RISC-like commands are converted directly into a micro-operation by the decoders. Some of the complex commands are still processed in microcode. For this purpose, one of the decoders is equipped with a microcode ROM and generates corresponding sequences of micro-operations. The powerful SIMD units of the Pentium processors are described in section 12. AMD's competitor Athlon is also a super scalar and has a total of six decoders that fill a buffer memory for 72 micro-operations. From there they are assigned to three integer units, three LOAD / STORE units and three floating point units for execution. The Athlon's caches are also generous: each 64 KByte code and data cache and 256 KByte shared L2 cache, all on-chip. So with today's PC processors we are dealing with a mixed architecture that cannot be clearly classified as CISC or RISC. Superscalarity through parallel instruction execution can easily be tried out on a PC if a C compiler is installed that allows inline assembler. The Pentium processors (as well as the Athlon) have a time stamp counter that is 64 bits wide and is increased by one with each processor cycle. The time stamp counter is part of the so-called performance monitor, which can be used to obtain data about the execution history of the processor. The RDTSC (read time stamp counter) command can be used to transfer the time stamp counter to the EDX and EAX registers. If you read out the time stamp counter twice, the difference between the counter readings is the number of processor clocks used for the command sequence in between. In the following example, four simple commands are inserted between the RDTSC commands. The execution time only increases by one processor cycle, which demonstrates the parallel work in the processor very nicely. 3 _asm {rdtsc; Read time stamp counter mov esi, eax; Save counter reading in ESI; ********************************************** ********; The following four simple integer commands extend; the execution time only by one processor cycle or ecx, edi xor eax, eax and ebx, ecx end: inc edi 2 Pentium Pro and Pentium II differ slightly in the mixed ALUs. 3 In this section of the program, only the lower 32 bits of the time stamp counter were used, as their overflow is extremely rare.

18 scalar and superscalar architectures; ******************************************** ********* rdtsc; Read time stamp counter sub eax, esi; Determine the difference between the counter readings mov anzahlakte, eax; and transferred to variable} printf ("The execution of the command sequence needed% i processor clocks \ n", number of clocks); We now exchange the xor command for the conditional jump command JNZ ende (jump if not zero to ende). The conditional jump instruction can lead to the pipeline having to be emptied and refilled. In fact, the conditional jump instruction extends the execution time on the author's PC (Pentium III) by a further 11 cycles, which corresponds exactly to the number of pipeline stages. _asm {rdtsc; Read time stamp counter mov esi, eax; Save counter reading in ESI; ********************************************** ********; The following four commands extend the; Execution time at several processor cycles or ecx, edi jnz end and ebx, ecx end: inc edi; jump if not zero to end; ******************************************* ********** rdtsc; Read time stamp counter sub eax, esi; Determine the difference between the counter readings mov anzahlakte, eax; and transferred to variable} printf ("The execution of the command sequence needed% i processor clocks \ n", number of clocks); It should be pointed out that these small examples do not represent exact measurements, because the inserted commands are also executed partially overlapping with the RDTSC command [19] Pentium 4 With the Pentium 4 came the Netburst Micro-Architecture. Here the L1 code cache is moved as an execution trace cache behind the decoding [19], [36]. This contains 12 k micro-operations, which were saved as chained sequences (trace sequences) according to the last sequence of operations, namely across all jumps and branches. If the same sequence occurs again, which often occurs, the trace sequence can simply be taken from the execution trace cache and passed on to the execution units again (Fig.). This saves the data traffic to the memory hierarchy and the effort of re-decoding completely. Cache blocks that are actually logically divided by a jump instruction also no longer have to be kept completely in the L1 cache

19 11.3 Case study: Intel Pentium and Core architecture 203 Figure 11.13: The execution of a code sequence that is completely in the execution trace cache of the Pentium 4 does not need to be decoded again. become. The branch prediction has been improved and works cooperatively with the execution trace cache. A bypassing network is located parallel to the execution units. The core of the Pentium 4 allows execution in a different order over the large number of 126 instructions. Two integer units, a LOAD and a STORE unit (Address Generation Units, AGU) are available for processing, all of which operate at twice the processor frequency (quad pumped). There are also a slow integer unit, a floating point and a mixed floating point / SIMD -Unit. Parallel to the background registers is a bypassing network that supports the forwarding of results and loading. The retirement unit creates up to 3 micro-operations per cycle. The further development to the Pentium 4E brought the following changes: Extension of the pipeline to 32 stages An additional multiplication unit in the slow integer unit SIMD extension SSE3 Enlargement of the caches and other small changes The Pentium 4E contains the enormous number of 125 million transistors. With its hyper-long pipeline, it was originally intended for further development up to 10 GHz. Its large power requirement at full load up to 150 W and the associated

20 Scalar and superscalar architectures Heat generation is a problem, however. This may also be the reason why the Pentium 4 and its Netburst architecture are no longer being developed further at Intel. The 64-bit expansion The development of microprocessors also went hand in hand with a constant widening of the working registers. The first microprocessor Intel 4004 still worked with a width of 4 bits, but after the turn of the millennium, 32 bits were apparently already too little. A change to 64 bit was required, although there aren't that many applications that make good use of such large registers. At that time, Intel relied on a completely new beginning with the Itanium processor and the IA-64 architecture. (see Sect. 11.4) This offered 64-bit processing and a completely new instruction set with which one wanted to start into the future without ballast. The Itanium prevailed, however, only hesitantly and meanwhile competitor AMD took a different path. The register set of the Athlon was expanded to 64 bit, but the processor was kept compatible with its predecessors. This means that existing software can continue to be used, an important argument already mentioned. The 64-bit extension not only provides 64-bit extended registers but also 8 additional registers. A prefix byte in front of the relevant command switches between the 32-bit environment and the 64-bit environment in the current code. The 64-bit environment extends the general-purpose registers Figure 11.14: The 64-bit extension of the AMD Athlon 64. On the left the 32-bit environment and on the right the 64-bit environment.

21 11.3 Case study: Intel Pentium and Core Architecture 205 EAX, EBX, ECX, EDX, EDI, ESI, EBP and ESP for 64-bit registers, which are now called RAX, RBX, RCX, RDX, RDI, RSI, RBP and RSP . In addition, 8 new registers R8 R15 with also 64 bits are available. In the 64-bit environment, these registers are also available for access with 32, 16 or even only 8 bits. You can also simply use the 64-bit extension to have twice as many general-purpose registers available. Interestingly, even 8-bit partial registers from SI, DI, SP and BP can be used here, which was not possible in the past. The status register EFlags is expanded to 64 bits, whereby the upper 32 bits are initially unused. The Extended Instruction Pointer (PC) is expanded to a 64-bit RIP so that 64-bit pointers can also be used. This expands the address space. In addition, eight additional 128-bit registers are available for the SSE expansion: XMM8 XMM15. In order to be able to address the 64-bit registers, a number of commands have been extended to a width of 64 bits. An example is LODSQ, Load String Quadword or MOVZX, Move doubleword to quadword with zero-extension. The 64-bit extension was finally also taken over by Intel and is available with the newer processors under the name EM64T. In order to be able to use the 64 bits, the chipset on the motherboard and the operating system must also support the expansion. Figure 11.15: The historical development of the processing range using the example of some Intel processors. Core architecture The P6 architecture of the Pentium III was evidently still capable of development and was continued, especially with regard to energy-saving processors for mobile computers. The successful models Pentium M, Core Solo and the dual core processor Core Duo were created. [22] Only minor modifications were necessary: ​​adding a pipeline stage, enlarging the cache and adding the SSE instructions. In a separate line of development, a new architecture has now been derived from this, which is intended to lead into the future as a core architecture. A number of changes and additions have been made to this end. The most spectacular change was the expansion of the registers to 64 bit, which is based on the model of the Athlon processors (EM64T). There have been many small but intelligent improvements in the micro-architecture: The decoding stage received a fourth decoder, so that the design