Instruction fusion to vector code

9/4/2023

Again, this means that the CPU can stuff more instructions in the same OoO buffers, increasing efficiency and improving performance.Īll very nice, but let us take a look at what really matters: How do the 3 simple + 1 complex decoders of Core compare to the 3 complex decoders of AMD's K8 architecture? In the decoding phase, ADD, EAX results in one micro-op. The pre-decode stage recognizes the macro-ops (or x86) instructions that should be kept together.

(In CPU designs, the maximum clock speed is essentially determined by the slowest possible pipeline stage execution time.) Only by using bigger, smarter circuitry that can do a lot in parallel is micro-op fusion possible without lowering the clock speed significantly. This is no small feat: in older designs keeping the load and ALU operation together would result in pipeline stages that take much longer and thus lower the maximum clock frequency. Since Banias, the ALU and the Load operation are kept together in one micro-op. Store the result back to memory ( MOV, EBX) Load the contents of into a register ( MOV EBX, )Īn ALU operation, ADD the two registers together ( ADD EBX, EAX) RISC designs all load their data into the registers and then perform the necessary calculation on the registers. There is no way you could feed such an instruction (ADD, EAX) to RISC execution units. Remember that the whole philosophy behind all modern x86 CPUs, since the P6, is to decode x86 instructions into RISC-y micro-ops which are then fed to a fast RISC backend the backend then schedules, issues, executes and retires the instructions in a smooth RISC way. In earlier designs such as the P6 (Pentium Pro, PII, PIII) architecture, these instruction would have been broken up into two or even three micro-ops. Store instructions which get broken down into store address and store data are another example. store the result back at the memory address). This means add the content of register EAX to the content of a certain memory location (i.e. We are talking for example about mathematical operations where an address is referenced instead of a register. There are a few x86 instructions which are pretty complex to perform, but which are at the same time a very typical and common x86 instruction. The second clever technique already exists in the current P-M CPUs. If Intel's "1 out of 10" claims are accurate, macro-ops fusion alone should account for an 11% performance boost relative to architectures that lack the technology. The fused instruction travels down the pipeline as a single entity, and this has other advantages: more decode bandwidth, less space taken in the Out of Order (OoO) buffers, and less scheduling overhead. When two x86 instructions are fused together, the 4 decoders can decode 5 instructions in one cycle. The result is that on average in a typical x86 program, for every 10 instruction, two x86 instructions (called macro-ops by Intel) are fused together. These instructions are typically the assembler result of a compiled if-then-else statement. For example, the x86 compare instruction (CMP) is fused with a jump (JNE TARG). It makes it possible for two relatively common x86 instructions to be fused into a single instruction. The first clever technique is macro-op fusion. There is still more to the Core decoders. This way of handling the complex most CISC-y instructions has been adopted by all modern x86 CPU designs, including the P6, Athlon (XP and 64), and Pentium 4. The really long and complex x86 instructions are handled by a microcode sequencer. The complex decoder is responsible for the instructions that produce up to 4 micro-ops. The most common x86 instructions are translated into a single micro-op by the 3 simple decoders.

The task of the decoders - for all current x86 CPUs - is not only to decipher the incoming instruction (opcode, addresses), but also to translate the 1 to 15 byte variable length x86 instructions into - easier to schedule and execute - fixed length RISC-like instructions (called micro-ops). Pre-decode information includes instruction length and decode boundaries.Ī first for the x86 world, the Core architecture is equipped with four x86 decoders, 3 simple decoders and 1 complex decoder. Similar to the K8 architecture, Core pre-decodes instructions that are fetched.

0 Comments

Instruction fusion to vector code

Leave a Reply.

Author

Archives

Categories