Logout succeed
Logout succeed. See you again!

dependency and exception handling in an asynchronous microprocessor david alan gilbert PDF
Preview dependency and exception handling in an asynchronous microprocessor david alan gilbert
D EPENDENCY AND E H XCEPTION ANDLING IN A AN SYNCHRONOUS M ICROPROCESSOR A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Science and Engineering 1997 D A G AVID LAN ILBERT Department of Computer Science Contents Contents.........................................................................................2 List of figures..................................................................................7 List of tables...................................................................................9 Abstract........................................................................................10 Declaration...................................................................................11 Copyright and intellectual property rights.....................................12 Acknowledgments........................................................................13 The author....................................................................................14 1 Introduction...........................................................................15 1.1....Synchronous and asynchronous design........................................................15 1.2....Arguments for asynchronous design............................................................16 1.2.1...Clock skew avoidance...........................................................................16 1.2.2...Better than worst case execution time...................................................17 1.2.3...Power considerations.............................................................................17 1.2.4...Electromagnetic compatibility (EMC)..................................................17 1.2.5...Modularity of design.............................................................................18 - 2 - 1.3....Problems with asynchronous design............................................................18 1.3.1...Control logic complexity.......................................................................18 1.3.2...Testability..............................................................................................18 1.3.3...The risk of deadlock..............................................................................19 1.3.4...The loss of implied knowledge.............................................................19 1.4....Asynchronous handshaking.........................................................................20 1.5....Micropipelines.............................................................................................22 1.6....The ARM Microprocessor...........................................................................24 1.7....Existing AMULET processors.....................................................................25 1.8....Aims of AMULET3.....................................................................................27 1.9....Other asynchronous microprocessors..........................................................28 1.10..The structure of this thesis...........................................................................29 2 Dependencies.......................................................................30 2.1....Types of dependency...................................................................................30 2.1.1...Procedural dependencies.......................................................................30 2.1.2...Read after write (RAW) data dependencies..........................................32 2.1.3...Write after write (WAW) data dependencies........................................33 2.1.4...Write after read (WAR) dependencies..................................................35 2.1.5...Resource contention..............................................................................36 2.1.6...Summary of dependency types..............................................................37 2.2....Enforcing dependencies...............................................................................37 2.2.1...Procedural dependencies.......................................................................37 2.2.2...RAW dependencies...............................................................................38 2.2.3...WAW dependencies..............................................................................40 2.2.4...WAR dependencies...............................................................................40 2.2.5...Resource contention..............................................................................40 2.2.6...Summary of dependency enforcement techniques................................41 2.3....Reducing the effect of dependencies...........................................................42 2.3.1...Procedural dependencies.......................................................................42 2.3.2...RAW dependencies...............................................................................46 2.3.3...WAR and WAW dependencies.............................................................49 2.3.4...Out of order issue..................................................................................50 2.4....Dependencies and external state..................................................................52 2.5....Novel approaches to dependency resolution................................................54 2.5.1...SCALP...................................................................................................54 2.5.2...The counterflow pipeline processor architecture..................................55 2.5.3...Hades.....................................................................................................57 2.5.4...The Micronet-based asynchronous processor (MAP)...........................58 2.6....Special registers...........................................................................................59 2.7....Summary......................................................................................................59 - 3 - 3 Exceptions............................................................................60 3.1....Causes of exceptions....................................................................................60 3.1.1...External interrupts.................................................................................60 3.1.2...Arithmetic errors...................................................................................61 3.1.3...Undefined/unimplemented instructions................................................61 3.1.4...Memory access errors............................................................................61 3.1.5...Software interrupts................................................................................62 3.1.6...Unpredicted/mispredicted branches......................................................62 3.1.7...Breakpoints............................................................................................62 3.1.8...Reset......................................................................................................63 3.2....The effect of exceptions...............................................................................63 3.3....Cost and frequency of different types of exception.....................................64 3.3.1...Frequency of exceptions........................................................................64 3.3.2...Cost of exceptions.................................................................................66 3.4....Mechanisms for saving state........................................................................67 3.4.1...In-order, lookahead and architectural state...........................................68 3.4.2...Saving state using checkpoints..............................................................69 3.4.3...The history buffer..................................................................................70 3.4.4...The reorder buffer.................................................................................72 3.4.5...Reorder buffer with forwarding paths...................................................74 3.4.6...The future file........................................................................................76 3.5....Exceptions in the AMULET1 processor......................................................78 3.6....Exceptions in the Fred processor.................................................................79 3.7....Exceptions and external state.......................................................................81 3.7.1...Exceptions in a pipelined memory........................................................82 3.7.2...Multiple forms of external state............................................................83 3.8....Summary......................................................................................................84 4 Issues in implementing the ARM architecture......................85 4.1....Processor modes...........................................................................................85 4.2....Registers.......................................................................................................86 4.3....The Program counter....................................................................................87 4.4....The CPSR.....................................................................................................88 4.5....The SPSRs and their role in exception entry...............................................90 4.6....External state................................................................................................91 4.6.1...Memory.................................................................................................91 4.6.2...Coprocessors.........................................................................................91 4.7....Exceptions on the ARM...............................................................................92 4.8....Conditional execution..................................................................................93 4.8.1...Conditional execution and the use of future files..................................94 4.8.2...Conditional execution and the Hades forwarding mechanism..............96 4.9....Load/store multiple instructions...................................................................97 4.9.1...LDM with base register in the transfer list............................................98 4.9.2...LDM with PC in the transfer list...........................................................98 4.9.3...Conditional LDM/STM instructions.....................................................99 4.9.4...User mode register access.....................................................................99 4.10..Summary....................................................................................................100 - 4 - 5 An Asynchronous Reorder Buffer.......................................102 5.1....Dependency and exception handling in AMULET2..................................102 5.2....A new pipeline model................................................................................104 5.3....Parallel access FIFOs.................................................................................106 5.4....Three process view of the parallel FIFO buffer.........................................109 5.5....Five process view of the buffer..................................................................111 5.6....Operation with the five process model......................................................112 5.6.1...The pipeline stages..............................................................................113 5.6.2...A data operation..................................................................................113 5.6.3...Memory operations..............................................................................117 5.6.4...Data validity in the read process.........................................................121 5.6.5...Lack of synchronisation between the read and allocate processes......122 5.7....Process synchronisation.............................................................................124 5.8....The reorder buffer entries..........................................................................125 5.9....Summary of constraints.............................................................................126 5.10..Summary....................................................................................................127 6 Auxiliary mechanisms.........................................................128 6.1....The instruction colour................................................................................128 6.2....The program counter..................................................................................129 6.2.1...Reading the program counter..............................................................129 6.2.2...Changing the program counter............................................................131 6.2.3...Loading PC via the fetch unit..............................................................135 6.2.4...Discarding instructions at the decode stage........................................138 6.3....The CPSR...................................................................................................139 6.3.1...The Xpipe............................................................................................140 6.4....The SPSRs..................................................................................................141 6.4.1...The interaction of the SPSRs and data aborts.....................................142 6.5....Base restoration..........................................................................................144 6.6....Exceptions in the memory..........................................................................145 6.6.1...Pipelined memory and reorder buffer size..........................................145 6.7....Interrupts....................................................................................................145 6.8....Long multiplication....................................................................................146 6.9....Summary....................................................................................................147 7 Simulation and results........................................................150 7.1....The simulation environment......................................................................150 7.1.1...Trace based simulation........................................................................150 7.1.2...Behavioural simulation........................................................................150 7.2....Benchmarks................................................................................................155 7.3....Results........................................................................................................156 7.3.1...Results from the trace based simulation..............................................156 7.3.2...The benefits of the reorder buffer........................................................159 7.3.3...Loading the PC via the instruction fetch mechanism..........................163 7.3.4...The penalty of SPSR locking..............................................................164 7.4....Absolute performance................................................................................165 7.4.1...Cost of stream changes........................................................................167 7.4.2...Load latency........................................................................................168 7.4.3...Stage complexity.................................................................................169 - 5 - 7.5....Summary....................................................................................................170 8 Conclusions and Future Work............................................172 8.1....Summary....................................................................................................172 8.2....Conclusions................................................................................................174 8.3....Advantages and disadvantages of the proposed architecture.....................175 8.4....Future work................................................................................................178 8.4.1...Coprocessors.......................................................................................178 8.4.2...Thumb.................................................................................................180 8.4.3...PC change prediction...........................................................................180 8.5....The asynchronous future............................................................................181 References.................................................................................182 A The ARM instruction set.....................................................188 A.1...Conditional execution................................................................................188 A.2...Normal data processing operations............................................................189 A.2.1..Data processing with PC write............................................................190 A.3...Branch........................................................................................................191 A.3.1..Branch with link..................................................................................191 A.4...Single value memory transfer....................................................................192 A.4.1..Halfword and signed byte accesses.....................................................192 A.4.2..Swap....................................................................................................193 A.5...Multiple value memory transfer.................................................................193 A.6...Access to the CPSR/SPSR.........................................................................194 A.7...Multiplication.............................................................................................194 A.8...Coprocessor instructions............................................................................195 A.8.1..Coprocessor data operations................................................................195 A.8.2..Coprocessor data transfers...................................................................196 A.8.3..Coprocessor register transfers.............................................................196 A.9...Software interrupts.....................................................................................197 A.10.Undefined instructions...............................................................................197 A.11.Summary....................................................................................................198 B The structure of existing ARMs and AMULETs..................199 B.1...ARM 2/3/6/7..............................................................................................199 B.2...ARM 8........................................................................................................200 B.3...StrongARM................................................................................................201 B.4...The AMULET1 and AMULET2...............................................................203 B.5...Summary....................................................................................................205 - 6 - List of figures Figure 1.1......Global synchronisation.................................................................19 Figure 1.2......A two phase handshake................................................................21 Figure 1.3......A four phase handshake................................................................22 Figure 1.4......Variations on the 4 phase handshake bundled data protocol...................................................................23 Figure 1.5......Micropipeline................................................................................23 Figure 1.6......Micropipeline with logic..............................................................24 Figure 2.1......A simple pipeline..........................................................................31 Figure 2.2......Five stage pipeline........................................................................33 Figure 2.3......Pipeline with out of line memory unit..........................................34 Figure 2.4......Pipeline with out of order issue....................................................36 Figure 2.5......The AMULET1 Lock FIFO.........................................................39 Figure 2.6......Arbitration for a shared register write bus....................................41 Figure 2.7......Instruction fetch unit with BTC....................................................45 Figure 2.8......Pipeline with result forwarding....................................................47 Figure 2.9......Tomasulo’s algorithm...................................................................51 Figure 2.10....A simplified model of the CFPP...................................................56 Figure 3.1......Checkpoints for exception recovery.............................................69 Figure 3.2......The history buffer.........................................................................71 Figure 3.3......Format of a history buffer entry...................................................71 Figure 3.4......Processor organisation with a reorder buffer................................73 Figure 3.5......Format of the reorder buffer entries.............................................73 Figure 3.6......Processor organisation with a reorder buffer with forwarding............................................................................74 Figure 3.7......Processor organisation with a future file......................................76 Figure 4.1......The ARM Register set..................................................................87 Figure 4.2......Organisation of the CPSR and SPSRs..........................................89 Figure 5.1......Initial pipeline model..................................................................104 Figure 5.2......Micropipeline and parallel FIFO implementations....................107 Figure 5.3......Parallel FIFO with forwarding...................................................109 Figure 5.4......Pipeline with reorder buffer........................................................112 Figure 6.1......The commit block in the pipeline...............................................129 Figure 6.2......Adding the offset to the program counter...................................130 Figure 6.3......Writing to the PC via the reorder buffer.....................................131 Figure 6.4......Writing to the PC via the commit block.....................................132 Figure 6.5......Reading the PC from memory via the fetch unit........................136 - 7 - Figure 6.6......The relationship of the execute unit and the commit block........140 Figure 6.7......Duplicate copies of the SPSRs...................................................143 Figure 6.8......The Base Restore Pipe (BRP).....................................................144 Figure 6.9......Architecture overview................................................................148 Figure 7.1......Model as simulated: Reorder buffer without forwarding...........152 Figure 7.2......Model as simulated: No reorder buffer.......................................152 Figure 7.3......Operand age distribution............................................................159 Figure 7.4......Relative execution times for the JPEG benchmark....................160 Figure 7.5......Summary of execution times for reorder buffer without forwarding.....................................................................161 Figure 7.6......Percentage of reorder buffer allocations which had to stall (no forwarding)...........................................................................161 Figure 7.7......Summary of execution times for reorder buffer with forwarding..........................................................................162 Figure 7.8......Percentage of reorder buffer allocations which had to stall (with forwarding)........................................................................163 Figure B.1......ARM 2/3/6/7 organization..........................................................200 Figure B.2......ARM8 Integer Unit organisation................................................202 Figure B.3......StrongARM pipeline core organization......................................204 Figure B.4......AMULET Internal organisation.................................................205 - 8 - List of tables Table 1.1:......An overview of the ARM instruction set......................................26 Table 4.1:......Mapping of variables to storage....................................................95 Table 5.1:......Example forwarding key for R1..................................................114 Table 7.1:......Instruction set usage (dynamic)...................................................157 Table 7.2:......Relative execution time: PC load via data interface/PC load via fetch interface...........................................164 Table 7.3:......Absolute execution times (in ms)................................................166 Table 7.4:......Percentage slower than StrongARM (ROB 4 entry)...................166 Table 7.5:......Branch costs on StrongARM and the VHDL model (cycles).....167 Table 7.6:......Branch type occurrence in benchmarks.......................................168 Table 7.7:......Branch costs for benchmarks.......................................................168 Table 7.8:......Benchmark performance having compensated for branch costs.168 Table A.1:.....ARM Condition codes.................................................................188 Table A.2:.....ARM data processing operations.................................................189 Table A.3:.....ARM Comparison operations......................................................190 - 9 - Abstract Dependency and exception handling mechanisms are an important part of modern high-performance microprocessors. In a pipelined microprocessor, dependency and exception handling require different stages of the pipeline to interact with each other to determine the current state of the processor as a whole. In a synchronous processor inter- actions between separate pipeline stages are managed using a global clock. Communica- tion between non-neighbouring pipeline stages is more complex in an asynchronous microprocessor which does not have a global clock. This thesis describes a solution to this problem in the context of a third generation asynchronous implementation of the ARM instruction set architecture. The architecture described provides powerful and efficient dependency resolution while simultaneously providing a flexible, low overhead exception handling mechanism. The mechanism pro- vides the basis for the architecture of the AMULET3 microprocessor. Existing exception handling and dependency resolution mechanisms are re-evaluated in the context of asynchronous implementation and the ARM architecture. The Reorder Buffer is chosen as the basis of the architecture and novel enhancements are proposed which enable its use in an asynchronous environment. Simulation results are presented that show that the proposed architecture is signifi- cantly faster and more flexible than comparable architectures while still providing com- plete compatibility with the ARM instruction set architecture. - 10 -