I-Unit Pipelining Examples - Summary
Non-Pipelining:
-
The reason Non-Pipelining has 111 cycles is because 16 instructions multiplied
by 6 cycles each is 96. Then add 7 for the first cache miss, 5 cycles for
the other cache miss, 1 cycle for the multiplying of "2 ( a + b + 4 )",
and 2 cycles for the multiplying of "3 ( a + b + 6 )". The total is 111
cycles:
16 # Number of Instructions
* 6 # Number of cycles per Instruction
96
7 # Cache Miss for line at 3000
5 # Cache Miss for line at 2000 (2 less because stores are released early)
1 # Extra cycle to calculate first multiply
+ 2 # Extra cycles to calculate second multiply
111 # Total number of cycles
Pipelining:
-
The Pipelining example took 60 cycles. Ideally, by running the instructions
in parallel, a new instruction would have been completed in each cycle.
This would have ended up completing the sequence in the following number
of cycles:
16 # Number of Instructions
5 # Number of cycles it takes to get the Pipeline started
7 # Cache Miss for line at 3000
5 # Cache Miss for line at 2000 (2 less because store are released early)
1 # Extra cycle to calculate first multiply
+ 2 # Extra cycles to calculate second multiply
36 # Total number of cycles
But it didn't. The reason it took 60 cycles instead of 36 is because that
some of the instructions depended on the results of instructions that were
still being executed. For example, the second instruction at cycle 004
had to wait in the A-Stage for 3 cycles. The reason for this additional
wait, was because the address generation depended on the value of GPR1
which was being modified by an instruction further down in the pipe. The
instruction in the A-Stage had to wait until GPR1 was modified with the
new value.
Comparison
Since a 6-stage pipeline was used, there should have been a factor of 6
improvement in performance. Actually the performance only improved by a
factor of 1.85. No where need the factor of 6 we'd like to get, but overall
the addition of this single concept almost doubled the performance. Don't
worry, with bypassing the performance will improve to a factor of 3.08
over the non-pipelined example. Still no where near the factor of 6.
Ideally we would have also included an example of EPIC, which for this
program would have gotten us near or perhaps even beyond the factor of
6.
|