Understanding Simulator Timing
Running a program in Simics is not the same thing as running it on real
hardware. Simics should be viewed as one possible implementation of the
architecture, where physical hardware is another. Both comply with the hardware
architecture as defined by its documentation, but the implementations differ —
and one of the most important differences is timing.
One can differentiate between the function and the timing of a system. The
function of a system is the state changes it goes through, given the current
state and inputs. These state changes are usually described in a programmer’s
reference manual, and they need to be modeled correctly by a simulator. The
timing, on the other hand, is when these state changes occur. It is common
that timing is not fully defined by the architecture, and thus there is some
headroom for the simulator to make its own choices.
The timing may differ between hardware and simulation model for a number of
reasons:
- The timing of the system modeled may be undocumented.
- To increase simulator performance.
- To simplify the simulation timing model, so that a user fully understands it
should they want to alter it.
In Simics, the most evident timing difference compared to real hardware is
instruction timing. In real hardware, instruction timing depends on memory
access latencies, the complexity of the operation, execution resource
contention, and so on. These things are not simulated by Simics by default.
Instead, one instruction is normally simulated each cycle. The simulation of the
execution of one instruction is called a step.
It is possible to configure a Simics processor to execute more or less than one
step per cycle. It is also possible to have two or more processors in one
simulation, where each processor has a unique clock frequency. In multiprocessor
simulation, time advances on one processor at a time in a round-robin fashion.
Each processor advances a fixed amount of simulated time, called a quantum,
before switching to the next processor. Even though each quantum has the same
duration in simulated time, processors may simulate a different number of steps,
depending on their frequency.
Simics offers the possibility to stall memory transactions. This is a common way
to affect instruction execution timing and make the simulation more realistic.
Memory transactions are usually initiated by a processor and sent to a memory
space. The memory space decides where the transaction should go based on its
mappings.
Each memory space provides the memory hierarchy interface. This interface
provides a way to inspect and even alter a memory transaction passing through
the memory space, both before and after the actual transaction has executed. In
addition, the memory hierarchy interface lets a user stall the transaction. This
is how caches are modeled in Simics.
It is important to know that not all memory transactions are visible through the
memory hierarchy interface by default. During normal operation, instruction
fetches are not visible at all, and some data accesses are not sent to the
memory space. Simics caches the result of a memory space lookup so that further
accesses to the same area do not have to repeat the lookup. This is done to
increase performance. However, Simics can be configured so that all memory
transactions are visible through the memory hierarchy interface.
Simics is an event driven simulator with a maximum time resolution of a clock
cycle. In a single-processor system, the length (in seconds) of a clock cycle
is easily defined as 1 divided by the processor frequency set by the user. As
described later in this chapter, Simics can also handle multiprocessor systems
with different clock frequencies, but we will focus on single-processor systems
for the rest of this section.
As Simics makes the simulated time progress, cycle by cycle, events are
triggered and executed. Events include device interrupts, internal state
updates, as well as step executions. A step is the unit in which Simics
divides the execution of a flow of instructions going through a processor; it is
defined as the execution of an instruction, an instruction resulting in an
exception, or an interrupt. Steps are by far the most common events.
Steps and cycles are fundamental Simics concepts. Simics exposes two types of
events to users: events linked to a specific step, and events linked to a
specific cycle. Step events are useful for debugging (execute 1 step at a time,
stop after executing 3 steps, etc.), whereas time events are rather independent
from the flow of execution (sector read operation on the hard disk will finish
in 1 ms).
For each executed cycle, events are triggered in the following order:
- All events posted for this specific cycle, except step execution.
- For each step scheduled for execution on this cycle:
- All events scheduled for that specific step.
- A step execution.
Events belonging to the same category are executed in FIFO order: posted first
is executed first.
In the default model, the execution of a step takes no time by itself, and steps
are run in program order. This is called the Simics in-order model. It
implements the basic instruction set abstraction that instructions execute
discretely and sequentially. This minimalistic approach makes simulation fast
but does not attempt to model execution timing in any way.
Normally one step is executed every cycle, so that the step and cycle counts are
the same. See the section Changing the Step Rate for how to change this.
The in-order model can be extended by adding timing models to control the
timing of memory operations, typically using the memory hierarchy
interface.
When timing models are introduced, steps are no longer atomic operations taking
no time. A step performing a memory operation (whether an instruction fetch or a
data transaction) can stall for a number of cycles. Cycle events are performed
during the stalling period as time goes forward. Step events are performed just
before the step execution, as in the default model. Simics executes one step at
a time, but with varying timing for each step, so the simulation is still
performing an in-order execution of the instruction flow. The basic step rate
can also be changed; see the section Changing the Step Rate below.
Choosing an execution mode is matter of trade-off between performance and
accuracy. The stalling mode is notably slower than in-order, but allows for
memory timing models to operate correctly and, as a consequence, also allows for
inspection of all memory transactions. By using checkpoints, it is possible to
switch between the two modes, since a checkpoint created in the in-order mode
can be loaded in stall mode. The simple in-order mode can thus be used to reach
interesting parts of the simulation quickly, and the stall mode is used to
simulate those parts.
The step rate is the number of steps executed each cycle, disregarding any
stalling. It is expressed as the quotient q/p. By default, p=q=1; this schedules
one step to run in each cycle. This can be changed by using the
<processor>.set-step-rate command. For example,
simics> cpu0.set-step-rate "3/4"
will set the step rate of cpu0 to 3/4; that is, three steps every four cycles.
If q<p, then some cycles will execute no step at all; if q>p, then some cycles
will execute more than one step. The step rate parameters are currently limited
to 1 ≤ p ≤ 128 with p=2k for some integer k, and 1 ≤ q ≤ 128.
Setting a non-unity step rate can be used to better approximate the timing of a
target machine averaging more or less than one instruction per cycle. It can
also be used to compensate for Simics running instructions slower than actual
hardware when it is desirable to have the simulated time match real time;
specifying a lower step rate will cause simulated time go faster. Finally, a
lower step rate may improve simulator performance by reducing the number of
instructions executed between consecutive simulated timer interrupts. This is,
however, to a high degree depending on the workload in question. Some workloads
even benefit from a larger number of steps between timer interrupts.
The step rate is sometimes called IPC (instructions per cycle), and its inverse,
the cycle rate, may be called CPI (cycles per instruction). The actual rates
will depend on how many extra cycles are added by stalling.
Let us look at an example using the QSP-x86 platform. We will first run 1
million steps with the default settings:
simics> c 1_000_000
simics> print-time
┌──────────────────────┬─────────┬─────────┬────────┐
│ Processor │ Steps │ Cycles │Time (s)│
├──────────────────────┼─────────┼─────────┼────────┤
│qsp.mb.cpu0.core[0][0]│1_000_000│1_400_000│ 0.001│
└──────────────────────┴─────────┴─────────┴────────┘
The processor has run 1 million steps, taking 1.4 million cycles to execute
them. (The larger number is because QSP is configured so that certain operations
will stall the core for 20,000 cycles, see Stalling above, as well as the
output from the qsp.mb.cpu0.core[0][0].info command.) Let us now set the cycle
rate to the value mentioned above, 3 steps for every 4 cycles:
simics> qsp.mb.cpu0.core[0][0].set-step-rate "3/4"
Setting step rate to 3/4 steps/cycle
simics> bp.cycle.break 1_200_000
Breakpoint 2: qsp.mb.cpu0.core[0][0] will break at cycle 2600000
simics> c
[bp.cycle] Breakpoint 2: qsp.mb.cpu0.core[0][0] reached cycle 2600000
simics> print-time
┌──────────────────────┬─────────┬─────────┬────────┐
│ Processor │ Steps │ Cycles │Time (s)│
├──────────────────────┼─────────┼─────────┼────────┤
│qsp.mb.cpu0.core[0][0]│1_900_000│2_600_000│ 0.001│
└──────────────────────┴─────────┴─────────┴────────┘
When running the next 1.2 million cycles, the processor executes only 900000
steps, which corresponds to the 3/4 rate that we configured.
It is possible to set the step rate to infinity, or equivalently, to suspend
simulated time while executing steps. This is done by setting the
step_per_cycle_mode processor attribute to one of the following values:
"constant" Steps are executed at the constant and
finite rate specified in the step_rate attribute
"infinite"
Steps are executed with no progress in simulated time
While time is suspended, the cycle counter does not advance, nor are any time
events run. To the simulated machine this appears as if all instructions are run
infinitely fast. Using the same example as above, we set the step per cycle mode
to “infinite” to prevent the simulated time from advancing:
simics> qsp.mb.cpu0.core[0][0]->step_per_cycle_mode = "infinite"
simics> c 1_000_000
simics> print-time
┌──────────────────────┬─────────┬──────┬────────┐
│ Processor │ Steps │Cycles│Time (s)│
├──────────────────────┼─────────┼──────┼────────┤
│qsp.mb.cpu0.core[0][0]│1_000_000│ 0│ 0.000│
└──────────────────────┴─────────┴──────┴────────┘
The processor has executed 1 million steps but the simulated time has not
advanced. Note that setting this mode would probably prevent many machines from
booting since many hardware events (like interrupts) are time-based.
Conversely, it is possible set the step rate to zero, thus suspending execution
while letting simulated time pass. This can be done by stalling the processor
for a finite time (see Stalling above) or by disabling the processor for
an indefinite time. Disabling and re-enabling processors is done with the
commands <processor>.enable and <processor>.disable.
Simics can model systems with several processors, each with their own clock
frequency. In this case the definition of how long a cycle is becomes
processor-dependent. Ideally, Simics would make time progress and execute one
cycle at a time, scheduling processors according to their frequency. However,
perfect synchronization is exceedingly slow, so Simics serializes execution to
improve performance.
Simics does this by dividing time into segments and serializing the execution of
separate processors within a segment. The length of these segments is referred
to as the quantum and is specified in seconds. This is similar to the way
operating systems implement multitasking on a single-processor machine: each
process is given access to the processor and runs for a certain time quantum.
The processors are scheduled in a round-robin fashion, and when a particular
processor P has finished its quantum, all other processors will finish their
quanta before execution returns to P. The length of the time quantum can be
set by using the command set-time-quantum. The argument to set-time-quantum
is specified in either cycles referring to the first processor in the system or
seconds.
As an example, consider a dual-processor system where the first processor runs
at 4 MHz and the second at 1 MHz. Setting set-time-quantum to 10 will give a
quantum of 2.5 simulated microseconds. During each quantum, the first processor
will execute 10 steps, and the second 2 or 3 steps. During the following quanta
the second processor will continue to execute 2 or 3 steps each quantum, but it
is not defined exactly how many steps will be executed in each quantum. Stopping
the simulation does not affect this schedule, so human interaction with Simics
remains non-intrusive.
Note that if you are single-stepping (step-instruction) on a processor P,
which has just executed the last cycle of a quantum, the next single-step will
cause all other processors to advance an entire quantum and then P will stop
after one step. This behavior makes it convenient to follow the execution of
instructions on a particular processor. You can use the <processor>.print-time
command to see the current time on each particular processor in the simulated
machine.
For a multi-processor simulation to run efficiently, the quantum should not be
set too low, since a CPU switch causes simulator overhead. It should not be set
below 10, and should preferably be set to 50 or higher. The default value is
1000. For perfectly synchronized simulation, set the switch time to 1 (which
gives a very slow simulation but is useful for detailed cache studies, for
example).
Note that all of the above remains essentially the same when running a
distributed simulation.
Time events in Simics are executed when the processor on which they were posted
runs the triggering cycle during its quantum. However, it is possible to post
synchronizing time events that will ensure that all processors have the same
local time when the event is executed, independently of the time quantum.
Synchronizing events cannot be posted less than one time quantum in the future
unless the simulation is already synchronized.
Let us have a look at a 2-machines setup containing two SPARC SunFire machines
(with one processor each) to illustrate multiprocessor simulation. The processor
in the first machine runs at 168MHz; the other runs at 56MHz (equal to 168/3).
The time quantum (configured via the set-time-quantum command) is 1000 cycles
of the first processor, or 6 microseconds.
simics> d1_cpu0->freq_mhz
168
simics> d2_cpu0->freq_mhz
56
simics> sim->cpu-switch-time
Current CPU switch time: 1000 cycles (0.000006 seconds)
simics> c 10_000
simics> print-time -all
processor steps cycles time [s]
d1_cpu0 10000 10000 0.000
d2_cpu0 3333 3333 0.000
While the first processor executed 10000 steps, the second processor completed
3333 steps, which corresponds to the ratio between the two frequencies (168MHz
compared to 56MHz). Let us now examine the effects of the time quantum:
simics> c 30
simics> print-time -all
processor steps cycles time [s]
d1_cpu0 10030 10030 0.000
d2_cpu0 3333 3333 0.000
Although the first processor ran 30 steps further, the second processor has not
run the 10 steps that we would expect, and the frequency ratio is not respected
anymore. This is the effect of the 1000 cycles time quantum: the first processor
is scheduled for the next 1000 cycles and no other processor will be run until
the quantum is finished. If we switch to the second processor and try to make it
run one step further, we will observe the following:
simics> pselect d2_cpu0
simics> c 1
simics> print-time -all
processor steps cycles time [s]
d1_cpu0 11000 11000 0.000
d2_cpu0 3334 3334 0.000
The second processor has run 1 step further as requested, but the first had to
finish its time quantum before the second processor could be allowed to run,
which explains its step count of 11000 compared to 10030 before. Let us now set
the time quantum to 1 cycle:
simics> set-time-quantum count = 1 unit = "cycles"
The switch time will change to 1 cycles (for CPU-0) once all processors have synchronized.
simics> c 1
simics> print-time -all
processor steps cycles time [s]
d1_cpu0 11000 11000 0.000
d2_cpu0 3335 3335 0.000
Note that the new time quantum length will only become valid once all processors
have finished their current time quantum. This is why stepping one more step
forward with the second processor has not affected the first yet. Now let us
select the first processor again, and run three steps:
simics> pselect d1_cpu0
simics> c 3
simics> print-time -all
processor steps cycles time [s]
d1_cpu0 11003 11003 0.000
d2_cpu0 3668 3668 0.000
simics> c 3
simics> print-time -all
processor steps cycles time [s]
d1_cpu0 11006 11006 0.000
d2_cpu0 3669 3669 0.000
>
All processors finished their 1000 cycles time quantum and started to run with
the new 1 cycle value, which means that they are now advancing in lockstep. For
every 3 steps performed by the first processor, the second executes 1.