For an introduction on what hypersimulation is, see the Simics User Guide, in the Performance chapter.
Simics has a module called hypersim-pattern-matcher
, that manages a number of specific idle-loop machine code patterns, hypersim-patterns. The theory of operation is that each pattern registers itself into a global database, from where the matcher selects the patterns that matches the simulated architecture. The matcher then insert hooks into those processors to detect when the binary code matches any of the patterns. If a match is found, an execute breakpoint is set at this address to allow the pattern to fast-forward the simulation every time the CPU reaches the pattern.
The Automatic Hypersimulation feature in Simics 6 reduces the need to write manual patterns. The ability to detect loops which can be hypersimulated is part of the JIT compiler for PPC, ARM and MIPS processor models. Loops that can be detected must fit into one compilation unit and may not contain any store instruction in the loop path. Automatically detected loops have lower insertion overhead and simulates faster compared to using the hypersim-pattern-matcher. Consequently, old manual written patterns, describing simple loops, might now be detected automatically with increased performance. These manual patterns should therefore be removed.
With idle loops we mean any kind of loop that does not calculate any value; this include spin-locks, busy-wait loops, variable polling, and timing loops.
The easiest way to locate the idle loop is simply to pause the simulation when it appears to do nothing but still takes time to simulate. Chances are high that the CPU is stopped somewhere in the idle loop. You can then use the step-instruction
command to single step and inspect the instructions being executed. If you see a repetitive pattern here, there is a small loop that could possibly be optimized. The -r
flag to step-instruction
will print register updates, it is helpful to understand if there is a counter involved in the loop.
The next step is to verify that it only accesses RAM on predictable addresses. Loops that polls a device is not possible to hypersimulate, unless the device accesses are free from side-effects. To inspect this, you can insert a tracer using new-tracer
and trace-io
.
Each pattern should register itself as a Simics Class with certain interfaces and attributes. Please see section 47.2.4 for detailed information.
The pattern matcher accept patterns in two different formats. The generic format is as a hexadecimal or binary string. However, since architectures with constant-width instructions of 4 bytes are common, there is a simpler format as a list of (value, mask) pairs. This form specifies the instruction opcodes and masks to match in 32-bit big endian format.
The simplest option is to convert the loop into a pattern that should match exactly, but more general patterns are also possible, for example by using any register, and the instructions that are not part of the loop could be left out. However, we recommend keeping the pattern exact until you actually face another loop which could have matched too. See section 47.2.1 how to match with wildcards.
If the two parts of the loop are separated, one could use a main pattern and a sub-pattern. See section 47.2.2 for more information on how to match sub-patterns.
Next step is to determine what should happen every time the identified loop is entered. First, the necessary pre-conditions should be checked.
When all pre-conditions are fulfilled, the simulation can be fast-forwarded. To make it easier to update the state after fast-forwarding, we only fast-forward an even number of iteration, and let the processor simulate the last iteration. Then the pattern don't need to do any state updates by itself.
If the pattern gave no limit on how far a processor can be fast-forwarded, the pattern matcher will fast-forward it to the next event posted. The advance()
function in the step
interface is used to let the CPU actually fast-forward.
If the pattern need to do side-effects, it gets the actual number of steps forwarded as a return value from the ffwd_time()
function. Here is an example that needs to update a loop register:
static bool
pattern_triggered(conf_object_t *obj, pattern_t *handle, void *loop_data,
conf_object_t *cpu, physical_address_t paddr)
{
test_pattern_t *tp = (test_pattern_t *)obj;
struct per_loop_data *loop = loop_data;
if (SIM_object_class(cpu) != loop->cached_class) {
/* Either there is no cached interface pointer or it was cached
for another cpu class, so read them out again. */
if (cache_interfaces(obj, loop, cpu) == 0)
return false;
}
int steps = tp->matcher.iface->ffwd_time(
tp->matcher.obj,
handle,
cpu,
LOOP_INSTR,
0,
0,
COUNT_AS_IDLE);
int regno = loop->cpuclass_regno;
loop->reg_iface->write(cpu, regno,
loop->reg_iface->read(cpu, regno)
+ steps / LOOP_INSTR);
return true;
}
The quickest way to test that patterns are deterministic is to try running your machine with and without hypersim-patterns installed, and see if the state differs.
The impact of hypersimulating idle-patterns can be measured by the system-perfmeter
tool. The hypersim-status
command also have some information about how many steps that have been skipped.
If the performance did not improve as much as expected, it might be that either the pattern matched too often but the examine_pattern() method rejected it, or that the preconditions was not fulfilled often enough in the pattern_triggered() call. The percentage of successful calls can be listed with the hypersim-status -v
command.
Often, a recompilation of the software can lead to small differences in the bit pattern of the idle loop. A pattern can be made to match many variants by placing wildcards in a pattern. Wildcards are done by replacing a digit in the pattern by a + sign, or, in the specialized pattern style, with zeros in the mask. One of our test patterns has wildcards on register values.
static const char * const pattern[] = {
"0x60000000", /* A. nop */
"0b0011_10++_++++_++++ 0x0001", /* B. addi rX,rY,1 */
"0x4bfffff8", /* C. b A */
NULL
};
If there are constraints on the wildcards, they need to be checked in the examine function. To continue the previous example, here is how it verifies that the same register is used as source and destination in the addi
instruction.
static void *
examine_pattern(conf_object_t *obj, pattern_t *handle, conf_object_t *cpu,
logical_address_t vaddr, physical_address_t paddr)
{
uint32 addi_insn = SIM_read_phys_memory(cpu, paddr + 4, 4);
if (SIM_clear_exception() != SimExc_No_Exception)
return NULL;
int reg = (addi_insn >> 16) & 31;
if (((addi_insn >> 21) & 31) != reg) {
/* addi rX,rY,1 where X != Y means no match */
return NULL;
}
struct per_loop_data *loop = MM_ZALLOC(1, struct per_loop_data);
loop->incremented_register = reg;
return loop;
}
If the idle loop contains subroutine calls, the subroutine needs to be checked against a pattern too. This is done in the examine function. An example from the mpc-u-boot-hypersim-patterns
module is given below.
/* This loop busy polls the timebase, waiting for it to reach a certain value */
static const char * const pattern[] = {
"0b010010++ 0x++++ 0b++++++01", // 0: @ bl (get_ticks)
"0x7c84 0b+++++000 0x10", // 1: @ subfc r4,r4,rX
"0x7c63 0b+++++001 0x11", // 2: @ subfe. r3,r3,rY
"0x4080fff4", // 3: @ bge
NULL
};
/* Function used by main loop - reads the timebase */
static const char * const sub_pattern[] = {
"0x7c6d42 0b1+100110", // 0: @ mfspr r3,269 # utbu
"0x7c8c42 0b1+100110", // 1: @ mfspr r4,268 # utbl
"0x7cad42 0b1+100110", // 2: @ mfspr r5,268 # utbu
"0x7c032800", // 3: @ cmpw r3,r5
"0x4082fff0", // 4: @ bne
"0x4e800020", // 5: @ blr
NULL
};
static void *
examine_pattern(conf_object_t *obj, pattern_t *handle, conf_object_t *cpu,
logical_address_t vaddr, physical_address_t paddr)
{
wait_ticks_pattern_t *pat = SIM_object_data(obj);
/* Extract address of sub-pattern */
uint32 insn = SIM_read_phys_memory(cpu, paddr + (BL_IDX * 4), 4);
if (SIM_clear_exception() != SimExc_No_Exception) {
SIM_LOG_ERROR(pat->obj, 0,
"failed to load branch-and-link opcode");
return 0;
}
int32 li = (int32)((insn & 0x3ffffffc) << 6) >> 6; // sign-extend
uint32 bl_addr = vaddr + BL_IDX * 4 + li;
/* Match sub pattern and get physical address of it */
physical_address_t sub_pattern_paddr;
if (!pat->matcher.iface->check_pattern(pat->matcher.obj,
cpu,
bl_addr,
pat->sub_pattern,
&sub_pattern_paddr)) {
SIM_LOG_INFO(2, pat->obj, 0,
"sub-pattern mismatch at v:0x%x", bl_addr);
return 0;
}
pat->matcher.iface->protect_region(pat->matcher.obj,
handle,
paddr,
sub_pattern_paddr,
SUB_LOOP_INSTR);
/* Get input parameter register numbers */
uint32 subfe = SIM_read_phys_memory(cpu, paddr + (SUBFE_IDX * 4), 4);
if (SIM_clear_exception() != SimExc_No_Exception) {
SIM_LOG_ERROR(pat->obj, 0, "failed to load subfe opcode");
return 0;
}
int subfe_rb = get_rb_from_xo_form(subfe);
uint32 subfc = SIM_read_phys_memory(cpu, paddr + (SUBFC_IDX * 4), 4);
if (SIM_clear_exception() != SimExc_No_Exception) {
SIM_LOG_ERROR(pat->obj, 0, "failed to load subfc opcode");
return 0;
}
int subfc_rb = get_rb_from_xo_form(subfc);
SIM_LOG_INFO(2, pat->obj, 0,
"Examine pattern:\n"
" Pattern @ 0x%llx\n"
" Sub-pattern @ 0x%llx\n"
" input-params in r%d/r%d",
paddr, sub_pattern_paddr, subfe_rb, subfc_rb);
return add_pattern_info(pat, paddr, subfe_rb, subfc_rb);
}
Subpatterns needs to be registered with the hypersim-pattern-matcher before they can be matched against. This call will parse the opcode strings and create a more efficient internal format. Do this in the finalize
function.
static void
finalize_instance(conf_object_t *obj)
{
wait_ticks_pattern_t *pat = SIM_object_data(obj);
pat->handle = pat->matcher.iface->
install_pattern(pat->matcher.obj,
obj,
"mpc-u-boot-wait-ticks",
pattern,
4 * MAIN_LOOP_INSTR);
pat->sub_pattern = pat->matcher.iface->
register_sub_pattern(pat->matcher.obj,
pat->handle,
sub_pattern);
}
Subpatterns needs to be unregistered when the objects is deleted.
static void
pre_delete_instance(conf_object_t *obj)
{
wait_ticks_pattern_t *pat = SIM_object_data(obj);
pat->matcher.iface->uninstall_pattern(pat->matcher.obj, pat->handle);
}
A loop which polls a cycle counter is called a time-dependent loop. An example that reads the timebase register on a PPC processor, whose pattern is given in section 47.2.2. Since steps and cycles does not correspond 1-to-1 in Simics, we need to take into account the conversion factor. In the case of timebase, the conversion is given by the equation:
steps = timebase cycles × cpu frequency × step-rate / timebase frequency
Below is an example from [simics]/src/extensions/mpc-u-boot-hypersim-patterns/wait-ticks.c
. In this example, the step-rate is not taken into account, so it only works for an 1:1 step-rate.
static bool
pattern_triggered(conf_object_t *obj, pattern_t *handle, void *loop_data,
conf_object_t *cpu, physical_address_t paddr)
{
wait_ticks_pattern_t *pat = SIM_object_data(obj);
cpu_info_t *c = get_cpu_info(pat, cpu);
pattern_info_t *p = loop_data;
const int_register_interface_t *int_reg_iface = c->int_iface;
/* Get stop value of timebase */
uint32 tb_high = int_reg_iface->read(cpu, c->r0 + p->input_high);
uint32 tb_low = int_reg_iface->read(cpu, c->r0 + p->input_low);
cycles_t tb_stop = (cycles_t)tb_high << 32 | tb_low;
/* Get current timebase value */
uint32 tbu = int_reg_iface->read(cpu, c->tbu);
uint32 tbl = int_reg_iface->read(cpu, c->tbl);
cycles_t tb_start = (cycles_t)tbu << 32 | tbl;
if (tb_stop == tb_start)
return true;
/* Remove one tb-tick since we might have partially executed the next
next timebase cycle (tb.remainder != 0) */
uint64 tb_diff = tb_stop - tb_start - 1;
uint64 c_cycles_low, c_cycles_high;
uint64 cpu_ticks = 0, rem;
unsigned_multiply_64_by_64(tb_diff, c->cpu_freq,
&c_cycles_high, &c_cycles_low);
unsigned_divide_128_by_64(c_cycles_high, c_cycles_low, c->tb_freq,
&cpu_ticks, &rem);
if (cpu_ticks == 0)
return true;
pat->matcher.iface->ffwd_time(
pat->matcher.obj,
handle,
cpu,
TOTAL_LOOP_INSTR, /* Minimal step count */
cpu_ticks, /* Maximum step count */
TOTAL_LOOP_INSTR, /* Steps to keep (run last iter for real) */
0); /* Not an idle loop */
return true;
}
When Simics is started, the individual pattern modules announces their existence by calling the Python function hypersim_patterns.add()
, from their simics_start.py
files.
When a machine configuration is later loaded, its CPU architectures are checked against the registered patterns.
If there are patterns matching an architecture, an object of type hypersim-pattern-matcher
is created and attached to the physical memory-space of that processor. This matcher object will, in turn, create one object for each registered pattern.
The reason why a pattern both need to register in Python code and later in C code is to avoid loading modules for hypersim patterns unrelated to the architecture being simulated.
The hypersim-pattern-matcher
creates patterns with the SIM_create_object() call and sets the "matcher" attribute to point back to itself. That means there will be one pattern instance for each physical memory-space in the system.
For more complete documentation on related classes and commands such as enable-hypersim
, disable-hypersim
, hypersim-status
and list-hypersim-patterns
, Please refer to the Reference Manual.
The hypersim_pattern_matcher
and hypersim_pattern
interfaces are documented in the API Reference Manual.