16.3 Distributed Simulation

Distributed simulation is used for connecting multiple Simics processes, possibly running on different host machines, to run synchronously and exchange data in a reliable and deterministic way.

16.3.1 Configuration

The Simics processes taking part in the combined simulation, here called nodes, are configured and managed individually. Each node will set up and run its own configuration and have its own name space. It will be controlled by its own command line or graphical interface.

Nodes are strung together by letting the local top-level synchronization domain in one node have a domain in another node as parent. Typically, there will be one global top-level domain in one node controlling the domains in all other nodes:

       D0                         global top-level domain
        |
        +---------+---------+
        |         |         |
       D1        D2        D3     local domains
        |         |         |
        +-+-+     +-+-+     +-+-+
        |   |     |   |     |   |
        C11 C12   C21 C22   C31 C32  cells
        |--------|---------|--------|
        node 1    node 2   node 3

In the above diagram, D0-D3 are synchronization domains and C11-C32 cells. D1, C11 and C12 are all in node 1, and so on. The top-level domain D0 could be placed either in a node of its own, without any cells, or in one of the other nodes. We will here assume it is located in node 1, the server node; the other nodes are then clients.

Domains in different nodes connect by proxies, which themselves connect over the network. The relation between D0 and D3 above is set up like follows:

           /     D0           sync_domain
   node 1 |      |
           \     D3_proxy     remote_sync_node
                 :
              (network connection)
                 :
           /     D0_proxy     remote_sync_domain
   node 3 |      |
           \     D3           sync_domain

The remote_sync_domain in the client node, D0_proxy, is created explicitly in the configuration for that node. The remote_sync_node in the server node is created automatically by a special server object when D0_proxy connects to the server node.

When a node has finished its configuration, it must inform the server to allow other clients to connect. This is done by setting to None the finished attribute of the remote_sync_domain object, or the remote_sync_server in the server. As a result, node configuration is done in sequence.

The default domain used by cells is default_sync_domain, so by using this as the local domain name, existing non-distributed configurations can be re-used. It is also a good idea to use the same name for the remote_sync_domain as for the actual top-level domain it is a proxy for. That way, it will matter less in what node the top-level domain is placed.

The configuration script for a single node could look like the following Python fragment:

srv_host = "serverhost"  # machine the server node runs on
srv_port = 4567          # TCP port the server listens on

# Start by creating the global and/or local domain objects:

if this_is_the_server_node:
    topdom = SIM_create_object("sync_domain", "top_domain",
                               [["min_latency", 0.04]])
    rss = SIM_create_object("remote_sync_server", "rss",
                            [["domain", topdom], ["port", srv_port]])
else:
    # Client nodes: create a proxy for the top-level domain.
    # This will initiate a connection to the server.
    topdom = SIM_create_object("remote_sync_domain", "top_domain",
                               [["server",
                                 "%s:%d" % (srv_host, srv_port)]])
# create a local domain to be a parent for the cells in this node
SIM_create_object("sync_domain", "default_sync_domain",
                  [["sync_domain", topdom], ["min_latency", 0.01]])

# --- Here the rest of the node should be configured. ---

if this_is_the_server_node:
    rss.finished = None     # let clients connect to the server
else:
    topdom.finished = None  # let other clients connect to the server

At the end of this script, the configuration is finished for that node. Note that other nodes may not have finished theirs yet—the simulation cannot start until the entire system has been set up. The user can just wait for this to happen, or write a mechanism to block until the system is ready; see the section about global messages below.

16.3.2 Links

Links work across nodes in the same way as in a single process simulation. Using the same global ID for links in two different nodes ensures that they are considered as the same link in the distributed simulation. The global ID for a link is set using the global_id attribute when the link is created.

There is one important aspect of link distribution that should be taken into account when creating distributed configuration.

When creating single-session configuration, Simics provides only one object namespace, which means that all objects have a unique name in that session. This property is used to keep link message delivery deterministic when no other way of comparing the messages is available. To be more precise, messages arriving from different senders to the same receiver at the same cycle are sorted according to the pair (sender name, sender port).

In distributed sessions however, Simics does not impose a single object namespace. This allows several objects with the same name to be connected to the same distributed link. As a consequence, the delivery of messages as described in the previous paragraph may become indeterministic again, since different sender may report the same (sender name, sender port) pair. Distributed links report an error if such a configuration is found.

The solution is to name differently the various boards or machines that compose the complete distribution configuration.

Note: Deleting a distributed link is not supported.

16.3.3 Global Messages

There is a supporting mechanism for sending simple messages to all nodes in the combined system: SIM_trigger_global_message(msg) will trigger the global notifier Sim_Global_Notify_Message, whose callbacks can use SIM_get_global_message to obtain the message. A notifier listener could look like:

def global_message_callback(_, ref):
    print("got message {0}".format(SIM_get_global_message(ref)))
SIM_add_global_notifier(Sim_Global_Notify_Message, None,
                        global_message_callback, my_ref)

Global messages will arrive and be processed during a call to SIM_process_work(). This is useful for blocking further execution of a script until a certain message has arrived.

Global messages can be used to assist in configuration and running a distributed system. Possible uses include:

waiting for all nodes to finish their configuration
starting and stopping the simulation on all nodes as the same time
broadcasting commands to all nodes
sending data to a single requesting node
saving the configuration of all nodes

This facility is not intended for simulation use; message delivery is reliable but not deterministic in timing. It should be regarded as a low-level mechanism to be used as a building block for higher-level communication.

16.3.4 Running the simulation

Each node will have to start the simulation for the combined system to make progress. If one node stops the simulation—by user intervention or because of a coded breakpoint—the remaining nodes will be unable to continue, because the top-level synchronization domain keeps cells in different nodes from diverging.

Each node can only access the objects it has defined locally. This means that inspection and interaction with the system must be done separately for each node. A combined front-end interface is not available for Simics at this time.

When one Simics process terminates, the other nodes will automatically exit as well.

16.3.5 Saving and restoring checkpoints

The state of a distributed simulation can be saved by issuing write-configuration in each node separately. To restore such a checkpoint, start an equal number of (empty) Simics processes and read the saved configuration for each node.

Note that the server host name and port number are saved in the configuration files. These have to be edited if the checkpoint is restored with a different server host, or if the port number needs to be changed. Similarly, if SimicsFS is mounted to a file system, it will be saved in the checkpoint. Not all connections to real network or file systems can be restored at a later time.

Note as well that the configurations must be read in sequence again, using the finished attribute to control which session is allowed to connect. However, the order of read-configuration sequence does not matter, as long as the server is started first.

16.3.6 Security

The distributed simulation facility has no authentication mechanism and is not security-hardened. It should therefore only be used in a trusted network environment.

16.2 Multicore Accelerator 16.4 Page-Sharing