10. Current status and plans

TCF is currently being used to drive HW testing of the Zephyr OS inside Intel; while it was desigined with it in mind, it is by no means restricted to running Zephyr testcases.

Our current deployment looks like:

  • six target servers spread around two continents
  • Jenkins launching the runs on six different slaves in a matrix build
  • around 70 MCU targets of 16 different types, covering x86, ARC, ARM and NIOS2 architectures plus PC-class machines, network (ethernet and bluetooth setups)
  • power switches, YKUSH power switching hubs, serial consoles, etc
  • ansible used to keep all the servers setup correctly

Depending on how many runs are competing for the hardware, it usually runs all of Zephyr Sanity Check plus a bunch of other internal testcases, samples and combinations (totalling about 14000 testcases) in 95min (~150 TCs/min).

This allows us to get almost realtime feedback on a continuous integration manner. We are looking into extending this to commit verification to provide feedback to developers on their proposed changes.

10.1. Challenges

Achieving high target-per-server density to reduce cost is possible, the main problems being:

  • USB bandwidth gets consumed rapidly, so more USB Host Controllers need to be added to the system

  • these rates speed places huge strain on the USB host controllers (as the system is basically plugging/unplugging/resetting multiple devices way many times per minute) and some of them will just die and provide no feedback to the matter. All that is left is a USB serial port that seems connected but provides no output, no signs in the kernel to tell what is going on.

    when this happens, a driver reenumeration sometimes helps, others just a server power-cycle will fix it. Enterprise class USB Host Controller hardware seems to take it better, but YMMV. Or reduce density…

10.2. The code

The code base has evolved a lot over time and has multiple places where it can use improvements, refactoring and rethought.

Main areas of improvement:

  • the web server (Tornado) is hardwired to start N servers and requests bounce on them across targets; it’d be a good idea to have N be dynamic and even bound to a single target, as most usage models do sequential access on a target.

  • the target acquisition mechanism is currently based on a very simplistic random poll mechanism, which puts a unnecesary strain on the network when there is a lot of contention. As well, there is no way to configure prioritization (a developer over Jenkins, for example). [tbdl.tt.targets_assign]

    This needs to be replaced with a simple queue and event mechanism that lets the client known when their request for assignment is completed and can start operating on the target. This will require changes in both server and client.

  • the code that generates target groups for testcases / testgroup pairing needs to be broken up and extended to take more advantage of more possibilities (in the case of target groups with interconnects) [tcfl.tt._run_on_targets]

  • it will make sense also to abstract the ttbd_client access layer into something that would allow other target servers to be used by the client to run testcases on, not just the ttbd server.