Scanning for Patterns

Hyperscan provides three different scanning modes, each with its own scan function beginning with hs_scan. In addition, streaming mode has a number of other API functions for managing stream state.

Handling Matches

All of these functions will call a user-supplied callback function when a match is found. This function has the following signature:

typedef int (*match_event_handler)(unsigned int id, unsigned long long from, unsigned long long to, unsigned int flags, void *context)

The id argument will be set to the identifier for the matching expression provided at compile time, and the to argument will be set to the end-offset of the match. If SOM was requested for the pattern (see Start of Match), the from argument will be set to the leftmost possible start-offset for the match.

The match callback function has the capability to halt scanning by returning a non-zero value.

See match_event_handler for more information.

Streaming Mode

The core of the Hyperscan streaming runtime API consists of functions to open, scan, and close Hyperscan data streams:

  • hs_open_stream(): allocates and initializes a new stream for scanning.

  • hs_scan_stream(): scans a block of data in a given stream, raising matches as they are detected.

  • hs_close_stream(): completes scanning of a given stream (raising any matches that occur at the end of the stream) and frees the stream state. After a call to hs_close_stream(), the stream handle is invalid and should not be used again for any purpose.

Any matches detected in the data as it is scanned are returned to the calling application via a function pointer callback.

The match callback function has the capability to halt scanning of the current data stream by returning a non-zero value. In streaming mode, the result of this is that the stream is then left in a state where no more data can be scanned, and any subsequent calls to hs_scan_stream() for that stream will return immediately with HS_SCAN_TERMINATED. The caller must still call hs_close_stream() to complete the clean-up process for that stream.

Streams exist in the Hyperscan library so that pattern matching state can be maintained across multiple blocks of target data – without maintaining this state, it would not be possible to detect patterns that span these blocks of data. This, however, does come at the cost of requiring an amount of storage per-stream (the size of this storage is fixed at compile time), and a slight performance penalty in some cases to manage the state.

While Hyperscan does always support a strict ordering of multiple matches, streaming matches will not be delivered at offsets before the current stream write, with the exception of zero-width asserts, where constructs such as \b and $ can cause a match on the final character of a stream write to be delayed until the next stream write or stream close operation.

Stream Management

In addition to hs_open_stream(), hs_scan_stream(), and hs_close_stream(), the Hyperscan API provides a number of other functions for the management of streams:

Stream Compression

A stream object is allocated as a fixed size region of memory which has been sized to ensure that no memory allocations are required during scan operations. When the system is under memory pressure, it may be useful to reduce the memory consumed by streams that are not expected to be used soon. The Hyperscan API provides calls for translating a stream to and from a compressed representation for this purpose. The compressed representation differs from the full stream object as it does not reserve space for components which are not required given the current stream state. The Hyperscan API functions for this functionality are:

  • hs_compress_stream(): fills the provided buffer with a compressed representation of the stream and returns the number of bytes consumed by the compressed representation. If the buffer is not large enough to hold the compressed representation, HS_INSUFFICIENT_SPACE is returned along with the required size. This call does not modify the original stream in any way: it may still be written to with hs_scan_stream(), used as part of the various reset calls to reinitialise its state, or hs_close_stream() may be called to free its resources.

  • hs_expand_stream(): creates a new stream based on a buffer containing a compressed representation.

  • hs_reset_and_expand_stream(): constructs a stream based on a buffer containing a compressed representation on top of an existing stream, resetting the existing stream first. This call avoids the allocation done by hs_expand_stream().

Note: it is not recommended to use stream compression between every call to scan for performance reasons as it takes time to convert between the compressed representation and a standard stream.

Block Mode

The block mode runtime API consists of a single function: hs_scan(). Using the compiled patterns this function identifies matches in the target data, using a function pointer callback to communicate with the application.

This single hs_scan() function is essentially equivalent to calling hs_open_stream(), making a single call to hs_scan_stream(), and then hs_close_stream(), except that block mode operation does not incur all the stream related overhead.

Vectored Mode

The vectored mode runtime API, like the block mode API, consists of a single function: hs_scan_vector(). This function accepts an array of data pointers and lengths, facilitating the scanning in sequence of a set of data blocks that are not contiguous in memory.

From the caller’s perspective, this mode will produce the same matches as if the set of data blocks were (a) scanned in sequence with a series of streaming mode scans, or (b) copied in sequence into a single block of memory and then scanned in block mode.

Scratch Space

While scanning data, Hyperscan needs a small amount of temporary memory to store on-the-fly internal data. This amount is unfortunately too large to fit on the stack, particularly for embedded applications, and allocating memory dynamically is too expensive, so a pre-allocated “scratch” space must be provided to the scanning functions.

The function hs_alloc_scratch() allocates a large enough region of scratch space to support a given database. If the application uses multiple databases, only a single scratch region is necessary: in this case, calling hs_alloc_scratch() on each database (with the same scratch pointer) will ensure that the scratch space is large enough to support scanning against any of the given databases.

While the Hyperscan library is re-entrant, the use of scratch spaces is not. For example, if by design it is deemed necessary to run recursive or nested scanning (say, from the match callback function), then an additional scratch space is required for that context.

In the absence of recursive scanning, only one such space is required per thread and can (and indeed should) be allocated before data scanning is to commence.

In a scenario where a set of expressions are compiled by a single “main” thread and data will be scanned by multiple “worker” threads, the convenience function hs_clone_scratch() allows multiple copies of an existing scratch space to be made for each thread (rather than forcing the caller to pass all the compiled databases through hs_alloc_scratch() multiple times).

For example:

hs_error_t err;
hs_scratch_t *scratch_prototype = NULL;
err = hs_alloc_scratch(db, &scratch_prototype);
if (err != HS_SUCCESS) {
    printf("hs_alloc_scratch failed!");
    exit(1);
}

hs_scratch_t *scratch_thread1 = NULL;
hs_scratch_t *scratch_thread2 = NULL;

err = hs_clone_scratch(scratch_prototype, &scratch_thread1);
if (err != HS_SUCCESS) {
    printf("hs_clone_scratch failed!");
    exit(1);
}
err = hs_clone_scratch(scratch_prototype, &scratch_thread2);
if (err != HS_SUCCESS) {
    printf("hs_clone_scratch failed!");
    exit(1);
}

hs_free_scratch(scratch_prototype);

/* Now two threads can both scan against database db,
   each with its own scratch space. */

Custom Allocators

By default, structures used by Hyperscan at runtime (scratch space, stream state, etc) are allocated with the default system allocators, usually malloc() and free().

The Hyperscan API provides a facility for changing this behaviour to support applications that use custom memory allocators.

These functions are:

The hs_set_allocator() function can be used to set all of the custom allocators to the same allocate/free pair.