Flat Index

Main Flat Class

template<data::ImmutableMemoryDataset Data, typename Dist, typename Ownership = OwnsMembers>
class FlatIndex

Implementation of the Flat index.

The mid-level implementation for the flat index that uses exhaustive search to find the exact nearest neighbors (within the limitations of possibly quantization error for the dataset or floating-point error for some distance functors).

NOTE: This method is not as performant as other index methods. It is meant to return the exact rather than approximate nearest neighbors and thus must exhaustively search the whole dataset.

Template Parameters:
  • Data – The full type of the dataset being indexed.

  • Dist – The distance functor used to compare queries with the elements of the dataset.

  • Ownership – Implementation detail and may be omitted for most use cases.

Public Types

using distance_type = Dist

The type of the distance functor.

using data_type = Data

The type of dataset.

Public Functions

inline svs::logging::logger_ptr get_logger() const

Getter method for logger.

template<typename ThreadPoolProto>
inline FlatIndex(Data data, Dist distance, ThreadPoolProto threadpool_proto, svs::logging::logger_ptr logger = svs::logging::get())

Construct a new index from constituent parts.

ThreadPool

An acceptable thread pool should implement two methods:

  • size_t size(). This method should return the number of threads used in the thread pool.

  • void parallel_for(std::function<void(size_t)> f, size_t n). This method should execute f. Here, f(i) represents a task on the i^th partition, and n represents the number of partitions that need to be executed.

Parameters:
  • data – The data to use for the index. The resulting index will take ownership of the passed argument.

  • distance – The distance functor to use to compare queries with dataset elements.

  • threadpool_proto – Precursor for the thread pool to use. Can either be an acceptable thread pool instance or an integer specifying the number of threads to use. In the latter case, a new default thread pool will be constructed using threadpool_proto as the number of threads to create.

  • logger_ – Spd logger for per-index logging customization.

inline size_t size() const

Return the number of independent entries in the index.

inline size_t dimensions() const

Return the logical number of dimensions of the indexed vectors.

template<typename QueryType, typename Pred = lib::Returns<lib::Const<true>>>
inline void search(QueryResultView<size_t> result, const data::ConstSimpleDataView<QueryType> &queries, const search_parameters_type &search_parameters, const lib::DefaultPredicate &cancel = lib::Returns(lib::Const<false>()), Pred predicate = lib::Returns(lib::Const<true>()))

Fill the result with the num_neighbors nearest neighbors for each query.

Preconditions:

The following pre-conditions must hold. Otherwise, the behavior is undefined.

  • result.n_queries() == queries.size()

  • result.n_neighbors() == num_neighbors.

  • The value type of queries is compatible with the value type of the index dataset with respect to the stored distance functor.

Implementation Details

The internal call stack looks something like this.

search: Prepare scratch space and perform tiling over the dataset.
  |
  +-> search_subset: multi-threaded search of all queries over the current subset
      of the dataset. Partitions up the queries according to query batch size
      and dynamically load balances query partition among worker threads.
        |
        +-> search_patch: Bottom level routine meant to run on a single thread.
            Compute the distances between a subset of the queries and a subset
            of the data and maintines the `num_neighbors` best results seen so far.
{}

Template Parameters:
  • Queries – The full type of the queries.

  • Pred – The type of the optional predicate.

Parameters:
  • result – The result data structure to populate. Row i in the result corresponds to the neighbors for the ith query. Neighbors within each row are ordered from nearest to furthest. num_neighbors is computed from the number of columns in result.

  • queries – A dense collection of queries in R^n.

  • search_parameters – search parameters to use for the search.

  • cancel – A predicate called during the search to determine if the search should be cancelled.

  • predicate – A predicate functor that can be used to exclude certain dataset elements from consideration. This functor must implement bool operator()(size_t) where the size_t argument is an index in [0, data.size()). If the predicate returns true, that dataset element will be considered.

inline size_t get_num_threads() const

Return the current number of threads used for search.

See also

set_num_threads

template<threads::ThreadPool Pool>
inline void set_threadpool(Pool threadpool)

Destroy the original thread pool and set to the provided one.

ThreadPool

An acceptable thread pool should implement two methods:

  • size_t size(). This method should return the number of threads used in the thread pool.

  • void parallel_for(std::function<void(size_t)> f, size_t n). This method should execute f. Here, f(i) represents a task on the i^th partition, and n represents the number of partitions that need to be executed.

Parameters:

threadpool – An acceptable thread pool.

inline threads::ThreadPoolHandle &get_threadpool_handle()

Return the current thread pool handle.

template<typename Query>
inline double get_distance(size_t id, const Query &query) const

Compute the distance between an external vector and a vector in the index.

Flat Entry Point

template<typename DataProto, typename Distance, typename ThreadPoolProto>
auto svs::index::flat::auto_assemble(DataProto &&data_proto, Distance distance, ThreadPoolProto threadpool_proto, svs::logging::logger_ptr logger = svs::logging::get())

Entry point for loading a Flat index.

This method provides much of the heavy lifting for constructing a Flat index from a data file on disk or a dataset in memory.

data_loader

The data loader should be any object loadable via svs::detail::dispatch_load returning a Vamana compatible dataset. Concrete examples include:

ThreadPool

An acceptable thread pool should implement two methods:

  • size_t size(). This method should return the number of threads used in the thread pool.

  • void parallel_for(std::function<void(size_t)> f, size_t n). This method should execute f. Here, f(i) represents a task on the i^th partition, and n represents the number of partitions that need to be executed.

Parameters:
  • data_proto – Data prototype. See expanded notes.

  • distance – The distance functor to use to compare queries with elements of the dataset.

  • threadpool_proto – Precursor for the thread pool to use. Can either be an acceptable thread pool instance or an integer specifying the number of threads to use. In the latter case, a new default thread pool will be constructed using threadpool_proto as the number of threads to create.

  • logger_ – Spd logger for per-index logging customization.