.. _howtos: How-Tos ########## .. contents:: :local: :depth: 1 .. _how_to_run_dynamic_indexing: How to Do Dynamic Indexing =========================== This tutorial will show you how to create a dynamic index, add and remove vectors, search the index, save and reload it. Generating test data ******************** We generate a sample dataset using the :py:func:`svs.generate_test_dataset` generation function. This function generates a data file, a query file, and the ground truth. Note that this is randomly generated data, with no semantic meaning for the elements within it. We first load svs and other modules required for this example. .. literalinclude:: ../examples/python/example_vamana_dynamic.py :language: python :start-after: [imports] :end-before: [imports] Then proceed to generate the test dataset. .. literalinclude:: ../examples/python/example_vamana_dynamic.py :language: python :start-after: [generate-dataset] :end-before: [generate-dataset] :dedent: 4 Building the Dynamic Index ************************** To construct the index we first need to define the hyper-parameters for the graph construction (see :ref:`graph-build-param-setting` for details). **In Python** This is done by creating an instance of :py:class:`svs.VamanaBuildParameters`. .. literalinclude:: ../examples/python/example_vamana_dynamic.py :language: python :start-after: [build-parameters] :end-before: [build-parameters] :dedent: 4 Now that we've established our hyper-parameters, it is time to construct the index. For this, we load the data and build the dynamic index with the first 9k vectors of the dataset. **In Python** .. literalinclude:: ../examples/python/example_vamana_dynamic.py :language: python :start-after: [build-index] :end-before: [build-index] :dedent: 4 Updating the index ****************** Once we've built the initial dynamic index, we can add and remove vectors. **In Python** .. literalinclude:: ../examples/python/example_vamana_dynamic.py :language: python :start-after: [add-vectors] :end-before: [add-vectors] :dedent: 4 .. literalinclude:: ../examples/python/example_vamana_dynamic.py :language: python :start-after: [remove-vectors] :end-before: [remove-vectors] :dedent: 4 Deletions are performed in a lazy fashion to avoid an excessive compute overhead. When a vector is deleted, it is added to a list of deleted elements but not immediately removed from the index. At search time, it is used during graph traversal but it is filtered out from the nearest neighbors result. Once a sufficient number of deletions is accumulated the ``consolidate()`` and ``compact()`` functions should be ran to effectively remove the vectors from the index. **In Python** .. literalinclude:: ../examples/python/example_vamana_dynamic.py :language: python :start-after: [consolidate-index] :end-before: [consolidate-index] :dedent: 4 Searching the Index ******************** First, we load the queries and the computed ground truth for our example dataset. **In Python** .. literalinclude:: ../examples/python/example_vamana_dynamic.py :language: python :start-after: [load-aux] :end-before: [load-aux] :dedent: 4 Then, run the search in the same fashion as for the static graph. **In Python** .. literalinclude:: ../examples/python/example_vamana_dynamic.py :language: python :start-after: [perform-queries] :end-before: [perform-queries] :dedent: 4 Saving the Index **************** If you are satisfied with the performance of the generated index, you can save it to disk to avoid rebuilding it in the future. **In Python** .. literalinclude:: ../examples/python/example_vamana_dynamic.py :language: python :start-after: [saving-results] :end-before: [saving-results] :dedent: 4 .. note:: The save index function currently uses three folders for saving. All three are needed to be able to reload the index. * One folder for the graph. * One folder for the data. * One folder for metadata. This is subject to change in the future. Reloading a Saved Index *********************** To reload the index from file, use the corresponding constructor with the three folder names used to save the index. Performing queries is identical to before. **In Python** .. literalinclude:: ../examples/python/example_vamana_dynamic.py :language: python :start-after: [loading] :end-before: [loading] :dedent: 4 Note that the second argument, the one corresponding to the file for the data, requires a :py:class:`svs.VectorDataLoader` and the corresponding data type. | .. _graph-build-param-setting: How to Choose Graph Building Hyper-parameters ============================================= The optimal values for the graph building hyper-parameters depend on the dataset and on the trade-off between performance and accuracy that is required. We suggest here commonly used values and provide some guidance on how to adjust them. See :ref:`graph-building-details` for more details about graph building. * ``graph_max_degree``: Maximum out-degree of the graph. A larger ``graph_max_degree`` implies more distance computations per hop, but potentially a shorter graph traversal path, so it can lead to higher search performance. High-dimensional datasets or datasets with a large number of points usually require a larger ``graph_max_degree`` to reach very high search accuracy. Keep in mind that the graph size in bytes is given by 4 times ``graph_max_degree`` (each neighbor id in the graph adjacency lists is represented with 4 bytes) times the number of points in the dataset, so a larger ``graph_max_degree`` will have a larger memory footprint. Commonly used values for ``graph_max_degree`` are 32, 64 or 128. * ``alpha``: Threshold for the graph adjacency lists :ref:`pruning rule <graph-pruning>` during the second pass over the dataset. For distance types favoring minimization, set this to a number greater than 1.0 to build a denser graph (typically, 1.2 is sufficient). For distance types preferring maximization, set to a value less than 1.0 to build a denser graph (such as 0.95). * ``window_size``: Sets the ``search_window_size`` for the graph search conducted to add new points to the graph. This parameter controls the quality of :ref:`graph construction <graph-building-pseudocode>`. A larger window size will yield a higher-quality index at the cost of longer construction time. Should be larger than ``graph_max_degree``. * ``max_candidate_pool_size``: Limit on the number of candidates to consider for the graph adjacency lists :ref:`pruning rule <graph-pruning>`. Should be larger than ``window_size``. * ``num_threads``: The number of threads to use for index construction. The indexing process is highly parallelizable, so using as many ``num_threads`` as possible is usually better. .. _search-window-size-setting: How to Set the Search Window Size ================================== The ``search_window_size`` is the knob controlling the trade-off between performance and accuracy for the graph search. A larger ``search_window_size`` implies exploring more vectors, improving the accuracy at the cost of a longer search path. See :ref:`graph-search-details` for more details about graph search. One simple way to set the ``search_window_size`` is to run searches with multiple values of the parameter and print the recall to identify the required ``search_window_size`` for the chosen accuracy level. **In Python** .. literalinclude:: ../examples/python/example_vamana.py :language: python :start-after: [search-window-size] :end-before: [search-window-size] :dedent: 4 **In C++** .. collapse:: Click to display .. literalinclude:: ../examples/cpp/vamana.cpp :language: cpp :start-after: [Search Window Size] :end-before: [Search Window Size] :dedent: 4 | .. _compression-setting: How to Choose Compression Parameters ===================================== Number of bits per level ************************ LVQ compression [ABHT23]_ comes in two flavors: one or two levels. One level LVQ, or LVQ-B, uses B bits to encode each vector component using a scalar quantization with per-vector scaling factors. Two level LVQ, or LVQ-B1xB2, uses LVQ-B1 to encode the vectors and a modification of LVQ to encode the residuals using B2 bits. Whether using one or two levels, and the number of bits, depends on the dataset and the trade-off between performance and accuracy that needs to be achieved. When using **two-level LVQ**, the graph search is conducted using vectors compressed with LVQ-B1 and a final re-ranking step is performed using the residuals compressed with B2 bits to improve the search recall. This decoupled strategy is particularly beneficial for **high dimensional datasets** (>200 dimensions) as LVQ achieves up to ~8x bandwidth reduction (B1=4) compared to a float32-valued vector. The number of bits for the residuals (4 or 8) should be chosen depending on the desired search accuracy. Suggested configurations for high dimensional vectors are LVQ-4x8 or LVQ-4x4 depending on the desired accuracy. For **lower dimensional datasets** (<200 dimensions), **one-level** LVQ-8 is often a good choice. If higher recall is required, and a slightly larger memory footprint is allowed, then LVQ-8x4 or LVQ-8x8 should be used. These are general guidelines, but the best option will depend on the dataset. If willing to optimize the search for a particular dataset and use case, we suggest trying different LVQ options. See :ref:`SVS + Vector compression (large scale datasets) <benchs-compression-evaluation>` and :ref:`SVS + Vector compression (small scale datasets) <benchs-compression-evaluation_small_scale>` for benchmarking results of the different LVQ settings in standard datasets. .. _lvq_strategy: LVQ implementation strategy *************************** The ``strategy`` argument in the :py:class:`svs.LVQLoader` is of type :py:class:`svs.LVQStrategy` and defines the low level implementation strategy for LVQ, whether it is Turbo or Sequential. Turbo is an optimized implementation that brings further performance over the default (Sequential) implementation [AHBW24]_. Turbo can be used when using 4 bits for the primary LVQ level and it is enabled by default for that setting. Padding ******* LVQ-compressed vectors can be padded to a multiple of 32 or 64 bytes to be aligned with half or full cache lines. This improves search performance and has a low impact on the overall memory footprint cost (e.g., 5% and 12% larger footprint for `Deep <http://sites.skoltech.ru/compvision/noimi/>`_ with ``graph_max_degree`` = 128 and 32, respectively). A value of 0 (default) implies no special alignment. For details on the C++ implementation see :ref:`cpp_quantization_lvq`. How to Prepare Your Own Vector Dataset ====================================== Preparing your own vector dataset is simple with our Python API ``svs``, which can directly use embeddings encoded as ``numpy`` arrays. There are 3 main steps to preparing your own vector dataset, starting from the original data format (e.g. images, text). 0. Select embedding model 1. Preprocess the data 2. Embed the data to generate vectors 3. Use or save the embeddings We will walk through a simple example below. For complete examples, please see our `VectorSearchDatasets repository <https://github .com/IntelLabs/VectorSearchDatasets>`_, which contains code to generate compatible vector embeddings for common datasets such as `open-images <https://storage.googleapis.com/openimages/web/index.html>`_. Example: vector embeddings of images ************************************ This simplified example is derived from our `VectorSearchDatasets <https://github .com/IntelLabs/VectorSearchDatasets/tree/main/openimages>`_ code. Select embedding model ---------------------- Many users will be interested in using deep learning models to embed their data. For an image this could be something like the multimodal vision-language model, `CLIP <https://github.com/OpenAI/CLIP>`_. This model is available through the `Hugging Face Transformers library <https://huggingface.co/docs/transformers/en/model_doc/clip>`_, which we import here in order to load the model. We also import PyTorch and some data processing tools which will appear in the next step. .. code-block:: py import svs import torch from transformers import AutoProcessor, CLIPProcessor, CLIPModel model_str = "openai/clip-vit-base-patch32" model = CLIPModel.from_pretrained(model_str) Preprocess the data ------------------- You will need to prepare your data so that it can be processed by the embedding model. You should also apply any other preprocessing here, such as cropping images, removing extraneous text, etc. For our example, let's assume that the OpenImages dataset has been downloaded and we wish to preprocess the first 24 images. Here we use the HuggingFace AutoProcessor to pre-process the dataset to the format required by CLIP. .. collapse:: Click to display .. code-block:: py import pandas as pd from PIL import Image processor = AutoProcessor.from_pretrained(model_str) image_list = [] n_img_to_process = 24 for img_id in range(0, n_img_to_process): image_fname = f'{oi_base_dir}/images/{img_id}.jpg' image = Image.open(image_fname) image_list.append(image) print(f'Pre-processing {n_img_to_process} images') inputs = processor(images=image_list, return_tensors="pt") | Embed the data to generate vectors ---------------------------------- We then call the CLIP method to grab the features, or vector embeddings, associated with the images in our list. We also convert these embeddings from torch tensors to numpy arrays to enable their use in svs. .. code-block:: py with torch.no_grad(): print(f'Generating embeddings') embeddings = model.get_image_features(**inputs).detach().numpy() Use or save the embeddings -------------------------- Now that you have the embeddings as numpy arrays, you can directly pass them as inputs to svs functions. An example call to build a search index from the embeddings is shown below. .. code-block:: py index = svs.Vamana.build(parameters, embeddings, svs.DistanceType.L2) You may also save the embeddings into the commonly used vector file format ``*vecs`` with :py:func:`svs.write_vecs`. A description of the ``*vecs`` file format is given `here <http://corpus-texmex.irisa.fr/>`_. .. code-block:: py svs.write_vecs(embeddings, out_file) Other data format helper functions are described in our `I/O and Conversion Tools <https://intel.github .io/ScalableVectorSearch/io.html>`_ documentation.