Experimental Setup

Datasets 

To cover a wide range of use cases, we evaluate SVS on standard datasets of diverse dimensionalities ( $d = 25$ to $d = 768$ ), number of elements ( $n = 10^{6}$ to $n = 10^{9}$ ), data types and metrics as described in the table below.

Dataset	d	n	Encoding	Similarity	n queries	Space (GiB)
gist-960-1M	960	1M	float32	L2	1000	3.6
sift-128-1M	128	1M	float32	L2	10000	0.5
deep-96-10M	96	10M	float32	cosine similarity	10000	3.6
glove-50-1.2M	50	1.2M	float32	cosine similarity	10000	0.2
glove-25-1.2M	25	1.2M	float32	cosine similarity	10000	0.1
t2i-200-100M	200	100M	float32	inner product	10000	74.5
deep-96-100M	96	100M	float32	cosine similarity	10000	35.8
deep-96-1B	96	1B	float32	cosine similarity	10000	257.6
sift-128-1B	128	1B	uint8	L2	10000	119.2

Evaluation Metrics 

In all benchmarks and experimental results, search accuracy is measured by k-recall at k, defined by $| S \cap G_{t} | / k$ , where $S$ are the ids of the $k$ retrieved neighbors and $G_{t}$ is the ground-truth. We use $k = 10$ in all experiments. Search performance is measured by queries per second (QPS).

Query Batch Size 

The size of the query batch, which will depend on the use case, can have a big impact on performance. Therefore, we evaluate batch sizes: 1 (one query at a time or single query), 128 (typical batch size for deep learning training and inference) and full batch (determined by the number of queries in the dataset, see Datasets).