Introduction
In FAISS the database followed the schema in which the vectors and the corresponding raw data are stored as separate arrays.
The two arrays are linked by the array index. Qdrant, however, follows an entirely different schema.
It represents each vectors as a separate entity called Point (ScoredPoint).
The points are stored in a collection. A collection is like a table for vectors. One can create multiple collections on the same Qdrant client.
First, we setup Qdrant. It can be installed using pip install qdrant-client.
Qdrant comes with it’s own embedding model, which one can install while installing qdrant client as pip install qdrant-client[fastembed].
Creating Collection
After installing the database, we can import the client object and instantiate the client object.
from qdrant_client import QdrantClient
database_path = "path/to/database/"
client = QdrantClient(path=database_path)
This initializes a client object connected to local database. One can pass a url instead where the Qdrant is hosted.
The method create_collection() allows to create collections. A collection is identified using its name.
from qdrant_client import models
collection_name = "cat_facts"
if client.collection_exists(collection_name=collection_name):
client.create_collection(collection_name=collection_name,
vectors_config=models.VectorParams(size=dim, distance=models.Distance.COSINE))
Here, we check if the collection already exists or not. To delete an existing collection, we can use client.delete_collection(collection_name=collection_name)
Vector datatypes
One can also define datatypes for the vectors. This is useful when the embedding vector is in quantized format.
For ex, for uint8 embeddings, we can pass the datatypes as models.VectorParams(size=text_dim, distance=models.Distance.COSINE, datatypes=models.Datatype.UNIT8)
Collection metadata
A collection metadata can be specified during collection creation.
if client.collection_exists(collection_name=collection_name):
client.create_collection(collection_name=collection_name,
vectors_config=models.VectorParams(size=dim, distance=models.Distance.COSINE),
metadata={"type": "text-embedding",
"title": "The Fascinating World of Cats: From Origins to Modern Day",
"description": "Summary of facts about domestic cats including their behavior, anatomy, lifespan, and history."})
After creating the collection, metadata can be added using update_collection() and passing the metadata for the collection.
Points
The basic entity in Qdrant is a point. A point consists of the following elements.
- ID: An optional UUID. If the ID is not provided, Qdrant generates one for each point.
- Vector: The embedding vector.
- Payload: An optional JSON object that contains the raw data corresponding to the Vector.
The structured payload makes it possible to add other information about the data corresponding to the particular vector. This allows to filter the searched vectors based on the keys in the payload. For ex, the query data needs to be after certain date. Standard vector search like in FAISS doesn’t support such functionality. Qdrant’s vector schema supports this out of the box.
Vector types
Qdrant supports different types of vectors.
Dense vectors
Dense vector is the most common vector type. It is represented as a one dimensional array of numbers. This is what most embedding models produce as output. Creating collection for dense vector is the usual way described above.
Dense vectors are added by passing a list of vector components when adding vectors one by one. When adding vectors in bulk, Qdrant accepts the dense vectors as a list of vectors or as an array with vectors stored row-wise.
To upload the vectors to the collection, we wrap the vectors in points and then upload the points to the collection as the basic object in collection is a point. Here is how we create points and add them to the collection.
points = [models.PointStruct(id=ix, vector=np.random.random(dim).tolist(), payload={}) for ix in range(3)]
client.upsert(collection_name=collection_name, points=points)
Output
UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: ‘completed’>)
The class PointStruct creates a point object with the vector. We can add payload to the point.
The upsert() method adds a list of points to the collection.
We can fetch points using models.retrieve() method as below.
client.retrieve(collection_name=collection_name, ids=[1,2], with_vectors=True)
Points
[Record(id=1, payload={}, vector=[0.6323352456092834, 0.4621959924697876, 0.6217129230499268], shard_key=None, order_value=None), Record(id=2, payload={}, vector=[0.7779846787452698, 0.6282112002372742, 0.009519333951175213], shard_key=None, order_value=None)]
The methods retrieve(), search(), or scroll() by default return the points only with payload, if it exists. To return the vectors too, we need to pass with_vectors=True.
To search for points whose vectirs are similar to the given vector, we use query_points().
The query_points() search for nearest vectors to the passed vector.
client.query_points(collection_name=collection_name,
query=[0.6323352456092834, 0.4621959924697876, 0.6217129230499268],
with_vectors=True)
Queried vectors
QueryResponse(points=[ScoredPoint(id=1, version=0, score=0.9999999784910842, payload=, vector=[0.6323352456092834, 0.4621959924697876, 0.6217129230499268], shard_key=None, order_value=None), ScoredPoint(id=2, version=0, score=0.7882221419790302, payload=, vector=[0.7779846787452698, 0.6282112002372742, 0.009519333951175213], shard_key=None, order_value=None), ScoredPoint(id=0, version=0, score=0.7698541166251136, payload=, vector=[0.014260887168347836, 0.7366276383399963, 0.6761482357978821], shard_key=None, order_value=None)])
Named vectors
In some cases one may need to add multiple related vectors like an image and it’s associated text.
Qdrant allows storing multiple related vectors in a single point as named vectors.
To store multiple vectors per point, we need to pass the configuration of each vector as a dictionary to vectors_config argument.
if not client.collection_exists(collection_name=collection_name):
client.create_collection(collection_name=collection_name,
vectors_config={"text": models.VectorParams(size=text_dim, distance=models.Distance.COSINE)
"image": models.VectorParams(size=image_dim, distance=models.Distance.DOT}))
To add the named vectors to the collection, one creates the PointStruct objects consisting of the named vectors.
points = [models.PointStruct(id=ix, vector={'text': np.random.random(text_dim),
'image': np.random.random(image_dim)}) for ix in range(4)]
Querying the named vectors is similar. query object in this case is simply a dictionary of vectors.
query = {'text': np.random.random(text_dim), 'image': np.random.random(image_dim)}
client.query_points(collection_name=colelction_name, query=query, with_vectors=True)
Sparse vectors
These are dense vectors but most of the elements of the array are zero. So, storing them as an array is not an efficient approach. To store a sparse vector, a JSON object consisting of the indices and their corresponding values can be passed.
client.create_collection(collection_name=collection_name,
sparse_vectors_config={'vector_name': models.SparseVectorParams()})
Note that sparse vectors are always named vectors. Qdrant does not allows unnamed sparse vectors like dense vectors. Moreover, the sparse vectors support only work with the DOT distance so there is no choice of distance to choose from. The vector size is also not needed for sparse vectors.
The sparse vectors can be added to the collection using the same PointStruct and passing it the indices and the corresponding values.
points = [models.PointStruct(id=ix,
vector={'text_sparse': models.SparseVector(indices=np.random.choice(np.arange(0,10),4, replace=False).tolist(),
values=np.random.random(4).tolist())},
payload={}) for ix in range(3)]
client.upsert(collection_name=collection_name, points=points)
Sparse vectors are the only vectors for which a separate Vector class models.SparseVector exists.
The class accepts, as in the above snippets, the indices and the corresponding values.
For other vectors, a list of values of arrays are passed.
Querying the sparse vectors is same as querying a named vector. In this case, however, instead of using the list for vectors, we use a dictionary of indices and values.
vector = models.SparseVector(indices=np.random.choice(np.arange(0,10),4, replace=False).tolist(),
values=np.random.random(4).tolist())
client.query_points(collection_name=collection_name, query=vector,
using='text_sparse', with_vectors=True, limit=2)
Output
QueryResponse(points=[ScoredPoint(id=0, version=0, score=0.6526923179626465, payload={}, vector={‘text_sparse’: SparseVector(indices=[1, 3, 5, 6], values=[0.11813525177377737, 0.7320883617640301, 0.2657800217399512, 0.9593644767815215])}, shard_key=None, order_value=None), ScoredPoint(id=1, version=0, score=0.3416300117969513, payload={}, vector={‘text_sparse’: SparseVector(indices=[4, 5, 8, 9], values=[0.272710627171461, 0.7391453100815311, 0.11007852520994543, 0.1801180842060507])}, shard_key=None, order_value=None)])
Multivectors
So, far we considered cases where a single vector is defined for a single chunk of data. We may have multiple vectors for the same piece of data in which case we can define the configuration as follows.
if not client.collection_exists(collection_name=collection_name):
client.create_collection(collection_name=collection_name,
vectors_config=models.VectorParams(size=dim, distance=models.Distance.COSINE,
multivectors_config=models.MultiVectorConfig(comparator=models.MultiVectorComparator.MAX_SIM))
We can also create collection by passing a dictionary of arguments to the vectors_config.
if not client.collection_exists(collection_name=collection_name):
client.create_collection(collection_name=collection_name,
vectors_config={'size': dim,
'distance': 'Cosine',
'multivector_config': {'comparator': 'max_sim'}})
To add a vector to the collection, we create a point with a list of vectors and pass it to the collection.
points = [models.PointStruct(id=ix, vector=[np.random.random(dim).tolist() for i in range(3)]) for ix in range(4)]
client.upsert(collection_name=collection_name, points=points)
Querying the multivector is same as dense vectors.
client.query_points(collection_name=collection_name,
query=[np.random.random(dim).tolist() for i in range(3)],
with_vectors=True)
Output
QueryResponse(points=[ScoredPoint(id=2, version=0, score=2.982257692056449, payload=, vector=[[0.18267421424388885, 0.5387661457061768, 0.8224117755889893], [0.8703754544258118, 0.4385502338409424, 0.22387555241584778], [0.624049186706543, 0.4725625216960907, 0.6222919821739197]], shard_key=None, order_value=None), ScoredPoint(id=0, version=0, score=2.8350925115193615, payload=, vector=[[0.1981322020292282, 0.9796438217163086, 0.03227436542510986], [0.08620001375675201, 0.17250984907150269, 0.9812288284301758], [0.6063098311424255, 0.3430255651473999, 0.7174411416053772]], shard_key=None, order_value=None), ScoredPoint(id=3, version=0, score=2.7701239551534997, payload=, vector=[[0.563003420829773, 0.19666346907615662, 0.8027145862579346], [0.6861861944198608, 0.6232661604881287, 0.3750837743282318], [0.2924101650714874, 0.8869141936302185, 0.3576025366783142]], shard_key=None, order_value=None), ScoredPoint(id=1, version=0, score=2.517101773174179, payload=, vector=[[0.7019085884094238, 0.6607173085212708, 0.2660396993160248], [0.891431450843811, 0.36677613854408264, 0.2661301791667938], [0.8020596504211426, 0.5900818705558777, 0.09221615642309189]], shard_key=None, order_value=None)])
Adding vectors in bulk to the Collection
Once the collection has been created, the vectors can be added to the collection either in bulk or one by one. In either case there is an option to provide the payload corresponding to each vector.
There are four methods for bulk uploading each one differs in how they handle data.
upsert(): It is a low-level building block that other methods use for uploading pointys.
upsert() takes a list of points and uploads them to the collection.
upload_collection()
upload_points()
add()