πŸ’»Semantic NLP search with FAISS and VectorHub

Assumed Knowledge: Vectors Target Audience: Data scientists, Python developers Reading Time: 3 minutes

The following guide uses VectorHub and FAISS (by Facebook) to show an example of how to use vectors for search.

Step 0) Getting the right Python and requirements

Here, we use Python3.6/Python3.7. We have tested the code on Colab to ensure that this works even if you do not have your own Python installed. If you are interested in running the code in Colab, click here.

Step 1) Encoding Data With Vectors

First, we install VectorHub to encode models easily. We install the encoders-text-tfhub extra requirement because we are interested in using VectorHub's Bert model. You can find more about the Bert model here. Bert was a model released by Google that provides bi-directional encoding with attention layers that led to a significant improvement in NLP performance.

%%capture
!pip install vectorhub[encoders-text-tfhub]

Then, we want to instantiate our model and start encoding. VectorHub abstracts away the dependency requirements into simple installation steps like above and also uses the best model and default pooler strategies based on our own tests. You can read more about Bert2Vec on the VectorHub model card here.

from vectorhub.encoders.text.tfhub import Bert2Vec
bert_enc = Bert2Vec()
# Words
words = [
    'How can I design my own post-graduate education?', 
    'How could water be produced on Mars?', 
    'How can I fall in love?', 
    'How can India improve in corruption?'
]
vectors = []
# This can be optimised using list comprehension but 
#we make it easier to read just for demo purposes
for word in words:
    vector = bert_enc.encode(word)
    vectors.append(vector)

Step 2) Building An Index

We then add our vectors and their associated words to the FAISS index. The FAISS index can be instantiated in a number of different ways. In this case, we instantiate it with the L2 index and then add the models. As they require numpy arrays for compatibility reasons, we convert them to compatible numpy arrays before inserting them into the index.

import numpy as np
import faiss
vector_length = len(vector)
index = faiss.IndexFlatL2(vector_length) # build the index using L2 as the distance
index.add(np.array(vectors).astype('float32')) # add vectors to the index

Step 3) Searching Our Index

Once you build the index, you encode the query vector. From the query vectors, we locate the closest vectors to the query vector.

num_of_results = 3 # Number of results 
# Search using
search_term = 'Building a better government'
query_vector = bert_enc.encode(search_term)
D, I = index.search(np.array([query_vector]).astype('float32'), num_of_results)
# Return the results in order
for i in range(k):
    print(words[I[0][i]])

Voila! You have built a very basic semantic search with FAISS. From here, you may add more to the index, build improved search or use your own datasets. FAISS search, however, is limited in its ability to provide support for more advanced search options (searching with filters, multi-vector search, personalised search). For these additional requirements (as well as online storage), we recommend reading Vector Search With Vector AI which is our cloud-based vector search solution.

Last updated