The world of search is changing.
Traditional search systems use inverted indexes to find relevant content for a query. These systems rely on lexical or word matching to rank documents. For example, if I'm searching for "laptop," traditional search systems will only return documents that include the literal word token "laptop."
But popular search engines, such as Google and Bing, have begun augmenting their search systems with transformers-based semantic search systems in recent years. In semantic search systems, the meaning of a document and user query is encoded and matched, eliminating the problems faced with word matching. For example, the "I want a laptop" query will also surface documents that include the tokens "MacBook" or "computers."
We're only scratching the surface of the potential of semantic search. Knowledge-storing services (e.g., Confluence) still rely on lexical search, and employees often have to type exact keywords to find content. In the future, it should be possible for people to search in much more intuitive ways.
Let's look at how you can get started with semantic search using an open-source vector index, Faiss, and a pre-trained semantic search model, MPNet.
Introduction to semantic search
Semantic search works by encoding text — paragraphs and documents — into dense vectors and then indexing those vectors. We then encode the query at runtime using the same encoding mechanism and use the vector index to find the closest vector to the query vector. Closeness in vectors is usually defined by the cosine similarity metric, which is an inner product between two normalized vectors. You can find more information on word embeddings and dense vectors here.
First, we'll convert the text in the documents to vectors using the MPNet model and add these vectors to an empty Faiss index. During query time, we'll encode the query using the same model and then compare this vector to the vectors in the Faiss index to find the most similar ones.
Setting up the environment
The two libraries I'll use — Faiss and Transformers — are open-source and can be installed in Deepnote using pip. We'll use CPU-based Faiss installation for this notebook, but you can find information on GPU installations for faster indexing and search here. The exclamation mark in the cell below tells Deepnote that the line is a Bash (Linux) command, not a Python command.
Setting up the sentence embedding model
In the following code snippet, we set up a SemanticEmbedding
class, which uses the pre-trained MPNet model to encode text into vectors.
Testing the model
Next, let's test the model by embedding a test phrase: "I love playing football." Note that the output is a 768-dimensional dense vector.
Setting up the Faiss index
Faiss is an open-source framework developed by Facebook AI that enables us to perform semantic search. It does this by indexing the word vectors that you give it and also providing an API for identifying the closest vectors to query vectors.While we can index vectors with Faiss, we must store the mapping of document vectors back to documents in a separate data structure and maintain this structure ourselves. For simplicity, I'm using a dictionary of {id: text}, but if you want to store more data, you can create a complex dictionary of {id: dict()} as well.Faiss provides options for many different indexes based on the functionality desired by the user. I'm using IndexFlatIP for this demo, as its distance mechanism is the inner product, which for normalized embeddings is the same as cosine similarity.
In the code below, I create a FaissIdx
class that initializes our index with the embedding vector size (768, in this case) with a counter to keep track of the ids of the documents. Note that the Faiss index is currently empty — it doesn't contain any document vectors.
I then add two methods to the class to add and search documents. To implement the methods, I use Faiss's API to get the embedding of a document and to search for a query vector. Note that the search API, self.index.search, also takes a parameter k as input, which defines the number of document vectors to return. In this case, we tell the method to return the top three document vectors, using the doc_map data structure that we defined to give us a human-readable document back.
Testing the index
Once you've set up the index, you can add additional documents and perform searches against it. In the code below, we're adding the documents "laptop computers" and "doctor's office," and then searching for "PC computer." Note that "laptop computer" has a high similarity while "doctor's office" has a low similarity, which makes sense. Remember that cosine similarity ranges from -1 to 1, with 1 meaning exactly similar.
Checking out the final setup
In the above tests, we're using a very small index with only two documents, "laptop computers" and "doctor's office." This isn't very realistic. Most semantic search indexes in the real world would have millions of documents indexed. To give some approximation of this, let's download some sample data and use it to populate our index.
I'll use a subset of the sts data set. To start, let's first download the sample data and take a look at it.
I can ignore most of the columns — I just need the sentence_A
column, which I'll add as documents to the Faiss index I've set up above. I'll add the documents one by one instead of in a single pass to mimic real-life knowledge sources that are generally built incrementally over time and must be added in the order they come.
You can explore the contents of the data below to check the relatedness of sentences and play around a bit while you're waiting.
Now that we've finished adding all the documents to the index, let's try some queries.
As you can see, semantic search can return results containing "ball" and "running" when this term was not even present in the query. This is why semantic search is better than lexical search. We can match the meaning encoded in the indexed document and the query without relying on exact word matches. Semantic search still has shortcomings, such as struggling with unseen words or not effectively encoding long documents, but there is active research addressing these issues.
Semantic search relies on computing dense embeddings for documents and queries and an index that can store document vectors and search over these vectors using cosine similarity as the distance metric. Faiss, as an example index, is easily scalable to 10 million documents and can return results from them in less than 100 milliseconds.
Want to dig even deeper into semantic search? Find more information here.
Explore semantic search using Deepnote
Get started for free and start exploring your data in seconds.