Searching through an organization’s documents can feel like looking for a needle in a haystack. Except the haystack is the size of Mount Everest.
The simplest text search method is “lexical”. In basic terms the search engine tries to match the keywords in your query to keywords in the documents. This is fast and works well enough in most cases. Unfortunately, the English language is full of synonyms, homonyms, and ambiguity.
What if a search engine could understand the contextual meaning and intent behind our query? This is where semantic search comes in. In this article we will be talking specifically about text embedding vector-based search. That’s quite the mouthful. Let’s see if we can break it down.
To better understand how vector embeddings can support the retrieval of data, let’s look at an example of how they allow machine learning algorithms to better capture the underlying relationships and patterns in the data, leading to more accurate predictions and insights.
In a traditional database you can perform filters like this:
WHERE[ID] = 35356323
WHERE[SALES] > 200000 AND [REGION] != ‘NORTHWEST’
WHERE[SALES_DATE] > ‘2023-01-01’
These kinds of queries work well enough for structured data. But how do you search unstructured data like text or images? For example, how would you index text like this:
The Maine Coon is a large, domesticated cat breed. It is one of the oldest natural breeds in North America.
The breed originated in the U.S. state of Maine, where it is the official state cat.
The Maine Coon is a large and social cat, which could be the reason why it has a reputation of being referred to as "the gentle giant."
The Maine Coon is predominantly known for its size and dense coat of fur which helps the large feline to survive in the harsh climate of Maine.
The Maine Coon is often cited as having "dog-like" characteristics.
This is a common problem and one that search engines have been dealing with for decades. The simplest approaches involve keywords. You search for ‘cat’ and return all documents with the word ‘cat’ in them.
What if you search for ‘big, long-haired kitty’? Clearly this article would be relevant to that search, but none of the search words are found in the article. You could use synonym lookups but those are difficult to maintain and error prone. Semantic search goes beyond simple keywords and strives to include the intent and contextual meaning.
Vector embedding works by analyzing enormous amounts of text data to identify patterns and relationships between words. We can then use this analysis to convert words and text into an array of numbers called a vector. These vectors encode the meaning of the text and are much easier for computers to work with. Let’s look at a simplified example.
In our simple embedding we can see that words that are like each other are grouped closely. Words that are not similar are faraway. We can measure these distances with simple cosine formulas. The distances between words can be interpreted as how semantically close they are.
Embeddings can be thought of as coordinates in an abstract semantic space. In our simplified example we are just using X, Y coordinates.
In real use these embedding vectors would be much larger. OpenAI’sADA-02 embedding is 1,536 dimensions. Our example is just using single words, but we can also convert phrases, whole documents, images, and more to embeddings.
Let’s go back to our original example and assign each document in our library to an embedding vector.
Maine Coons.docx =[12, 12]
2023 Financial Report.xlsx = [5 ,5]
Governance Plan.pdf =[-6, -6]
When we run the search “big, long-haired kitty” through ourmodel to get an embedding of [12, 11]. It is easy to see that MaineCoons.docx is the closest match.
Searching is easy if you only have three documents, but what if we have millions or even billions of documents? Checking the distance from the search embedding to each document embedding would be computationally expensive. This is where vector databases come in. These databases have developed efficient ways of storing and searching vector embeddings.
Vector databases have seen a surge in popularity because they pair well with large language models like ChatGPT. ChatGPT plug-ins combined with a vector database allow users to safely query their own data without feeding it into the larger model.
Some examples of Vector databases: Weaviate, Pinecone, Zilliz
In conclusion, semantic search powered by vector embeddings offers a more sophisticated and accurate approach to information retrieval than traditional keyword-based methods. By capturing the contextual meaning and intent behind a query, semantic search can better understand unstructured datalike text and images.
As technology continues to evolve, we can expect further advancements in semantic search, opening new possibilities for data analysis and insights.
A Brief History of Word Embeddings