How search engines work, The absolute basics of indexing and retrieval.

5 min readJul 17, 2024

Imagine you’re looking for a single sentence in a library containing every book ever written. Sounds impossible, right? Yet, that’s essentially what search engines do every day, looking through billions of web pages to find exactly what you’re looking for in a fraction of a second. It’s like having a super-powered librarian who can read every book simultaneously and instantly point you to the exact page you need.

But how do they work. Let’s break down how a simple search engine would work step by step.

Step 1. Document Crawling and Preprocessing

In data science ETL stands for Extract, Transform and Load. This essentially is the first few steps in a search engine algorithm.

Step 1.1 Extract

Search engines like Google and Bing systematically browse and download billion of websites. They use programs know as web crawlers to follow links and discover these pages. In the most basic search engines only text is extracted from these websites. Modern search engines like Google use computer vision to parse information from images within those websites as well. Additionally metadata such as meta descriptions and titles may also be extracted.

Step 1.2 Transform

In this stage those extracted text is altered. Following are some of the processes done in this stage.

Tokenization: Breaking text into individual words or terms.
Lowercasing: Converting all text to lowercase for consistency.
Stopword removal: Removing common words like “the” or “and” that don’t carry significant meaning.
Stemming or lemmatization: Reducing words to their root form (e.g., “running” to “run”). Lemmatization is much more effective than stemming although it also required significantly more processing power and time.

Step 2.Load

In the load stage those parsed data is then stored in a format optimized for quick retrieval.

Step 2.1 Inverted Indexing

Instead of iterating through every document looking for a word, it’s much quicker to keep track of every document in which a certain word appears. Inverted index table does exactly this. It keeps a record of every unique term and the documents it appears in. This is an example of a Inverted Index table

word_1 | doc_1,doc_99,doc_6789

word_2 | doc_99,doc_890,doc_7789

word_3 | doc_1,doc_99,doc_100

word_4 | doc_4,doc_6,doc_80

If the user does a query like “word_1 word_3” then the search algorithm know it only needs to retrieve information about doc_1, _doc_99 and doc_6789, doc_100 instead of all the other documents.

Step 2.2 Term Frequency-Inverse Document Frequency (TF-IDF)

This fancy set of words refers to a technique that allows to determine how important a word is within a set of document.

Term Frequency (TF): How often a term appears in a document. Formula for TF is as follows.

Inverse Document Frequency (IDF): Measures how common or rare a term is across all documents. The formula is as follows.

Corpus is a word used to refer to the entire collection of documents.

TF-IDF is calculated by simply multiplying these two values.

What these search engine algorithms do is they pre-calculate and store these tf-idf values for every unique word in the entire database. This way those values can then be retrieved extremely fast and can be used to find and rank document relevant to a query.

At this point what need to be stored to make the search engine work is done. Or in another words, for this set of documents Load is completed.

Step 3. Document Ranking

This is where the math gets a bit confusing, but I’ll try my best to explain it.

If the query is just a single word its as easy as

Use inverted index table to look up for all appearances on that word
Sort the result by TF-IDF score for that particular term and return the top N number of document.

(To be clear this is how a very basic search engine algorithm would do this)

When the query consists of a bunch of terms this gets much more complicated.

Step 3.1 Query Processing

When the user enters a query. The search algorithm, first performs step 1.2 for the query in order to clean it up. Then TF value for the query is also calculated. An adjusted formula looks something like this

Step 3.2 Vector representation

The purpose of this step is to bring together all the values we calculated so far and use them all to mathematically find the most relevant documents. To do this, search algorithms use vector mathematics.

Imagine a document as a unique point in a high-dimensional space. Each dimension (axis) corresponds to a unique keyword in our entire document collection (Corpus).

The value of a particular point (or a document), is the TF-IDF value of a term (or a axis). Higher TF-IDF scores translate to higher values on that particular dimension in the vector space. Kind of like this,

The above example shows a 2 dimensional example, meaning only a two words are considered. In a real example this will be a high dimensional graph.

Now imagine the query TF-IDF values we discussed earlier in 3.1 also, represented in this same graph.

Step 3.3 Cosine Similarity

Now, with both documents and queries represented as vectors, cosine similarity comes into play. The formula for cosine similarity between two vectors, DocVector (representing a document) and QueryVector (representing the user’s query), is as follows:

Like this cosine similarity is calculated to each document with the query.

Documents with higher cosine similarity scores are considered more relevant to the query.

Step 4. Result Presentation

And that leads us to the final step. It’s as simple as sorting those cosine similarity score by high to low and presenting a particular number of results. 🎉

Now, it’s important to understand that modern search engines like Google and Bing have evolved far beyond the basic example we’ve discussed. They employ numerous advanced techniques and technologies like Machine Learning and AI, personalized results and knowledge graphs to provide more accurate, relevant, and personalized results.

Thank you for reading.