This content originally appeared on HackerNoon and was authored by Ritesh Modi
\ When I first started working with text data years ago, the whole concept of embeddings seemed unnecessarily complex. I was comfortable with my bag-of-words approaches and straightforward TF-IDF vectors. I believed they were straightforward, easy to understand, and effectively accomplished the task at hand.
\ I didn't seriously consider embeddings until I encountered a challenge in a sentiment analysis project. I was working with product reviews, and my traditional models kept misclassifying reviews with sarcasm or nuanced language. The problem became clear: my models didn't understand that "this product is sick" could actually be positive or that "worked exactly as expected" could be neutral or negative depending on context.
\ That's when I discovered what embeddings actually solve. They're not just a fancy new technique – they address fundamental limitations in how machines understand language.
\
The Old Days: Life Before Embeddings
Let’s understand the history of using text representations before embeddings without getting into the details of these methods and approaches. They are very well covered at other places.
One-Hot Encoding
It represented each word as a sparse vector with all zeros except for a single "1" at the position corresponding to that word in a vocabulary. It represented words as massive, sparse vectors where every word got its own dimension. If your vocabulary had 100,000 words (which is modest), each word vector had 99,999 zeros and a single 1. These representations told us absolutely nothing about meaning. The words "excellent" and "fantastic" were mathematically just as different as "excellent" and "terrible" – completely missing the obvious semantic relationships.
\ "cat" → [1, 0, 0, 0, …, 0] (position 5432 in vocabulary) "dog" → [0, 1, 0, 0, …, 0] (position 8921 in vocabulary)
\
Limitations
- Dimensionality explosion: Vectors had as many dimensions as vocabulary size (often 100,000+)
- No semantic relationships: "cat" and "kitten" were as different as "cat" and "airplane" (all equidistant)
- Computational inefficiency: Multiplying these sparse matrices was extremely resource-intensive
- No generalization: The system couldn't understand words outside its original vocabulary
Bag-of-Words Approach
It counts word occurrences in documents, sometimes weighted by their importance. It treated words in documents as unordered collections of words, completely throwing away word order. "The dog bit the man" and "The man bit the dog" would have identical representations.
\ Document: "The cat sat on the mat" BoW: {"the": 2, "cat": 1, "sat": 1, "on": 1, "mat": 1}
Limitations:
- Loss of word order: "Dog bites man" and "Man bites dog" had identical representations
- Sparse high-dimensional vectors: Still required vocabulary-sized vectors
- No semantic understanding: Synonyms were represented as completely different features
- No contextual meaning: Each word had a fixed representation regardless of context
N-grams
To capture some word order, we started using n-grams (sequences of n consecutive words). It looked at sequences of N consecutive words to capture some contextual information.
\ With unigrams (single words), you might have a vocabulary of 100,000. With bigrams (word pairs), suddenly you're looking at potential millions of features. With trigrams? Billions, theoretically. Even with aggressive pruning, the dimensionality became unmanageable.
Limitations:
- Combinatorial explosion: The number of possible n-grams grows exponentially
- Data sparsity: Most possible n-grams never appear in training data
- Limited context window: Only captured relationships within small windows (typically 2-5 words)
TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF improved things by weighting words based on how important they were to a specific document relative to the corpus. But it still treated "amazing" and "excellent" as completely unrelated terms.
Limitations:
No semantic meaning: It’s the count and frequency of words that decide the importance of their usage.
The Embedding Revolution: What Changed?
The transition to embeddings wasn't just an incremental improvement; it was a paradigm shift in how we represent language. Here's what made the difference:
Meaning Through Context
The fundamental insight behind embeddings is deceptively simple: words that appear in similar contexts likely have similar meanings. If you see "dog" and "cat" appearing around the same kinds of words ("pet," "food," "fur"), they're probably semantically related.
\ Early embedding models like Word2Vec captured this by training neural networks to predict either:
\
- A word based on its surrounding context (Continuous Bag of Words)
- The surrounding context based on a word (Skip-gram)
\ The hidden layer weights from these models became our word vectors, encoding semantic relationships in the geometric properties of the vector space.
\ When I first plotted word vectors and saw that "king" - "man" + "woman" ≈ "queen", I knew we were onto something revolutionary. These weren't just arbitrary numbers; they captured meaningful semantic relationships.
\ The next big leap came with contextual embeddings. Early models like Word2Vec and GloVe gave each word a single vector regardless of context. But the same word can mean different things in different contexts:
\ "I need to bank the money" vs. "I'll meet you by the river bank"
\ Models like BERT and GPT solved this by generating different embeddings for the same word depending on its surrounding context. This was a game-changer for tasks like named entity recognition and sentiment analysis, where context determines meaning.
\ So, first let’s understand what embeddings are and how they transformed NLP and addressed the limitations of previous approaches.
What Are Embeddings?
Embeddings are numerical representations of data (text, images, audio, etc.) in a continuous vector space. For text, embeddings capture semantic relationships between words or documents, allowing machines to understand meaning in a way that's mathematically processable.
Key Concepts:
- Vectors: Ordered lists of numbers representing a point in multi-dimensional space
- Dimensions: The number of values in each vector (e.g., 768-dim, 1024-dim)
- Vector Space: The mathematical space where embeddings exist
- Semantic Similarity: Measured by distance or angle between vectors (closer = more similar)
What Do Dimensions Represent?
Each dimension in an embedding vector represents a learnt feature or aspect of the data. Unlike classical feature engineering where humans define what each dimension means, in modern embedding models:
\
- Dimensions emerge during training to represent abstract "concepts"
- Individual dimensions often lack specific human-interpretable meaning
- The complete vector, however, captures semantic information holistically
- Some dimensions might correspond to sentiment, formality, topic, or syntax, but most represent complex combinations of features
Why We Need Embeddings
Computers fundamentally work with numbers, not words. When processing language, we need to convert text into numerical representations that:
\
- Capture semantic relationships – similar concepts should have similar representations
- Preserve contextual meaning – The same word can mean different things in different contexts
- Enable mathematical operations – Like finding similarities or performing analogies
- Work efficiently at scale – Process large volumes of text without computational explosion
\ Embeddings solve these problems by representing words, phrases, or documents as dense vectors in a continuous space where semantic relationships are preserved as geometric relationships.
Foundations of Embeddings
Dense Vector Representation
Instead of sparse vectors with thousands or millions of dimensions, embeddings use a few hundred dense dimensions where each dimension contributes to meaning.
\ "cat" → [0.2, -0.4, 0.1, -0.8, …, 0.3] (300 dimensions)
"kitten" → [0.19, -0.38, 0.15, -0.75, …, 0.29] (similar to "cat")
\ This makes computation orders of magnitude more efficient while enabling richer semantic representation.
Distributional Semantics
Embeddings are built on the principle that "you shall know a word by the company it keeps" (J.R. Firth). By analyzing what words appear in similar contexts, embeddings capture semantic relationships automatically.
\ For example, "king" and "queen" will have similar contexts, so they'll have similar embeddings, even though they rarely appear in the exact same position.
Mathematical Properties
Embedding spaces have remarkable mathematical properties:
\ vector("king") - vector("man") + vector("woman") ≈ vector("queen")
\ This allows for analogical reasoning and semantic operations directly in the vector space.
Transfer Learning
Pre-trained embeddings capture general language knowledge that can be fine-tuned for specific tasks, dramatically reducing the data needed for new applications.
Contextual Understanding
Modern contextual embeddings (like those from BERT, GPT, etc.) represent the same word differently based on context:
\ "I'll deposit money in the bank" → "bank" relates to finance
"I'll sit by the river bank" → "bank" relates to geography
\ With all the knowledge about the history and understanding of embeddings, it’s time to get down to using them.
Use LLM/SLM Models to generate embeddings
Various research teams have developed embedding models trained on diverse datasets spanning multiple languages and domains. This diversity results in models with vastly different vocabularies and semantic understanding capabilities. For instance, models trained predominantly on English scientific literature will encode technical concepts differently than those trained on multilingual social media content. This specialisation allows practitioners to select embedding models that best align with their specific use cases.
\ The practical implementation of embeddings has been greatly simplified by libraries like the SentenceTransformer package from Hugging Face, which provides a comprehensive SDK for working with various embedding models. Similarly, OpenAI's SDK offers straightforward access to their embedding models, which have shown impressive performance across many benchmarks. These tools, and there are many more, have democratised access to state-of-the-art embedding technologies, allowing developers to integrate semantic understanding into applications without needing to train models from scratch.
\ The models, for the sake of this article, should be treated as a black box that takes sentences as an input and returns their corresponding vector representation. The details about the process and workings of an embedding model are beyond the scope of this post.
Using SentenceTransformers Library for Embeddings
The simplest way to generate embeddings using SentenceTransformer:
from sentence_transformers import SentenceTransformer
# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2') # 384 dimensions
# Generate embeddings
texts = ["This is an example sentence", "Each sentence becomes a vector"]
embeddings = model.encode(texts)
print(f"Shape: {embeddings.shape}") # (2, 384)
max_seq_length = model.tokenizer.model_max_length
print(max_seq_length) # 256
\ The model “all-MiniLM-L6-v2” available from HuggingFace has 384 dimensions. It means it can capture 384 features or nuances for a given word or a sentence. The sequence length of this model is 256 tokens. Sentences are broken down into words and words into tokens by the tokenizer during the embedding process. The number of tokens generated for a sentence is generally 25% to 40% more than the number of words in the sentence.
\ Sequence length denotes the number of tokens that can be processed by the model as given input. Anything less is padded to make it 256 in length, and anything more is discarded. There are implications of this, and we will discuss them in another post.
\ The encode method of the SentenceTransformer class is a wrapper on PyTorch inference mode to use the model. The above code is similar to the PyTorch code:
\
from sentence_transformers import SentenceTransformer
import torch
# Load the model directly with SentenceTransformer
model = SentenceTransformer("sentence-transformers/msmarco-distilbert-base-tas-b")
# Input text
texts = ["This is an example sentence", "Each sentence becomes a vector"]
# Get embedding directly
with torch.no_grad():
embedding = model.encode(texts, convert_to_tensor=True)
print(embedding)
\ Here, the torch.no_grad function ensures that no gradients are calculated during back propagation.
\n Another, more generic way to generate embeddings using PyTorch:
\
# Load the model
model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")
# Get the tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")
# Tokenize input
text = ["This is an example sentence"]
encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
# Get embedding of the [CLS] token
with torch.no_grad():
outputs = model(**encoded_input, return_dict=True)
cls_embedding = outputs.last_hidden_state[:, 0]
print(cls_embedding)
\ The difference between this and earlier code snippets is that the encode function has been replaced with using tokenizer and model explicitly.
\ Another difference is that we are using outputs.lasthiddenstate[:, 0] to retrieve the vector related to the CLS token. This CLS special token is added to every sentence at the beginning of each sentence, and it contains accumulated information about the entire sentence. The encode method does this for us automatically.
\ It is to be noted that this approach of adding a CLS token is applicable to only certain transformer-based architectures, and this includes BERT and its variants and encoder-only-based transformers. Also, there are architectures like SBERT that provide embeddings for entire sentences without using this technique.
\ Best for: Classification and sequence-level prediction tasks
\ Why they work: The [CLS] token in BERT-style models is specifically trained to aggregate information from the entire sequence during pre-training. It functions as a "summary" token that captures the overall meaning.
\ When to choose:
- When using BERT, RoBERTa, or similar models for classification
- When you need a single vector representing an entire sequence
- When your downstream task involves predicting a property of the whole text
\ The CLS method used is just one of the methods for capturing the embeddings for a sentence. There are other methods as well.
Mean Pooling
Taking the average of all token embeddings is surprisingly effective for many tasks. It's my go-to method when I'm using embeddings for similarity or retrieval tasks.
\ Best for: Semantic similarity, retrieval, and general-purpose representations.
\ Why it works: By averaging across all token representations, mean pooling captures the collective semantic content while reducing noise. It gives equal weight to all meaningful tokens.
\ When to choose:
- For document similarity or semantic search applications
- When you need robust representations that aren't dominated by any single token
- When empirical testing shows it outperforms other methods (it often does for similarity tasks)
\
import torch
from transformers import AutoTokenizer, AutoModel
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Tokenize input
texts = ["This is an example sentence", "Each sentence becomes a vector"]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Mean pooling
with torch.no_grad():
outputs = model(**inputs)
# Get attention mask to ignore padding tokens
attention_mask = inputs['attention_mask']
# Sum token embeddings and divide by the number of tokens
sum_embeddings = torch.sum(outputs.last_hidden_state * attention_mask.unsqueeze(-1), dim=1)
count_tokens = torch.sum(attention_mask, dim=1, keepdim=True)
mean_embeddings = sum_embeddings / count_tokens
print(f"Shape: {mean_embeddings.shape}") # (2, 768)
\
Max Pooling
Max pooling takes the maximum value for each dimension across all tokens. It's surprisingly good at capturing important features regardless of where they appear in the text.
\ Best for: Feature detection and information extraction tasks
\ Why it works: Max pooling selects the strongest activation for each dimension across all tokens, effectively capturing the most salient features regardless of where they appear in the text.
\ When to choose:
- When specific features matter more than their frequency or position
- When looking for the presence of particular concepts or entities
- When dealing with long texts where important signals might get diluted in averaging
\
import torch
from transformers import AutoTokenizer, AutoModel
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Tokenize input
texts = ["This is an example sentence", "Each sentence becomes a vector"]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Max pooling
with torch.no_grad():
outputs = model(**inputs)
# Create a mask to ignore padding tokens for max pooling
attention_mask = inputs['attention_mask'].unsqueeze(-1)
# Replace padding token representations with -inf so they're never selected as max
token_embeddings = outputs.last_hidden_state.masked_fill(attention_mask == 0, -1e9)
# Take max over token dimension
max_embeddings = torch.max(token_embeddings, dim=1)[0]
print(f"Shape: {max_embeddings.shape}") # (2, 768)
Weighted Mean Pooling
Not all words contribute equally to meaning. The weighted pooling method tries to give more weight to more important tokens based on position (e.g., giving more weight to later tokens).
\ Best for: Tasks where different parts of the input have different importance
\ Why it works: Not all words contribute equally to meaning. Weighted pooling allows you to emphasize certain tokens based on their position, attention scores, or other relevance metrics.
\ When to choose:
- When sequence order matters (e.g., giving more weight to later tokens)
- When certain tokens are inherently more informative (e.g., nouns and verbs vs. articles)
- When you have a specific importance heuristic that makes sense for your task
\
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Tokenize input
texts = ["This is an example sentence", "Each sentence becomes a vector"]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Weighted mean pooling - more weight to later tokens
with torch.no_grad():
outputs = model(**inputs)
# Get token embeddings and attention mask
token_embeddings = outputs.last_hidden_state
attention_mask = inputs['attention_mask']
# Create position-based weights (later positions get higher weights)
input_lengths = torch.sum(attention_mask, dim=1).unsqueeze(-1)
position_indices = torch.arange(token_embeddings.size(1)).unsqueeze(0).expand_as(attention_mask)
position_weights = position_indices.float() / input_lengths.float()
position_weights = position_weights * attention_mask
# Normalize weights to sum to 1
position_weights = position_weights / torch.sum(position_weights, dim=1, keepdim=True)
# Apply weights and sum
weighted_embeddings = torch.sum(token_embeddings * position_weights.unsqueeze(-1), dim=1)
print(f"Shape: {weighted_embeddings.shape}") # (2, 768)
\
Last Token Pooling
Last token pooling is a technique for creating a single embedding vector from a sequence of token embeddings by selecting only the final token's representation.
\ Best for: Autoregressive models and sequential processing
\ Why it works: In left-to-right models like GPT, the final token contains the accumulated context from the entire sequence, making it information-rich for certain tasks.
\ When to choose:
- When using GPT or other decoder-only models
- When working with tasks that depend heavily on the full preceding context
- For text generation or completion tasks
\
import torch
from transformers import AutoTokenizer, AutoModel
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Tokenize input
texts = ["This is an example sentence", "Each sentence becomes a vector"]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Last token pooling
with torch.no_grad():
outputs = model(**inputs)
# Get the last non-padding token for each sequence
attention_mask = inputs['attention_mask']
last_token_indices = torch.sum(attention_mask, dim=1) - 1
batch_indices = torch.arange(attention_mask.size(0))
# Extract the last token embedding for each sequence
last_token_embeddings = outputs.last_hidden_state[batch_indices, last_token_indices]
print(f"Shape: {last_token_embeddings.shape}") # (2, 768)
\ \ There are many more ways, and these methods can be combined together as well to create custom methods. This was just the beginning for understanding embeddings as a concept and basic implementation to get embeddings using different techniques.
Looking Forward: Where Embeddings Are Headed
The embedding space (pun intended) continues to evolve:
\
Multimodal embeddings are breaking down barriers between text, images, audio, and video. Models like CLIP and DALL-E use embeddings to create a shared semantic space between different modalities.
More efficient architectures like MobileBERT and DistilBERT are making it possible to use powerful embeddings on edge devices with limited resources.
Domain-specific embeddings pre-trained on specialized corpora are pushing the state-of-the-art in fields like medicine, law, and finance.
\
I'm particularly excited about composition-aware embeddings that better capture how meaning is constructed from smaller units, which could finally solve long-standing challenges with negation and compositional phrases.
Final Thoughts
Embeddings aren't just another NLP technique – they're a fundamental shift in how machines understand and process language. They've moved us from treating text as arbitrary symbols to capturing the rich, complex web of meanings and relationships that humans intuitively understand.
\ Whatever NLP task you're working on, chances are that thoughtfully applied embeddings can make it better. The key is understanding not just how to generate them but when and why to use different approaches.
\ And if you're still using bag-of-words or one-hot encoding for text analysis…
\ Well, there's a whole world of possibilities waiting for you.
This content originally appeared on HackerNoon and was authored by Ritesh Modi

Ritesh Modi | Sciencx (2025-03-27T21:01:08+00:00) Embeddings 101: Unlocking Semantic Relationships in Text. Retrieved from https://www.scien.cx/2025/03/27/embeddings-101-unlocking-semantic-relationships-in-text/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.