Nov 19, 2024
The Rise of RAG in NLP Applications: In recent years, there has been a significant shift in the way machine learning models are being used in real-world applications. One such trend is the growing use of Retrieval-Augmented Generation (RAG). This approach combines the power of large language models with external knowledge retrieval, allowing applications to produce more accurate and contextually relevant outputs by retrieving specific pieces of information.
Why Sentence Similarity Matters: In the pursuit of building smarter applications using models like RAG, one key challenge is capturing how similar two pieces of text are, especially when the model needs to understand nuanced differences in meaning. This led me to explore different ways to measure sentence similarity — a foundational aspect of understanding relationships between textual data.
As I started diving into this space, I wanted to better understand the effectiveness of different sentence similarity measures. How do we know if two sentences, with possibly very different structures, still convey the same meaning? In this article, I’ll share my experiments and findings around comparing sentence embeddings using cosine similarity.
When I mapped the cosine similarities between my validation and training sentences, I expected to find some clear threshold for sentence similarity. However, as the heatmap below shows, the results were more varied than anticipated. This variance made it difficult to establish a clear cut-off point for classifying sentences as similar.
I also experimented with the elbow cutoff method and tried gathering other empirical evidence to find a more definitive approach for determining the optimal cut-off. However, despite these efforts, I wasn’t able to arrive at a very clear or consistent solution across the dataset. The complexity of the data and the variation in results made it challenging to draw a firm conclusion.
There’s certainly more to explore on this topic, and perhaps in a future article, I’ll dive deeper into how methods like the elbow cutoff can be used in different contexts.
Hands-On Experiments with Sentence Similarity
1. Labelled Data of Already Tagged Sentences
To ensure my experiments had a controlled setup, I used a dataset of labelled sentences where the relationships between sentences were already tagged. These tags helped me compare the actual similarity between sentences to the cosine similarity scores generated by the models. The dataset included sentences that mimicked real-world sentence-matching tasks, such as comparing educational qualifications or job descriptions.
Having this labelled data allowed me to objectively evaluate how well the models performed, as I could compare the cosine similarity scores against the expected relationships between sentences. This gave me a baseline to assess whether the models were effectively capturing the meaning behind the sentences or just identifying superficial token-level similarities.
2. Pooling Method
After generating embeddings using models like BERT, OpenAI’s text-embedding-ada-002, and SentenceTransformers models (such as all-MiniLM-L6-v2
, stsb-roberta-large
, and all-distilroberta-v1
), I handled the embeddings differently depending on the model.
For BERT and the SentenceTransformers models, which produce embeddings for each token (word or sub-word) in a sentence, I experimented with different pooling techniques to combine these token-level embeddings into a single sentence-level representation. In contrast, OpenAI’s text-embedding-ada-002 directly generates sentence-level embeddings, so no additional pooling was required. I then assessed how these different approaches impacted the resulting cosine similarity scores.
Average Pooling: This method takes the average of all token embeddings in the sentence, giving equal weight to every token. It worked well for sentences where the overall structure mattered more than individual words.
Max Pooling: Here, I used the maximum value for each dimension from the token embeddings. This method helps when you want to highlight the most prominent or important words in a sentence, but it didn’t always perform as well when sentence structure was important.
CLS Pooling: In models like BERT, the first token in the sequence (the CLS token) is meant to represent the entire sentence. I tested whether using this CLS token alone provided better results than averaging or max pooling. Interestingly, while it worked well in some cases, it didn’t consistently outperform the other pooling strategies.
Results: Pooling and Model Comparisons
After applying the different pooling strategies and comparing the cosine similarity scores, I observed some interesting patterns across the models.
BERT and OpenAI: The average pooling method consistently worked better for both BERT and OpenAImodels. These models were able to capture the overall semantic meaning of sentences more effectively, leading to higher cosine similarity scores when using average pooling. As seen in the heatmaps below, both BERT and OpenAI maintained relatively high and stable similarity scores across all test sentences.
SentenceTransformers: Surprisingly, SentenceTransformers didn’t perform as well in this setup. Despite their optimization for sentence similarity tasks, the results were more varied, and average pooling didn’t produce the same level of consistency observed in the other models. However, I’m still in the process of experimenting with these models to better understand their behavior and possibly identify conditions under which they might perform better.
Cosine Similarity Heatmap:
Below is a visual representation of the cosine similarity scores across test sentences for OpenAI, BERT, and SentenceTransformers using average pooling:
In the heatmaps displayed, we applied the same color scale across all models (OpenAI, BERT, and Sentence Transformers), ensuring that values like 0.85 have the same color representation in each model, allowing for direct visual comparison of cosine similarity across different embedding techniques.
Conclusion
In this exploration of sentence similarity models, I set out to understand how different models — BERT, OpenAI’s text-embedding-ada-002, and SentenceTransformers — perform when tasked with measuring the similarity between sentences using cosine similarity. Through various experiments, I found that average pooling consistently produced the best results for both BERT and OpenAI models, offering a reliable way to capture the overall meaning of sentences. This suggests that these models are highly effective at generating sentence-level embeddings that align semantically.
However, when applying the same approach to SentenceTransformers, the results were less consistent. Despite these models being fine-tuned for sentence similarity tasks, they didn’t perform as well in this specific experimental setup. I’m continuing my work with these models to better understand how they behave under different conditions and pooling strategies.
In the end, this journey highlights the complexities of measuring sentence similarity and underscores the importance of selecting the appropriate pooling strategy for each model. While there’s no universal solution, experimenting with different models and techniques offers valuable insights into how well they capture meaning. As I continue refining my approach, I’m excited to share further insights on enhancing sentence similarity metrics in future explorations.
What’s Next?
This exploration of sentence similarity models has opened up several exciting paths for further investigation. In addition to refining cosine similarity experiments, I plan to explore more advanced techniques like attention-weighted pooling and other methods to see if they can better capture sentence-level meaning compared to traditional pooling approaches.
- Animesh Srivastava, Co-Founder Researchify