Semantic similarity is quickly becoming a mainstream technique for working with natural language data; embedding new examples into a latent space uncovers the relationship between them. (among others) are doing a great job of producing content that describes how semantic similarity works under the hood.

We can compare the distance between any two pieces of data that have been embedded in a latent space, and this tells us how semantically similar they are. This unlocks all types of search use-cases.

For example, very basic (yet powerful) NLU can be built using semantic search (for example, to power a chatbot): by simply embedding the chatbot answers themselves, and then comparing the semantic similarity between those answers and any given user input, you get pretty good results.

It took me a few minutes to prepare and embed the 140 hotel facts provided by Botpress in their white-paper for the OpenBook NLU (i.e: "The hotel is a 4 star rated hotel", "The postal code is G8L 9F5", "Guests with disabilities can bring their service animal in the hotel" etc), and then test the semantic similarity against typical user questions you could expect in a chat (i.e: "am I allowed to bring pets", "is there any free wifi in the rooms" etc..).

The results are quite good off the bat:

Of course with further tweaking, you could get even more impressive results; for example, you could split inputs that contain multiple questions (i.e: "we'd like to book the hotel for a wedding, but want to make sure that it's close to downtown") into sub-parts, run semantic similarity on the substrings, and then merge the answers together to provide an answer that confirms 1) weddings are supported, and 2) that the hotel is close to downtown.

The release of large language models and access to their underlying latent spaces is making the development of this kind of use-case almost trivial: Co:here and Pinecone in particular are democratizing the infrastructure and APIs required to build on latent spaces.

What's exciting is that larger latent spaces (with higher dimensionality) should translate to better results without the need for any training (despite HuggingFace recently showing that initial commercially available embeddings from OpenAI didn't outperform existing open models for the purposes of semantic similarity).

HumanFirst uses a transformer-based latent space to power semantic similarity throughout the product; however we can plug in other latent spaces quite easily, so I'm very keen to compare the outputs of different open and commercially available latent spaces (like Co:here's) within our product soon.

Another application of semantic similarity is clustering

Data science teams have been "dividing and conquering" large amounts of unstructured data using clustering techniques for long time now - semantic similarity provides a simple & powerful signal to bucket things together in clusters. Typically, teams use this to:

  • Label data (i.e: intents, entities and training examples for NLU)
  • Extract information (i.e: product or customer insights)

Jay Alammar recently published a great post on the Co:here blog describing the process of applying clustering based on semantic similarity to the top 10,000 Hacker News titles: the resulting visualizations are intuitive and fun to play around with.

The purpose of clustering is to assist the human in transforming the unstructured data into clean "labeled" data, whereby the clusters help identify the labels (and accompanying data) that relate to the business problem at hand.

The visual / spatial UX seems to be the preferred approach to explore and work with clusters:

The graphical / spatial UX paradigm for exploring clusters definitely has a "coolness" factor to it, however it's often disconnected from the subsequent data engineering that's required to get production-ready data  

Unsupervised clustering has inherent limitations:

  1. It's difficult to ensure that the generated clusters capture the scope that matters to your business use-case (i.e: in a customer support scenario where you're looking to triage incoming calls, you might be interested in clustering together anything that represents a high-level "problem with credit cards" -  however the clusters might present more specific requests such as "credit card surcharges" ... the clustering algorithms ultimately don't know what matters to you. Changing the clusters requires fine-tuning and tweaking the number and sizes of clusters: this is typically very tedious.

2.  The scope of a given label can overlap many clusters that are not in proximity to each other (i.e: in the below example, you might be looking for a cluster that represents "question about your favorite X", to explore the different values of X).

3.  Clusters will naturally contain (a lot of) items that don't "fit" with the desired label and need to be removed...

4. ... and conversely, the high overlap between adjacent clusters often requires "pulling in" data from other clusters to compose your labeled dataset.

Because of this, most practical applications require a lot of additional human-driven data exploration, cherry-picking and curation to translate the insights and value provided by clustering into datasets that are business-ready.

Excel is where this cleanup and curation work tends to happen.

The raw output of clustering algorithms often needs to be loaded into Excel

In addition, the clustering user experience itself can introduce bias into the exploration and label discovery:

  1. Because clusters are typically presented as a "static" snapshot that can explored, it's easy to fall into the trap of following the clustering recommendations (often delimited by colour codings) when building the list of labels: this can affect the quality and scope of the labels, and prevent finding labels that map as correctly as possible to the business problem at hand.
  2. Clustering UX's often rely on listing the repeating keywords (using techniques like TF-IDF) to give the user a "preview" of a cluster's content: this introduces another level of bias, since a few keywords very rarely capture the full semantic richness (and variety) within a cluster's content, which can lead to insights: it's easy to fall into the trap of "assuming" we understand what the cluster contains and labeling it inappropriately.

At HumanFirst, we learned that semantic search and clustering become exponentially more valuable when part of the data engineering flow itself.

Concretely, this means that instead of making clustering the "step 1" of a waterfall process (i.e: where "step 2" is manually cleaning & organizing the output of "step 1"), HumanFirst exposes clustering and semantic similarity capabilities as simple, real-time filters throughout the product and data engineering workflows (think something as ubiquitous as CTRL+F, but on steroids).

This reduces "exploration -> business-ready data" to a single step (assisted by clustering and similarity search), with an instant feedback loop that empowers the domain expertise, curiosity and creativity of the human behind the wheel.

An iterative human-driven feedback-loop to build production-ready data: users select interesting item(s), search for semantically similar items, cluster those results, select interesting clusters, find semantically similar clusters to those selected etc.

Let's look at how this iterative process plays out for both 1) building labeling data for an NLU use-case, and 2) extracting information from text:

Labeling training data

Extracting information from text  

In a future article I'll dig more into:

  1. The importance of having the hierarchical output of a clustering algorithm like HDBSCAN to further improve the experience by controlling where sub-sampling happens.
  2. How a model trained from the labeled data can accelerate and improve the quality of exploration of future data.

HumanFirst is like Excel, for Natural Language Data.
A complete productivity suite to transform natural language into business insights and AI training data.