Simple and Scalable Machine Learning Solutions for Unstructured Data on the Lakehouse

8 min readOct 10, 2022

When you think of the ideal vacation, does it involve machine learning? Are you looking for fun new ways to do similarity searches on embeddings you’ve generated using state-of-the-art models on the lakehouse?

Are you wondering what a lakehouse is? Never heard of an embedding before? If any of those are true, I have good news for you — you’ve come to the right place.

This blog will introduce you to the process of using machine learning to generate embeddings for unstructured data in an automated and performant manner. In future posts we’ll use this framework to assist us with common use cases a retail data scientist may face.

While this particular example may have a retail flavor, the methodology highlighted here is applicable to other industries and use cases.

Business Problem

You knew this day was coming. Summer was a blast, but now fall is here. Out with the old and in with the new. Your design team has been working hard on the next release of the hottest items for the upcoming season. Consumers are ready to ditch their swimwear for loungewear and iced lattes for pumpkin spice lattes.

The problem is, how do you forecast demand for these new products? Chances are, you may be applying a list of rules or business logic in order to narrow down the category your product best fits into. Is there a better way to understand how well they’ll perform?

Another way you might be thinking about these products is from a design perspective. How alike are the new products to your previous release? Are they too similar? Maybe the designs aren’t as innovative and forward thinking as they could be.

Or are these products too different from previous generations? Will customers be disappointed with a design they were loyal to that strayed a little too far?

From a sheer volume perspective, how are you handling incorporating these new SKUs into production?

These are some of the questions that we’ll address in this post while leveraging a Databricks lakehouse.

Envisioning a Modern Approach

In order to have a modern and open architecture, what are some of the components you’ll need? Every customer has different priorities but common needs include low cost, great performance, flexibility for multiple types workloads, and a unified governance model.

A lakehouse architecture on Databricks addresses all of these needs. A lakehouse takes the best aspects of a data warehouse, such as highly performant queries and security controls, and marries those with the benefits of a data lake, such as cheap, scalable storage and support for unstructured data to be used in machine learning. See these posts for more details on implementing a lakehouse architecture:

To get started with building our lakehouse, we’ll be using Databricks to work with our images and text on blob storage. We’ll be creating an open-source framework by using Towhee to generate embeddings for unstructured data and inserting the resulting embeddings into a vector database called Milvus to perform our similarity lookups. In order to make this easier to productionalize, we’ll show you how to wrap this process up into a Delta Live Tables (DLT) pipeline as shown below so that newly arriving unstructured data will be processed automatically.

Open-Source Embedding Framework on the Lakehouse

This means, as your business scales and gains volume, you can reliably scale your ability to extract the inherent semantics of your unstructured data assets across images, text, audio or video such that it can be applied to further all machine learning use cases meaningfully. This process can be productionalized further by implementing a DevOps process that takes advantage of CI/CD workflows with Databricks Repos.

Ingesting Unstructured Data

One of the key benefits of the Databricks lakehouse is the ability to support multiple data types such as structured, unstructured, semi-structured, and streaming. In this case, we’ll be transforming our raw image and text data into Delta tables which can be used downstream when we apply SQL queries and Machine Learning models to them.

Let’s take a look at our raw image data as it sits in our blob storage using this simple spark command:

Querying Unstructured Data on the Lakehouse

There are multiple options when it comes to loading your data into Delta tables, the most notable ones being the “Copy Into” command and Autoloader. “Copy Into” is best suited for bulk conversion of data while Autoloader is best suited to handle newly arriving files in your cloud storage location.

~~~~~~~~~~~~~~~~~~~~~~~~See Code [HERE]!~~~~~~~~~~~~~~~~~~~~~~~~

In our example we’ll be leveraging Autoloader, also known as “cloudfiles”, in the context of a DLT pipeline. Let’s define our bronze DLT table (see medallion architecture) for the raw image data:

What are Embeddings?

At the heart of the approach we’re going to outline are the image and text embeddings. OpenAI defines embeddings as “numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts”.

Embeddings not only make it easier to process image and text data at scale, but also make it more computationally efficient to do similarity comparisons for multiple entries. While we are using embeddings to do lookups, that’s not all that they are good for. You can also use embeddings as a means of extracting features from your unstructured data that can be used in downstream models.

Building models using this approach allows you to take full advantage of the data you’ve generated, not just that which is in tabular form. Luckily, instead of coming up with our own embedding models, we’re going to take advantage of the openness that a lakehouse offers and use the Towhee package.

Generating Embeddings with Towhee

Once Towhee is installed, we can use the built-in model pipelines to derive embeddings from our unstructured data. In this particular example, we’ll be leveraging BLIP aka “Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”.

BLIP seeks to address gaps in other types of image and text processing models by offering up a “Multimodal mixture of Encoder-Decoder (MED)”. This new methodology allows for a flexible transfer learning framework that can be used for image-text contrastive learning, image-text matching, and image-conditioned language modeling. In conjunction with layers for image labeling and filtering, noisy caption data is more easily identified and removed, leading to improvements in model performance.

We can wrap the BLIP model into a Pandas UDF which can then be used to enable image and text embedding generation in a streaming DLT pipeline. Here’s an example of the Pandas UDF for the image embedding generation:

Automating this Process with DLT

We’ll take the UDF we previously defined and apply it in our DLT pipeline to generate our silver table with the derived embeddings. We’ll also go ahead and generate an ID variable using the generated always as expression in order to make it easier to join with our text data:

Now that we’ve generated our silver layer embeddings for our image data, we can repeat the process for our text data. We’ll use these two silver tables to define our aggregated gold layer that will be inserted into our vector database:

Performant Vector Searches

Thanks to our lakehouse, we’ve been able to generate embeddings on our unstructured data. Each time Autoloader processes newly arriving unstructured data to generate an embedding, we can capture that vector as well as return the top “N” most similar vectors of interest.

Going back to our original business problem, the ability to get the most similar vectors goes a long way in allowing us to establish a baseline for customer demand or serve as a starting point for us to create relevant descriptions for our images.

So how do we go about doing so? We need somewhere to store those embeddings so we can maintain a list for our future lookups. Enter our vector database, Milvus.

Milvus not only provides us with a place to store our embeddings, but allows us to apply similarity metrics. Similarity metrics allow us to obtain a magnitude by which two vectors are similar to one another. This has its advantages over tags or keywords which only offer a binary choice for any given business logic. Check out the “Hello Milvus” notebook for how to get started or this post by my colleague for an even more in-depth look at integration with Databricks.

Implementing the Embedding Framework

So where do we go from here? Let’s revisit those earlier questions and how you could apply this embedding framework.

Identifying Similar Products

Upsert generated embeddings into Milvus
Score newly arriving unstructured data
Find top “N” similar items

Forecasting for new products

Find top “N” similar items (see above)
Generate weighted demand trends and seasonality
Scale the magnitude of the demand accordingly

Designing Innovative Products

Create multiple new product designs
Generate embeddings for each
Evaluate against a baseline to rank order similarities

Quickly Increasing the range of Input Modalities for your ML Models

Use embeddings + other features for traditional ML models to enhance regression, classification and other tasks which might benefit from multi-modal inputs.

There’s so much more you can do with this framework. Stay tuned as we dive into more specific use cases in future posts!

Connect with me on LinkedIn