Reverse Image Search System for E-Commerce Using CLIP, Qdrant and Gradio

Reverse image search is a tool that can be widely used in e-commerce which allows users to search for products using both text and images. The system works by first converting any uploaded image into embeddings, which are numerical representations of the image's features. It then uses these embeddings to perform a similarity search, matching the image with similar products in the database. This innovative approach helps users find exactly what they are looking for, even if they don't know the product's name or specific details.

The benefits of this tool are numerous. Firstly, it enhances the shopping experience by making it easier for users to find products that match their preferences. Instead of struggling with the right keywords, users can simply upload a photo of the product they are interested in. This is particularly useful for finding visually similar items, such as clothing, home decor, or accessories. Secondly, it can save users a significant amount of time, as they no longer need to browse through endless product listings. For businesses, reverse image search can increase customer satisfaction and potentially boost sales by helping customers find what they are looking for more efficiently.

This tool can be integrated into any e-commerce platform. When a user uploads an image, the system quickly processes it and returns a list of visually similar products. This can be combined with text search for even more precise results. For example, if a user uploads a photo of a red dress and types "summer dress," the system will prioritize showing red summer dresses. This combination of image and text search offers a powerful and user-friendly way to navigate e-commerce platforms, making shopping both intuitive and enjoyable.

CLIP: An Overview

CLIP (Contrastive Language-Image Pre-Training) is a neural network model by OpenAI that learns from 400 million image-text pairs, enabling it to understand visual concepts through natural language supervision. This model excels in image-text similarity scoring and zero-shot image classification, and performs well in tasks like OCR, action recognition in videos, geo-localization, and fine-grained object classification.

CLIP encodes text and images into a common vector space by using a contrastive learning objective. Here's how it works:

Dual Encoder Architecture: CLIP employs two separate neural networks: one for encoding images and another for encoding text. The image encoder is typically a Vision Transformer (ViT) or a ResNet, while the text encoder is a Transformer similar to those used in NLP tasks.

Contrastive Objective: The main idea behind CLIP’s training is to bring the embeddings of matching text-image pairs closer together in the vector space while pushing the embeddings of non-matching pairs further apart. This is achieved through a contrastive loss function.

Common Vector Space: As a result of this training process, the image and text embeddings are mapped into a shared high-dimensional space where semantically related images and text descriptions are close to each other. This enables direct associations between text and image content.

Zero-Shot Learning: Because CLIP can encode any text description and any image into this shared space, it can perform zero-shot learning. For any new classification task, you can provide CLIP with text descriptions of the classes, and it can classify images by finding which text description has the highest similarity to the image embedding.

Let’s Code

Load the dataset. You have the option to use your own dataset.

Here’s the link to the dataset I have used: https://www.kaggle.com/datasets/vikashrajluhaniwal/fashion-images

import pandas as pd
from datasets import Dataset, Features, Value, Image

csv_file_path = 'Path_to_csv_file'  
df = pd.read_csv(csv_file_path)

# Combine the relevant columns into a single text column
df['text'] = df.apply(lambda row: f"Gender: {row['Gender']}, Category: {row['Category']}, SubCategory: {row['SubCategory']}, ProductType: {row['ProductType']}, Colour: {row['Colour']}, Usage: {row['Usage']}, ProductTitle: {row['ProductTitle']}", axis=1)
df['image'] = df['ProductId'].apply(lambda x: f' path_to_image_folder/{x}.jpg')  

# Define the features of the dataset
features = Features({
    'text': Value('string'),
    'image': Image()
})

# Create the dataset
data = Dataset.from_pandas(df[['text', 'image']], features=features)

We are ready to load the embedding model. We are using CLIP.

# Load the model and processor
model_id = "openai/clip-vit-base-patch32"
processor = CLIPProcessor.from_pretrained(model_id)
model = CLIPModel.from_pretrained(model_id)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

Now, let's define the embedding function that will generate embeddings for both text and images.

# Function to compute embeddings
def compute_embeddings(texts=None, images=None):
    text_emb = img_emb = None
    if texts:
        text_inputs = processor(
            text=texts,
            padding=True,
            return_tensors='pt'
        ).to(device)

        with torch.no_grad():
            text_emb = model.get_text_features(**text_inputs)
            text_emb = text_emb.detach().cpu().numpy()
            text_emb = text_emb / np.linalg.norm(text_emb, axis=1, keepdims=True)

    if images:
        image_inputs = processor(
            images=images,
            return_tensors='pt'
        ).to(device)

        with torch.no_grad():
            img_emb = model.get_image_features(pixel_values=image_inputs['pixel_values'])
            img_emb = img_emb.detach().cpu().numpy()
            img_emb = img_emb / np.linalg.norm(img_emb, axis=1, keepdims=True)

    return text_emb, img_emb

#compute embeddings for the dataset
text_emb, img_emb = compute_embeddings(texts=data['text'], images=data['image'])

Let's define a vector database for storing embeddings. We are using Qdrant.

To begin using Qdrant, simply visit https://cloud.qdrant.io/login, where you can create a new account. Upon signing in, navigate to the “Clusters” section on the left-hand side. Here, you can choose between the free or paid versions depending on your requirements. Once you’ve created your cluster successfully, it will appear on your dashboard.

The next step involves generating an API Key. Click on the “API key” option below your Cluster, and copy the generated key for future reference.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(
    url="URL", 
    api_key="Qdrant_API_KEY",
)
#Creating a qdrant collection
client.recreate_collection(
    collection_name="fashion_img_db",
    vectors_config=models.VectorParams(size=512, distance=models.Distance.COSINE),
)

Now we will convert our data into payloads to store the metadata.

from qdrant_client.http import models

records = []

records = []
for idx, (image_path, image_embedding) in tqdm(enumerate(zip(df['image'], img_emb)), total=len(df)):
    record = models.Record(
        id=idx,
        vector=image_embedding,
        payload={"image_path": image_path} 
    )
    records.append(record)

We will upload the payload to the vector database.

client.upload_records(
    collection_name="fashion_img_db",
    records=records,
)

Next we’ll define a function for similarity search, which helps us find similar products.

def find_similar_images_from_text(input_text):
    input_emb, _ = compute_embeddings(texts=[input_text])
    input_emb = input_emb[0]  

    results = client.search(
        collection_name="fashion_img_db", 
        query_vector=input_emb.tolist(),
        limit=5      )

    image_paths = [result.payload['image_path'] for result in results]
    images = [Image.open(image_path) for image_path in image_paths]
    return images


def find_similar_images_from_image(input_image):
    _, input_emb = compute_embeddings(images=[input_image])
    input_emb = input_emb[0] 

    results = client.search(
        collection_name="fashion_img_db", 
        query_vector=input_emb.tolist(),
        limit=5
    )

    image_paths = [result.payload['image_path'] for result in results]
    images = [Image.open(image_path) for image_path in image_paths]
    return images

Now we will build our Gradio interface for text-to-image search and image-to-image search.

with gr.Blocks() as demo:
    gr.Markdown("# Image Search Engine")

    with gr.Tab("Text to Image Search"):
        text_input = gr.Textbox(label="Enter text to find similar images")
        text_output = gr.Gallery(label="Similar Images")
        text_input.change(fn=find_similar_images_from_text, inputs=text_input, outputs=text_output)

    with gr.Tab("Image to Image Search"):
        image_input = gr.Image(label="Upload an image to find similar images")
        image_output = gr.Gallery(label="Similar Images")
        image_input.change(fn=find_similar_images_from_image, inputs=image_input, outputs=image_output)

demo.launch(share=True)

Conclusion

Integrating a reverse image search tool powered by CLIP technology enhances e-commerce listings by enabling intuitive product discovery through both images and text. This innovative approach improves user experience, saves time, and boosts customer satisfaction, ultimately benefiting businesses by facilitating more efficient product searches and potentially increasing sales.