Efficient Image Labeling with OpenAI CLIP in Python

Image labeling is a crucial step in training machine learning models for computer vision tasks. The process can be time-consuming and expensive, especially when dealing with large datasets. In this blog post, we will discuss how to leverage OpenAI's powerful CLIP model to efficiently label images in Python, speeding up your machine learning workflows.

Prerequisites

Before getting started, ensure you have the following installed:

Python 3.6 or higher
OpenAI CLIP (follow the installation instructions)
torch
torchvision
Pillow

You can install the required packages using pip:

pip install torch torchvision Pillow

Loading CLIP Model

First, we need to load the pre-trained CLIP model and its tokenizer. The following code snippet shows how to do this:

import torch
import clip

# Load the pre-trained CLIP model
model, preprocess = clip.load('ViT-B/32', device='cuda' if torch.cuda.is_available() else 'cpu')

# Load the tokenizer
tokenizer = clip.simple_tokenizer.SimpleTokenizer()

Creating a Labeling Function

Now that we have the CLIP model loaded, we can create a function to label images. Here's a simple function to do just that:

from PIL import Image
import torchvision.transforms.functional as TF

def label_image(image_path, labels, top_k=5):
    """Labels an image using OpenAI's CLIP model.
    
    Args:
        image_path (str): Path to the input image.
        labels (list): List of possible labels.
        top_k (int): Number of top labels to return.
        
    Returns:
        list: List of tuples with the top-k labels and their probabilities.
    """
    # Load and preprocess the image
    image = Image.open(image_path)
    image = preprocess(image).unsqueeze(0).to(model.device)

    # Tokenize and encode the labels
    label_tokens = [tokenizer.encode(f"This is a {label}") for label in labels]
    label_tensors = [torch.tensor(tokens).unsqueeze(0).to(model.device) for tokens in label_tokens]

    # Calculate the image and label embeddings
    with torch.no_grad():
        image_emb, _ = model.encode_image(image)
        label_embs = [model.encode_text(label_tensor)[0] for label_tensor in label_tensors]

    # Compute the similarity between the image and label embeddings
    similarities = [torch.cosine_similarity(image_emb, label_emb) for label_emb in label_embs]

    # Find the top-k labels with the highest similarity
    top_similarities, top_indices = torch.topk(torch.tensor(similarities), top_k)

    # Return the top-k labels and their probabilities
    return [(labels[top_indices[i]], float(top_similarities[i])) for i in range(top_k)]

Labeling Images

With the label_image function defined, we can now label images using the CLIP model. Here's an example of how to use the function:

# Define the possible labels
labels = ['cat', 'dog', 'car', 'truck', 'building']

# Label an image
image_path = 'path/to/your/image.jpg'
top_labels = label_image(image_path, labels, top_k=3)

# Print the results
print("Top labels for the image:")
for label, prob in top_labels:
    print(f"{label}: {prob * 100:.2f}%")

This will output the top 3 labels for the given image along with their probabilities.

Conclusion

In this blog post, we have demonstrated how to leverage OpenAI's CLIP model to efficiently label images in Python. This can be a valuable tool for speeding up your machine learning workflows and improving the quality of your image annotations. Give it a try and see how it can help you streamline your image labeling tasks!