Understanding OpenAI CLIP & Its Applications

Anshu Kumar

6 min readNov 19, 2022

Hands-On examples of Image Search & Reverse Image Search

In this Article I would cover the the following

CLIP Architecture and Embedding.
Core Implementation, Key Ingredients and Training Approach
Application of CLIP Embeddings to build Image Search and Reverse Image Search
Key Takeaways and Limitations

Let’s get Started!!

CLIP: (Contrastive Language–Image Pre-training)

CLIP Learns visual concepts from natural language supervision.

Supervised computer vision systems are trained/fine-tuned with fixed set of labels. This limits the capability of the model as every time model is trained when new label is encountered.

The CLIP uses the prior work of VirTex[1] and ConVIRT[2].

Image descriptions acted as proxy labels for training CLIP. CLIP is helper model of DALL-E.

CLIP can tell whether a text description matches to the image or not.

CLIP is zero shot by design hence not restricted to number of labels. It does not constrain the model to reduce to one concept or a label. This helps CLIP to learn multiple concepts present in image. Hence CLIP performs reasonably well on unseen image datasets and text descriptions.

What are the key ingredients of CLIP?

1. Large Dataset:

CLIP is trained over WebImage Text(WIT) 400M image-text pair. Diverse dataset crawled from internet. More data is better.

2. Contrastive Pre-Training:

I and T vectors represent the embeddings of Image Batch and Text Batch. I_i T_i represents the dot product of matched Image and Text Embeddings. Off diagnonal elements are not matching Image and Text Description.

In contrastive learning we are trying to maximise diagonal value (I_1,T_1), (I_2,T_2)… I_N,T_N while minimising off diagonal elements.

Most of the learning coming from negative image description. In a batch of 32,768 there is only one positive pair. CLIP’s learns most by, what this image is not about. I mean the model learns a lots from off diagonal values minimisation.

Core Implementation of CLIP

3. Computational Efficiency

a. Contrastive objective for connecting text with images.

b. Adoption of the Vision Transformer. Gained 3x compute efficiency compared to ResNet model. Vision Transformer is used as Image Encoder.

c. Web Scale supervision seems to surpass manual curation dataset.

Major Problems in Deep Learning and How CLIP helps?

Costly Dataset : Methods that tries to mitigate the costly dataset problems are

Self Supervised Learning — [3]
Contrastive Methods — [4]
Self Training Approaches — [5]
Generative Modelling — [6]

2. Narrow : Perform good only on training dataset domain and weakly generalise out of training examples. To Apply CLIP to a new task, we just need to encode the task description. Accuracy of these tasks as good as supervised counterparts and many times better.

3. Generalisation : Current DL methods benchmark performance is good compared to real performance. Reason being they are trained on benchmark dataset and evaluated on benchmark dataset.

CLIP model can be evaluated without being trained on the benchmark dataset.

How CLIP is used to do prediction Tasks

CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in the dataset.

Key Takeaways from CLIP

CLIP is highly efficient:

Contrastive objective for connecting text with images.
Adoption of the Vision Transformer. Gained 3x compute efficiency compared to ResNet model.

2. CLIP is generalisable

The best CLIP model outperforms the best publicly available ImageNet model, the Noisy Student EfficientNet-L2[7], on 20 out of 26 different transfer datasets.

3. Limited performance on fine-grained and systematics tasks.

Applications of CLIP

CLIP encodes text and image in same embedding space. This gives us opportunity to work with intersection of image and text modalities.

Few Applications which could be build with CLIP are

Zero Shot Image classification: Using Clip embedding out of the box to do zero shot image classification.
Fine Tuned Image Classification: Adding a classification head and fine tune the head for specific fine-grained systematic image classification.
Semantic Image Retrieval : Text to image and Reverse Image search both are possible with rich CLIP embeddings.
Content Moderation : If you prompt CLIP correctly , we can filter out graphic or NSFW images out of the box.
Image Ranking : It’s not just factual representations that are encoded in CLIP’s memory. It also knows about qualitative concepts as well
Image Captioning : With the feature vectors from CLIP have been wired into GPT-2 to output an English description for a given image. Have a look at this repo for implementation details. [8]
Deciphering Blurred Images : Inverse Problems Leveraging Pre-Trained Contrastive Representations, researchers have shown how CLIP can be used to interpret extremely distorted or corrupted images.

1. Image Search Using CLIP Embeddings

Load The Text Encoder Using SentenceTransformer

Load The Image Encoder Model Using SentenceTransformer

Head Over to this github repo to find the complete example

applied_clip/Image_Search_multilingual.ipynb at main · akgeni/applied_clip

Applications of CLIP. Contribute to akgeni/applied_clip development by creating an account on GitHub.

github.com

Reverse Image Search Using CLIP Embeddings

We can also build Revers Image Search, which searches Images given its input is also an image. Google Lens is an example of reverse image search.

I am demoing a scalable way to build Reverse Image Search using Annoy Indexer. Annoy Indexer can be used as Nearest Neighbour Embedding Search. Read more about it here https://github.com/spotify/annoy

Load CLIPImage Encoder, Functions to read image data and convert to embeddings using CLIP

Index the Image Embeddings using Annoy Indexing.

I had a small dataset of around ~1K Flicr Images. We can still see how Trekking, and Scene and Bicycle concepts are captures.

For full code Please head over to this: https://github.com/akgeni/applied_clip/blob/main/scalable_reverse_image_search/scalable_reverse_image_search_clip.ipynb

CLIP applications are not limited to these examples shown above. A more inclusive list of application we have seen above. Feel free to experiment with them.

I know the article has become little longer, but I wont let you go without informing about limitations of CLIP. Here are they

CLIP Limitations

Fails to count number of objects in image.
Fails on fine-grained image classification like celebrity identification, car model identification, flower species etc.
Weakly performs on handwritten image MNIST. 88% Zero Shot accuracy.

For Any doubts, Clarification or correction Feel free to connect me on LinkedIN: https://www.linkedin.com/in/anshu19/

Resources:

Simple Implementation of OpenAI CLIP model: A Tutorial

A tutorial on simple implementation of CLIP model in PyTorch.

towardsdatascience.com

GitHub - moein-shariatnia/OpenAI-CLIP: Simple implementation of OpenAI CLIP model in PyTorch.

I am happy to find out that this code has been used and cited in the following papers: Domino: Discovering Systematic…