Zeroshot Classification

4 min readSep 16, 2021

Machine learning with no Data and without Training

What is Zero-shot?

ML technique which is used to classify data based on very few or even no labeled example. which means classifying on the fly.

Zero-shot is also a variant of transfer learning. Its a pattern recognition with no examples using semantic transfer.

Zero-shot learning (ZSL) most often referred to a fairly specific type of task: learn a classifier on one set of labels and then evaluate on a different set of labels that the classifier has never seen before.

Why Zero-shot?

No data or very small amount of data is available for training. (Intent detection without any data provided by the user)
Number of classes/labels is very high. (many thousands)
Out of the box classifier, reduced cost in terms of infra and development.

How Zero-shot works?

There are few approaches for Zero-shot learning

Latent embedding approach

we have a sequence embedding model Φ(sent), set of possible class names C. We classify a given sequence X according to,

!pip install transformers
from transformers import AutoTokenizer, AutoModel
from torch.nn import functional as Ftokenizer = AutoTokenizer.from_pretrained('deepset/sentence_bert')
model = AutoModel.from_pretrained('deepset/sentence_bert')sentence = 'Who are you voting for in 2020?'
labels = ['business', 'art & culture', 'politics']# run inputs through model and mean-pool over the sequence
# dimension to get sequence-level representations
inputs = tokenizer.batch_encode_plus([sentence] + labels,
return_tensors='pt',
pad_to_max_length=True)
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
output = model(input_ids, attention_mask=attention_mask)[0]
sentence_rep = output[:1].mean(dim=1)
label_reps = output[1:].mean(dim=1)# now find the labels with the highest cosine similarities to
# the sentence
similarities = F.cosine_similarity(sentence_rep, label_reps)
closest = similarities.argsort(descending=True)
for ind in closest:
print(f'label: {labels[ind]} \t similarity: {similarities[ind]}')

Classification as Natural Language Inference(NLI)

NLI considers two sentences: a “premise” and a “hypothesis”. The task is to determine whether the hypothesis is true (entailment) or false (contradiction) given the premise.

The approach, proposed by Yin et al. (2019), uses a pre-trained MNLI sequence-pair classifier as an out-of-the-box zero-shot text classifier that actually works pretty well.

Important Practical Consideration

Zeroshot models are large and compute heavy. To take it to the production few practical aspects should be considered.

Zero shot doesn’t work as well when the topic is a more abstract term in relation to the text.
Labels should have proper semantics.
Zero-shot can work as multi-label classifier.
BART MNLI + Yahoo answers model works better for general use cases.
Descriptive and meaningful labels are needed to work with ZSL.
Performance of Zero-shot classifiers: F1 score of .68 and .72 for the unseen and seen labels respectively.

Zeroshot Performance compare with Fine-Tuned Sota Model

Limitation of Zero-shot Approach

Validation is a challenge: this is a challenge for any unsupervised learning situation.
Meaningful labels are a necessity.
This probably won’t beat supervised methods.
Impractical inference time for labels size more than 20.
GPU hungry!!

I will describe how inference time varies as we increase the label size in the next post.

How to handle large number of labels

Try out one of the community-uploaded distilled models 40 on the hub. Their accuracies are comparable to larger model but have smaller in size and faster inference time. Start with valhalla/distilbart-mnli-12-3 (models can be specified by passing e.g. pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-3") when you construct a model.
On GPU, make sure to pass device=0 to the pipeline factory in to utilize cuda.
On CPU, try running the pipeline with ONNX Runtime. You should get a boost. Here’s a project 52, it lets you use HF pipelines with ORT automatically.
If you have a lot of candidate labels, try to get clever about passing just the most likely ones to the pipeline. Passing a large # of labels for each sentence is really going to slow you down since each sentence/label pair has to be passed to the model together. If you have 100 possible labels but you can use some kind of heuristic or simpler model to narrow it down, that will help a lot.
Use mixed precision. This is pretty easy 47 if using PyTorch 1.6.

Other Use-cases with ZSL

Zero-shot as intent detection.
Zero-shot as sentiment classification.
Zero-shot as emotion classification.
Zero-shot as topic classification.
Zero-shot as multi-label classification.

Domains of Application

natural language processing
image classification
semantic segmentation
image generation
object detection

In this post I tried to summarise what Zeroshot is, what are its practical usages and consideration.

In the next post I will dive deeper into analysis and practical usage. I will extend this post for following topics

Analysis of Inference time across these points.

Hardware (CPU vs GPU)
Label size

2. How to handle large label set.

Usages of GPUs
Uses of ONNX runtime for CPU based inferences.

e.t.c

Please stay tuned. Request you to put some claps

if you liked the post. Thank You!!

For performace comparisons accross different zeroshot methods.