Zeroshot Classification Performance Evaluation
Previous post on Zeroshot Classification
In this post we will analyse the performance of Zero Shot classification across different models and label sizes.
Specifically we will compare Across three models
- Cosinse similarity on Sentence Transformers
- Large and Complex models like Bart Model
- Smaller distilled moels like distilroberta
These models will be evaluated for inference speed across different label sizes [5, 10, 20, 50, 100]
Inference Performance (Speed)
We will also see how inference performance varies over these models.
We can see that for complex model like Bart and label size 100, it is impractical to use it.
Lets see how these models perform over GPU machines.
Here interesting to note above is, If we choose to use DistillBart over Bart then we will have 2x speed up. Now the important question is how distilled version fares with accuracy. We will see that as well.
For completeness let’s also compare the Sentence Bert Cosine method performance on CPU and GPU machines.
Predictive Performance (Accuracy)
Now let’s see how the complex model like Deberta scores against different label sizes.
For the same dataset we get the following accuracies on Distilroberta
If we compare the accuracy number from Deberta and Distilroberta, we see marginal increase for 20 labels. So concluding on which model should be used is highly dependent on business requirements. Whether to optimize cost and speed vs Accuracy.
Summary:
- We have seen the impact of labels on prediction speed.
- GPUs can be key requiremnt, for Zero-shot classification.
- Impact of label size, on accuracy.
- Comparison of Large and Distilled version of Zero-shot models.
- For handling large label set we can use Distill models and GPUs.
Soon I will be sharing the all the code for comparison and improvement. Please stay tuned.
In the next posts we will see other methods of handling large labels, like ONNX and Funneling/Hierarchial labels reduction.
Now is the time to put some Claps ;)