Fine-grained Visual Classification with VLMs

Team

Christian Ickler, Aishwarya Venkataramanan

Overview

Fine-grained visual classification (FGVC) aims to distinguish between highly specific subcategories within a broader category, such as differentiating between various car models or bird species. This task is inherently challenging due to the subtle differences between similar classes. Moreover, visual data for highly specific classes is often scarce, creating the need for auxiliary information. One promising source is textual descriptions, for example those gathered from field guides.

Aligning visual and textual representations within a shared feature space is a non-trivial challenge. Recent advances in multimodal learning have led to the emergence of large vision–language models (VLMs), trained on massive amounts of internet-scale data. Despite their broad pretraining distribution, these models often exhibit limited performance in highly specialized domains, such as distinguishing between similar-looking bird or moth species. This project therefore aims to develop a model with domain-specific image–text alignment that leverages discriminative information from textual data to support classification decisions. Target applications include the classification of plant stress types and the recognition of aircraft models.

A critical yet underexplored dimension of FGVC in such specialized settings is the reliable quantification of model uncertainty. Misclassifications in plant disease detection can have significant agricultural consequences, while errors in aircraft classification carry safety-critical implications. This makes it essential that model predictions are accompanied by well-calibrated confidence estimates. This project thus integrates uncertainty quantification (UQ) methods into the VLM-based FGVC pipeline, enabling models to distinguish between cases of genuine visual ambiguity, insufficient textual grounding, and out-of-distribution inputs.

Publications

2025

Christian Ickler, Aishwarya Venkataramanan, Joachim Denzler:
Text-Assisted Zero-Shot Classification of Fine-Grained Animal Species.
International Workshop Series on Camera Traps, AI, \& Ecology (CamTrapAI). 2025.
[bibtex] [abstract]