Fine-Grained Visual Classification of Aircraft

Andrea Vedaldi; Esa Rahtu; Juho Kannala; Matthew Blaschko; Subhransu Maji

arxiv: 1306.5151 · v1 · submitted 2013-06-21 · 💻 cs.CV

Fine-Grained Visual Classification of Aircraft

Subhransu Maji , Esa Rahtu , Juho Kannala , Matthew Blaschko , Andrea Vedaldi This is my paper

Pith reviewed 2026-05-11 17:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords fine-grained visual classificationaircraft datasetFGVC-Aircraftimage classificationcomputer visionbenchmark datasetrigid objectsobject recognition

0 comments

The pith

The paper introduces FGVC-Aircraft, a dataset of 10,000 images across 100 aircraft models organized in a three-level hierarchy for fine-grained visual classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes FGVC-Aircraft as a benchmark dataset for fine-grained visual classification by providing 10,000 images of 100 aircraft models organized hierarchically. The authors show that model differences are subtle but visually measurable, creating challenging yet solvable tasks distinct from those with deformable objects like animals. They supply evaluation protocols and baseline results while noting that enthusiast contributions enabled the dataset and could apply to other classes. Aircraft variations include purpose, size, designation, structure, historical style, and branding, offering new modes of variation for study.

Core claim

The central discovery is the FGVC-Aircraft dataset itself, which contains 10,000 images of aircraft from 100 models arranged in a three-level hierarchy. At the finest level, differences between models are often subtle but always visually measurable, making visual recognition challenging but possible. Corresponding classification tasks and evaluation protocols are defined, with baseline results presented. The dataset's creation leverages work by aircraft enthusiasts, a method extendable to other object classes. Compared to typical fine-grained domains like animals, aircraft are rigid and less deformable but exhibit interesting variations in purpose, size, designation, structure, historical, 1

What carries the argument

The FGVC-Aircraft dataset, a hierarchically organized collection of 10,000 aircraft images across 100 models that enables definition of fine-grained classification tasks.

If this is right

Defines specific classification tasks and evaluation protocols based on the hierarchy.
Provides baseline performance results for standard classification methods on the dataset.
Shows that enthusiast-sourced data can construct useful fine-grained datasets for other object classes.
Identifies unique variation modes in aircraft such as historical style and branding that differ from animal domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Algorithms developed on this dataset might transfer to practical applications like automatic aircraft type identification at airports.
The three-level hierarchy could support hierarchical classification approaches that improve accuracy by leveraging coarser categories first.
Future work might compare results here to other FGVC datasets to understand the impact of object rigidity on recognition difficulty.
Extending the enthusiast-contribution method could rapidly create benchmarks for other vehicle or manufactured object classes.

Load-bearing premise

That the visual differences between the 100 aircraft models are always measurable from the images and that the three-level hierarchy provides a useful structure for the classification tasks.

What would settle it

A demonstration that certain pairs of aircraft models cannot be reliably distinguished by visual inspection of the dataset images, or that the provided baselines fail to exceed random guessing, would falsify the claim that the dataset enables meaningful fine-grained classification.

read the original abstract

This paper introduces FGVC-Aircraft, a new dataset containing 10,000 images of aircraft spanning 100 aircraft models, organised in a three-level hierarchy. At the finer level, differences between models are often subtle but always visually measurable, making visual recognition challenging but possible. A benchmark is obtained by defining corresponding classification tasks and evaluation protocols, and baseline results are presented. The construction of this dataset was made possible by the work of aircraft enthusiasts, a strategy that can extend to the study of number of other object classes. Compared to the domains usually considered in fine-grained visual classification (FGVC), for example animals, aircraft are rigid and hence less deformable. They, however, present other interesting modes of variation, including purpose, size, designation, structure, historical style, and branding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FGVC-Aircraft, a new dataset of 10,000 images spanning 100 aircraft models organized in a three-level hierarchy. It defines corresponding classification tasks and evaluation protocols at different hierarchy levels and presents baseline results obtained with standard methods. The construction relies on contributions from aircraft enthusiasts, and the paper notes that aircraft are rigid objects presenting modes of variation such as purpose, size, designation, structure, historical style, and branding.

Significance. If the labels and splits are reliable, the dataset supplies a useful benchmark for fine-grained visual classification on rigid objects whose inter-class differences are often subtle. The three-level hierarchy supports multi-granularity experiments, and the enthusiast-sourcing approach offers a scalable template for other domains. Baseline numbers establish an initial reference point for future method comparisons.

major comments (2)

[Dataset construction and annotation] The abstract asserts that 'differences between models are often subtle but always visually measurable' and that the hierarchy is useful, yet the manuscript provides no dedicated section or table quantifying inter-annotator agreement, label verification procedure, or the fraction of model pairs whose visual separability was explicitly checked. This verification step is load-bearing for the claim that the 100-class task is 'challenging but possible.'
[Tasks, protocols, and baselines] The evaluation protocols are described at a high level, but the paper does not report the exact train/validation/test splits per hierarchy level or the number of images per model. Without these numbers (or a supplementary table), it is difficult to reproduce the baselines or assess class balance.

minor comments (2)

[Figures] Figure 1 (example images) would benefit from captions that explicitly indicate the three hierarchy levels for each shown aircraft.
[Introduction] The related-work discussion could cite the exact prior FGVC datasets (e.g., CUB-200-2011) when contrasting deformable vs. rigid object challenges.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the FGVC-Aircraft dataset as a benchmark for fine-grained classification of rigid objects. We address the major comments point by point below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: [Dataset construction and annotation] The abstract asserts that 'differences between models are often subtle but always visually measurable' and that the hierarchy is useful, yet the manuscript provides no dedicated section or table quantifying inter-annotator agreement, label verification procedure, or the fraction of model pairs whose visual separability was explicitly checked. This verification step is load-bearing for the claim that the 100-class task is 'challenging but possible.'

Authors: We acknowledge that the manuscript would benefit from greater transparency on the annotation process. The dataset was constructed through contributions by aircraft enthusiasts possessing domain expertise, which guided the selection of 100 models where inter-model differences are visually measurable (as asserted in the abstract). However, we did not include a dedicated section quantifying inter-annotator agreement or explicit pairwise separability checks. In the revised version, we will add a section on dataset construction that describes the label collection and verification procedures employed, thereby supporting the claim that the 100-class task is challenging but possible. revision: yes
Referee: [Tasks, protocols, and baselines] The evaluation protocols are described at a high level, but the paper does not report the exact train/validation/test splits per hierarchy level or the number of images per model. Without these numbers (or a supplementary table), it is difficult to reproduce the baselines or assess class balance.

Authors: We agree that the exact splits and per-model image counts are necessary for full reproducibility and class-balance assessment. While the manuscript states the overall dataset size (10,000 images across 100 models) and describes the evaluation protocols at a high level, it does not tabulate the precise train/validation/test splits per hierarchy level or the image counts per model. We will add a supplementary table (or expanded section) providing these details in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a dataset introduction paper whose central contribution is the release of FGVC-Aircraft (10k images, 100 models, three-level hierarchy) together with task definitions and baselines. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. Claims about subtle but visually measurable differences and hierarchy usefulness are stated as descriptive properties of the collected data rather than derived results. The enthusiast-sourcing strategy is presented only as an extensible construction method, not as a self-referential proof. No self-citations or ansatzes are invoked to support load-bearing steps, so the derivation chain (such as it is) is self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the domain assumption that aircraft images can be hierarchically organized and that visual differences are measurable; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Differences between aircraft models are subtle but always visually measurable.
Stated in the abstract as the basis for the classification challenge.

pith-pipeline@v0.9.0 · 5436 in / 1063 out tokens · 41373 ms · 2026-05-11T17:35:53.008576+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This paper introduces FGVC-Aircraft, a new dataset containing 10,000 images of aircraft spanning 100 aircraft models, organised in a three-level hierarchy. At the finer level, differences between models are often subtle but always visually measurable.
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A benchmark is obtained by defining corresponding classification tasks and evaluation protocols, and baseline results are presented.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PERL: Parameter Efficient Reasoning in CLIP Latent Space
cs.CV 2026-05 unverdicted novelty 7.0

PERL augments frozen CLIP with a shared recurrent reasoning module of roughly 6K parameters that iteratively refines representations via latent token injection, delivering strong base-to-novel and transfer performance...
Neutral-Reference Prompting for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base...
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
cs.CV 2026-05 unverdicted novelty 7.0

SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition
cs.CV 2026-05 conditional novelty 7.0

FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.
FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition
cs.CV 2026-05 unverdicted novelty 7.0

FIKA-Bench is a leakage-aware benchmark of 311 instances showing that even the best large multimodal models and tool-equipped agents reach only 25.1% accuracy on fine-grained recognition questions that require externa...
Reducing Bias and Variance: Generative Semantic Guidance and Bi-Layer Ensemble for Image Clustering
cs.CV 2026-05 conditional novelty 7.0

GSEC uses MLLM-generated semantic guidance and bi-layer ensemble learning to reduce bias and variance, outperforming 18 prior methods on six image clustering benchmarks.
Online Continual Learning with Dynamic Label Hierarchies
cs.LG 2026-05 unverdicted novelty 7.0

HALO improves online continual learning under evolving label hierarchies by adaptively combining classification heads regularized with organized learnable prototypes for better adaptation and reduced forgetting.
MC-RFM: Geometry-Aware Few-Shot Adaptation via Mixed-Curvature Riemannian Flow Matching
cs.CV 2026-05 unverdicted novelty 7.0

MC-RFM achieves superior few-shot adaptation by representing features on a mixed hyperbolic-Euclidean manifold and learning task-conditioned continuous transport via Riemannian flow matching to hybrid prototypes.
SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data
cs.LG 2026-05 unverdicted novelty 7.0

SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships ...
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 7.0

HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

GRPO-TTA applies GRPO to test-time visual tuning of vision-language models via group-wise policy optimization on unlabeled class candidates, outperforming prior TTA methods especially under natural distribution shifts.
Hierarchically Robust Zero-shot Vision-language Models
cs.CV 2026-04 unverdicted novelty 7.0

A hierarchical adversarial fine-tuning method for VLMs aligns image and text embeddings at multiple hierarchy depths with theoretical margin connections to boost robustness to leaf and superclass attacks while using m...
Improving Sparse Autoencoder with Dynamic Attention
cs.LG 2026-04 unverdicted novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
cs.CV 2026-04 unverdicted novelty 7.0

CLAY reframes pretrained VLM embedding spaces as text-conditional similarity spaces for adaptive, multi-conditioned image retrieval without additional training.
CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion
cs.CR 2026-04 unverdicted novelty 7.0

CLIP-Inspector reconstructs OOD triggers to detect backdoors in prompt-tuned CLIP models with 94% accuracy and higher AUROC than baselines, plus a repair step via fine-tuning.
FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios
cs.CV 2026-04 conditional novelty 7.0

FORGE benchmark shows domain-specific knowledge, not visual grounding, is the main bottleneck for MLLMs in manufacturing, with SFT on a 3B model delivering up to 90.8% relative accuracy improvement on held-out scenarios.
PowerCLIP: Powerset Alignment for Contrastive Pre-Training
cs.CV 2025-11 conditional novelty 7.0

PowerCLIP improves CLIP-style models by exhaustively aligning powersets of image regions to textual parse trees via efficient non-linear aggregators that approximate the full combinatorial loss.
Exploring Cross-Modal Flows for Few-Shot Learning
cs.CV 2025-10 unverdicted novelty 7.0

FMA introduces flow matching for multi-step cross-modal feature alignment in few-shot learning, using fixed coupling, noise augmentation, and early-stopping to outperform one-step PEFT methods.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
cs.CV 2025-06 unverdicted novelty 7.0

AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
Effective Prompt Pool Learning for Continual Category Discovery
cs.CV 2024-07 unverdicted novelty 7.0

PromptCCD uses Gaussian Mixture Prompts for global class prototypes and PromptCCD++ adds part-level prompt pools for finer representations in continual category discovery from unlabeled streams.
NetTailor: Tuning the Architecture, Not Just the Weights
cs.CV 2019-06 unverdicted novelty 7.0

NetTailor adapts CNN architecture for new tasks by assembling pre-trained universal blocks with task-specific layers, trained via activation mimicry and complexity penalties to match accuracy while reducing size for s...
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
cs.LG 2019-05 accept novelty 7.0

EfficientNet scales network depth, width, and resolution uniformly via a compound coefficient to deliver state-of-the-art accuracy and efficiency on image classification.
Temporal Aware Pruning for Efficient Diffusion-based Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

TAPE introduces temporal-aware token pruning for diffusion-based video generation, using frame smoothing, layer reselection, and timestep budgets to achieve speedups while maintaining visual fidelity and coherence.
TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

TAME uses a Mixture-of-Experts prompt bank with input-dependent routing and three unsupervised objectives to adaptively defend CLIP against adversarial attacks at inference time, achieving at least 49.1% robustness ga...
How to Choose Your Teacher for Fine Grained Image Recognition
cs.CV 2026-05 conditional novelty 6.0

Proposes Ratio 1-2 metric for teacher selection in knowledge distillation for fine-grained image recognition, validated across 1000+ experiments showing 18% better selection and up to 17% student accuracy gains.
Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study
cs.SD 2026-05 conditional novelty 6.0

In moderate-sized fine-grained bioacoustics, pretraining scale of masked autoencoders on diverse general audio dominates over domain-specific objectives or data curation for transfer performance.
A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning
cs.CV 2026-05 unverdicted novelty 6.0

A3B2 introduces an adaptive asymmetric adapter with uncertainty-aware dampening to reduce branch bias in few-shot vision-language image classification and outperforms standard adapter and prompt methods.
A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning
cs.CV 2026-05 unverdicted novelty 6.0

A3B2 adds uncertainty-aware dampening and asymmetric MoE-style adapters to balance image and text branches, outperforming 11 baselines on 11 few-shot datasets.
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.
Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery
cs.CV 2026-05 unverdicted novelty 6.0

Relational Pattern Consistency improves generalized category discovery by using invariant relational patterns between novel samples and known-class prototypes for bidirectional knowledge transfer.
DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 6.0

HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
ModelLens: Finding the Best for Your Task from Myriads of Models
cs.LG 2026-05 unverdicted novelty 6.0

ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency
cs.CV 2026-05 unverdicted novelty 6.0

DINORANKCLIP outperforms CLIP and RANKCLIP on fine-grained and out-of-distribution tasks by injecting DINOv3 local structure and using third-order ranking consistency trained on Conceptual Captions 3M.
Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

CAKI generates class-specific prompts from few-shot samples of the same class, stores them in a knowledge bank, and uses query-key matching to inject relevant class knowledge into test instance predictions for improve...
SpecPL: Disentangling Spectral Granularity for Prompt Learning
cs.CV 2026-05 unverdicted novelty 6.0

SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning
cs.CV 2026-05 unverdicted novelty 6.0

IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
cs.AI 2026-05 unverdicted novelty 6.0

JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...
Generalized Category Discovery under Domain Shifts: From Vision to Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Three frameworks adapt foundation models for generalized category discovery under domain shifts via disentanglement and prompt tuning, showing gains on synthetic and real multi-domain data.
Prototype-Based Test-Time Adaptation of Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

PTA adapts VLMs at test time by maintaining and updating class-specific knowledge prototypes from test samples, achieving higher accuracy than cache-based methods with far less speed loss.
Prototype-Based Test-Time Adaptation of Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

PTA adapts VLMs at test time via adaptively weighted class prototypes that accumulate test-sample features, delivering higher accuracy than cache-based TTA while preserving nearly full inference speed.
HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning
cs.CV 2026-04 unverdicted novelty 6.0

HyCal mitigates Domain Gravity in cross-discipline imbalanced few-shot class-incremental learning by calibrating prototypes with complementary directional and covariance-aware distances on frozen CLIP embeddings.
Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning
cs.CV 2026-04 unverdicted novelty 6.0

Dual-modality anchors from text descriptions and test-time image statistics filter views and ensemble predictions to improve test-time prompt tuning, achieving SOTA on 15 datasets.
Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models
cs.CV 2026-04 unverdicted novelty 6.0

AdvFLYP finetunes CLIP on web image-text pairs using adversarial contrastive learning and regularization to boost zero-shot adversarial robustness across domains better than prior proxy-dataset methods.
Visual prompting reimagined: The power of the Activation Prompts
cs.CV 2026-04 unverdicted novelty 6.0

Activation prompts on intermediate layers outperform input-level visual prompting and parameter-efficient fine-tuning in accuracy and efficiency across 29 datasets.
R\'enyi Attention Entropy for Patch Pruning
cs.CV 2026-04 unverdicted novelty 6.0

Rényi entropy of attention maps serves as a tunable criterion for pruning redundant patches in vision transformers, reducing compute with preserved accuracy on image recognition.
The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
cs.LG 2026-03 unverdicted novelty 6.0

EAGC mitigates gradient entanglement in GCD by anchoring supervised gradients and adaptively projecting unlabeled ones, boosting existing methods to new state-of-the-art performance.
Specificity-aware reinforcement learning for fine-grained open-world classification
cs.CV 2026-03 unverdicted novelty 6.0

SpeciaRL applies a dynamic verifier-based reward in reinforcement learning to steer reasoning LMMs toward correct and specific predictions on fine-grained open-world image classification tasks.
Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning
cs.CV 2026-02 unverdicted novelty 6.0

Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained catego...
SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models
cs.CV 2025-12 conditional novelty 6.0

SigLino distills SigLIP2 and DINOv3 into efficient vision models via asymmetric relation-knowledge distillation, token-balanced batching, and hierarchical data sampling on a new 200M-image corpus, yielding better tran...
ATAC: Augmentation-Based Test-Time Adversarial Correction for CLIP
cs.CV 2025-11 unverdicted novelty 6.0

ATAC boosts CLIP robustness to adversarial image changes by correcting embeddings via augmentation-induced drift vectors and angular consistency checks, beating prior test-time methods by nearly 50 percent on average ...
On the Provable Importance of Gradients for Language-Assisted Image Clustering
cs.CV 2025-10 unverdicted novelty 6.0

GradNorm selects positive nouns via gradient magnitudes from cross-entropy loss, with an error bound proving it subsumes prior CLIP methods and delivers SOTA clustering results.
SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP
cs.CV 2025-09 unverdicted novelty 6.0

SeMoBridge projects images into the text modality via a semantic bridge to reduce CLIP's intra-modal misalignment and improve few-shot performance.
Little by Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts
cs.LG 2025-06 unverdicted novelty 6.0

MoRAM frames continual learning as incremental addition of rank-1 adapters viewed as self-activating key-value associative memory units in a mixture-of-experts setup.
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation
cs.CV 2025-04 unverdicted novelty 6.0

Introduces FG-BMK benchmark and evaluates twelve LVLMs on fine-grained semantic recognition and feature tasks, identifying influences from training paradigms and perturbation sensitivity.
Perception Encoder: The best visual embeddings are not at the output of the network
cs.CV 2025-04 unverdicted novelty 6.0

Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...
Visual-RFT: Visual Reinforcement Fine-Tuning
cs.CV 2025-03 conditional novelty 6.0

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Robust Adaptation of Foundation Models with Black-Box Visual Prompting
cs.CV 2024-07 unverdicted novelty 6.0

BlackVIP adapts foundation models via a Coordinator for input-dependent visual prompts and SPSA-GC for gradient estimation, enabling robust transfer on 19 datasets with low memory use and a link to randomized smoothin...
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
cs.CV 2024-03 unverdicted novelty 6.0

RAR combines CLIP retrieval with MLLM ranking to improve few-shot and zero-shot fine-grained visual recognition on 5 benchmarks, 11 few-shot datasets, and 2 detection tasks.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · cited by 81 Pith papers

[1]

Chatﬁeld, V

K. Chatﬁeld, V . Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In Proc. BMVC, 2011. 5

work page 2011
[2]

Novel dataset for ﬁne-grained image categorization

Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for ﬁne-grained image categorization. In CVPR Workshop on Fine-Grained Visual Categorization, 2011. 1

work page 2011
[3]

J. Liu, A. Kanazawa, D. Jacobs, and P. Belhumeur. Dog breed classi- ﬁcation using part localization. In Proc. ECCV, 2012

work page 2012
[4]

Parkhi, A

O. Parkhi, A. Vedaldi, C. V . Jawahar, and A. Zisserman. Cats vs dogs. In Proc. CVPR, 2012. 1

work page 2012
[5]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Technical report, California In- stitute of Technology, 2011. 1 6

work page 2011

[1] [1]

Chatﬁeld, V

K. Chatﬁeld, V . Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In Proc. BMVC, 2011. 5

work page 2011

[2] [2]

Novel dataset for ﬁne-grained image categorization

Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for ﬁne-grained image categorization. In CVPR Workshop on Fine-Grained Visual Categorization, 2011. 1

work page 2011

[3] [3]

J. Liu, A. Kanazawa, D. Jacobs, and P. Belhumeur. Dog breed classi- ﬁcation using part localization. In Proc. ECCV, 2012

work page 2012

[4] [4]

Parkhi, A

O. Parkhi, A. Vedaldi, C. V . Jawahar, and A. Zisserman. Cats vs dogs. In Proc. CVPR, 2012. 1

work page 2012

[5] [5]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Technical report, California In- stitute of Technology, 2011. 1 6

work page 2011