, my analysis at Multitel has centered on fine-grained visible classification (FGVC). Particularly, I labored on constructing a strong automotive classifier that may work in real-time on edge units. This put up is a part of what might develop into a small collection of reflections on this expertise. I’m writing to share among the classes I discovered but additionally to arrange and compound what I’ve discovered. On the similar time, I hope this offers a way of the sort of high-level engineering and utilized analysis we do at Multitel, work that blends educational rigor with real-world constraints. Whether or not you’re a fellow researcher, a curious engineer, or somebody contemplating becoming a member of our staff, I hope this put up presents each perception and inspiration.
1. The issue:
We wanted a system that would determine particular automotive fashions, not simply “this can be a BMW,” however which BMW mannequin and yr. And it wanted to run in actual time on resource-constrained edge units alongside different fashions. This sort of job falls below what’s often known as fine-grained visible classification (FGVC).
FGVC goals to acknowledge photos belonging to a number of subordinate classes of a super-category (e.g. species of animals / vegetation, fashions of vehicles and many others). The problem lies with understanding fine-grained visible variations that sufficiently discriminate between objects which might be extremely comparable in total look however differ in fine-grained options [2].

What makes FGVC notably tough?
- Small inter-class variation: The visible variations between lessons may be extraordinarily refined.
- Massive intra-class variation: On the similar time, cases inside the similar class might range significantly resulting from adjustments in lighting, pose, background, or different environmental elements.
- The refined visible variations may be simply overwhelmed by the opposite elements equivalent to poses and viewpoints.
- Lengthy-tailed distributions: Datasets sometimes have a number of lessons with many samples and plenty of lessons with only a few examples. For instance, you might need solely a few photos of a uncommon spider species present in a distant area, whereas frequent species have 1000’s of photos. This imbalance makes it tough for fashions to be taught equally effectively throughout all classes.

2. The panorama:
After we first began tackling this downside, we naturally turned to literature. We dove into educational papers, examined benchmark datasets, and explored state-of-the-art FGVC strategies. And at first, the issue appeared much more sophisticated than it really turned out to be, a minimum of in our particular context.
FGVC has been actively researched for years, and there’s no scarcity of approaches that introduce more and more complicated architectures and pipelines. Many early works, for instance, proposed two-stage fashions: a localization subnetwork would first determine discriminative object elements, after which a second community would classify based mostly on these elements. Others centered on customized loss capabilities, high-order characteristic interactions, or label dependency modeling utilizing hierarchical constructions.
All of those strategies had been designed to sort out the refined visible distinctions that make FGVC so difficult. In the event you’re curious in regards to the evolution of those approaches, Wei et al [2]. present a strong survey that covers a lot of them in depth.

After we regarded nearer at current benchmark results (archived from Papers with Code), lots of the top-performing options had been based mostly on transformer architectures. These fashions typically reached state-of-the-art accuracy, however with little to no dialogue of inference time or deployment constraints. Given our necessities, we had been pretty sure that these fashions wouldn’t maintain up in real-time on an edge system already working a number of fashions in parallel.
On the time of this work, the perfect reported outcome on Stanford Vehicles was 97.1% accuracy, achieved by CMAL-Web.
3. Our approch:
As a substitute of beginning with probably the most complicated or specialised options, we took the alternative method: May a mannequin that we already knew would meet our real-time and deployment constraints carry out effectively sufficient on the duty? Particularly, we requested whether or not a strong general-purpose structure might get us near the efficiency of newer, heavier fashions, if skilled correctly.
That line of considering led us to a paper by Ross Wightman et al., “ResNet Strikes Again: An Improved Coaching Process in Timm.” In it, Wightman makes a compelling argument: most new architectures are skilled utilizing the newest developments and strategies however then in contrast in opposition to older baselines skilled with outdated recipes. Wightman argues that ResNet-50, which is continuously used as a benchmark, is commonly not given the advantage of these trendy enhancements. His paper proposes a refined coaching process and exhibits that, when skilled correctly, even a vanilla ResNet-50 can obtain surprisingly robust outcomes, together with on a number of FGVC benchmarks.
With these constraints and objectives in thoughts, we got down to construct our personal robust, reusable coaching process, one that would ship excessive efficiency on FGVC duties with out counting on architecture-specific tips. The concept was easy: begin with a recognized, environment friendly spine like ResNet-50 and focus completely on bettering the coaching pipeline moderately than modifying the mannequin itself. That manner, the identical recipe might later be utilized to different architectures with minimal changes.
We started amassing concepts, strategies, and coaching refinements from throughout a number of sources, compounding finest practices right into a single, cohesive pipeline. Specifically, we drew from 4 key assets:
- Bag of Methods for Picture Classification with Convolutional Neural Networks (He et al.)
- Compounding the Efficiency Enhancements of Assembled Strategies in a Convolutional Neural Community (Lee et al.)
- ResNet Strikes Again: An Improved Coaching Process in Timm (Wightman et al.)
- Methods to Prepare State-of-the-Artwork Fashions Utilizing TorchVision’s Newest Primitives (Vryniotis)
Our aim was to create a sturdy coaching pipeline that didn’t depend on model-specific tweaks. That meant specializing in strategies which might be broadly relevant throughout architectures.
To check and validate our coaching pipeline, we used the Stanford Vehicles dataset [9], a broadly used fine-grained classification benchmark that carefully aligns with our real-world use case. The dataset accommodates 196 automotive classes and 16,185 photos, all taken from the rear to emphasise refined inter-class variations. The info is almost evenly break up between 8,144 coaching photos and eight,041 testing photos. To simulate our deployment situation, the place the classification mannequin operates downstream of an object detection system, we crop every picture to its annotated bounding field earlier than coaching and analysis.
Whereas the unique internet hosting website for the dataset is not accessible, it stays accessible by way of curated repositories equivalent to Kaggle, and Huggingface. The dataset is distributed below the BSD-3-Clause license, which allows each industrial and non-commercial use. On this work, it was used solely in a analysis context to supply the outcomes introduced right here.

Constructing the Recipe
What follows is the distilled coaching recipe we arrived at, constructed by means of experimentation, iteration, and cautious aggregation of concepts from the works talked about above. The concept is to point out that by merely making use of trendy coaching finest practices, with none architecture-specific hacks, we might get a general-purpose mannequin like ResNet-50 to carry out competitively on a fine-grained benchmark.
We’ll begin with a vanilla ResNet-50 skilled utilizing a primary setup and progressively introduce enhancements, one step at a time.
With every approach, we’ll report:
- The person efficiency achieve
- The cumulative achieve when added to the pipeline
Whereas lots of the strategies used are probably acquainted, our intent is to focus on how highly effective they are often when compounded deliberately. Benchmarks typically obscure this by evaluating new architectures skilled with the newest developments to outdated baselines skilled with outdated recipes. Right here, we need to flip that and present what’s potential with a rigorously tuned recipe utilized to a broadly accessible, environment friendly spine.
We additionally acknowledge that many of those strategies work together with one another. So, in follow, we tuned some combos by means of grasping or grid search to account for synergies and interdependencies.
The Base Recipe:
Earlier than diving into optimizations, we begin with a clear, easy baseline.
We prepare a ResNet-50 mannequin pretrained on ImageNet utilizing the Stanford Vehicles dataset. Every mannequin is skilled for 600 epochs on a single RTX 4090 GPU, with early stopping based mostly on validation accuracy utilizing a endurance of 200 epochs.
We use:
- Nesterov Accelerated Gradient (NAG) for optimization
- Studying price: 0.01
- Batch measurement: 32
- Momentum: 0.9
- Loss operate: Cross-entropy
All coaching and validation photos are cropped to their bounding containers and resized to 224×224 pixels. We begin with the identical customary augmentation coverage as in [5].
Right here’s a abstract of the bottom coaching configuration and its efficiency:
Mannequin | Pretrain | Optimizer | Studying price |
Momentum | Batch measurement |
ResNet50 | ImageNet | NAG | 0.01 | 0.9 | 32 |
Loss operate | Picture measurement | Epochs | Persistence | Augmentation | Accuracy |
Crossentropy Loss |
224×224 | 600 | 200 | Normal | 88.22 |
We repair the random seed throughout runs to make sure reproducibility and scale back variance between experiments. To evaluate the true impact of a change within the recipe, we comply with finest practices and common outcomes over a number of runs (sometimes 3 to five).
We’ll now construct on high of this baseline step-by-step, introducing one approach at a time and monitoring its affect on accuracy. The aim is to isolate what every part contributes and the way they compound when utilized collectively.
Massive batch coaching:
In mini-batch SGD, gradient descending is a random course of as a result of the examples are randomly chosen in every batch. Rising the batch measurement doesn’t change the expectation of the stochastic gradient however reduces its variance. Utilizing massive batch measurement, nonetheless, might decelerate the coaching progress. For a similar variety of epochs, coaching with a big batch measurement leads to a mannequin with degraded validation accuracy in comparison with those skilled with smaller batch sizes.
He et al [5] argues that linearly rising the educational price with the batch measurement works empirically for ResNet-50 coaching.
To enhance each the accuracy and the velocity of our coaching we alter the batch measurement to 128 and the educational price to 0.1. We add a StepLR scheduler that decays the educational price of every parameter group by 0.1 each 30 epochs.
Studying price warmup:
Since in the beginning of the coaching all parameters are sometimes random values utilizing a too massive studying price might end in numerical instability.
Within the warmup heuristic, we use a small studying price in the beginning after which change again to the preliminary studying price when the coaching course of is steady. We use a gradual warmup technique that will increase the educational price from 0 to the preliminary studying price linearly.
We add a linear warmup technique for five epochs.

Mannequin | Pretrain | Optimizer | Studying price | Momentum |
ResNet50 | ImageNet | NAG | 0.1 | 0.9 |
Batch measurement | Loss operate | Picture measurement | Epochs | Persistence |
128 | Crossentropy Loss |
224×224 | 600 | 200 |
Augmentation | Scheduler | Scheduler step measurement |
Scheduler Gamma |
Warmup Methodology |
Normal | StepLR | 30 | 0.1 | Linear |
Warmup epochs |
Warmup Decay |
Accuracy | Incremental Enchancment |
Absolute Enchancment |
5 | 0.01 | 89.21 | +0.99 | +0.99 |
Trivial Increase:
To discover the affect of stronger information augmentation, we changed the baseline augmentation with TrivialAugment. Trivial Increase works as follows. It takes a picture x and a set of augmentations A as enter. It then merely samples an augmentation from A uniformly at random and applies this augmentation to the given picture x with a energy m, sampled uniformly at random from the set of potential strengths {0, . . . , 30}, and returns the augmented picture.
What makes TrivialAugment particularly enticing is that it’s fully parameter-free, it doesn’t require search or tuning, making it a easy but efficient drop-in substitute that reduces experimental complexity.
Whereas it could appear counterintuitive that such a generic and randomized technique would outperform augmentations particularly tailor-made to the dataset or extra refined automated augmentation strategies, we tried quite a lot of alternate options, and TrivialAugment persistently delivered robust outcomes throughout runs. Its simplicity, stability, and surprisingly excessive effectiveness make it a compelling default alternative.

Mannequin | Pretrain | Optimizer | Studying price | Momentum |
ResNet50 | ImageNet | NAG | 0.1 | 0.9 |
Batch measurement | Loss operate | Picture measurement | Epochs | Persistence |
128 | Crossentropy Loss |
224×224 | 600 | 200 |
Augmentation | Scheduler | Scheduler step measurement |
Scheduler Gamma |
Warmup Methodology |
TrivialAugment | StepLR | 30 | 0.1 | Linear |
Warmup epochs |
Warmup Decay |
Accuracy | Incremental Enchancment |
Absolute Enchancment |
5 | 0.01 | 92.66 | +3.45 | +4.44 |
Cosine Studying Charge Decay:
Subsequent, we explored modifying the educational price schedule. We switched to a cosine annealing technique, which decreases the educational price from the preliminary worth to 0 by following the cosine operate. A giant benefit of cosine is that there are not any hyper-parameters to optimize, which cuts down once more our search house.

Mannequin | Pretrain | Optimizer | Studying price | Momentum |
ResNet50 | ImageNet | NAG | 0.1 | 0.9 |
Batch measurement | Loss operate | Picture measurement | Epochs | Persistence |
128 | Crossentropy Loss |
224×224 | 600 | 200 |
Augmentation | Scheduler | Scheduler step measurement |
Scheduler Gamma |
Warmup Methodology |
TrivialAugment | Cosine | – | – | Linear |
Warmup epochs |
Warmup Decay |
Accuracy | Incremental Enchancment |
Absolute Enchancment |
5 | 0.01 | 93.22 | +0.56 | +5 |
Label Smoothing:
A great approach to cut back overfitting is to cease the mannequin from turning into overconfident. This may be achieved by softening the bottom reality utilizing Label Smoothing. The concept is to alter the development of the true label to:
[q_i = begin{cases}
1 – varepsilon, & text{if } i = y,
frac{varepsilon}{K – 1}, & text{otherwise}.
end{cases} ]
There’s a single parameter which controls the diploma of smoothing (the upper the stronger) that we have to specify. We used a smoothing issue of ε = 0.1, which is the usual worth proposed within the unique paper and broadly adopted within the literature.
Apparently, we discovered empirically that including label smoothing decreased gradient variance throughout coaching. This allowed us to securely enhance the educational price with out destabilizing coaching. In consequence, we elevated the preliminary studying price from 0.1 to 0.4
Mannequin | Pretrain | Optimizer | Studying price | Momentum |
ResNet50 | ImageNet | NAG | 0.1 | 0.9 |
Batch measurement | Loss operate | Picture measurement | Epochs | Persistence |
128 | Crossentropy Loss |
224×224 | 600 | 200 |
Augmentation | Scheduler | Scheduler step measurement |
Scheduler Gamma |
Warmup Methodology |
TrivialAugment | StepLR | 30 | 0.1 | Linear |
Warmup epochs |
Warmup Decay |
Label Smoothing | Accuracy | Incremental Enchancment |
5 | 0.01 | 0.1 | 94.5 | +1.28 |
Absolute Enchancment |
||||
+6.28 |
Random Erasing:
As a further type of regularization, we launched Random Erasing into the coaching pipeline. This system randomly selects an oblong area inside a picture and replaces its pixels with random values, with a set likelihood.
Usually paired with Computerized Augmentation strategies, it normally yields extra enhancements in accuracy resulting from its regularization impact. We added Random Erasing with a likelihood of 0.1.

Mannequin | Pretrain | Optimizer | Studying price | Momentum |
ResNet50 | ImageNet | NAG | 0.1 | 0.9 |
Batch measurement | Loss operate | Picture measurement | Epochs | Persistence |
128 | Crossentropy Loss |
224×224 | 600 | 200 |
Augmentation | Scheduler | Scheduler step measurement |
Scheduler Gamma |
Warmup Methodology |
TrivialAugment | StepLR | 30 | 0.1 | Linear |
Warmup epochs |
Warmup Decay |
Label Smoothing | Random Erasing | Accuracy |
5 | 0.01 | 0.1 | 0.1 | 94.93 |
Incremental Enchancment |
Absolute Enchancment |
|||
+0.43 | +6.71 |
Exponential Transferring Common (EMA):
Coaching a neural community utilizing mini batches introduces noise and fewer correct gradients when gradient descent updates the mannequin parameters between batches. Exponential shifting common is utilized in coaching deep neural networks to enhance their stability and generalization.
As a substitute of simply utilizing the uncooked weights which might be instantly discovered throughout coaching, EMA maintains a working common of the mannequin weights that are then up to date at every coaching step utilizing a weighted common of the present weights and the earlier EMA values.
Particularly, at every coaching step, the EMA weights are up to date utilizing:
[theta_{mathrm{EMA}} leftarrow alpha , theta_{mathrm{EMA}} + (1 – alpha) , theta]
the place θ are the present mannequin weights and α is a decay issue controlling how a lot weight is given to the previous.
By evaluating the EMA weights moderately than the uncooked ones at check time, we discovered improved consistency in efficiency throughout runs, particularly within the later phases of coaching.
Mannequin | Pretrain | Optimizer | Studying price | Momentum |
ResNet50 | ImageNet | NAG | 0.1 | 0.9 |
Batch measurement | Loss operate | Picture measurement | Epochs | Persistence |
128 | Crossentropy Loss |
224×224 | 600 | 200 |
Augmentation | Scheduler | Scheduler step measurement |
Scheduler Gamma |
Warmup Methodology |
TrivialAugment | StepLR | 30 | 0.1 | Linear |
Warmup epochs |
Warmup Decay |
Label Smoothing | Random Erasing | EMA Steps |
5 | 0.01 | 0.1 | 0.1 | 32 |
EMA Decay | Accuracy | Incremental Enchancment |
Absolute Enchancment |
|
0.994 | 94.93 | 0 | +6.71 |
We examined EMA in isolation, and located that it led to notable enhancements in each coaching stability and validation efficiency. However once we built-in EMA into the total recipe alongside different strategies, it didn’t present additional enchancment. The outcomes appeared to plateau, suggesting that many of the beneficial properties had already been captured by the opposite elements.
As a result of our aim is to develop a general-purpose coaching recipe moderately than one overly tailor-made to a single dataset, we selected to preserve EMA within the remaining setup. Its advantages could also be extra pronounced in different situations, and its low overhead makes it a protected inclusion.
Optimizations we examined however didn’t undertake:
We additionally explored a variety of extra strategies which might be generally efficient in different picture classification duties, however discovered that they both didn’t result in important enhancements or, in some instances, barely regressed efficiency on the Stanford Vehicles dataset:
- Weight Decay: Provides L2 regularization to discourage massive weights throughout coaching. We experimented extensively with weight decay in our use case, nevertheless it persistently regressed efficiency.
- Cutmix/Mixup: Cutmix replaces random patches between photos and mixes the corresponding labels. Mixup creates new coaching samples by linearly combining pairs of photos and labels. We tried making use of both CutMix or MixUp randomly with equal likelihood throughout coaching, however this method regressed outcomes.
- AutoAugment: Delivered robust outcomes and aggressive accuracy, however we discovered TrivialAugment to be higher. Extra importantly, TrivialAugment is totally parameter-free, which cuts down our search house and simplifies tuning.
- Different Optimizers and Schedulers: We experimented with a variety of optimizers and studying price schedules. Nesterov Accelerated Gradient (NAG) persistently gave us the perfect efficiency amongst optimizers, and Cosine Annealing stood out as the perfect scheduler, delivering robust outcomes with no extra hyperparameters to tune.
4. Conclusion:
The graph under summarizes the enhancements as we progressively constructed up our coaching recipe:

Utilizing simply a regular ResNet-50, we had been in a position to obtain robust efficiency on the Stanford Vehicles dataset, demonstrating that cautious tuning of some easy strategies can go a great distance in fine-grained classification.
Nevertheless, it’s essential to maintain this in perspective. These outcomes primarily present that we are able to prepare a mannequin to tell apart between fine-grained, well-represented lessons in a clear, curated dataset. The Stanford Vehicles dataset is almost class-balanced, with high-quality, largely frontal photos and no main occlusion or real-world noise. It does not tackle challenges like long-tailed distributions, area shift, or recognition of unseen lessons.
In follow, you’ll by no means have a dataset that covers each automotive mannequin—particularly one which’s up to date day by day as new fashions seem. Actual-world programs have to deal with distributional shifts, open-set recognition, and imperfect inputs.
So whereas this served as a robust baseline and proof of idea, there was nonetheless important work to be executed to construct one thing strong and production-ready.
References:
[1] Krause, Deng, et al. Accumulating a Massive-Scale Dataset of Fantastic-Grained Vehicles.
[2] Wei, et al. Fantastic-Grained Picture Evaluation with Deep Studying: A Survey.
[3] Reslan, Farou. Computerized Fantastic-grained Classification of Chook Species Utilizing Deep Studying.
[4] Zhao, et al. A survey on deep learning-based fine-grained object clasiffication and semantic segmentation.
[5] He, et al. Bag of Methods for Picture Classification with Convolutional Neural Networks.
[6] Lee, et al. Compounding the Efficiency Enhancements of Assembled Strategies in a Convolutional Neural Community.
[7] Wightman, et al. ResNet Strikes Again: An Improved Coaching Process in Timm.
[8] Vryniotis. Methods to Prepare State-of-the-Artwork Fashions Utilizing TorchVision’s Newest Primitives.
[9] Krause et al, 3D Object Representations for Fantastic-Grained Catecorization.
[10] Müller, Hutter. TrivialAugment: Tuning-free But State-of-the-Artwork Knowledge Augmentation.
[11] Zhong et al, Random Erasing Knowledge Augmentation.