Authors: Antoni Olbrysz, Karol Struniawski, Tomasz Wierzbicki
of Contents
- Introduction
- New Dataset of Pollen Images
- Extraction of Individual Pollen Images
- Classification of Individual Pollen Images
- Conclusions
- Acknowledgement
1. Introduction
Pollen classification is an fascinating area in visible picture recognition, with a broad vary of use circumstances throughout ecology and biotechnology, akin to research of plant populations, local weather change, and pollen construction. Regardless of this, the topic is comparatively unexplored, as few datasets have been composed of photographs of such pollens, and those who exist are sometimes lackluster or in any other case inadequate for the coaching of a correct visible classifier or object detector, particularly for photographs containing mixtures of assorted pollens. In addition to offering a classy visible identification mannequin, our undertaking goals to fill this hole with a custom-made dataset. Visible pollen classification is commonly sophisticated to resolve with out machine imaginative and prescient, as trendy biologists are sometimes incapable of differentiating between pollen of various plant species based mostly on photographs alone. This makes the duty of shortly and effectively recognising harvested pollens extraordinarily difficult, supplied the pollen particles’ supply is unknown beforehand.
1.1 Accessible Datasets of Pollen Photographs
This part highlights the parameters of a number of freely obtainable datasets and compares them to the properties of our {custom} set.
Dataset 1
Hyperlink: https://www.kaggle.com/datasets/emresebatiyolal/pollen-image-dataset
Variety of lessons: 193
Photographs per class: 1-16
Picture high quality: Separated, clear photographs, typically with labelings
Picture color: Numerous
Notes: The dataset appears composed of incongruent photographs taken from a number of sources. Whereas broad in lessons, every accommodates solely a number of pictures, inadequate for coaching any picture detection mannequin.
Dataset 2
Hyperlink: https://www.kaggle.com/datasets/andrewmvd/pollen-grain-image-classification
Variety of lessons: 23
Photographs per class: 35, 1 class 20
Picture high quality: Properly separated, barely blurry, no textual content on photographs
Picture color: Uncoloured, constant
Notes: Localised, well-prepared dataset for the classification of Brazilian Savannah pollen. Constant picture supply, but the variety of photographs per class could pose points when aiming for prime accuracy.
Dataset 3
Hyperlink: https://www.kaggle.com/datasets/nataliakhanzhina/pollen20ldet
Variety of lessons: 20
Photographs per class: Exorbitant
Picture high quality: Self-explanatory photographs, with separated and joined pollen photographs.
Picture color: Dyed, constant
Notes: An insurmountable quantity of well-labeled, constant, and high-quality photographs make this dataset of the best high quality. Nevertheless, the colouring current could also be a difficulty in particular functions. Moreover, the magnification and talent of the pollens to intersect could pose issues in mixed-pollen situations.
2. New Dataset of Pollen Photographs
Our dataset is a set of high-quality microscope photographs of 4 totally different lessons of pollens belonging to frequent fruit crops: the European gooseberry, the haskap berry, the blackcurrant, and the shadbush. These plant species haven’t been a part of any earlier dataset, so our dataset contributes new knowledge in the direction of visible pollen classification.
Every class accommodates 200 photographs of a number of grains of pollen, every picture with out dye. It was obtained in collaboration with the Nationwide Institute of Horticultural Analysis in Skierniewice, Poland.
Variety of lessons: 5 (4 pollens + blended)
Photographs per class: ~200
Picture high quality: Clear photographs, photographs comprise a number of pollen fragments, blended photographs current
Picture color: Undyed, constant
Our dataset focuses on regionally obtainable pollens, class steadiness, and an abundance of photographs to coach on with out added dye, which can make the classifier unsuitable for some duties. Moreover, our proposed answer accommodates photographs with mixtures of various pollen sorts, aiding in coaching detection fashions for field-collection functions. Exemplary photographs from the dataset are represented as Figures 1-4.



The total dataset is accessible from the corresponding writer on affordable request. The information acquisition steps are composed of the pattern preparation and taking microscopic photographs, which had been ready by Professor Agnieszka Marasek-Ciołakowska and Ms. Aleksandra Machlańska from the Division of Utilized Biology at The Nationwide Institute of Horticultural Analysis, for which we’re very grateful. Their efforts have confirmed invaluable for the success of our undertaking.
We first extracted photographs of particular person pollens from the pictures in our datasets to coach varied fashions to acknowledge pollens. Every of these pictures contained a number of pollens and different lifeforms and air pollution, making figuring out the pollen species far harder. We used the YOLOv12 mannequin, a cutting-edge attention-centric real-time object detection mannequin developed by Ultralytics.
3.1 Effective-Tuning YOLOv12
Due to YOLOv12’s modern nature, it may be skilled even on tiny datasets. We skilled this phenomenon firsthand. To arrange our personal dataset, we manually labeled the pollens’ places on ten photographs in every of the 4 lessons of our dataset utilizing CVAT, later exporting the labels into .txt recordsdata similar to particular person photographs. Then, we organized our knowledge right into a YOLOv12-appropriate format: we divided our knowledge right into a coaching set (7 image-label pairs per class, in whole 28) and a validation set (3 image-label pairs per class, in whole 12). We added a .yaml file pointing in the direction of our dataset. It may be seen that the dataset was actually very small. The ensuing picture in prediction mode with detected particular person pollens with the boldness overlay is represented as Fig. 5. We additionally downloaded the mannequin (YOLO12s) from the YOLOv12 website. Then we began the coaching.

The mannequin proved to detect pollens with very excessive accuracy, however there was yet one more factor to contemplate: the mannequin’s confidence. For each detected pollen, the mannequin additionally outputted a price of how particular its prediction is. We needed to determine whether or not to make use of a decrease threshold for confidence (extra photographs, however larger threat of malformed or non-pollen pictures) or a better one (fewer photographs, however decrease probability of non-pollen). We ultimately settled on making an attempt out two thresholds, 0.8 and 0.9 to judge which one would work higher when coaching classification fashions.
3.2 Exporting the individual-pollen datasets
To do that, we launched the mannequin’s prediction on all the class-specific photographs in our dataset. This labored very nicely, however after exporting, we encountered one other difficulty — some pictures had been cropped, even on larger thresholds. For that reason, we added one other step earlier than exporting our particular person pollens: we eradicated photographs with a disproportionate side ratio (see instance as Fig. 6). Particularly, 0.8, dividing the smaller dimension by the bigger one.

Then, we resized all the pictures into 224×224, the usual dimension for enter photographs for deep studying fashions.
3.3 Particular person-pollen datasets — a brief evaluation
We ended up with two datasets, one made with a confidence threshold of 0.8 and the opposite with 0.9:
- 0.8 Threshold:
- gooseberry — 7788 photographs
- haskap berry — 3582 photographs
- blackcurrant — 4637 photographs
- shadbush — 4140 photographs
Whole — 20147 photographs
- 0.9 Threshold:
- gooseberry — 2301 photographs
- haskap berry — 2912 photographs
- blackcurrant — 2438 photographs
- shadbush —1432 photographs
Whole — 9083 photographs
A fast take a look at the numbers exhibits that the 0.9 threshold dataset is over twice as small because the 0.8 threshold one. Each datasets are not balanced — the 0.8 one on account of gooseberry and the 0.9 one on account of shadbush.
YOLOv12 was an efficient instrument for segmenting our photographs into two single-pollen picture datasets, despite the fact that we encountered some difficulties. The newly created datasets could also be unbalanced, but their measurement ought to compensate for this downside, primarily since each class is extensively represented. There may be a variety of potential in them for future coaching of classification fashions, however we should see for ourselves.
4. Classification of Particular person Pollen Photographs
4.1 An Overview of Mannequin Ranking Metrics
One should devise metrics to measure efficiency to correctly method coaching fashions, whether or not classical ones engaged on statistical options, or extra advanced approaches, akin to convolutional neural networks or imaginative and prescient transformers. By the years, many strategies have been devised for conducting these duties — from statistical measures akin to F1, precision, or recall, to extra visible metrics akin to GradCAM, that enable a deeper perception into the mannequin’s inside workings. This text explores the grading strategies utilized by our fashions, with out going into pointless element.
Recall
Recall is described because the ratio of correct guesses of 1 class to the whole guesses of that class (see Eq. 1). It measures what share of photographs marked as a category belong to it. Engaged on separate lessons makes it practical in each balanced and imbalanced datasets.
Eq. 1— Formulation for recall.
Precision
Versus recall, precision is the share of accurately marked gadgets amongst all gadgets belonging to the category (see Eq. 2). It measures the share of things in a category that had been guessed accurately. This metric performs equally to recall.
Eq. 2 — Formulation for precision.
F1 Rating
The F1 Rating is just the harmonic imply of precision and recall (see Eq. 3). It helps mix precision and recall right into a concise measurement. Therefore, it nonetheless performs excellently even on unbalanced datasets.
Eq. 3 — Formulation for F1.
Confusion Matrix
The confusion matrix is a visible measure evaluating the variety of guesses made for one class to the precise variety of photographs on this class. It helps as an instance errors made by the mannequin, which can have bother with solely particular programs (see Fig. 7).

GradCAM
GradCAM is a measure of CNN efficiency that visualises which areas of the picture affect the prediction. To do that, the strategy computes the gradients from 1 convolutional layer and determines an activation map that visually layers on prime of the picture. It significantly aids in understanding and explaining the mannequin’s “causes” for labelling a specific picture as a selected class (see instance in Fig. 8).

These metrics are only some of an unlimited sea of measurements and visualisation strategies utilized in machine studying. But, they’ve confirmed adequate for measuring the efficiency of the mannequin. In additional articles, metrics will probably be introduced up accordingly as new classifiers are used and launched within the undertaking.
4.2 Particular person Pollen Classification with Commonplace Fashions
With our photographs preprocessed, we may transfer on to the subsequent stage: classifying particular person pollen into species. We tried three approaches - customary, easy classifiers based mostly on options extracted from photographs, Convolutional Neural Networks, and Imaginative and prescient Transformers. This text outlines our work on customary fashions, together with the kNN classifier, SVMs, MLPs, and Random Forests.
Characteristic extraction
To make our classifiers work, we first needed to receive options on which they may base their predictions. We settled for 2 most important sorts of options. One was statistical measures based mostly on the presence of pixels with a specific color (one from the RGB mannequin) for a selected picture, such because the imply, customary deviation, median, quantiles, skew, and kurtosis — we extracted them for each shade layer. The opposite was GLCM (Grey Stage Co-Incidence Matrix) options: distinction, dissimilarity, homogeneity, vitality, and correlation. These had been obtained from grayscale-converted photographs, and we extracted every at totally different angles. Each single picture had 21 statistical options and 20 GLCM-based options, which quantities to 41 options per picture.
k-Nearest-Neighbors (kNN)
The kNN is a classifier that makes use of a spatial illustration of information to categorise knowledge by detecting the ok nearest neighbours of a characteristic to foretell its label. The mentioned classifier is quick, but different strategies outperform it.
kNN Metrics:
0.8 Dataset:
F1: 0.6454
Precision: 0.6734
Recall: 0.6441
0.9 Dataset:
F1: 0.6961
Precision: 0.7197
Recall: 0.7151
Assist Vector Machine (SVM)
Just like the kNN, the SVM represents knowledge as factors in a multi-dimensional house. Nonetheless, as a substitute of discovering nearest neighbours, it tries algorithmically separating the info with a hyperplane. This yields higher outcomes than the kNN, however introduces randomness and continues to be outclassed by different options.
SVM Metrics:
0.8 Dataset:
F1: 0.6952
Precision: 0.7601
Recall: 0.7025
0.9 Dataset:
F1: 0.8556
Precision: 0.8687
Recall: 0.8597
Multi-Layered Perceptron (MLP)
The Multi-Layered Perceptron is a mannequin impressed by the human mind and its neurons. It passes inputs by way of a community of layers of neurons with their very own particular person weights, that are altered throughout coaching. When well-optimized, this mannequin can typically obtain nice outcomes for the standard classifier. Nonetheless, pollen recognition was not one among them - it carried out poorly in comparison with different options and was not constant.
MLP Metrics:
0.8 Dataset:
F1: 0.8131
Precision: 0.8171
Recall: 0.8173
0.9 Dataset:
F1: 0.7841
Precision: 0.8095
Recall: 0.7940
Random Forest
The random forest is a mannequin well-known for its explainability - it’s based mostly on choice timber, which classify knowledge based mostly on thresholds, which people can analyze much more simply than, as an illustration, weights in neural networks. The Random Forest carried out pretty nicely and persistently - we discovered that 200 timber was optimum. Nonetheless, it was outclassed by extra advanced classifiers.
RF Metrics:
0.8 Dataset:
F1: 0.8211
Precision: 0.8210
Recall: 0.8233
0.9 Dataset:
F1: 0.8150
Precision: 0.8202
Recall: 0.8216
The classical fashions exhibited diverse efficiency levels- some carried out worse than anticipated, whereas others delivered pretty good metrics. Nevertheless, this isn’t but the top. We nonetheless have superior deep studying fashions to check out, akin to Convolutional Neural Networks and Imaginative and prescient Transformers. We count on that they may carry out considerably higher.
4.3 Particular person Pollen Classification with Convolutional Neural Networks
Classical fashions akin to MLPs, Random Forests, or SVMs in particular person pollen classification yielded mediocre to fairly good outcomes. Nevertheless, the subsequent method we determined to strive was Convolutional Neural Networks (CNNs). They’re fashions that generate options by processing photographs and are identified for his or her effectiveness.
As an alternative of coaching CNNs from scratch, we used a switch studying method — we took pre-trained fashions, particularly ResNet50 and ResNet152, and fine-tuned them to our dataset. This method makes coaching considerably quicker and fewer resource-demanding. It additionally permits for a lot simpler classification as a result of fashions’ already being professionally skilled on massive datasets. Earlier than coaching, we additionally needed to normalize the pictures.
When it comes to metrics, we used Grad-CAM, a way that makes an attempt to focus on the areas of a picture that influenced a mannequin’s prediction probably the most, along with customary metrics akin to F1 rating, precision, and recall. We additionally included confusion matrices to see if our CNNs wrestle with any explicit class.
ResNet50
ResNet50 is a CNN structure developed by Microsoft Analysis Asia in 2015, which was a major step in the direction of creating far deeper and extra environment friendly neural networks. It’s a residual community (therefore the identify ResNet) that makes use of skip connections to permit direct knowledge move. This, in flip, mitigates the vanishing gradient drawback.
We anticipated this mannequin to carry out worse than ResNet152. Our expectations had been shortly subverted because the mannequin delivered predictions on the identical stage as ResNet152 on each datasets, as represented within the below-listed metrics and confusion metrics (see Fig. 9 and Fig. 10), in addition to Grad-Cam visualization (see Fig. 11).
ResNet50 Metrics:
0.8 Dataset:
F1: 0.98
Precision: 0.98
Recall: 0.98
0.9 Dataset:
F1: 0.99
Precision: 0.99
Recall: 0.99



Relating to Grad-CAM, it didn’t present any helpful insights in regards to the mannequin’s inside workings - the highlighted zones included the background and seemingly random locations. As a result of it achieves very excessive accuracy, the community seems to note patterns undetectable by the human eye.
ResNet152
Additionally a improvement of Microsoft’s researchers, the ResNet152 is a residual community and a CNN structure with important depth and deep studying capabilities far exceeding these of the ResNet50.
Due to this fact, our expectations for this mannequin had been larger than for ResNet50. We had been dissatisfied to see that it carried out on par with it. It carried out excellently (see Fig. 12 and Fig. 13 with confusion matrices and Fig. 14 with Grad-Cam visualizations).
ResNet152 Metrics:
0.8 Dataset:
F1: 0.98
Precision: 0.98
Recall: 0.98
0.9 Dataset:
F1: 0.99
Precision: 0.99
Recall: 0.99



Grad-CAM was not useful for ResNet152 both - we skilled the enigmatic nature of deep studying fashions, which obtain excessive accuracy however can’t be defined simply.
We had been stunned that the extra advanced ResNet152 didn’t outperform the ResNet50 on the 0.9 dataset. Each achieved the best metrics out of any fashions we’ve tried to this point - they trumped the classical fashions, with the distinction between one of the best classical mannequin and the CNNs exceeding 10 share factors. It’s time to check probably the most modern mannequin - the Imaginative and prescient Transformer.
4.4 Particular person Pollen Classification with Imaginative and prescient Transformers
For particular person pollen classification, we tried out easy fashions, which supplied a diverse stage of efficiency, from inadequate to passable. Then, we carried out convolutional neural networks, which fully trumped their efficiency. Now it’s time for us to check out the modern mannequin generally known as the Imaginative and prescient Transformer.
Transformers, basically, originate from the well-known 2017 paper “Consideration Is All You Want” by researchers at Google, however they had been initially used primarily for pure language processing. In 2020, the transformer structure was utilized in pc imaginative and prescient, yielding the ViT — Imaginative and prescient Transformer. Its wonderful efficiency marked the start of the top for Convolutional Neural Networks’ reign within the space.
Our method right here was just like what we used when coaching CNNs. We imported a pre-trained mannequin: vit-base-patch16–224-in21k, a mannequin skilled on ImageNet-21k. Then, we normalized our dataset photographs, fine-tuned them, and famous down the outcomes of metrics and confusion matrices (see Fig. 15 and Fig. 16).
vit-base-patch16–224-in21k outcomes:
0.8 Dataset:
F1: 0.98
Precision: 0.98
Recall: 0.98
0.9 Dataset:
F1: 1.00
Precision: 1.00
Recall: 1.00


Within the 0.8 dataset, the Imaginative and prescient Transformer introduced a stage of efficiency that didn’t exceed that of the Residual Networks, and it struggled with comparable issues — it misclassified Gooseberry as Blackcurrant. Nevertheless, on the 0.9 dataset, it achieved an almost good rating. We witnessed innovation conquer extra dated options, which urged us to avoid wasting the mannequin and designate it as our mannequin of alternative for extra demanding duties.
4.5 Comparability of Metrics for Numerous Fashions
For our pollen classification duties, we’ve used many fashions: conventional fashions, together with the kNN, SVM, MLP, and Random Forest; Convolutional Neural Networks (ResNet50 and ResNet152), and a Imaginative and prescient Transformer (vit-base-patch16–224-in21k). This text serves as an summary and a efficiency rating (see Tab. 1).

Rating
6. kNN (k-Nearest-Neighbors)
The only classifier. As anticipated, it skilled shortly, however carried out the worst.
5. MLP (Multi-Layered Perceptron)
The mannequin’s structure is predicated on the human nervous system. The MLP was outperformed by different customary fashions, which we didn’t count on.
4. RF (Random Forest)
The Random Forest classifier carried out with the best consistency of all fashions, however its metrics had been removed from superb.
3. SVM (Assist Vector Machine)
The surprising winner among the many typical classifiers. Its efficiency was random however yielded good outcomes for the standard classifier on the 0.9 dataset.
2. ResNet50 and ResNet152 (Residual Networks)
Each architectures achieved the identical excessive outcomes due to their complexity, far exceeding the capabilities of any customary classifier on each datasets.
1. ViT (Imaginative and prescient Transformer)
Probably the most modern answer we’ve tried trumped the classical fashions and caught as much as the Residual Networks on the 0.8 dataset. But the true problem was the 0.9 dataset, the place the CNNs reached an insurmountable accuracy of 0.99. To our shock, the Imaginative and prescient Transformer’s outcomes had been so excessive that they had been rounded to 1.00 - an ideal rating. Its outcomes are a real testomony to the facility of innovation.
Be aware: the classification report rounded up the mannequin’s metrics - they aren’t precisely equal to 1, as that will imply that each one photographs with out exception had been categorised accurately. We settled for this worth as a result of solely a marginal 5 photographs (0.27%) had been misclassified.
By evaluating totally different classifiers within the area of visible pollen recognition, we had been in a position to expertise the historical past and evolution of machine studying personally. We examined fashions with various levels of innovation, beginning with the best classifiers and the attention-based Imaginative and prescient Transformer, and seen how their outcomes elevated together with their novelty. Primarily based on this comparability, we unanimously elected the ViT as our mannequin of alternative for working with pollen.
5. Conclusions
The duty of visually classifying pollen, which has eluded biologists world wide and lay exterior the grasp of human capacity, has lastly been confirmed potential due to the facility of machine studying. The fashions introduced in our publication have all proven potential to categorise the pollens, with various levels of accuracy. Some fashions, akin to CNNs or the Imaginative and prescient Transformer, have reached close to perfection, recognising pollen with precision unseen in people.
To higher perceive why this accomplishment is so spectacular, we illustrate it in Fig. 17.

It’s extremely probably that the majority readers can’t accurately classify these photographs into the 4 lessons talked about beforehand. However, our fashions have confirmed to recognise them with nearly good accuracy, reaching a prime F1 rating of over 99%.
One could surprise what such a classifier might be used for, or why it was skilled within the first place. The functions of this method are quite a few, from monitoring plant populations to measuring airborne allergen ranges on an area scale. We constructed the fashions to not solely present a instrument for palynologists to make use of to categorise pollens they may accumulate, but in addition to supply a analysis platform for different machine studying fanatics to construct off of, and to display the ever-expanding functions of this discipline.
On that notice, that is the top of this publication. We sincerely hope the reader finds this beneficial info of their analysis endeavours and that our articles have sparked concepts for initiatives utilizing this expertise.
6. Acknowledgments
We’re very grateful to Professor Agnieszka Marasek-Ciołakowska from the Nationwide Institute of Horticultural Analysis, Skierniewice, Poland, for making ready samples and taking microscopic photographs of them utilizing the Keyence VHX-5000 microscope. The authors possess the entire, non-restricted copyrights to the dataset used on this analysis and all photographs used inside this text.