have you ever heard this:
Now we have these very massive values on this timeseries, however these are simply “outliers” and occur solely [insert small number here]% of the time
Throughout my Knowledge Science years, I’ve heard that sentence quite a bit. In conferences, throughout product evaluations, and through calls with purchasers, this sentence has been stated to reassure that some very massive (or small) undesirable values which may seem usually are not “commonplace”, they don’t belong to the “regular course of” (all obscure phrases I’ve heard of), and they don’t represent a problem for the system we try to construct (for purpose X, Y and Z).
In manufacturing settings, these very small or very massive values (often known as excessive values) are accompanied by guardrails to “fail gracefully” in case the intense values are measured. That is normally sufficient for instances the place you simply want your system to work, and also you wish to be secure that it really works always, even when the undesirable/not commonplace/impolite, loopy, and annoying excessive values occur.
Nonetheless, when analyzing a timeseries, we will do one thing greater than “fixing” the intense worth with guardrails and if/else threshold: we will truly monitor excessive values in order that we will perceive them.
In timeseries, excessive values truly symbolize one thing within the system of curiosity. For instance, in case your time sequence describes the power consumption of a metropolis, an unreasonably excessive power stage would possibly point out a worrisome power consumption in a particular space, which can require motion. If you’re coping with monetary knowledge, the highs and lows have an apparent, essential which means, and understanding their habits is extraordinarily vital.
On this weblog put up, we will probably be coping with climate knowledge, the place the time sequence will symbolize the temperature (in Kelvin). The information, as we’ll see, could have a number of cities, and each metropolis yields a timeseries. If you choose a particular metropolis, you’ve a timeseries just like the one you might be seeing under:
So, in this sort of dataset, it’s fairly vital to mannequin the maxima and minima, as a result of they imply it’s both as sizzling as a furnace or extraordinarily chilly. Hopefully, at this level, you might be asking your self
“What can we imply by modelling the maxima and minima?”
If you find yourself coping with a time sequence dataset, it’s cheap to anticipate a Gaussian-like distribution, just like the one you see right here:

However should you think about solely the intense values, the distribution is much from that. And as you will note in a second, there are a number of layers of complexity related to extracting the distribution of maximum values. Three of them are:
- Defining an excessive worth: How are you aware one thing is excessive? We are going to outline the implementation of our excessive worth as a primary stage
- Defining the distributions that describe these occasions: there are a number of potential distributions of maximum values. The three distributions that will probably be handled on this weblog put up are the Generalized Excessive Worth (GEV), Weibull, and Gumbel (a particular case of GEV) distributions.
- Selecting the most effective distribution: There are a number of metrics we will use to find out the “greatest becoming” distributions. We are going to deal with the Akaike Data, the Log-likelihood, and the Bayesian Data Criterion.
All issues we’ll discuss on this article 🥹
Appears to be like like we now have a variety of floor to cowl. Let’s get began.
0. Knowledge and Script Supply
The language we’ll use is Python. The code supply will be discovered on this PieroPaialungaAI/RareEvents folder. The information supply will be discovered on this open supply Kaggle Dataset. Thoughts you, should you clone the GitHub folder, you received’t must obtain the dataset. The dataset is contained in the RawData folder contained in the RareEvents GitHub predominant folder (you’re welcome 😉).
1. Preliminary Knowledge Exploration
With the intention to make all the pieces easy through the exploration section, and provides us the utmost versatility within the pocket book with out writing tons of of traces of code. The code that does that [data.py] is the next:
This code does all of the soiled work data-wise; so we will do all the next steps in a only a few traces of code.
The very first thing we will do is simply show a few of the rows of the dataset. We are able to do it with this code:
Discover that there are 36 columns/cities, not simply 4, within the dataset. I displayed 4 to have a properly formatted desk. 🙂
Just a few issues to note:
- Each column, besides “datetime”, is a metropolis and represents a time sequence, the place each worth corresponds to the datetime, which represents the time axis
- Each worth within the metropolis column represents the Kelvin temperature for the date within the corresponding datetime column. For instance, index = 3 for column = ‘Vancouver’ tells us that, at datetime 2012-10-01 15:00:00, the temperature was 284.627 Okay
I additionally developed a perform that lets you plot the town column. For instance, if you wish to peek at what occurs in New York, you should utilize this:

Now, the datetime column is only a string column, however it could be truly useful to have the particular month, day, and 12 months in separate columns. Additionally, we now have some NaN values that we must always deal with. All these boring preprocessing steps are contained in the `.clean_and_preprocess()`
That is the output:
2. Detecting Excessive Occasions
Now, a vital query:
What’s an excessive occasion? And the way are we going to detect it?
There are two predominant methods to outline an “excessive occasion”. For instance, if we wish to determine the maxima, we will apply:
- The primary Definition: Peak Over Threshold (POT). Given a threshold, all the pieces above that threshold is a most level (excessive occasion).
- The second definition: Excessive inside a area. Given a window, we outline the utmost worth of the window as an excessive occasion.
On this weblog put up, we’ll use the second method. As an illustration, if we use each day home windows, we scan via the dataset and extract the very best worth for every day. This is able to provide you with a variety of factors, as our dataset spans greater than 5 years. OR, we may do it with month-to-month home windows or yearly home windows. This is able to provide you with fewer factors, however maybe richer info.
That is precisely the facility of this methodology: we now have management over the variety of factors and their “high quality”. For this examine, arguably the most effective window measurement is the “daily-sized one. For an additional dataset, be happy to regulate based mostly on the amount of your factors; for instance, you would possibly wish to cut back the window measurement when you have a really brief pure window (e.g., you gather knowledge each second), or improve it when you have a really massive dataset (e.g., you’ve 50+ years of information and per week window is extra applicable).
This definition of most worth is outlined inside the RareEventsToolbox class, in [rare_events_toolbox.py] script (have a look at the extract_block_max perform).
And we will shortly show the distribution of uncommon occasion at totally different window sizes utilizing the next block of code:

3. Excessive Occasions Distribution
Earlier than diving into code, let’s take a step again. Basically, the intense worth distributions don’t exhibit the gorgeous gaussian bell habits that you’ve got seen earlier (the Gaussian distribution for San Francisco). From a theoretical perspective, the 2 distributions to know are the Generalized Excessive Worth (GEV) distribution and the Weibull distribution.
GEV (Generalized Excessive Worth)
- The GEV is the muse of maximum worth principle and supplies a household of distributions tailor-made for modeling block maxima or minima. A particular case is the Gumbel distribution.
- Its flexibility comes from a form parameter that determines the “tail habits.” Relying on this parameter, the GEV can mimic totally different sorts of extremes (e.g., average, heavy-tailed).
- The demonstration of the GEV distribution may be very elegant: identical to the Central Worth Principle (CLT) says, “If you happen to common a bunch of i.i.d. random variables, the distribution of the typical tends to a Gaussian”, the EVT (Excessive Worth Principle) says “should you take the most (or minimal) of a bunch of i.i.d. random variables, the distribution of that most tends to a GEV.”
Weibull
- The Weibull is likely one of the most generally used distributions in reliability engineering, meteorology, and environmental modeling.
- It’s particularly helpful for describing knowledge the place there’s a way of “bounded” or tapered-off extremes.
- Not like the GEV distribution(s), the Weibull formulation is empirical. Waloddi Weibull, a Swedish engineer, first proposed the distribution in 1939 to mannequin the breaking power of supplies.
So we now have three prospects: GEV, Gumbel, and Weibull. Now, which one is the most effective? The brief reply is “it relies upon,” and one other brief reply is “it’s greatest simply to attempt all of them and see which one performs greatest”.
So now we now have one other query:
How can we consider the standard of a distribution perform and a set of information?
Three metrics to make use of are the next:
Three metrics to make use of are the next:
- Log-Probability (LL). It measures how possible the noticed knowledge is below the fitted distribution: larger is best.

the place f is the likelihood density (or mass) perform of the distribution with parameters θ and x_i is the i-th noticed knowledge level
- Akaike Data Criterion (AIC) AIC balances two forces: match high quality (through the log-likelihood = L) and simplicity (penalizes fashions with too many parameters, variety of parameters = ok).

- Bayesian Data Criterion (BIC). Comparable spirit to AIC, however harsher on complexity (dataset measurement = n).

The usual advice is to make use of one between AIC and BIC, as they think about the Log Probability and the complexity.
The implementation of the three distribution features, and the corresponding L, AIC, and BIC values, is the next:
After which we will show our distribution utilizing the next:

Fairly good match, proper? Whereas it visually appears good, we is usually a little bit extra quantitative and have a look at the Q-Q plot, which shows the quartile match between the info and the fitted distribution:

This shows that our distribution matches very properly with the supplied dataset. Now, you may discover how, should you had tried with a regular distribution (e.g. Gaussian curve), you’ll have certainly failed: the distribution of the info is closely skewed (as anticipated, as a result of we’re coping with excessive values, and this requires excessive worth distributions (this feels weirdly motivational 😁).
Now the cool factor is that, as we made it structured, we will additionally run this for each metropolis within the dataset utilizing the next block of code:
And the output will appear to be this:
{'Dallas': {'dist_type': 'gev',
'param': (0.5006578789482107, 296.2415220841758, 9.140132853556741),
'dist': <scipy.stats._continuous_distns.genextreme_gen at 0x13e9fa290>,
'metrics': {'log_likelihood': -6602.222429209462,
'aic': 13210.444858418923,
'bic': 13227.07308905503}},
'Pittsburgh': {'dist_type': 'gev',
'param': (0.5847547512518895, 287.21064374616327, 11.190557085335278),
'dist': <scipy.stats._continuous_distns.genextreme_gen at 0x13e9fa290>,
'metrics': {'log_likelihood': -6904.563305593636,
'aic': 13815.126611187272,
'bic': 13831.754841823378}},
'New York': {'dist_type': 'weibull_min',
'param': (6.0505720895039445, 238.93568735311248, 55.21556483095677),
'dist': <scipy.stats._continuous_distns.weibull_min_gen at 0x13e9cd390>,
'metrics': {'log_likelihood': -6870.265288196851,
'aic': 13746.530576393701,
'bic': 13763.10587863208}},
'Kansas Metropolis': {'dist_type': 'gev',
'param': (0.5483246490879885, 290.4564464294219, 11.284265203196664),
'dist': <scipy.stats._continuous_distns.genextreme_gen at 0x13e9fa290>,
'metrics': {'log_likelihood': -6949.785968553707,
'aic': 13905.571937107414,
'bic': 13922.20016774352}}
4. Abstract
Thanks for spending time with me up to now, it means quite a bit ❤️
Let’s recap what we did. As an alternative of hand-waving away “outliers,” we handled extremes as first-class alerts. For instance:
- We took a dataset representing the temperature of cities world wide
- We outlined our excessive occasions utilizing block maxima on a hard and fast window
- We modeled city-level temperature highs with three candidate households (GEV, Gumbel, and Weibull)
- We chosen the most effective match utilizing log-likelihood, AIC, and BIC, then verified suits with Q-Q plots.
Outcomes present that “greatest” varies by metropolis: for instance, Dallas, Pittsburgh, and Kansas Metropolis leaned GEV, whereas New York match a Weibull.
This sort of method is essential when excessive values are of excessive significance in your system, and we have to reveal the way it statistically behaves below uncommon and excessive situations.
5. Conclusions
Thanks once more in your time. It means quite a bit ❤️
My identify is Piero Paialunga, and I’m this man right here:

I’m a Ph.D. candidate on the College of Cincinnati Aerospace Engineering Division. I discuss AI and Machine Studying in my weblog posts and on LinkedIn, and right here on TDS. If you happen to preferred the article and wish to know extra about machine studying and comply with my research, you may:
A. Comply with me on Linkedin, the place I publish all my tales
B. Comply with me on GitHub, the place you may see all my code
C. For questions, you may ship me an e mail at piero.paialunga@hotmail