In knowledge science, we attempt to enhance the less-than-desirable efficiency of our mannequin as we match the information at hand. We attempt strategies starting from altering mannequin complexity to knowledge massaging and preprocessing. Nonetheless, as a rule, we’re suggested to “simply” get extra knowledge. Apart from that being simpler stated than performed, maybe we must always pause and query the traditional knowledge. In different phrases,
Does including extra knowledge at all times yield higher efficiency?
On this article, let’s put this adage to the check utilizing actual knowledge and a device I constructed for such inquiry. We are going to make clear the subtleties related to knowledge assortment and enlargement, difficult the notion that such endeavors routinely enhance efficiency and calling for a extra aware and strategic follow.
What Does Extra Knowledge Imply?
Let’s first outline what we imply precisely by “extra knowledge”. In essentially the most normal setting, we generally think about knowledge to be tabular. And when the thought of buying extra knowledge is recommended, including extra rows to our knowledge body (i.e, extra knowledge factors or samples) is what first involves thoughts.
Nonetheless, an alternate strategy can be including extra columns (i.e., extra attributes or options). The primary strategy expands the information vertically, whereas the second does so horizontally.
We are going to subsequent contemplate the commonalities and peculiarities of the 2 approaches.
Case 1: Extra Samples
Let’s contemplate the primary case of including extra samples. Does including extra samples essentially enhance mannequin efficiency?
In an try and unravel it, I created a tool hosted as a HuggingFace space to focus on this query. This device permits the consumer to experiment with the consequences of adjusting the attribute set, the pattern dimension, and/or mannequin complexity when analyzing the UCI Irvine – Predict Students’ Dropout and Academic Success dataset [1] with a choice tree. Whereas each the device and the dataset are meant for academic functions, we’ll nonetheless be capable to derive useful insights that generalize past this fundamental setting.

…


Say the college’s dean fingers you some scholar data and asks you to establish the components that predict scholar dropout to deal with the problem. You might be given 1500 knowledge factors to begin with. You create a 700-data-point hidden out check set and you employ the remainder for coaching. The info furnished to you accommodates the scholars’ nationalities and fogeys’ occupations, in addition to the GDP and inflation and unemployment charges.
Nonetheless, the outcomes don’t appear spectacular. The F1 rating is low. So, naturally, you ask your dean to drag some strings to purchase extra scholar data (maybe from prior years or different faculties), which they do over a few weeks. You rerun the experiment each time you get a brand new batch of scholar data. Typical knowledge means that including extra knowledge steadily improves the modeling course of (Take a look at F1 rating ought to improve monotonically), however that’s not what you see. The efficiency erratically fluctuates as extra knowledge is available in. You might be confused. Why would extra knowledge ever damage efficiency? Why did the F1 rating drop from 46% right down to 39% when one of many batches was added? Shouldn’t the connection be causal?

Nicely, the query is de facto whether or not extra samples essentially present extra info. Let’s first ponder the character of those extra samples:
- They may very well be false (i.e., a bug in knowledge assortment)
- They may very well be biased (e.g., over-representing a particular case that doesn’t align with the true distribution as represented by the check set)
- The check set itself could also be biased…
- Spurious patterns could also be launched by some batches and later cancelled by different batches.
- The attributes collected set up little to no correlation or causation with the goal (i.e., there are lurking variables unaccounted for). So, irrespective of what number of samples you add, they don’t seem to be going to get you wherever!
So, sure, including extra knowledge is usually a good suggestion, however we should take note of inconsistencies within the knowledge (e.g. two college students of the identical nationality and social standing could find yourself on totally different paths on account of different components). We should additionally fastidiously assess the usefulness of the accessible attributes (e.g., maybe GDP has nothing to do with scholar dropout price).
Some could argue that this could not be a difficulty when you will have a lot of actual knowledge (In spite of everything, this can be a comparatively small dataset). There may be advantage to that argument, however provided that the information is properly homogenized and accounts for the totally different variabilities and “levels of freedom” of the attribute set (i.e., the vary of values every attribute can take and the potential mixtures of those values as seen in the true world). Research has proven circumstances by which massive datasets which might be thought-about gold commonplace present biases in attention-grabbing and obscure ways in which weren’t straightforward to identify at first look, inflicting deceptive experiences of excessive accuracy [2].
Case 2: Extra Attributes
Now, talking of attributes, let’s contemplate an alternate state of affairs by which your dean fails to accumulate extra scholar data. Nonetheless, they arrive and say, “Hey you… I wasn’t capable of get extra scholar data… however I used to be in a position to make use of some SQL to get extra attributes on your knowledge… I’m certain you possibly can enhance your efficiency now. Proper?… Proper?!”

Nicely, let’s put that to the check. Let’s take a look at the next instance the place we incrementally add extra attributes, increasing the scholars’ profile and together with their marital, monetary, and immigration statuses. Every time we add an attribute, we retrain the tree and consider its efficiency. As you possibly can see, whereas some increments enhance efficiency, others really damage it. However once more, why?
Trying on the attribute set extra intently, we discover that not all attributes really carry helpful info. The true world is messy… Some attributes (e.g., Gender) would possibly present noise or false correlations within the coaching set that won’t generalize properly to the check set (overfitting).
Additionally, whereas frequent knowledge says that as you add extra knowledge you need to improve your mannequin complexity, this follow doesn’t at all times yield the most effective consequence. Typically, when including an attribute, reducing mannequin complexity could assist with overfitting (e.g., when Course was launched to the combo).

Conclusion
Taking a step again and looking out on the large image, we see that whereas amassing extra knowledge is a noble trigger, we ought to be cautious to not routinely assume that efficiency will get higher. There are two forces at play right here: how properly the mannequin suits the coaching knowledge, and the way reliably that match generalizes and extends to unseen knowledge.
Let’s summarize how every kind of “extra knowledge” influences these forces—relying on whether or not the added knowledge is nice (consultant, constant, informative) or dangerous (biased, noisy, inconsistent):
| If knowledge high quality Is nice… | If knowledge high quality is poor… | |
| Extra samples (rows) | • Coaching error could rise barely (extra variations make it tough to suit).
• Take a look at error often drops. The mannequin turns into extra secure and assured. |
• Coaching error could fluctuate on account of conflicting examples.
• Take a look at error usually rises. |
| Extra attributes (columns) | • Coaching error often drops (extra sign results in richer illustration.)
• Take a look at error drops as attributes encode true and generalizable patterns. |
• Coaching error often drops (the mannequin memorizes noisy patterns).
• Take a look at error rises on account of spurious correlations. |
Generalization isn’t nearly amount—it’s additionally about high quality and the suitable degree of mannequin complexity.
To wrap up, subsequent time somebody means that you need to “merely” get extra knowledge to magically enhance accuracy, focus on with them the intricacies of such a plan. Discuss in regards to the traits of the procured knowledge when it comes to nature, dimension, and high quality. Level out the nuanced interaction between knowledge and mannequin complexities. It will assist make their effort worthwhile!
Classes to Internalize:
- At any time when potential, don’t take others’ (or my) phrase for it. Experiment your self!
- When including extra knowledge factors for coaching, ask your self: Do these samples signify the phenomenon you’re modeling. Are they displaying the mannequin extra attention-grabbing real looking circumstances? or are they biased and/or inconsistent?
- When including extra attributes, ask your self: Are these attributes hypothesized to hold info that enhances our potential to make higher predictions, or is it principally noise?
- In the end, conduct hyper-parameter tuning and correct validation to get rid of doubts when assessing how informative the brand new coaching knowledge is.
Attempt it your self!
In the event you’d prefer to discover the dynamics showcased on this article your self, I host the interactive device here. As you experiment by adjusting the pattern dimension, variety of attributes, and/or mannequin depth, you’ll observe the impression of those changes on mannequin efficiency. Such experimentation enriches your perspective and understanding of the mechanisms underlying knowledge science and analytics.
References:
[1] M.V.Martins, D. Tolledo, J. Machado, L. M.T. Baptista, V.Realinho. (2021) “Early prediction of scholar’s efficiency in larger training: a case examine” Tendencies and Purposes in Data Techniques and Applied sciences, vol.1, in Advances in Clever Techniques and Computing collection. Springer. DOI: 10.1007/978-3-030-72657-7_16. This dataset is licensed below a Creative Commons Attribution 4.0 International (CC BY 4.0) license. This enables for the sharing and adaptation of the datasets for any objective, supplied that the suitable credit score is given.
[2] Z. Liu and Okay. He, A Decade’s Battle on Dataset Bias: Are We There But? (2024), arXiv: https://arxiv.org/abs/2403.08632
