Close Menu
    Trending
    • Gemini introducerar funktionen schemalagda åtgärder i Gemini-appen
    • AIFF 2025 Runway’s tredje årliga AI Film Festival
    • AI-agenter kan nu hjälpa läkare fatta bättre beslut inom cancervård
    • Not Everything Needs Automation: 5 Practical AI Agents That Deliver Enterprise Value
    • Prescriptive Modeling Unpacked: A Complete Guide to Intervention With Bayesian Modeling.
    • 5 Crucial Tweaks That Will Make Your Charts Accessible to People with Visual Impairments
    • Why AI Projects Fail | Towards Data Science
    • The Role of Luck in Sports: Can We Measure It?
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » How to Set the Number of Trees in Random Forest
    Artificial Intelligence

    How to Set the Number of Trees in Random Forest

    ProfitlyAIBy ProfitlyAIMay 16, 2025No Comments14 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Scientific publication

    T. M. Lange, M. Gültas, A. O. Schmitt & F. Heinrich (2025). optRF: Optimising random forest stability by figuring out the optimum variety of bushes. BMC bioinformatics, 26(1), 95.

    Comply with this LINK to the unique publication.

    Forest — A Highly effective Software for Anybody Working With Knowledge

    What’s Random Forest?

    Have you ever ever wished you could possibly make higher selections utilizing knowledge — like predicting the danger of ailments, crop yields, or recognizing patterns in buyer habits? That’s the place machine studying is available in and one of the crucial accessible and highly effective instruments on this subject is one thing referred to as Random Forest.

    So why is random forest so fashionable? For one, it’s extremely versatile. It really works effectively with many sorts of knowledge whether or not numbers, classes, or each. It’s additionally broadly utilized in many fields — from predicting affected person outcomes in healthcare to detecting fraud in finance, from enhancing procuring experiences on-line to optimising agricultural practices.

    Regardless of the title, random forest has nothing to do with bushes in a forest — nevertheless it does use one thing referred to as Decision Trees to make good predictions. You possibly can consider a choice tree as a flowchart that guides a collection of sure/no questions primarily based on the information you give it. A random forest creates a complete bunch of those bushes (therefore the “forest”), every barely completely different, after which combines their outcomes to make one remaining resolution. It’s a bit like asking a gaggle of consultants for his or her opinion after which going with the bulk vote.

    However till just lately, one query was unanswered: What number of resolution bushes do I really want? If every resolution tree can result in completely different outcomes, averaging many bushes would result in higher and extra dependable outcomes. However what number of are sufficient? Fortunately, the optRF package deal solutions this query!

    So let’s take a look at easy methods to optimise Random Forest for predictions and variable choice!

    Making Predictions with Random Forests

    To optimise and to make use of random forest for making predictions, we are able to use the open-source statistics programme R. As soon as we open R, we’ve to put in the 2 R packages “ranger” which permits to make use of random forests in R and “optRF” to optimise random forests. Each packages are open-source and obtainable by way of the official R repository CRAN. With a purpose to set up and cargo these packages, the next strains of R code will be run:

    > set up.packages(“ranger”)
    > set up.packages(“optRF”)
    > library(ranger)
    > library(optRF)

    Now that the packages are put in and loaded into the library, we are able to use the features that these packages comprise. Moreover, we are able to additionally use the information set included within the optRF package deal which is free to make use of beneath the GPL license (simply because the optRF package deal itself). This knowledge set referred to as SNPdata comprises within the first column the yield of 250 wheat crops in addition to 5000 genomic markers (so referred to as single nucleotide polymorphisms or SNPs) that may comprise both the worth 0 or 2.

    > SNPdata[1:5,1:5]
                Yield SNP_0001 SNP_0002 SNP_0003 SNP_0004
      ID_001 670.7588        0        0        0        0
      ID_002 542.5611        0        2        0        0
      ID_003 591.6631        2        2        0        2
      ID_004 476.3727        0        0        0        0
      ID_005 635.9814        2        2        0        2

    This knowledge set is an instance for genomic knowledge and can be utilized for genomic prediction which is a vital software for breeding high-yielding crops and, thus, to battle world starvation. The thought is to foretell the yield of crops utilizing genomic markers. And precisely for this function, random forest can be utilized! That signifies that a random forest mannequin is used to explain the connection between the yield and the genomic markers. Afterwards, we are able to predict the yield of wheat crops the place we solely have genomic markers.

    Due to this fact, let’s think about that we’ve 200 wheat crops the place we all know the yield and the genomic markers. That is the so-called coaching knowledge set. Let’s additional assume that we’ve 50 wheat crops the place we all know the genomic markers however not their yield. That is the so-called check knowledge set. Thus, we separate the information body SNPdata in order that the primary 200 rows are saved as coaching and the final 50 rows with out their yield are saved as check knowledge:

    > Coaching = SNPdata[1:200,]
    > Take a look at = SNPdata[201:250,-1]

    With these knowledge units, we are able to now take a look at easy methods to make predictions utilizing random forests!

    First, we obtained to calculate the optimum variety of bushes for random forest. Since we need to make predictions, we use the operate opt_prediction from the optRF package deal. Into this operate we’ve to insert the response from the coaching knowledge set (on this case the yield), the predictors from the coaching knowledge set (on this case the genomic markers), and the predictors from the check knowledge set. Earlier than we run this operate, we are able to use the set.seed operate to make sure reproducibility despite the fact that this isn’t crucial (we are going to see later why reproducibility is a matter right here):

    > set.seed(123)
    > optRF_result = opt_prediction(y = Coaching[,1], 
    +                               X = Coaching[,-1], 
    +                               X_Test = Take a look at)
      Beneficial variety of bushes: 19000

    All the outcomes from the opt_prediction operate are actually saved within the object optRF_result, nevertheless, an important info was already printed within the console: For this knowledge set, we should always use 19,000 bushes.

    With this info, we are able to now use random forest to make predictions. Due to this fact, we use the ranger operate to derive a random forest mannequin that describes the connection between the genomic markers and the yield within the coaching knowledge set. Additionally right here, we’ve to insert the response within the y argument and the predictors within the x argument. Moreover, we are able to set the write.forest argument to be TRUE and we are able to insert the optimum variety of bushes within the num.bushes argument:

    > RF_model = ranger(y = Coaching[,1], x = Coaching[,-1], 
    +                   write.forest = TRUE, num.bushes = 19000)

    And that’s it! The article RF_model comprises the random forest mannequin that describes the connection between the genomic markers and the yield. With this mannequin, we are able to now predict the yield for the 50 crops within the check knowledge set the place we’ve the genomic markers however we don’t know the yield:

    > predictions = predict(RF_model, knowledge=Take a look at)$predictions
    > predicted_Test = knowledge.body(ID = row.names(Take a look at), predicted_yield = predictions)

    The info body predicted_Test now comprises the IDs of the wheat crops along with their predicted yield:

    > head(predicted_Test)
          ID predicted_yield
      ID_201        593.6063
      ID_202        596.8615
      ID_203        591.3695
      ID_204        589.3909
      ID_205        599.5155
      ID_206        608.1031

    Variable Choice with Random Forests

    A unique strategy to analysing such an information set can be to search out out which variables are most necessary to foretell the response. On this case, the query can be which genomic markers are most necessary to foretell the yield. Additionally this may be executed with random forests!

    If we deal with such a activity, we don’t want a coaching and a check knowledge set. We will merely use your complete knowledge set SNPdata and see which of the variables are an important ones. However earlier than we try this, we should always once more decide the optimum variety of bushes utilizing the optRF package deal. Since we’re insterested in calculating the variable significance, we use the operate opt_importance:

    > set.seed(123)
    > optRF_result = opt_importance(y=SNPdata[,1], 
    +                               X=SNPdata[,-1])
      Beneficial variety of bushes: 40000

    One can see that the optimum variety of bushes is now larger than it was for predictions. That is really typically the case. Nonetheless, with this variety of bushes, we are able to now use the ranger operate to calculate the significance of the variables. Due to this fact, we use the ranger operate as earlier than however we modify the variety of bushes within the num.bushes argument to 40,000 and we set the significance argument to “permutation” (different choices are “impurity” and “impurity_corrected”). 

    > set.seed(123) 
    > RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], 
    +                   write.forest = TRUE, num.bushes = 40000,
    +                   significance="permutation")
    > D_VI = knowledge.body(variable = names(SNPdata)[-1], 
    +                   significance = RF_model$variable.significance)
    > D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]

    The info body D_VI now comprises all of the variables, thus, all of the genomic markers, and subsequent to it, their significance. Additionally, we’ve straight ordered this knowledge body in order that an important markers are on the highest and the least necessary markers are on the backside of this knowledge body. Which signifies that we are able to take a look at an important variables utilizing the top operate:

    > head(D_VI)
      variable significance
      SNP_0020   45.75302
      SNP_0004   38.65594
      SNP_0019   36.81254
      SNP_0050   34.56292
      SNP_0033   30.47347
      SNP_0043   28.54312

    And that’s it! We’ve got used random forest to make predictions and to estimate an important variables in an information set. Moreover, we’ve optimised random forest utilizing the optRF package deal!

    Why Do We Want Optimisation?

    Now that we’ve seen how simple it’s to make use of random forest and the way rapidly it may be optimised, it’s time to take a more in-depth take a look at what’s occurring behind the scenes. Particularly, we’ll discover how random forest works and why the outcomes may change from one run to a different.

    To do that, we’ll use random forest to calculate the significance of every genomic marker however as a substitute of optimising the variety of bushes beforehand, we’ll stick to the default settings within the ranger operate. By default, ranger makes use of 500 resolution bushes. Let’s strive it out:

    > set.seed(123) 
    > RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], 
    +                   write.forest = TRUE, significance="permutation")
    > D_VI = knowledge.body(variable = names(SNPdata)[-1], 
    +                   significance = RF_model$variable.significance)
    > D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]
    > head(D_VI)
      variable significance
      SNP_0020   80.22909
      SNP_0019   60.37387
      SNP_0043   50.52367
      SNP_0005   43.47999
      SNP_0034   38.52494
      SNP_0015   34.88654

    As anticipated, all the things runs easily — and rapidly! In reality, this run was considerably quicker than once we beforehand used 40,000 bushes. However what occurs if we run the very same code once more however this time with a special seed?

    > set.seed(321) 
    > RF_model2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], 
    +                    write.forest = TRUE, significance="permutation")
    > D_VI2 = knowledge.body(variable = names(SNPdata)[-1], 
    +                    significance = RF_model2$variable.significance)
    > D_VI2 = D_VI2[order(D_VI2$importance, decreasing=TRUE),]
    > head(D_VI2)
      variable significance
      SNP_0050   60.64051
      SNP_0043   58.59175
      SNP_0033   52.15701
      SNP_0020   51.10561
      SNP_0015   34.86162
      SNP_0019   34.21317

    As soon as once more, all the things seems to work high quality however take a more in-depth take a look at the outcomes. Within the first run, SNP_0020 had the very best significance rating at 80.23, however within the second run, SNP_0050 takes the highest spot and SNP_0020 drops to the fourth place with a a lot decrease significance rating of 51.11. That’s a major shift! So what modified?

    The reply lies in one thing referred to as non-determinism. Random forest, because the title suggests, entails a variety of randomness: it randomly selects knowledge samples and subsets of variables at varied factors throughout coaching. This randomness helps forestall overfitting nevertheless it additionally signifies that outcomes can fluctuate barely every time you run the algorithm — even with the very same knowledge set. That’s the place the set.seed() operate is available in. It acts like a bookmark in a shuffled deck of playing cards. By setting the identical seed, you make sure that the random decisions made by the algorithm observe the identical sequence each time you run the code. However whenever you change the seed, you’re successfully altering the random path the algorithm follows. That’s why, in our instance, an important genomic markers got here out otherwise in every run. This habits — the place the identical course of can yield completely different outcomes because of inner randomness — is a traditional instance of non-determinism in machine studying.

    As we simply noticed, random forest fashions can produce barely completely different outcomes each time you run them even when utilizing the identical knowledge as a result of algorithm’s built-in randomness. So, how can we cut back this randomness and make our outcomes extra secure?

    One of many easiest and only methods is to extend the variety of bushes. Every tree in a random forest is skilled on a random subset of the information and variables, so the extra bushes we add, the higher the mannequin can “common out” the noise brought on by particular person bushes. Consider it like asking 10 individuals for his or her opinion versus asking 1,000 — you’re extra more likely to get a dependable reply from the bigger group.

    With extra bushes, the mannequin’s predictions and variable significance rankings are inclined to develop into extra secure and reproducible even with out setting a particular seed. In different phrases, including extra bushes helps to tame the randomness. Nonetheless, there’s a catch. Extra bushes additionally imply extra computation time. Coaching a random forest with 500 bushes may take a couple of seconds however coaching one with 40,000 bushes might take a number of minutes or extra, relying on the dimensions of your knowledge set and your pc’s efficiency.

    Nonetheless, the connection between the steadiness and the computation time of random forest is non-linear. Whereas going from 500 to 1,000 bushes can considerably enhance stability, going from 5,000 to 10,000 bushes may solely present a tiny enchancment in stability whereas doubling the computation time. In some unspecified time in the future, you hit a plateau the place including extra bushes provides diminishing returns — you pay extra in computation time however acquire little or no in stability. That’s why it’s important to search out the precise steadiness: Sufficient bushes to make sure secure outcomes however not so many who your evaluation turns into unnecessarily gradual.

    And that is precisely what the optRF package deal does: it analyses the connection between the steadiness and the variety of bushes in random forests and makes use of this relationship to find out the optimum variety of bushes that results in secure outcomes and past which including extra bushes would unnecessarily improve the computation time.

    Above, we’ve already used the opt_importance operate and saved the outcomes as optRF_result. This object comprises the details about the optimum variety of bushes nevertheless it additionally comprises details about the connection between the steadiness and the variety of bushes. Utilizing the plot_stability operate, we are able to visualise this relationship. Due to this fact, we’ve to insert the title of the optRF object, which measure we’re eager about (right here, we have an interest within the “significance”), the interval we need to visualise on the X axis, and if the really useful variety of bushes needs to be added:

    > plot_stability(optRF_result, measure="significance", 
    +                from=0, to=50000, add_recommendation=FALSE)
    R graph that visualises the stability of random forest depending on the number of decision trees
    The output of the plot_stability operate visualises the steadiness of random forest relying on the variety of resolution bushes

    This plot clearly reveals the non-linear relationship between stability and the variety of bushes. With 500 bushes, random forest solely results in a stability of round 0.2 which explains why the outcomes modified drastically when repeating random forest after setting a special seed. With the really useful 40,000 bushes, nevertheless, the steadiness is close to 1 (which signifies an ideal stability). Including greater than 40,000 bushes would get the steadiness additional to 1 however this improve can be solely very small whereas the computation time would additional improve. That’s the reason 40,000 bushes point out the optimum variety of bushes for this knowledge set.

    The Takeaway: Optimise Random Forest to Get the Most of It

    Random forest is a strong ally for anybody working with knowledge — whether or not you’re a researcher, analyst, scholar, or knowledge scientist. It’s simple to make use of, remarkably versatile, and extremely efficient throughout a variety of functions. However like several software, utilizing it effectively means understanding what’s occurring beneath the hood. On this put up, we’ve uncovered certainly one of its hidden quirks: The randomness that makes it sturdy may make it unstable if not rigorously managed. Happily, with the optRF package deal, we are able to strike the right steadiness between stability and efficiency, making certain we get dependable outcomes with out losing computational sources. Whether or not you’re working in genomics, medication, economics, agriculture, or another data-rich subject, mastering this steadiness will enable you to make smarter, extra assured selections primarily based in your knowledge.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleStability AI släpper Audio Open Small modell som kan köras på smartphones
    Next Article How to Build an AI Journal with LlamaIndex
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Not Everything Needs Automation: 5 Practical AI Agents That Deliver Enterprise Value

    June 6, 2025
    Artificial Intelligence

    Prescriptive Modeling Unpacked: A Complete Guide to Intervention With Bayesian Modeling.

    June 6, 2025
    Artificial Intelligence

    5 Crucial Tweaks That Will Make Your Charts Accessible to People with Visual Impairments

    June 6, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Can deep learning transform heart failure prevention? | MIT News

    April 5, 2025

    Framtidens AI-modeller från OpenAI API kan kräva ID-verifiering

    April 14, 2025

    Making AI-generated code more accurate in any language | MIT News

    April 18, 2025

    HIPAA Expert Determination for De-Identification

    April 9, 2025

    10 top women in AI in 2025

    April 4, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Google Just Leveled Up: Meet Gemini 2.5

    April 11, 2025

    TDS Authors Can Now Receive Payments Via Stripe

    May 13, 2025

    Gemini integreras i Android-ekosystemet Android Auto, Google TV och Android XR

    May 14, 2025
    Our Picks

    Gemini introducerar funktionen schemalagda åtgärder i Gemini-appen

    June 7, 2025

    AIFF 2025 Runway’s tredje årliga AI Film Festival

    June 7, 2025

    AI-agenter kan nu hjälpa läkare fatta bättre beslut inom cancervård

    June 7, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.