How to Reduce Your Power BI Model Size by 90%

what makes Power Bi so quick and highly effective relating to efficiency? So highly effective, that it performs complicated calculations over thousands and thousands of rows within the blink of an eye fixed.

On this article, we’ll dig deep to find what’s “beneath the hood” of Energy BI, how your information is being saved, compressed, queried, and eventually, introduced again to your report. When you end studying, I hope that you’ll get a greater understanding of the laborious work occurring within the background and admire the significance of making an optimum information mannequin to get most efficiency from the Energy BI engine.

First look beneath the hood — Method Engine and Storage Engine

First, I would like you to satisfy the VertiPaq engine, “mind & muscle tissues” of the system behind not solely Energy BI, but in addition Evaluation Companies Tabular and Excel Energy Pivot. Fact to be stated, VertiPaq represents just one a part of the storage engine throughout the Tabular mannequin, apart from DirectQuery, which we’ll talk about individually in one of many subsequent articles.

While you ship the question to get information to your Energy BI report, here’s what occurs:

Method Engine (FE) accepts the request, processes it, generates the question plan, and eventually executes it
Storage Engine (SE) pulls the information out of the Tabular mannequin to fulfill the request issued throughout the question generated by the Method Engine

Storage Engine works in two alternative ways to retrieve requested information: VertiPaq retains a snapshot of the information in reminiscence. This snapshot could be refreshed every so often from the unique information supply.

Quite the opposite, DirectQuery doesn’t retailer any information. It simply forwards the question straight to the information supply for each single request.

Photo by RKTW extend on Unsplash

Knowledge within the Tabular mannequin is often saved both as an in-memory snapshot (VertiPaq) or in DirectQuery mode. Nonetheless, there may be additionally a chance of implementing a hybrid Composite mannequin, which depends on each architectures in parallel.

Method Engine — “Mind” of Energy BI

As I already harassed, Method Engine accepts the question, and because it’s capable of “perceive” DAX (and MDX additionally, however it’s out of the scope of this collection), it “interprets” DAX into a selected question plan, consisting of bodily operations that have to be executed to get outcomes again.

These bodily operations could be joins between a number of tables, filtering, or aggregations. It’s necessary to know that Method Engine works in a single-threaded approach, which implies that requests to Storage Engine are all the time being despatched sequentially.

Storage Engine — “Muscle tissues” of Energy BI

As soon as the question has been generated and executed by the Method Engine, the Storage Engine comes into the scene. It bodily goes by the information saved throughout the Tabular mannequin (VertiPaq) or goes on to a unique information supply (SQL Server, for instance, if DirectQuery storage mode is in place).

Relating to specifying the storage engine for the desk, there are three doable choices to select from:

Import mode — primarily based on VertiPaq. Desk information is being saved in reminiscence as a snapshot. Knowledge could be refreshed periodically
DirectQuery mode — information is being retrieved from the information supply at question time. Knowledge resides in its unique supply earlier than, throughout, and after the question execution
Twin mode — a mix of the primary two choices. Knowledge from the desk is being loaded into reminiscence, however at question time it will also be retrieved immediately from the supply

Versus Method Engine, which doesn’t assist parallelism, the Storage Engine can work asynchronously.

Meet VertiPaq Storage Engine

As we’ve drawn a giant image beforehand, let me clarify in additional element what VertiPaq does within the background to spice up the efficiency of our Energy BI stories.

Once we select Import mode for our Energy BI tables, VertiPaq performs the next actions:

Reads the information supply, transforms information right into a columnar construction, encodes, and compresses information inside every of the columns
Establishes a dictionary and index for every of the columns
Prepares and establishes relationships
Computes all calculated columns and calculated tables, and compresses them

The 2 foremost traits of VertiPaq are:

VertiPaq is a columnar database
VertiPaq is an in-memory database

As you may see within the illustration above, columnar databases retailer and compress information another way from conventional row-store databases. Columnar databases are optimized for vertical information scanning, which implies that each column is structured in its personal approach and bodily separated from different columns!

With out going into deep evaluation about benefits and downsides between row-store vs column-store databases, since it might require a separate collection of articles, let me simply pinpoint just a few key differentials when it comes to efficiency.

With columnar databases, single-column entry is quick and efficient. As soon as the computation begins to contain a number of columns, issues change into extra complicated, because the middleman steps’ outcomes have to be briefly saved in a roundabout way.

Merely stated, columnar databases are extra CPU-intensive, whereas row-store databases enhance I/O, due to many scans of ineffective information.

To date, we painted a giant image of the structure that allows Energy BI to totally shine as an final BI device. Now, we’re able to dive deeper into particular architectural options and consequently leverage this information to profit from our Energy BI stories, by tuning our information mannequin to extract the utmost from the underlying engine.

Inside VertiPaq in Energy BI — Compress for fulfillment!

As you would possibly recall from the earlier a part of this text, we scratched the floor of VertiPaq, a strong storage engine, which is “accountable” for the blazing-fast efficiency of most of your Energy BI stories (each time you’re utilizing Import mode or Composite mannequin).

3, 2, 1…Fasten your seatbelts!

One of many key traits of the VertiPaq is that it’s a columnar database. We realized that columnar databases retailer information optimized for vertical scanning, which implies that each column has its personal construction and is bodily separated from different columns.

That reality allows VertiPaq to use various kinds of compression to every of the columns independently, selecting the optimum compression algorithm primarily based on the values in that particular column.

Compression is being achieved by encoding the values throughout the column. However, earlier than we dive deeper into an in depth overview of encoding strategies, simply take into account that this structure is just not completely associated to Energy BI — within the background is a Tabular mannequin, which can be “beneath the hood” of Evaluation Companies Tabular and Excel Energy Pivot.

Worth Encoding

That is essentially the most fascinating worth encoding kind since it really works completely with integers and, subsequently, requires much less reminiscence than, for instance, when working with textual content values.

How does this look in actuality? Let’s say we’ve a column containing quite a few cellphone calls per day, and the worth on this column varies from 4.000 to five.000. What the VertiPaq would do, is to seek out the minimal worth on this vary (which is 4.000) as a place to begin, then calculate the distinction between this worth and all the opposite values within the column, storing this distinction as a brand new worth.

At first look, 3 bits per worth may not appear like a major saving, however multiply this by thousands and thousands and even billions of rows and you’ll admire the quantity of reminiscence saved.

As I already harassed, Worth Encoding is being utilized completely to integer information kind columns (forex information kind can be saved as an integer).

Hash Encoding (Dictionary Encoding)

That is most likely essentially the most regularly used compression kind by a VertiPaq. Utilizing Hash encoding, VertiPaq creates a dictionary of the distinct values inside one column and afterward replaces “actual” values with index values from the dictionary.

Right here is an instance to make issues clearer:

As you could discover, VertiPaq recognized distinct values throughout the Topics column, constructed a dictionary by assigning indexes to these values, and eventually saved index values as tips to “actual” values. I assume you’re conscious that integer values require approach much less reminiscence house than textual content, in order that’s the logic behind any such information compression.

Moreover, by having the ability to construct a dictionary for any information kind, VertiPaq is virtually information kind unbiased!

This brings us to a different key takeover: regardless of in case your column is of textual content, bigint or float information kind — from VertiPaq perspective it’s the identical — it must create a dictionary for every of these columns, which suggests that each one these columns will present the identical efficiency, each when it comes to velocity and reminiscence house allotted! After all, by assuming that there aren’t any important variations in dictionary sizes between these columns.

So, it’s a fable that the information kind of the column impacts its measurement throughout the information mannequin. Quite the opposite, the variety of distinct values throughout the column, which is named cardinality, largely influences column reminiscence consumption.

RLE (Run-Size-Encoding)

The third algorithm (RLE) creates a type of mapping desk, containing ranges of repeating values, avoiding storing each single (repeated) worth individually.

Once more, having a look at an instance will assist to raised perceive this idea:

In actual life, VertiPaq doesn’t retailer Begin values, as a result of it may possibly shortly calculate the place the following node begins by summing earlier Rely values.

As highly effective as it would have a look at first look, the RLE algorithm is extremely depending on the ordering throughout the column. If the information is saved the best way you see within the instance above, RLE will carry out nice. Nonetheless, in case your information buckets are smaller and rotate extra regularly, then RLE wouldn’t be an optimum resolution.

Yet one more factor to remember relating to RLE: In actuality, VertiPaq doesn’t retailer information the best way it’s proven within the illustration above. First, it performs Hash encoding and creates a dictionary of the themes, after which applies the RLE algorithm, so the ultimate logic, in its most simplified approach, can be one thing like this:

So, RLE happens after Worth or Hash Encoding, in these eventualities when VertiPaq “thinks” that it is smart to compress information moreover (when information is ordered in that approach that RLE would obtain higher compression).

Re-Encoding issues

Irrespective of how “good” VertiPaq is, it may possibly additionally make some unhealthy choices, primarily based on incorrect assumptions. Earlier than I clarify how re-encoding works, let me simply briefly iterate by the method of knowledge compression for a selected column:

VertiPaq scans a pattern of rows from the column
If the column information kind is just not an integer, it can look no additional and use Hash encoding
If the column is of integer information kind, some further parameters are evaluated: if the numbers within the pattern linearly enhance, VertiPaq assumes that it’s most likely a major key and chooses Worth encoding
If the numbers within the column are fairly shut to one another (the quantity vary is just not very large, like in our instance above with 4.000–5.000 cellphone calls per day), VertiPaq will use Worth encoding. Quite the opposite, when values fluctuate considerably throughout the vary (for instance between 1.000 and 1.000.000), then Worth encoding doesn’t make sense, and VertiPaq will apply the Hash algorithm

Nonetheless, it may possibly occur typically that VertiPaq comes to a decision about which algorithm to make use of primarily based on the pattern information, however then some outlier pops up and it must re-encode the column from scratch.

Let’s use our earlier instance for the variety of cellphone calls: VertiPaq scans the pattern and chooses to use Worth encoding. Then, after processing 10 million rows, hastily it discovered a 500.000 worth (it may be an error, or no matter). Now, VertiPaq re-evaluates the selection, and it may possibly resolve to re-encode the column utilizing the Hash algorithm as a substitute. Absolutely, that might influence the entire course of when it comes to the time wanted for reprocessing.

Lastly, right here is the checklist of parameters (so as of significance) that VertiPaq considers when selecting which algorithm to make use of:

Variety of distinct values within the column (Cardinality)
Knowledge distribution within the column — column with many repeating values could be higher compressed than one containing regularly altering values (RLE could be utilized)
Variety of rows within the desk
Column information kind — impacts solely the dictionary measurement

Lowering the information mannequin measurement by 90% — actual story!

After we laid the theoretical floor for understanding the structure behind the VertiPaq storage engine, and which sorts of compression it makes use of to optimize your Energy BI information mannequin, it’s the best second to get our fingers soiled and apply our information in a real-life case!

Start line = 776 MB

Our information mannequin is sort of easy, but memory-intensive. We now have a reality desk (factChat), which comprises information about reside assist chats and one dimension desk (dimProduct), which pertains to a reality desk. Our reality desk has round 9 million rows, which shouldn’t be a giant deal for Energy BI, however the desk was imported as it’s, with none further optimization or transformation.

Now, this pbix file consumes a whopping 777 MB!!! You may’t consider it? Simply have a look:

Simply keep in mind this image! After all, I don’t must inform you how a lot time this report must load or refresh, and the way our calculations are gradual due to the file measurement.

…and it’s even worse!

Moreover, it’s not simply 776 MBs that take our reminiscence, since reminiscence consumption is being calculated considering the next components:

PBIX file
Dictionary (you’ve realized in regards to the dictionary at first sections of this text)
Column hierarchies
Person-defined hierarchies
Relationships

Now, if I open Process Supervisor, go to the Particulars tab, and discover the msmdsrv.exe course of, I’ll see that it burns greater than 1 GB of reminiscence!

Oh, man, that basically hurts! And we haven’t even interacted with the report! So, let’s see what we are able to do to optimize our mannequin…

Rule #1 — Import solely these columns you really want

The primary and crucial rule is: hold in your information mannequin solely these columns you really want for the report!

That being stated, do I really want each the chatID column, which is a surrogate key, and the sourceID column, which is a major key from the supply system? Each of those values are distinctive, so even when I must rely the whole variety of chats, I might nonetheless be superb with solely certainly one of them.

So, I’ll take away the sourceID column and verify how the file seems now:

By eradicating only one pointless column, we saved greater than 100 MB!!! Let’s study additional what could be eliminated with out taking a deeper look (and we’ll come to this later, I promise).

Do we actually want each the unique begin time of the chat and UTC time, one saved as a Date/Time/Timezone kind, the opposite as Date/Time, and each going to a second degree of precision??!!

Let me eliminate the unique begin time column and hold solely UTC values.

One other 100 MB of wasted house gone! By eradicating simply two columns we don’t want, we decreased the dimensions of our file by 30%!

Now, that was with out even trying into the small print of the reminiscence consumption. Let’s now activate DAX Studio, my favourite device for troubleshooting Energy BI stories. As I already harassed just a few occasions, this device is a MUST in case you plan to work significantly with Energy BI — and it’s fully free!

One of many options in DAX Studio is a VertiPaq Analyzer, a really great tool constructed by Marco Russo and Alberto Ferrari from sqlbi.com. Once I hook up with my pbix file with DAX Studio, listed here are the numbers associated to my information mannequin measurement:

I can see right here what the most costly columns are in my information mannequin and resolve if I can discard a few of them, or if I must hold all of them.

At first look, I’ve few candidates for elimination — sessionReferrer and referrer columns have excessive cardinality and subsequently can’t be optimally compressed. Furthermore, as these are textual content columns and have to be encoded utilizing a Hash algorithm, you may see that their dictionary measurement is extraordinarily excessive! In the event you take a better look, you may discover that these two columns take nearly 40% of my desk measurement!

After checking with my report customers in the event that they want any of those columns, or perhaps solely certainly one of them, I’ve bought a affirmation that they don’t carry out any evaluation on these columns. So, why on Earth ought to we bloat our information mannequin with them??!!

One other robust candidate for elimination is the LastEditDate column. This column simply exhibits the date and time when the report was final edited within the information warehouse. Once more, I checked with the report customers, they usually didn’t even know that this column exists!

I eliminated these three columns, and the result’s:

Oh, God, we halved the dimensions of our information mannequin by simply eradicating just a few pointless columns.

Fact be instructed, there are just a few extra columns that could possibly be dismissed from the information mannequin, however let’s now deal with different strategies for information mannequin optimization.

Rule #2 — Scale back the column cardinality!

As you could recall from the earlier a part of the article, the rule of thumb is: the upper the cardinality of a column, the tougher for VertiPaq to optimally compress the information. Particularly if we’re not working with integer values.

Let’s take a deeper have a look at VertiPaq Analyzer outcomes:

As you see, even when the chatID column has greater cardinality than the datetmStartUTC column, it takes nearly 7 occasions much less reminiscence! Since it’s a surrogate key integer worth, VertiPaq applies Worth encoding, and the dimensions of a dictionary is irrelevant. Then again, Hash encoding is being utilized for the column of date/time information kind with excessive cardinality, so the dictionary measurement is enormously greater.

There are a number of strategies for decreasing the column cardinality, akin to splitting columns. Listed below are just a few examples of utilizing this method.

For Integer columns, you may break up them into two even columns utilizing division and modulo operations. In our case, it might be:

SELECT chatID/1000 AS chatID_div
,chatID % 1000 AS chatID_mod……….

This optimization approach should be carried out on the supply aspect (on this case, by writing a T-SQL assertion). If we use the calculated columns, there isn’t any profit in any respect, because the unique column must be saved within the information mannequin first.

The same approach can convey important financial savings when you could have decimal values within the column. You may merely break up values earlier than and after the decimal as defined in this article.

Since we don’t have any decimal values, let’s deal with our downside — optimizing the datetmStartUTC column. There are a number of legitimate choices to optimize this column. The primary is to verify in case your customers want granularity greater than the day degree (in different phrases, are you able to take away hours, minutes, and seconds out of your information).

Let’s verify what financial savings this resolution would convey:

The very first thing we discover is that our file is now 271 MB, so 1/3 of what we began with. VertiPaq Analyzer’s outcomes present that this column is now nearly completely optimized, going from taking on 62% of our information mannequin to simply barely over 2.5%! That’s huuuuge!

Nonetheless, it appeared that the day-level grain was not superb sufficient, and my customers wanted to investigate figures on the hour degree. OK, so we are able to at the least eliminate minutes and seconds, and that might additionally lower the cardinality of the column.

So, I’ve imported values rounded per hour:

SELECT chatID
                ,dateadd(hour, datediff(hour, 0, datetmStartUTC), 0) AS datetmStartUTC
                ,customerID
                ,userID
                ,ipAddressID
                ,productID
                ,countryID
                ,userStatus
                ,isUnansweredChat
                ,totalMsgsOp
                ,totalMsgsUser
                ,userTimezone
                ,waitTimeSec
                ,waitTimeoutSec
                ,chatDurationSec
                ,sourceSystem
                ,topic
                ,usaccept
                ,transferUserID
                ,languageID
                ,waitFirstClick
            FROM factChat

It appeared that my customers additionally didn’t want a chatVariables column for evaluation, so I’ve additionally eliminated it from the information mannequin.

Lastly, after disabling Auto Date/Time in Choices for Knowledge Load, my information mannequin measurement was round 220 MB! Nonetheless, one factor nonetheless bothered me: the chatID column was nonetheless occupying nearly 1/3 of my desk. And that is only a surrogate key, which isn’t utilized in any of the relationships inside my information mannequin.

So, right here I used to be inspecting two totally different options: the primary was to easily take away this column and mixture the variety of chats, counting them utilizing the GROUP BY clause. Nonetheless, there can be no profit by protecting the chatID column in any respect, because it’s not getting used wherever in our information mannequin. As soon as I’ve eliminated it from the mannequin, one final time, let’s verify the pbix file measurement:

Please recall the quantity we began at: 776 MB! So, I’ve managed to scale back my information mannequin measurement by nearly 90%, making use of some easy strategies which enabled the VertiPaq storage engine to carry out extra optimum compression of the information.

And this was an actual use case, which I confronted over the past yr!

Common guidelines for decreasing information mannequin measurement

To conclude, right here is the checklist of normal guidelines it’s best to take into account when making an attempt to scale back the information mannequin measurement:

Preserve solely these columns your customers want within the report! Simply sticking with this one single rule will prevent an unbelievable quantity of house, I guarantee you…
Attempt to optimize column cardinality each time doable. The golden rule right here is: check, check, check…and if there’s a important profit from, for instance, splitting one column into two, or to substitute a decimal column with two complete quantity columns, then do it! However, additionally take into account that your measures have to be rewritten to deal with these structural modifications, with a purpose to show anticipated outcomes. So, in case your desk is just not huge, or if it’s important to rewrite tons of of measures, perhaps it’s not value splitting the column. As I stated, it is dependent upon your particular situation, and it’s best to rigorously consider which resolution makes extra sense
Identical as for columns, hold solely these rows you want: for instance, perhaps you don’t must import information from the final 10 years, however solely 5! That can even scale back your information mannequin measurement. Speak to your customers, ask them what they really want, earlier than blindly placing the whole lot inside your information mannequin
Mixture your information each time doable! Meaning — fewer rows, decrease cardinality, so all good issues you’re aiming to attain! In the event you don’t want hours, minutes, or seconds degree of granularity, don’t import them! Aggregations in Energy BI (and Tabular mannequin on the whole) are a vital and large subject, which is out of the scope of this collection, however I strongly advocate you verify Phil Seamark’s blog and his collection of posts on artistic aggregations utilization
Keep away from utilizing DAX calculated columns each time doable, since they aren’t being optimally compressed. As a substitute, attempt to push all calculations to a knowledge supply (SQL database, for instance) or carry out them utilizing the Energy Question editor
Use correct information varieties (for instance, in case your information granularity is on a day degree, there isn’t any want to make use of Date/Time information kind. Date information kind will suffice)
Disable Auto Date/Time possibility for information loading (it will take away a bunch of routinely created date tables within the background)

Conclusion

After you realized the fundamentals of the VertiPaq storage engine and totally different strategies it makes use of for information compression, I wished to wrap up this text by exhibiting you a real-life instance of how we are able to “assist” VertiPaq (and Energy BI consequently) to get the perfect out of report efficiency and optimum useful resource consumption.

Thanks for studying, hope that you simply loved the article!

Source link

The Math That’s Killing Your AI Agent

Building Robust Credit Scoring Models (Part 3)

How to Measure AI Value

How to Build an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard

Google integerar Gemini Nano i Chrome för att identifiera bedrägerier

AI Papers to Read in 2025

Deep Reinforcement Learning: The Actor-Critic Method

DataRobot + Nebius: An enterprise-ready AI Factory optimized for agents

Most Popular

AI Medical Record Summarization: Definition, Challenges, And Best Practices

How to Analyze and Optimize Your LLMs in 3 Steps

Taking ResNet to the Next Level

Our Picks