Take a Deep Dive into Filtering in DAX

We all the time use filters when creating DAX expressions, similar to DAX measures, or when writing DAX queries.

However what occurs precisely once we apply filters?

This piece is precisely about this query.

I’ll begin with easy queries and add variants to discover what occurs below the hood.

I exploit DAX Studio and the choice to point out server timings for every question.

In case you wish to be taught extra about this characteristic and how one can interpret the outcomes, learn the primary article within the References part on the finish of this piece.

Let’s begin with the bottom question:

EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"On-line Gross sales", [Sum Online Sales]
						)
					)

Determine 1 – Executing the question and the outcomes. The outcomes themselves will not be essential. However the execution statistics and the SE queries are (Determine by the Writer)

Once we activate the server timings and execute the question, we get the execution statistics and the Storage Engines (SE) question/queries wanted to get the info:

Determine 2 – The execution statistics for the question. We see the Storage Engine question to retrieve the outcomes (Determine by the Writer)

As you’ll be able to see, we’d like just one Storage Engine (SE) question to retrieve the outcomes.

The question completes in solely 47 ms and is served virtually completely by the SE (95.7%).

The extra time the SE can spend on a question, the higher, as a result of it’s the element that retrieves knowledge from the info shops and tables.
Furthermore, the SE can use a number of CPU cores, whereas the System Engine (FE) can use just one. We can not look at precisely what occurs within the FE as simply as we are able to with SE queries.

You’ll be able to be taught extra concerning the distinction between these two engines within the article talked about above.

A brief word:

A number of months in the past, I wrote an article right here with a really comparable title. However, whereas that one was solely about date filters with Time Intelligence capabilities, this one goes one step deeper into the rabbit gap.

That is far more generic than that one.

For those who missed it, I added the article hyperlink and extra assets on the present matter to the References part beneath.

Add easy filters

Subsequent, we add a easy filter for the product coloration purple to the question:

EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"On-line Gross sales", [Sum Online Sales]
						)
				,'Product'[ColorName] = "Purple"
				)

Right here is the question and the outcomes restricted to the product coloration purple:

Determine 3 – The question and the outcomes for the info restricted to the product coloration purple (Determine by the Writer)

Once we have a look at the question statistics, we see this:

Determine 4 – The question statistics and the SE question with the filter for purple merchandise (Determine by the Writer)

As you’ll be able to see, the whole question is executed in a single SE question.

The filter is within the question’s WHERE clause. Due to this fact, solely the restricted knowledge is retrieved.

That is seen within the “Rows” column, as solely 14 rows are returned from this question.

However what occurs once we use the FILTER() operate to filter the merchandise:

EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"On-line Gross sales", [Sum Online Sales]
						)
				,FILTER('Product'
						,'Product'[ColorName]  = "Purple")
				)

As you may know, utilizing the FILTER() operate will not be really helpful as a consequence of the way it works.

You’ll be able to be taught extra about this matter within the second article linked within the References part beneath.

The end result doesn’t change:

Determine 5 – The question and the outcomes for the info restricted by the FILTER() operate to the product coloration purple (Determine by the Writer)

However how does it have an effect on the execution plan and the SE queries?

Determine 6 – The question statistics and the SE question with the filter for purple merchandise, however with the FILTER() operate (Determine by the Writer)

As you’ll be able to see, on this case, the SE optimizes the question, yielding the identical execution plan as earlier than.

However, as we alter our code, we’ll see that utilizing FILTER() isn’t all the time a good suggestion.

Add a number of filters

Now, what occurs once we add a number of filters to a question?

EVALUATE
    CALCULATETABLE(
                SUMMARIZECOLUMNS('Product'[BrandName]
                        ,"On-line Gross sales", [Sum Online Sales]
                        )
                ,'Product'[ColorName]  = "Purple"
                ,'Geography'[ContinentName] = "Europe"
                )

Whereas the end result will not be that attention-grabbing to us, let’s have a look at the question statistics:

Determine 7 – Question statistics when making use of a number of filters to an expression (Determine by the Writer)

Once more, the question could be served by a single SE question that accommodates each filters.

The question executes so shortly that the FE time share is comparatively excessive, but it nonetheless solely takes 6ms.

When altering the question to make use of the FILTER() operate, the SE question doesn’t change both:

Determine 8 – Question statistics when making use of a number of filters with FILTER (Determine by the Writer)

This exhibits that, with this type of question, the engine can optimize execution to search out probably the most environment friendly method to fulfill the DAX question.

Anyway, the end result doesn’t change. It’s similar in each circumstances, accurately, as a result of we don’t change the filter per se. However please be affected person with me; I’m getting again to the FILTER() operate and why it’s essential to grasp its results in a second.

Transferring filters into measures

Subsequent, let’s see what occurs when the filter is moved into the measure.

Till now, the question was constructed in order that the measure [Sum Online Sales] obtained its filter from outdoors.

Let’s do that:

DEFINE 
MEASURE 'All Measures'[Online Sales A. Datum] =
		CALCULATE(
			SUMX('On-line Gross sales', ( 'On-line Gross sales'[UnitPrice] * 'On-line Gross sales'[SalesQuantity]) - 'On-line Gross sales'[DiscountAmount] )
			,'Product'[BrandName] = "A. Datum"
			)


EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"On-line Gross sales A. Datum", [Online Sales A. Datum]
						)
				)

As you’ll be able to see, the filter is utilized contained in the measure [Online Sales A. Datum].

In fact, the ensuing quantity is similar in every row of the end result, because the Model is ready as “A. Datum”:

Determine 9 – The identical end result for every model with the measure containing a filter (Determine by the Writer)

However the execution is barely completely different:

Determine 10 – Question statistics for the measure containing the filter (Determine by the Writer)

This time, now we have two SE queries.

The question to get the gross sales for the Model “A. Datum”. This question accommodates the filter for that model.
The second question is used to get the checklist for all manufacturers within the end result set.

The primary question is most essential to us, as a result of it nonetheless exhibits the filter for the model set inside the measure.

This question could be absolutely served by the SE with a easy filter in a really environment friendly method.

However, normally, we wish to add a number of measures to a question (or a visible in a report).

What occurs once we add the [Sum Online Sales] measure to the question?

The end result will not be significantly essential, because it exhibits one column with gross sales for every model and one other with gross sales for the filtered model.

However the question statistics are attention-grabbing:

Determine 11 – Question statistics for the question with one measure containing a filter and one other with out. The place is the filter set within the measure? (Determine by the Writer)

As you’ll be able to see within the red-marked line within the SE question, the Model filter is not current.

As a result of the engine acknowledges that the filter within the measure is utilized to the identical column because the one within the question, it strikes the filter to the FE and returns the end result.

Now, what occurs once we filter one other column within the measure, for instance, the colour:

DEFINE 
MEASURE 'All Measures'[Online Sales Red] =
        CALCULATE(
            SUMX('On-line Gross sales', ( 'On-line Gross sales'[UnitPrice] * 'On-line Gross sales'[SalesQuantity]) - 'On-line Gross sales'[DiscountAmount] )
            ,'Product'[ColorName] = "Purple"
            )


EVALUATE
    CALCULATETABLE(
                SUMMARIZECOLUMNS('Product'[BrandName]
                        ,"On-line Gross sales", [Sum Online Sales]
                        ,"On-line Gross sales Purple", [Online Sales Red]
                        )
                )

Once more, the end result will not be significantly attention-grabbing. We have an interest within the question statistics:

Determine 12 – Question statistics for 2 measures with one filtering a column not contained within the end result set (Determine by the Writer)

As you’ll be able to see, this time now we have two queries by BrandName. One with out and one with the filter for the colour.

Each queries return the identical variety of rows (14) – one for every Model.

The FE handles combining the 2 outcomes right into a single desk.

All the question remains to be served primarily by the SE, which is great.

However now, let’s add the FILTER() operate to the Filter:

For this instance, I alter the measure to filter for 2 values with the IN operator:

,'Product'[BrandName] IN { "A. Datum", "Journey Works" }

On this variant, the SE question is like those earlier than.

The filter is handed immediately into the question’s WHERE clause.

However what occurs once I change it to this:

,FILTER('Product'

       ,'Product'[BrandName] IN { "A. Datum", "Journey Works" }

       )

To start with, the end result modifications:

Determine 13 – Results of the measures with and with out FILTER(). In purple, the outcomes with the easy filter, and in blue, the outcomes for the measure with FILTER() (Determine by the Writer)

The reason being that FILTER() works utterly in a different way.

It retains the prevailing filter context and provides a brand new one.

I defined this conduct in one other article that I added because the second hyperlink within the References part beneath.

Furthermore, the SE can not deal with this in a single question anymore:

Determine 14 – The question statistics for utilizing the FILTER() operate within the measure. Now, a number of queries are wanted. (Determine by the Writer)

The primary two queries retrieve the values for the model to filter (See the queries marked in pink).

Discover the massive variety of rows (324 and a pair of’560) returned by the primary two queries. That is the materialization of intermediate outcomes wanted to carry out the calculation.

The third question makes use of these intermediate outcomes to filter the info (marked in purple).

The results of the third question is just two rows—the 2 rows we see within the general end result.

As described in my different article, FILTER() should be used with care.

Not solely is it significantly slower, but it surely additionally works utterly in a different way from a easy filter.

Anyway, I can restore the earlier conduct by including an ALL() within the FILTER() name:

Determine 15 – Including an ALL() inside the FILTER() name restores the semantics of the easy filter. However why would you do that? (Determine by the Writer)

I don’t wish to conceal the truth that this instance is particular, because the filter utilized impacts the identical column as used within the question.

When altering the question to filter the nation, the engine can optimize the execution and use the easy kind once more:

Determine 16 – Right here we see that when filtering columns completely different from the columns used within the DAX question, the engine can optimize the execution and fall again to use a easy filter. Within the blue inset, you see the outcomes (Determine by the Writer)

As you’ll be able to see, the engine optimizes the execution of the question and falls again to a easy filter when filtering columns that differ from these used within the DAX question. Within the blue inset, you see the outcomes.

I see this type of filtering fairly often when builders who will not be as proficient write DAX measures.

Utilizing the FILTER() operate appears intuitive, however it will probably yield incorrect or complicated outcomes and is slower than a easy filter. I strongly advocate studying my article linked beneath about this operate, in addition to the dax.guide documentation and the articles linked on SQLBI.com.

Moreover, I’ve to sort far more than when utilizing a easy filter.

As a lazy man, this is a vital purpose to not use FILTER() when it’s pointless.

Add a fancy filter

Lastly, I wish to present what occurs when making use of a filter utilizing a DAX operate, similar to CONTAINSSTRING().

EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"On-line Gross sales", [Sum Online Sales]
						)
				,CONTAINSSTRING('On-line Gross sales'[SalesOrderNumber], "202402252C")
				)

Such a question is executed while you use a slicer in your report back to filter for a selected order and retrieve the manufacturers of the bought merchandise.

Because the end result will not be essential at this level, let’s immediately have a look at the question statistics:

Determine 17 – Question statistics for the question utilizing a DAX operate to filter the end result. You’ll be able to see that just about the whole lot of the execution is carried out within the FE (Determine by the Writer)

Whereas the question took greater than 6 seconds to finish, 99.6% of the time was spent by the FE executing the CONTAINSSTRING() operate to search out matching rows within the knowledge. This operation may be very CPU-intensive, because the FE can use just one core. Once I execute this question on my laptop computer, it takes greater than 2 seconds longer.

I intentionally selected a gradual operate to show its results.

However the SE was nonetheless in a position to execute the question with a single question. Nonetheless, the constructive impact of this reality is negligible on this case.

Conclusion

Whereas it isn’t my intention to provide you recommendation on what to do and what to not do, I needed to point out you the implications of the alternative ways to put in writing DAX code and apply filters in your measures or queries.

The DAX engine(s) are very environment friendly in optimizing the queries, however they’ve limitations.

Due to this fact, we should all the time take care when writing our DAX code.

If the efficiency is poor or the code written by another person appears unusual, we must always analyze it to find out how one can enhance it.

I needed to point out you how one can do it and what to search for when analyzing your DAX code.

Bear in mind:

The Storage engine (SE) can use a number of CPU cores.
The extra work is completed by the SE, the higher.
The SE can execute solely easy aggregations and simple arithmetic capabilities (like +, -, x, and /)
Attempt to scale back the workload on the System Engine (FE)
The FE can use just one CPU core.
Attempt to scale back the materialization of information (The Rows column within the question statistics).
Attempt to scale back the variety of SE queries.

I do know that the necessities will pressure us to put in writing DAX code, which isn’t optimum.

Even worse, the Report designers may add logic to the report that causes a poor efficiency.

In such circumstances, remove that logic and test the response time once more. It may be price exploring making a devoted measure for such circumstances. Do not forget that it’s doable to create native measures in a report that’s linked to a Semantic mannequin through a life connection.

However most significantly: Take your time when writing DAX code. You may save time by avoiding the necessity to optimize your DAX code, which was written in a rush. I communicate from expertise. It is a very dangerous feeling.

I hope you realized one thing new.

References

To be taught the main points about how one can interpret the outcomes of the Server Timings in DAX Studio, learn this piece:

Are you interested by how one can use the FILTER() operate accurately? Learn this:

One other DAX operate that may hurt efficiency is KEEPFILTERS(). To be taught extra concerning the KEEPFILTERS() operate, learn this piece:

Right here, the talked about piece about date filters:

An attention-grabbing weblog submit by Information Mozart concerning the Storage engine:

Like in my earlier articles, I exploit the Contoso pattern dataset. You’ll be able to obtain the ContosoRetailDW Dataset at no cost from Microsoft here.

The Contoso Information can be utilized freely below the MIT License, as described in this document. I modified the dataset to shift the info to up to date dates.

Source link

Designing Data and AI Systems That Hold Up in Production

A Generalizable MARL-LP Approach for Scheduling in Logistics

Detecting and Editing Visual Objects with Gemini

For healthy hearing, timing matters | MIT News

Celebrating an academic-industry collaboration to advance vehicle technology | MIT News

How to Design Machine Learning Experiments — the Right Way

The Total Derivative: Correcting the Misconception of Backpropagation’s Chain Rule

MIT Sea Grant students explore the intersection of technology and offshore aquaculture in Norway | MIT News

Most Popular

How to Build an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard

Beyond the Flat Table: Building an Enterprise-Grade Financial Model in Power BI

Rapid Prototyping of Chatbots with Streamlit and Chainlit

Our Picks