7 min read

A Hitchhiker's Guide to the FILTER Function in Oracle Analytics

Does it still make sense to write about the FILTER function these days, when the top trends for Oracle Analytics are mostly related to Machine Learning and fancy visualizations?

By looking at the title of this article you can easily guess that my answer is a very firm "Yes, absolutely"! The FILTER function is a must-know feature for any Oracle Analytics developer to create advanced calculations in their own reports. Despite already being available when I entered the working world in 2010, it's plagued by a bug and the one line documentation is not exhaustive at all: as a result, people are struggling with it and get wrong values that cannot be used in their Machine Learning models and fancy visualizations.

The Basics

The FILTER function computes the expression using the given preaggregate filter, as stated in the documentation. All clear, huh? ¯\_(ツ)_/¯

The function allows you to filter a measure column before applying aggregation rules to the data, and without affecting any other column included in the same logical query. Whether you use it in the Analysis Editor or in the metadata repository, the syntax is exactly the same:

FILTER(measure USING filter_expression)

As an example, let's suppose that you need to compare the revenue for a given product category (e.g. Hardware) with the revenue for all product categories. To achieve this goal, you can add the Revenue column twice to an analysis, and then edit the expression for one of them as follows:

FILTER("Measures"."Revenue" USING "Product"."Product Category" = 'Hardware')

It's important to note that a standard filter for the Product Category column is not a valid alternative to the FILTER function in this circumstance as it will affect all columns in the analysis.

Behind the Scenes

Now that you know the basics, you may wonder what happens behind the scenes when you use the FILTER function in your logical queries. To understand this, I'm going to run some analyses against a subject area based on the SH Sample Schema and analyze the query logs. The business model is really simple with a single fact table and four dimension tables (Figure 1).

Figure 1. The business model based on the SH Sample Schema used for testing

The result of each analysis will be compared against the table below (Figure 2) which summarizes the correct Revenue values by Product Category and Customer Gender.

Figure 2. The correct Revenue values by Product Category and Customer Gender

Test Case #1

The first analysis is straightforward: it contains only one column and the formula is exactly the same one that illustrates the basics of the FILTER function. When executed it allows me to get the revenue for the Hardware product category and the result (20,647,606) is correct (Figure 3).

Figure 3. The result of the test case #1

The filter expression part of the FILTER function is included in the WHERE clause of the physical query generated by the BI Server. The result set is restricted directly in the database before aggregating values as required, and a single row is returned to the user.

WITH SAWITH0 AS (select sum(T95.AMOUNT_SOLD) as c1 from SH.PRODUCTS T58, SH.SALES T95 where ( T58.PROD_CATEGORY = 'Hardware' and T58.PROD_ID = T95.PROD_ID ) ) select D1.c1 as c1, D1.c2 as c2 from ( select distinct 0 as c1, D1.c1 as c2 from SAWITH0 D1 ) D1 where rownum <= 125001

It's worth noting that if you replace the FILTER function with a column filter, the BI Server will generate exactly the same physical SQL query. It means that when your analysis contains one measure column and any number of attribute columns, there is no reason to use the FILTER function instead of a normal column filter.

Test Case #2

I'm now interested into looking at what happens when a logical query contains more than one calculated column based on the FILTER function. Let's extend the previous analysis to compare Hardware against Software/Other revenue with a new calculated column as follows:

FILTER("Measures"."Revenue" USING "Product"."Product Category" = 'Software/Other')

The result of the analysis - 20,647,606 for Hardware and 13,834,856 for Software/Other - is correct (Figure 4).

Figure 4. The result of the test case #2

The physical query still returns a single row, but it works in a slightly different way compared to the previous one. The filter expressions are combined together and embedded into the WHERE clause to restrict the result set, but CASE WHEN expressions are required to properly aggregate values this time.

WITH SAWITH0 AS (select sum(case when T58.PROD_CATEGORY = 'Hardware' then T95.AMOUNT_SOLD end ) as c1, sum(case when T58.PROD_CATEGORY = 'Software/Other' then T95.AMOUNT_SOLD end ) as c2 from SH.PRODUCTS T58, SH.SALES T95 where ( T58.PROD_ID = T95.PROD_ID and (T58.PROD_CATEGORY in ('Hardware', 'Software/Other')) ) ) select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3 from ( select distinct 0 as c1, D1.c1 as c2, D1.c2 as c3 from SAWITH0 D1 ) D1 where rownum <= 125001

Test Case #3

Let's add the "Measures"."Revenue" column to the first analysis I created. This allows us to compare the revenue for Hardware against the total for all product categories and understand what happens when measures with and without the FILTER function are combined in the same analysis.

The result of the analysis - 20,647,606 for Hardware and 98,205,831 for all product categories - is correct once again (Figure 5).

Figure 5. The result of the test case #3

The filter expression is no longer included in the WHERE clause of the physical query, but this is expected. When the database joins the SALES fact table with the PRODUCTS dimension table, the result set can't be restricted further since all rows are required to properly calculate the revenue for all product categories. A CASE WHEN expression is still required to aggregate values related to the column which is using the FILTER function in the analysis. The physical query still returns one row as in the previous test cases.

WITH SAWITH0 AS (select sum(T95.AMOUNT_SOLD) as c1, sum(case when T58.PROD_CATEGORY = 'Hardware' then T95.AMOUNT_SOLD end ) as c2 from SH.PRODUCTS T58, SH.SALES T95 where ( T58.PROD_ID = T95.PROD_ID ) ) select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3 from ( select distinct 0 as c1, D1.c1 as c2, D1.c2 as c3 from SAWITH0 D1 ) D1 where rownum <= 125001

The Root of All Evil

Why don't we spice things up? We could use the Expression Editor to transform an attribute into a measure and check whether it can be used successfully with the FILTER function. This may sound bizarre, but as an Oracle Analytics consultant I came through it several times, trust me!

Let's suppose that we have been asked to add the number of distinct female customers to an existing analysis which shows the grand total revenue. Despite the fact the subject area contains only Revenue and Quantity measures, I can easily calculate the number of distinct customers from the "Customer"."Customer ID" attribute as follows:

COUNT(DISTINCT "Customer"."Customer ID")

Since the expression above is based on an aggregation function, Oracle Analytics considers it as a measure, and allows us to use it in the FILTER function to get - in theory - the number of distinct female customers:

FILTER(COUNT(DISTINCT "Customer"."Customer ID") USING "Customer"."Customer Gender" = 'F')

In practice, it leads to an epic fail (Figure 6), several questions in the Analytics Software community forum, and probably tons of service requests!

Figure 6. The FILTER function applied to an aggregated attribute results in an epic fail

We were expecting to see 2,463 instead of 7,059 - the number of distinct customers regardless of the gender - according to Figure 2. The result suggests that the filter expression has been totally ignored and the physical query confirms our suspicions:

WITH SAWITH0 AS (select sum(T95.AMOUNT_SOLD) as c1, count(distinct T33.CUST_ID) as c2 from SH.CUSTOMERS T33, SH.SALES T95 where ( T33.CUST_ID = T95.CUST_ID ) ) select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3 from ( select 0 as c1, D1.c1 as c2, D1.c2 as c3 from SAWITH0 D1 ) D1 where rownum <= 125001

At Oracle, they have been aware of the issue for years, but they have categorised it as a product limitation rather than a bug (the justification provided in the Doc ID 2138129.1 is questionable at best, imo). It would be sufficient to include a simple CASE WHEN expression in the physical query to solve the issue (as highlited in bold in the code below), but this is apparently too difficult for the BI Server:

WITH SAWITH0 AS (select sum(T95.AMOUNT_SOLD) as c1, count(distinct (CASE WHEN T33.CUST_GENDER = 'F' THEN T33.CUST_ID END)) as c2 from SH.CUSTOMERS T33, SH.SALES T95 where ( T33.CUST_ID = T95.CUST_ID ) ) select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3 from ( select 0 as c1, D1.c1 as c2, D1.c2 as c3 from SAWITH0 D1 ) D1 where rownum <= 125001

Luckily, alternative solutions are available. To achieve our goal we can either add the number of distinct customers as a predefined measure to the business model (recommended) or replace the FILTER function with a CASE WHEN expression in the analysis:

COUNT(DISTINCT CASE WHEN "Customer"."Customer Gender" = 'F' THEN "Customer"."Customer ID" END)

If you opt for replacing the FILTER function with a CASE WHEN expression, it's interesting to note that the physical query generated by the BI Server is exactly the same as the one above that contains the fix highlighted in bold!

Conclusion

The FILTER function can really help you when designing your reports, but it's fundamental to know its limitations. This post aims at providing a complete guide to the FILTER function in order to implement successful analyses and avoid epic fails.

Do you feel everything is covered? Have a different opinion? Please do get in touch or DM me on Twitter @barretbse.