Partitioned User-Defined Aggregates

When you create or edit a user-defined aggregate (UDA), you can specify a dimensional attribute to use as a partitioning key. If you do so, instances of the aggregate will be partitioned, with one partition created for each key value. Partitioning an aggregate can result in faster query performance.

Note: AtScale supports aggregate partitioning on the following data warehouses: Google BigQuery, Microsoft SQL Server, DB2, Snowflake, Hive, Impala, Spark, and Databricks. As of AtScale 2021.2.0, AtScale supports partitioning Google BigQuery Aggregate tables by Integer. Older versions of AtScale support Google BigQuery aggregate table partitioning for Date and DateTime columns only. Data warehouse platforms that do not have aggregate table partitioning support with AtScale are PostgreSQL, Redshift, and Iris.

When to use partitioning

For example, suppose that you look at the SUM of Order Quantity for all product categories per Order Day, Order Month, and Order Year. In Tableau, you might have a visualization such as this one.

Figure 1. Tableau workbook visualization

/public/images/partitioned_udas_1.png

Because you need to issue the query for this visualization and similar queries often, looking at the data for particular months and years, you define an aggregate that includes these dimensions and measures:

Dimensions
Order Day, Order Month, Order Year, Product Line, Product Category, Product Name
Measures
Average Sales per Order, Order Quantity, Sales Amount

When you refresh this visualization, filtering in this case on Order Month = "April" and Order Year = "2008", the query uses the aggregate, but you notice that the refresh takes longer than you would like. Given the very large size of your fact dataset, which could contain records from thousands of days and potentially thousands of orders per day, instances of the aggregate that you defined could also be very large. The query issued by AtScale requires a full scan on the aggregate to find the records that meet the filtering criteria.

To try improving the speed of the query, you can edit the aggregate in the AtScale Design Center by selecting a column (in this case, Order Month) to use as a partitioning key. When the AtScale engine builds instances of the aggregate, a partition is created for each value in the key. So, in your aggregate, there would be one partition per month. After you republish the project that contains the cube and you refresh the data visualization in Tableau, the query runs against only the partition for the key value that you filter on, which in the current state of your visualization is "April". Because the partition contains only a subset of the data that is in the aggregate as a whole, the query should run faster. Of course, you'll notice the improvement more when you partition very large tables than when you partition relatively smaller ones.

Partitioning user-defined aggregates is a good idea when the combination of dimensions in an aggregate would lead to aggregate instances with very high cardinalities. If you decide to partition a UDA, ensure that queries against the aggregate always use a WHERE clause that filters only on the partitioning key.

Performance Best Practice: MDX Tools like Excel will specify filters using Attribute Key bindings. However, SQL and DAX-based tools like Tableau will specify filters using Attribute Name bindings. Data architects should be aware of their user community's tool usage so they may adopt a partitioning strategy that maximizes their query performance. If a cube services a mix of BI Tools, then it is recommended that Data Architects either partition by both name and key columns or use the same physical column for the AtScale Key and Name attribute bindings.

Procedure

  1. Make sure the following advanced engine settings are set to True (default value). They can be accessed by selecting Settings > Engine under Organization Settings:

    • TABLES.CREATE.PARTITIONS.ENABLED
    • AGGREGATES.CREATE.PARTITION.USERDEFINEDAGGREGATE.ENABLED
  2. Proceed with the steps described in Defining Aggregates Yourself.

  3. Make sure you go to the Partitions tab, and specify whether the corresponding partition should be defined on the key column, name column, or both.

Result

Another engine configuration setting (TABLE.CREATE.PARTITIONS.MAXIMUMESTIMATEDNUMBEROFPARTITIONS) specifies the maximum number of partitions to allow for a user-defined aggregate. The AtScale engine refers to this setting after you publish a project in which a partitioned UDA is defined on a cube. The engine estimates the cardinality of the instances that would be built from the aggregate definition. If the estimate is at or below the configured maximum allowed number of partitions, the engine will build the UDA. However, if the estimate is above the configured maximum, the engine will not build the UDA and will issue an error message to say that the build of the UDA failed.

If the engine fails to build one of your partitioned UDAs, there are three different actions that you can try for a successful build:

  • Use a partitioning key that has a cardinality that is lower than the configured maximum allowed number of partitions.
  • Increase the maximum allowed number of partitions. (However, AtScale recommends that you set the maximum no higher than 1000.)
  • Edit the user-defined aggregate by unselecting the partitioning key.

Instances of partitioned UDAs can be build by full rebuilds or by incremental rebuilds. If an attribute that is used as a partitioning key is updated, the AtScale engine performs a full rebuild of the aggregate, creating a new instance and new partitions, whether or not you have specified to use incremental builds.

To monitor partitioned UDAs, you can use the Aggregates page in the Design Center, as you would for all other aggregate types. There are no special considerations that you need to be aware of when monitoring. To make troubleshooting easier, however, the Aggregates page displays the partitioning key for each partitioned user-defined aggregate.