About AtScale Virtual Models

An AtScale virtual model is a metadata layer that overlays a multi-dimensional model format on top of the datasets stored in a connected data warehouse, such as Google BigQuery. The model is virtual because the data is not moved or processed up front. Instead, the model contains the logic about how to process and optimize the data at query runtime.

The Purpose of an OLAP System

The basic purpose for having an Online Analytical Processing (OLAP) system is to allow people to access data, ask questions, and get answers quickly. OLAP is the foundation of business intelligence (BI) - the broader set of tools and methodologies for gaining meaningful insights from raw transactional data.

When describing OLAP systems, there are two major approaches in how the data is optimized to support BI analysis - MOLAP (multi-dimensional OLAP) and ROLAP (relational OLAP).

In the MOLAP approach, aggregate (measure) data is pre-calculated for every dimension combination. The result is a multi-dimensional model, where each cell in the model represents an intersection of n dimension values.

/public/images/cube.png

The benefit of building MOLAP models is performance. Since measures are calculated for every possible combination of dimensions ahead of time, slicing and dicing of multi-dimensional data is very fast. But this performance comes at a cost - redundant storage, long build times, time-to-insight latency, and the administration overhead of managing data silos. Plus, when dealing with the size of dimensions in big data, MOLAP models cannot scale. Model build times only increase as the data grows. The latency between when data lands, and when it is ready for analysis can prevent the business from making timely decisions.

In the ROLAP approach, data is not pre-computed ahead of time. Instead data is stored in relational tables that use a star (or snowflake) schema to model multi-dimensional data.

/public/images/star_schema.png

Analysis tools then query these tables using query languages such as SQL, and aggregate results are calculated on-the-fly for the requested measures and dimensions. The benefits of this approach is that it is more scalable in handling large data volumes, and does not require off-loading the data into another storage system for analysis. But ROLAP engines are usually slower when the data is not aggregated ahead of time. To get around this performance hit, it is often necessary to build separate summary tables that contain pre-aggregated data in order to boost query performance.

What is an AtScale Virtual Model?

AtScale virtual models are a hybrid of the MOLAP and ROLAP approaches. The AtScale platform combines the scalability of ROLAP with the ease-of-use of MOLAP-like data modeling.

Logically, the model looks like a MOLAP model to the business intelligence applications. The MOLAP model is easy for BI users to understand, because the data is presented as a simple list of measures and dimensions that can be used to build reports.

However, AtScale does not build MOLAP models. AtScale overlays a virtual ROLAP schema on top of the datasets stored in Hive.

How AtScale Optimizes OLAP Queries

BI applications send their queries to a model hosted on the AtScale engine. This model metadata is used to interpret the SQL queries sent by the BI tools, optimize them for the best performance, and then execute them directly against the corresponding data warehouse.

AtScale's cost-based query planner and optimizer dynamically builds and maintains aggregates (summary tables) based on the queries issued by BI users. Once aggregates exist, future queries can run against the aggregated data instead of the raw data, dramatically improving query performance.