Datasets

Dataset files define datasets to use in the repository. Each dataset file in your repository must correspond to either a physical table/view in your database, or the results of a SELECT statement.

Note: Dataset files must define all columns in the physical tables they reference, and can therefore be quite large. Because of this, AtScale recommends sharing these files across repositories.

Dataset files support the following properties.

unique_name

Type: string
Required: Y

The unique name of the dataset. This must be unique across all repositories and subrepositories.

object_type

Type: const
Required: Y

The type of object defined by the file. For datasets, the value of this property must be dataset.

label

Type: string
Required: Y

The name of the dataset, as it appears in AtScale. This value does not need to be unique.

connection_id

Type: string
Required: Y

The unique_name of the connection object that defines the database and schema in which the dataset is stored.

sql

Type: string
Required: Required if table is not defined

A SQL query used to pull data from a specific connection defined within the repository, similar to a database view. This determines whether the dataset file defines a query dataset.

table

Type: string
Required: Required if sql is not defined

The name of the table in the database that the dataset is based on.

columns

Type: array
Required: Y

Defines the columns available in the dataset.

Note: You should define all columns available in the dataset. This is especially important for dataset files that are shared across multiple repositories.

The columns property within a dataset file supports the following properties.

name

Type: string
Required: Y

The name of the column.

data_type

Type: string
Required: Required unless this column is a map

The data type of the values within the column.

Supported values:

string
int
long
bigint
tinyint
float
double
decimal
decimal(x,y)
number
number(x,y)
numeric(x,y)
boolean
date
datetime
timestamp

sql

Type: string
Required: N

Defines the column as a calculated column.

Calculated columns enable you to add simple data transformations to the dataset. These can be used as the basis of model attributes, just like any other dataset column. For more information, see Adding Calculated Columns To Datasets For Simple Data Transformations.

The value of this property should be a valid SQL statement that can be run as part of the SELECT list of a query.

The SQL statement is passed directly to the underlying database when the query runs, so it must be in a syntax that is supported by your chosen engine. If you want to run the query on other types of databases, use the dialects property to define additional dialects for it to run in.

dialects

Type: array
Required: N

Defines alternate dialects for the sql statement so that it can run on other types of databases. You can define as many alternate dialects as needed.

Supported properties:

dialect: String, required. An alternate SQL dialect.
Supported values:
- Snowflake
- Postgresql
- DatabricksSQL
- BigQuery
- Iris
sql: String, required. The alternate SQL statement.

map

Type: object
Required: N

Defines a map to use to create a calculated column.

Supported properties:

field_terminator: String, required. The delimiter used to separate the key:value pairs. This must be in quotes (").
key_terminator: String, required. The delimiter used to separate the individual keys from their values. This must be in quotes (").
key_type: String, required. The data type of the map's keys.
value_type: String, required. The data type of the map's values.

The mapped columns are defined as separate columns within the dataset file. Each of these must have the parent_column property.

For more information on maps in AtScale, see Extract Values From A Map.

parent_column

Type: string
Required: Required for mapped columns

For mapped columns only. Specifies the map column used to create this column.

description

Type: string
Required: N

A description of the dataset.

incremental

Type: object
Required: N

Enables incremental builds for the dataset. When the engine performs an incremental rebuild of an aggregate table, it appends new rows and updates existing rows that fall within a specified period of time (called the grace period). For more information, see About Incremental Rebuilds.

The incremental property supports the following properties.

column

Type: string
Required: Y

The name of the dataset column to use as an incremental indicator. This column must have values that increase monotonically, such as a numeric UNIX timestamp showing seconds since epoch, or a Timestamp/DateTime. The values in this column enable the AtScale engine both to append rows to an aggregate table and update rows during an incremental rebuild.

The values of this column should be of one of the following data types: Long, Integer, Decimal (38,0) (Snowflake only), Timestamp, DateTime.

If you do not have a column that meets this criteria, you may need to create a calculated column to transform string or datetime values into the right data type.

grace_period

Type: string
Required: Y

When the AtScale engine starts an incremental build, the grace_period determines how far back in time the engine looks for updates to rows in the dataset; for example, one week or 15 days.

The value of this property should be an integer, followed by the time unit. The time unit can be any of the following: s (second), m (minute), h (hour), d (day), w (week).

For example, setting the value to '100s' sets the grace period to 100 seconds. Setting it to '1w' sets the grace period to one week.

immutable

Type: boolean
Required: N

Determines whether the dataset changes often or not. The AtScale engine uses this information when running incremental builds of aggregates that use joins on dimensions that do not change often.