Create and Load Parquet Data Sources¶
As a part of the Analysis Model customization process, new Parquet Data Sources can be added to the model that is being customized. This can be achieved by Creating a Parquet Data Source based on Information Sources and Other Views and then performing an Explicit load.
Parquet Data Sources are loaded into the ADLS Gen 2 folder.
Below end-to-end steps can be followed to create a Parquet Data Source.
- Navigate to the Parquet Data Sources page.
2.Complete the assistant on creating a new Parquet Data Source.
By selecting the Add New Data Source action button, the assistant is launched to create new Parquet Data Sources.
The New Data Source creation assistant consists of two steps.
i.Parquet Data Source Details tab.
ii.Column Selection tab.
Data Properties:¶
Property | Description |
---|---|
Source Selection - Data Source Details |
This property is used to select the type of origin while creating a Parquet Data Source. The dropdown shows two available source origins- Information Sources- All the Facts and Dimensions Any- All information Sources and other available views Source origin selection is mandatory. Once set, the selection can be changed if required. (within the same step) |
Source Name - Data Source Details |
Source Name property filters data based on the Source selection. When the Source selection =Any, All the Information Sources and other Views that are available for Parquet data sources creation are listed. Selection can be done via the drop-down. When the source selection = Information Source Information Sources that are available for Parquet data sources creation are listed. Selection can be done via the drop-down. All the Information Sources are based on Oracle source _OL view. Source Name is a mandatory selection. Once set, the selection can be changed if required before Loading. |
Source Type - Data Source Details |
This property is auto-populated based on Name property. When the source origin is an Information Source, it populates the value (Dimension/Fact) based on the metadata. In this scenario, the Source Type value cannot be changed. When Any is selected, the source type is populated with a default value of Dimension. In this scenario, the Source Type value can be changed into another value (Dimension/Fact/Other) among the list of values. |
Name | This property is populated based on the Source Name selection. Name determines the Name of the folder in ADLGS Gen 2 where the .parquet file will be created. Name is always in-sync with the Source Name, unless the Name is manually modified. |
Parquet file name template - Data Source Details |
This property is auto-populated based on the Source Name selection. It is the file name or template of the parquet file that will be created when loading the Parquet Data Source. When Load Type is Incremental this column contains _{0} which is a placeholder where the value of the partition column is placed. i.e. Full Load : [<< Name>> .parquet] FACT_ABSENCE_PERIOD.parquet Incremental Load : [<< Name>>_ {0}.parquet] FACT_ABSENCE_PERIOD _ {0}.parquet |
Parquet file path - Data Source Details |
This property is auto-populated based on the Name and the Area selection. It is the Data lake file path where the .parquet file will be created. i.e.[<> / << Name>>] HCM/FACT_ABSENCE_PERIOD |
Area- Data Source Details |
Area values are the main functional areas for each Parquet Data Source and Analysis Model. Parquet Data Source functional areas specify the data lake folder into which this Parquet Data Source is loaded. The dropdown shows all the available Area values. This is a mandatory selection. Once set, the selection can be changed if required before Loading. |
Max Age- Data Source Details |
The maximum amount of minutes set in a Parquet Data Source before the next Loading is required. This can be set as desired by adding a numerical value. Adding a value for the Max Age is mandatory. Once added, the value can be changed if required. |
Column Selection | Based on the Source selection, all columns of the related Source are populated in a dual-multi selection box. Source type = Fact - All columns are populated as excluded by default. Source type = Dimension - All columns are populated as Included by default. Source type= Other -All columns are populated as included by default Exclusion/Inclusion of columns can be performed based on the requirement. |
How to Create a Parquet Data Source¶
- First, select the Source origin.
2.Select Source Name. The desired source can be searched and selected.
Selecting the Source Name where Source selection ==Any (Drop down list displays all the available tables and Views)
When the source selection ==Any, the default Source type value is Dimension. Drop down list provides three options - Dimension/Fact/Other.
Selecting the Source Name where Source selection ==Information Sources (Drop down list displays all the available Dimensions and Facts)
3.Name can be changed if needed.
Note
The modified Name value cannot be edited further during the Parquet Data Source Edit process.
4.Select Area from the drop-down list of Values.
The Parquet File path gets populated after the Area value selection.
5.The Load Type field is populated based on the Source type. The default value is Full.
If a Dimension is selected, the Load type field cannot be changed. If a Fact is selected, the default value (Full) can be changed to Incremental to enable Incremental Refresh.
6.Type in a Max Age value.
7.Move to the next step to select the Columns. Pre-selected source's columns selection is done based on the source type.
7.1 Column selection in a Dimension (Load type Full by default) - By default, all available columns of the selected source will be included. Columns can be chosen to opt out as desired.
7.2 Column selection in a Fact -Load type Full - By default, all available columns of the selected source will be excluded. Columns can be chosen to opt-in as desired.
7.3 Column selection in a Fact -Load type Incremental - In addition to the column selection, Incremental Load type enables selecting columns to facilitate Incremental refresh.
Note
Column selection for Source Type = Other behaves similarly to how Dimensions behave.
Data properties:¶
Property | Description |
---|---|
Partition Column | The name of the column on which data should partitioned. The ‘order by’ for the query is added automatically when the Partition Column is set. Detect Changes is done per partition, and it is done by first querying the data using a group by Partition Column and taking for example the max (Detect Changes Function) value of the last_modified_date (Detect Changes Column). Refer Incremental Refresh document for further details. |
Detect Changes Column | The name of the column that will be used in conjunction with the Detect Changes Function to check if a partition has changed. |
Detect Changes Function | The function to be carried out on the Detect Changes Column to determine if a partition is changed. The default value is MAX. MAX : If the Detect Changes function is set to MAX, it will return the MAX value of the detect changes column in the group (grouped by partition) SUM : If the Detect Changes function is set to SUM, it will return the SUM of the of the detect changes column in the group (grouped by partition) |
Detect Changes Row Count | When Load type is Incremental, this column indicates whether or not the count of the rows within the partition needs to be included in the detect changes query. If it is included, a change in row count also indicates if a partition has changed. |
To enable Incremental Refresh for a Fact, the Incremental Load detail section has to be configured.
-
Partition Column - Select the partition column from the column list of the respective source. All the Parquet Data Source columns are listed in the drop-down list.
-
Detect Changes Column - Select the column to be used to detect changes (that can be aggregated i.e. Cost, or that is a date i.e. Reg_date) from the column list of the respective source. All the Parquet Data Source columns are listed in the drop-down list.
-
Detect Changes Function -MAX or SUM function to determine whether the partition is changed.
-
Include Row Count-Toggle ON/OFF to include/exclude row count.
7.4 Column selection in Other type- Load type Incremental.
Other - By default, all available columns will be in an opt-in state. There is the option to opt-out columns as desired.
8.Finally, load the Parquet Data Source.