DataFactory Deltalake – Dextrus

The DataFactory Deltalake sink node reads incoming data either from source/transformation nodes and writes data into the new or existing Delta lake table. This sink node provisions to CREATE a new table, APPEND data to an existing table, TRUNCATE removes all records in a table with columns retained, and finally, DROP AND RECREATE deletes the existing table and allows to recreate the same table.

The DataFactory Deltalake node has two ports. One input port and one output port.

Input Port -> Data that needs to be written is connected to the input port of the DataFactory Deltalake node.

Output Port -> The output port is the logical port. Which can be further fed into Analytics nodes if required.

Note: The output port of the DataFactory Deltalake node cannot be connected to any of the Source/Transformation nodes.

Configure the DataFactory Deltalake Node:

DataFactory Deltalake Node can be found in the Sink Palette. The node can also be found through the search box next to Palette.

Drag and drop the DataFactory Deltalake node onto the canvas.
The Configuration option (radio button) is enabled by default.
The Configuration menu consists of Target Details and Description tabs.
- Target Details display Table Information comprising Table Strategy, Catalog, Table Name, Partitioned and Surrogate Key options.
  - Table Strategy: Displays a drop-down list with CREATE, APPEND, TRUNCATE, and DROP AND RECREATE options.
  - Catalog: Displays a drop-down list with a built-in Search Catalog Names search box along with existing available catalog names.
  - Table Name: This option provisions the user to type a table name, if he selects the table strategy as “CREATE” or else it displays a drop-down list if either “DROP AND RECREATE” / “APPEND” / “TRUNCATE” option is selected as table strategy.
  - Partitioned: Partitioned checkbox is optional. Enabling this checkbox will prompt the user to apply a partition condition on columns that were fetched from the target node input port data source.
    - Partitioned Columns: A drop-down list along with a built-in search text box is provided to select required columns.
  - Surrogate Key: Surrogate key checkbox is optional. Enabling this checkbox will prompt the user to provide a Surrogate Key Name field and prompt to select the Surrogate Key Columns from the Catalog Names drop-down list.

Note: Surrogate Key Name is a mandatory field to be filled after enabling the Surrogate Key checkbox.

- - Surrogate Key Name: Type a valid key name.
  - Surrogate Columns: A drop-down list along with a built-in search text box is provided to select required surrogate key columns.

To demonstrate the DataFactory Deltalake sink node functionality, let us create a pipeline using “CSV – Remove Dups – DataFactory Deltalake” nodes. Where the data is read from the CSV data source, removes duplicates from the input datasource using Remove Dups transformation node and finally writes data into DataFactory Deltalake node.

The Target Details tab is highlighted with an orange color underline.

Note: On the right pane of target details the fields “Out Datatype”, “Out Length”, and “Out Precision” can be changed as per your requirement. By default, Out Datatype can be changed at the target from the drop-down list available but Out Length and Out Precision textboxes are only enabled for a few selected Out Datatypes.

In our use case, we changed the Out Datatype of Empid and Project column names to “integer” and “string” datatypes.
Select the “CREATE” option from the Table Strategy drop-down list.
Select the “rdtest” option from the Catalog drop-down list.
Type the table name as “csvinfo”.

Note: In our use case we haven’t enabled Partitioned and Surrogate Key checkboxes.

Before running the created pipeline, the pipeline should be saved. Click the Save button.
A Save Pipeline pop-up window is populated on the screen.
Name the pipeline with a unique name and click the Save button to save the pipeline.
Click the icon to save the pipeline in the desired folder path.

Click the Run button to compile the pipeline.
The moment when the compilation process begins the log window appears at the bottom of the screen with the Log radio button selected.
The status (green/red) of each compiled time log is captured and displayed on the screen.
The number of output records processed at each node is displayed on the top of the node connected in the pipeline.

Note: All seven records are successfully read from the “Company4new.CSV” file data source, passed through the Remove Dups node, where five records are obtained as distinct values, and finally these five records are written to the DataFactory Deltalake sink node with a table name “csvinfo”.

Select the Output radio button or else click the count displayed on the target node to view the final output.
The details of all five output records are displayed in the output section with column names (Empid and Project) on which Remove Dups transformation node was applied.

Description:

Packet Size and Parallelism can be maintained here to achieve better performance.
Annotation can be used to mention brief details of the functionality achieved in the considered node.
Description can be used to provide more details of the node conditions and can also be used to maintain a log or audit trail of all the changes done to the node over some time.
PreSQL and PostSQL query text boxes are available to type SQL queries to get desired results.