The Aggregation node performs aggregate calculations on the incoming source data. This node can be applied over a group of columns in a table. Aggregate functions such as Average, Sum, Min, and others can be applied to a dataset to get desired results. The Group By can be used to perform aggregated calculations on subsets of a table.
The Aggregation node has two ports. one input port and one output port.
Input Port -> Data that needs to be aggregated is connected to the input port of the Aggregation node.
Output Port -> Single output is generated based on the selected aggregation details. This generated output port is fed to the target node for further processing.
Configure the Aggregation Node:
Aggregation Node can be found in the Transformations Palette. The node can also be found through the search box next to Palette.
- Drag and drop the Aggregation node onto the canvas.
- The Configuration option (radio button) is enabled by default.
- The Configuration menu consists of Aggregation Details and Options/Description tabs.
- Aggregation Details
- Column Name: A list of all column names of the selected table is retrieved and displayed.
- Alias: Alias provisions to rename the column name and this can be done by double-tapping on the text field beside the column name.
- Aggregation Details
Note: By default column names are displayed in alias text fields.
-
-
- DataType: Displays the data type of the column name.
- Aggregation: A drop-down list of all Aggregation functions is displayed.
-
Refer to https://spark.apache.org/docs/latest/api/sql/#length to learn more about aggregation functions.
For example, the Aggregation node will calculate Order details based on the aggregation functions applied to the selected columns and grouped by product id.
- Set the Count Aggregation on the Quantity column.
- Set the Max Aggregation on the Discount column.
- Set the node to group by the Productid.
- An output of 77 records is generated by selecting the Sample Output radio button.
Options / Description:
- Packet Size and Parallelism can be maintained here to achieve better performance.
- Annotation can be used to mention brief details of the functionality achieved in the filter node.
- Description can be used to provide more details of the filter conditions and can also be used to maintain a log or audit trail of all the changes done to the filter conditions over some time.