The Rank node is an active transformation technique to generate ranks based on a ‘Partition By’ and ‘Order By’ criteria. To group the records, we use a partition on a selected column name (field), and then to generate a ranking system on the partitioned fields, we use order by on another column name (field).
Pre Filter and Post Filter options (optional) can be used to improvise the incoming data accurately and obtain results as desired. Pre and Post Filters are used to create filter conditions before and after rank columns generation.
Note: The newly created rank columns can be used to create a Post Filter condition.
The rank node has two ports. One input port and one output port.
Input Port -> Data that needs to be processed is connected to the input port of the rank node.
Output Port -> One output port is populated with updated table data (based on Partition and Order) is passed through the output port for further processing in the downstream nodes of the pipeline.
Configure the Rank Node:
Rank Node can be found in the Transformations Palette. The node can also be found through the search box next to Palette.
- Drag and drop the rank node onto the canvas.
- The Configuration option (radio button) is enabled by default.
- The Configuration menu consists of Rank Details, Pre Filter, Post Filter, and Options/Description.
- Rank Details: The rank details include Column Name, Alias, and Datatype.
-
-
- Column Name: A list of all column names of the selected table is retrieved and displayed.
- Alias: Alias provisions to rename the column name and this can be done by double-tapping on the text field beside the column name.
- DataType: Displays the data type of the column name.
-
Note: Pre-populated Column names cannot be edited by clicking icon, but alternatively the column name can be changed by the aliasing option.
- Plus icon: Click the
icon to create a new Column Name by selecting Rank, Input Fields, ORDER, and PARTITION.
- Click the
icon to edit the newly created Column Name, RANK, Input Fields, ORDER, PARTITION, and Expression.
- Delete icon: Click the
to delete the newly created column name.
Note: The Expression text field is automatically populated based on the Input Fields selected in ORDER and PARTITION sections.
The Rank Column pop-up window is displayed by clicking Plus icon.
RANK
The ranking is awarded on the newly created column name (field), based on the ORDER and PARTITION selected column names. The same rank is awarded for records whose values are identical. This in turn produces gaps in the ranking sequence.
DENSE RANK
The dense ranking is also awarded on the newly created column name (field), based on the ORDER and PARTITION selected column names. The same rank is awarded for records whose values are identical and the next rank is awarded for the following records without producing any gaps in the ranking sequence.
PERCENT RANK
The percent rank awards the percentage ranking on the newly created column name (field) records.
NTILE
The concept of Bucketing helps to evenly divide records and push them into the specified number of buckets. Every bucket is numbered starting from one.
For more information about RANK functions, refer to https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-window.html.
For example, let us take an ORDERS table (obtained from Source node), where the list of all column names appears under the Rank Details tab.
- Click the
icon to create a new Column name.
- Enter a valid Column name of your choice in the Column Name text field.
- Select the Rank option from the Rank drop-down list.
- All the column names of the selected table appear under the Input Fields section.
- Drag and drop the column names into ORDER and PARTITION sections.
- Click the
icon to remove selected columns from the ORDER and PARTITION sections.
- The column names selected under ORDER and PARTITION sections automatically populate has an expression under the Expression section.
- Click the
icon to save the newly created Column name or else click the
icon to close the window.
- The newly created column name appears under the Rank Details tab along with pre-populated column names.
- Click the Run button to see the output.
- The output (records) is sorted based on the rank criteria.
- The ORDER_RANK output in a CSV file is displayed below.
Pre/Post Filter
The Pre/Post filters help to narrow down the results accurately. The configuration screen for Pre/Post filter is the same. The selected filter is highlighted with an orange color underline.
Note: A Post Filter condition can be only declared after creating a new column name (field). The newly created column name can be seen under the Input Fields section.
-
-
- Input Fields: The input fields consist of column IDs that are populated from incoming source node data.
-
-
-
- Operator: A list of all logical operators is displayed horizontally, where the user can choose any of them to apply an operator on the input field selected.
-
Note: Input operator allows to type value/string on the expression canvas.
-
-
- Functions: A list of functions is displayed. The usage of any function helps the user to refine the resultant data accurately. The Spark SQL functions are categorized under String, Numeric, Date&Time, and Custom functions.
-
For more information about Spark SQL functions, refer to Apache Spark SQL Functions documentation.
The Pre/Post filter details can also be defined using the expression editor. There are two modes in the expression editor:
- Designer Mode: This mode allows you to drag and drop/ double-click on the specific input fields, operator, and functions onto the expression canvas to create a Where clause condition.
- Expert Mode: This mode allows you to type in the queries in the query space.
Note: A toggle switch is available to switch from one mode to the other mode.
The Eraser option enables the user to clear the split conditions defined in the expression editor.
- Dragging and dropping the required Input fields, Operator, and Functions will define a filter condition. Alternatively, expression can also be typed in the Expression section.
Options / Description:
- Packet Size and Parallelism can be maintained here to achieve better performance.
- Annotation can be used to mention brief details of the functionality achieved in the filter node.
- Description can be used to provide more details of the filter conditions and can also be used to maintain a log or audit trail of all the changes done to the filter conditions over some time.