New Features and Bug fixes in Version 1.4
Streaming Pipeline:
- AWS S3 as a source: Ability to stream new/existing files from an S3 bucket
- AZURE as a source: Ability to stream new/existing files from Azure Blob Storage
- Architecture changes to the FTP/SFTP stage location: Converted file staging to Kafka staging topic to eliminate IO issues and increase the performance
- Ability to run the RDBMS Source in multiple tasks to achieve parallel processing (Only in Cluster)
- Rest API as a source: To connect to streaming REST-enabled sources (Twitter, Facebook)
Bulk Streaming Pipeline:
- LOG / Query Based replication designed in new UI is implemented
- Enabled generic JDBC as a target to connect to various Databases
- Enabled Query based replication in BSP
Batch Pipeline:
- Rest API as a source -- the first version
The user would be able to consume the REST API as a source in the batch pipeline. The user would be able to make use of POST and GET API methods. In this release support for JSON response is enabled. All the usual capabilities of REST API are enabled as follows.
- Request Parameters
- Headers
- pre-request API
- Request Body
- Pagination
- Exit criteria
From the flattening of the JSON response, the first level of flattening is supported. If the response JSON has multi-level nesting. Flattening is applied till the first level and the nested level objects are treated as strings.
- SAP Lean Version
We have initiated the SAP lean version in 1.4, We have done some groundwork on this, but not at a demo-ready state.
We will be targeting to have an initial version in the next sprint ready for demo purposes.
- CICD flow
CICD flow would enable the Review and Approval process for the assets that have been created in DataFactory. It is a capability that would allow us to version the assets in DataFactory working on similar lines of GIT workflow.
As part of 1.4 CICD flow has been initiated for the batch pipeline and completed the following capabilities.
1. Created all the UI screens
2. Accept review request flow
3. Review flow with the ability to Publish and Reject comments
- Metadata propagation
As part of pipeline building, the user would be using multiple nodes, and metadata is reflected from the prior node to the next. Users can go back to prior nodes and make any changes to metadata with this enhancement. We would be detecting the effect of the metadata updates on the subsequent steps and updating them to an extent. It is not 100% automation, updates would be taken care of automatically in some cases, and in some user intervention is still needed.
- Enhancement suggested on file-based cloud widgets in the UAT session
Ability to load data to single or multiple files in targets
Ability to ingest all files from a specific folder given that schema matches for all files
Ability to load data in the target in a compressed format
- Licensing module for AWS AMI
For publishing DataFactory AWS AMI, a very lean version of the license module has been initiated. With this capability, we would be restricting the trail access of DataFactory AWS AMI for 15 days. Users will be notified on the DataFactory UI of the time left in the trial period. Once the trial period ends, the user will not be able to access DataFactory.
BDP:
- Enhancements suggested by the leadership team from the UAT sessions have been implemented
- Delta lake as the target is implemented to load data from various data sources.
Jobs:
- Issues from AWS AMI & Client: Bugs fixed from Mingledorff's production system have been addressed
DATA WRANGLER:
- Scaling
Added multiple scaling techniques like Standard scaler, Minmax scaler, and Robust scaler
- Replace cells
Replacing multiple cells with custom inputs is implemented
- Data Quality Score
After the execution of all the recipe steps data quality metrics of the entire dataset are provided with details like
- Valid values
- Empty or null values
- Invalid values
- Mismatched datatype
- Mismatched domain
- Data quality details at the Column level
Data quality metrics at the individual column level are implemented
- Run session
Session history details like JOB ID, Input and output no. of records, and data quality score are implemented
- Column details in the run recipe
Column details of the transformed dataset are implemented
AI-ML:
- Analyze the data page
Details like correlation graphs, numerical Statistics, Categorical statistics, Histogram graphs, Box plot graphs, and scatter plot graphs are enabled
- Sticking of target column in the preview Data screen after the prediction