New Features and Bug fixes in Version 1.4

Streaming Pipeline:

AWS S3 as a source: Ability to stream new/existing files from an S3 bucket
AZURE as a source: Ability to stream new/existing files from Azure Blob Storage
Architecture changes to the FTP/SFTP stage location: Converted file staging to Kafka staging topic to eliminate IO issues and increase the performance
Ability to run the RDBMS Source in multiple tasks to achieve parallel processing (Only in Cluster)
Rest API as a source: To connect to streaming REST-enabled sources (Twitter, Facebook)

Bulk Streaming Pipeline:

LOG / Query Based replication designed in new UI is implemented
Enabled generic JDBC as a target to connect to various Databases
Enabled Query based replication in BSP

Batch Pipeline:

Rest API as a source -- the first version

The user would be able to consume the REST API as a source in the batch pipeline. The user would be able to make use of POST and GET API methods. In this release support for JSON response is enabled. All the usual capabilities of REST API are enabled as follows.

Request Parameters
Headers
pre-request API
Request Body
Pagination
Exit criteria

From the flattening of the JSON response, the first level of flattening is supported. If the response JSON has multi-level nesting. Flattening is applied till the first level and the nested level objects are treated as strings.

SAP Lean Version

We have initiated the SAP lean version in 1.4, We have done some groundwork on this, but not at a demo-ready state.
We will be targeting to have an initial version in the next sprint ready for demo purposes.

CICD flow

CICD flow would enable the Review and Approval process for the assets that have been created in DataFactory. It is a capability that would allow us to version the assets in DataFactory working on similar lines of GIT workflow.

As part of 1.4 CICD flow has been initiated for the batch pipeline and completed the following capabilities.

1. Created all the UI screens
2. Accept review request flow
3. Review flow with the ability to Publish and Reject comments

Metadata propagation

As part of pipeline building, the user would be using multiple nodes, and metadata is reflected from the prior node to the next. Users can go back to prior nodes and make any changes to metadata with this enhancement. We would be detecting the effect of the metadata updates on the subsequent steps and updating them to an extent. It is not 100% automation, updates would be taken care of automatically in some cases, and in some user intervention is still needed.

Enhancement suggested on file-based cloud widgets in the UAT session

Ability to load data to single or multiple files in targets
Ability to ingest all files from a specific folder given that schema matches for all files
Ability to load data in the target in a compressed format

Licensing module for AWS AMI

For publishing DataFactory AWS AMI, a very lean version of the license module has been initiated. With this capability, we would be restricting the trail access of DataFactory AWS AMI for 15 days. Users will be notified on the DataFactory UI of the time left in the trial period. Once the trial period ends, the user will not be able to access DataFactory.

BDP:

Enhancements suggested by the leadership team from the UAT sessions have been implemented
Delta lake as the target is implemented to load data from various data sources.

Jobs:

Issues from AWS AMI & Client: Bugs fixed from Mingledorff's production system have been addressed

DATA WRANGLER:

Scaling

Added multiple scaling techniques like Standard scaler, Minmax scaler, and Robust scaler

Replace cells

Replacing multiple cells with custom inputs is implemented

Data Quality Score

After the execution of all the recipe steps data quality metrics of the entire dataset are provided with details like

Valid values
Empty or null values
Invalid values
Mismatched datatype
Mismatched domain

Data quality details at the Column level

Data quality metrics at the individual column level are implemented

Run session

Session history details like JOB ID, Input and output no. of records, and data quality score are implemented

Column details in the run recipe

Column details of the transformed dataset are implemented

AI-ML:

Analyze the data page

Details like correlation graphs, numerical Statistics, Categorical statistics, Histogram graphs, Box plot graphs, and scatter plot graphs are enabled

Sticking of target column in the preview Data screen after the prediction