ETL Step 1: EXT

EXT is the first step of ETL. During EXT, the staging table schema aligns with the source entity schema.

Flat files are input in this step, while *_EXT tables are output.

The EXT step does the following:

  • Loading data from flat files to staging tables

    Flat files created by Extractor Engine are copied from flat files to staging tables (naming convention: *_SRC and *_DEL_SRC).

  • Cleaning data with null/duplicated business key

    • The system checks the data type and data length from the flat files. The system also checks if the data field is null. It is required that the data is not null. Rejected data is loaded to the VALF (validation fail) tables (naming convention: *_VALF). The VALF tables have the same table structures but all fields’ data type is VARCHAR (50000), which is to make sure that the data is not rejected again.
    • The system checks if there is duplicated data based on the is_bk attribute defined in the source entity. Duplicated data is moved to the VALF tables as well.

  • Generating hash codes for records

    The system generates a hash code for each record and puts the hash code to the MD_HASH_CODE field. The system use the hash code to compare table for data updates.

  • Prefilling data from snapshot tables to make sure that records can be joined in the SSI step.

    To reduce the load on the PPM database, Extractor Engine only extracts changed data. The system puts all data that was loaded to snapshot tables (naming convention: *_TSNP) in the TSNP step. In the EXT step, the system pulls data back from the snapshot tables, to make sure that all data can be joined in the SSI step.

  • Checking for updated, deleted, and inserted data

    • Through comparing snapshot tables (naming convention: *_TSNP), the system tags data with UPD for update and NEW for insert.
    • Extractor Engine extracts the business key fields for all records from the PPM database and compare it with the snapshot tables for deleted records.

The following table lists the key staging tables.

Key Staging Tables

Table Name

Table Description

<STREAM_NAME>_<SOURCE_ENTITY_NAME>_<MD_SOURCE_INSTANCE_ID>_SRC

Stores data loaded from flat files for further process.

<STREAM_NAME>_<SOURCE_ENTITY_NAME>_<MD_SOURCE_INSTANCE_ID>_SRC

Contains all business key fields, used for detecting deleted data.

<STREAM_NAME>_<SOURCE_ENTITY_NAME>_<MD_SOURCE_INSTANCE_ID>_TSNP

Contains all data that was loaded and the history of records.

<STREAM_NAME>_<SOURCE_ENTITY_NAME>_<MD_SOURCE_INSTANCE_ID>_VALF

Stores rejected data.

<STREAM_NAME>_<SOURCE_ENTITY_NAME>_<MD_SOURCE_INSTANCE_ID>_EXT

Contains data that has been processed in the EXT step and is going to be passed to the SSI step.

Most of the staging tables in this step contains the following fields.

System Reserved Fields in Staging Tables

Field Name

Field Description

MD_BATCH_ID

Data Batch ID. Extractor Engine puts data to flat files and gives a batch ID for those data. Administrators can use this ID to track the data in the staging tables and the target tables, and also check Extractor Engine and Flow Engine log files during troubleshooting.

MD_PROCESS_ID

ETL process ID. Flow Engine generates process ID and stores the ID in tables. Administrators can use the ID to check the Flow Engine log and the process status in the Derby database.

MD_SOURCE_INSTANCE_ID

Source Instance ID. System assigns an ID for PPM instance when the administrator registers a new PPM instance to the system. Vertica for PPM users can use this ID to separate data from different PPM instances.

MD_FLAG

Indicates whether this row is new, is updated, or needs to be deleted.

MD_PF_FLAG

Indicates whether row is prefilled.

P means the data is prefilled from the snapshot tables. D means the data is loaded from flat files during the test load.

MD_BUSINESS_KEY

Primary key that is mapped to one or more source columns. The system concatenates MD_SOURCE_INSTANCE_ID as part of MD_BUSINESS_KEY.

MD_HASH_CODE

Stores system-generated hash codes. The system compares this field with the snapshot tables for data update.