ETL Step 1: EXT
EXT is the first step of ETL. During EXT, the staging table schema aligns with the source entity schema.
Flat files are input in this step, while *_EXT tables are output.
The EXT step does the following:
-
Loading data from flat files to staging tables
Flat files created by Extractor Engine are copied from flat files to staging tables (naming convention: *_SRC and *_DEL_SRC).
-
Cleaning data with null/duplicated business key
- The system checks the data type and data length from the flat files. The system also checks if the data field is null. It is required that the data is not null. Rejected data is loaded to the VALF (validation fail) tables (naming convention: *_VALF). The VALF tables have the same table structures but all fields’ data type is
VARCHAR (50000)
, which is to make sure that the data is not rejected again. -
The system checks if there is duplicated data based on the is_bk attribute defined in the source entity. Duplicated data is moved to the VALF tables as well.
- The system checks the data type and data length from the flat files. The system also checks if the data field is null. It is required that the data is not null. Rejected data is loaded to the VALF (validation fail) tables (naming convention: *_VALF). The VALF tables have the same table structures but all fields’ data type is
-
Generating hash codes for records
The system generates a hash code for each record and puts the hash code to the MD_HASH_CODE field. The system use the hash code to compare table for data updates.
-
Prefilling data from snapshot tables to make sure that records can be joined in the SSI step.
To reduce the load on the PPM database, Extractor Engine only extracts changed data. The system puts all data that was loaded to snapshot tables (naming convention: *_TSNP) in the TSNP step. In the EXT step, the system pulls data back from the snapshot tables, to make sure that all data can be joined in the SSI step.
-
Checking for updated, deleted, and inserted data
- Through comparing snapshot tables (naming convention: *_TSNP), the system tags data with
UPD
for update andNEW
for insert. -
Extractor Engine extracts the business key fields for all records from the PPM database and compare it with the snapshot tables for deleted records.
- Through comparing snapshot tables (naming convention: *_TSNP), the system tags data with
The following table lists the key staging tables.
Table Name |
Table Description |
---|---|
<STREAM_NAME>_<SOURCE_ENTITY_NAME>_<MD_SOURCE_INSTANCE_ID>_SRC |
Stores data loaded from flat files for further process. |
<STREAM_NAME>_<SOURCE_ENTITY_NAME>_<MD_SOURCE_INSTANCE_ID>_SRC |
Contains all business key fields, used for detecting deleted data. |
<STREAM_NAME>_<SOURCE_ENTITY_NAME>_<MD_SOURCE_INSTANCE_ID>_TSNP |
Contains all data that was loaded and the history of records. |
<STREAM_NAME>_<SOURCE_ENTITY_NAME>_<MD_SOURCE_INSTANCE_ID>_VALF |
Stores rejected data. |
<STREAM_NAME>_<SOURCE_ENTITY_NAME>_<MD_SOURCE_INSTANCE_ID>_EXT |
Contains data that has been processed in the EXT step and is going to be passed to the SSI step. |
Most of the staging tables in this step contains the following fields.
Field Name |
Field Description |
---|---|
MD_BATCH_ID |
Data Batch ID. Extractor Engine puts data to flat files and gives a batch ID for those data. Administrators can use this ID to track the data in the staging tables and the target tables, and also check Extractor Engine and Flow Engine log files during troubleshooting. |
MD_PROCESS_ID |
ETL process ID. Flow Engine generates process ID and stores the ID in tables. Administrators can use the ID to check the Flow Engine log and the process status in the Derby database. |
MD_SOURCE_INSTANCE_ID |
Source Instance ID. System assigns an ID for PPM instance when the administrator registers a new PPM instance to the system. Vertica for PPM users can use this ID to separate data from different PPM instances. |
MD_FLAG |
Indicates whether this row is new, is updated, or needs to be deleted. |
MD_PF_FLAG |
Indicates whether row is prefilled.
|
MD_BUSINESS_KEY |
Primary key that is mapped to one or more source columns. The system concatenates MD_SOURCE_INSTANCE_ID as part of MD_BUSINESS_KEY. |
MD_HASH_CODE |
Stores system-generated hash codes. The system compares this field with the snapshot tables for data update. |