Extract Transform Load

From Research & Surveillance

This vignette is a generic introduction to our ETL pipelines.

Extract

The first step in the ETL process is change data capture (CDC), a software process that tracks and records changes to data in a database (eg DHIS2), and then downloads those changes to our systems in real-time.

The process of extracting data differs for each data asset. The main purpose is to update a raw database with new and modified data. No data are ever deleted. Instead, each observation is assigned a unique identifier, and if the value of an observation is updated at a source (eg DHIS2), the new value is date stamped and added to the raw database with the same unique identifier but an new version number. The raw data are thus effectively version controlled, and we provide utilities to reconstruct the raw data as it looked at some point in the past.

Transform

After updating the raw data, a series of steps are undertaken to develop a modified current database that holds clean, up-to-date data. To avoid storing an ever increasing number of current data assets, the data warehouse holds only one version of the current data that is generated from the raw data whenever it is updated.

Version-controlled algorithms that generate the current database from the raw datbase are stored in github.

Load

In the last step, the updated raw and current data assets are uploaded to the DHI data warehouse.