Harmonization Workflow • rddharmony

This article gives a brief description of the harmonization process ¹, from extracting the data from the UNPD database to producing clean output. The process is divided into four parts outlined below.

Part 1

TL;DR: In this section, the data is extracted from the UNPD database and harmonized to standard abridged and complete age groups, per series.

The term Counts by age and sex will be used in general, to refer to Counts by age of mother and sex of child (births) data and Counts by age and sex (population and deaths) data.

We begin by extracting the raw data from the UNPD database ². We then drop sub-national census records and discard data types that are not needed in the final output. A unique id is then generated by combining the LocID, LocName, DataProcess, type of data (births, deaths or population counts), TimeLabel, DataProcessType, DataSourceName, StatisticalConceptName, DataTypeName and DataReliabilityName. We later (for births and deaths) split the data into Counts by age and sex and Total counts by sex. The majority of the harmonisation process that follows is implemented on Counts by age and sex, so Total counts by sex data is merged back to the clean and harmonised data at the end of this process.

A major step that follows is the harmonization of the data into standard age groups ³. This step is implemented on each id and each series (abridged or complete). We begin by subsetting the data, to only be left with the final series, in a case where “Final” data status is available. For each series, we then check whether the series is a full series with all age groups represented and also check whether an open age (greater than 60 for the births data) exists in the data. A full series requires all closed age groups and either an open age group or a reported total value. If the data is abridged, then a full series also requires either both abridged child ages or both five-year child ages to be present. This condition sufficiently holds for both the births data and the population counts data, but is further enhanced for the deaths data as described in part 2. We also ensure that that we only have one “Total” record reported in the data. If more than one series exists, and the latest series is full, we only keep one, but if the latest series is not full, we then keep the latest data source record for each age label. In cases where an “Unknown” age label does not exist, we generate one and set its’ data value to 0. The same happens in cases where an “Unknown” age label exists, but is equivalent to the difference between the reported total and calculated total. This could be a case where age labels were later updated and these changes captured in the reported total. In cases where the age labels are grouped (abridged), we reconcile wide early age groups to abridged e.g in a case where 0-4 is missing and 0 and 1-4 are present, we sum 0 and 1-4 to obtain 0-4. We also check whether there are multiple open age groups and if so, we compute closed age groups from them and add them to the data if they are missing. The start age of the open age group needed to close the series is identified and a check is done to confirm whether this open age group exists in the series. Open age groups that do not close the series are dropped, a check is done to ensure that there are no missing age groups on the abridged series after which we compute all possible open age groups and append them to the data. If the total is not recorded and the series is otherwise complete, we compute it. In a case where we still have missing age labels or an open age group does not exist, a note is generated alerting the user that the series is missing data for one or more age groups. This marks the end of harmonization of the data into standard age groups.

Once the major harmonization step is done, we check whether the youngest age on abridged is lower than youngest age on complete (in the case where both abridged and complete series exist), and if so, we distribute (for births data) counts at ages 10-14 across single ages 10-14 using the Kirill’s Matlab regression method ⁴. We then reconcile abridged and complete series, as necessary. This involves generating abridged records from complete series , checking if the totals match, in cases where they exist, filling in the total of one series if it is missing but is present in the other series, appending the abridged series with the abridged records derived from complete series, computing all possible open age groups, dropping records for open age groups that do not close the series and finally in cases where totals for both series don’t match, generating a note alerting the user that different totals on abridged and complete preclude reconciliation ⁵. We proceed to fill in zeros for counts at young ages ⁶, if missing and in cases where we only have the complete series and not the abridged series, we generate the latter records from the former ⁷. We later populate missing age groups with zeros, as appropriate and finally assemble all of the series into a single dataset.

Part 2

In this section, two major things happen. One, we check if the data(series) is full ⁸ and if so, we keep it and two, for each id-sex combination of full series, we keep the reconciled series if it is available and discard the original abridged or complete series.

At this point, we also define a full series based on the availability of the individual sex data. In the deaths data, a full series is further defined as one where either both male and female data is available, or both sexes data and either the male data or female data is available.

Part 3

In this section, we validate the remaining series, making corrections on any instances where data for sex-age group combinations are missing and check whether the sum over age matches the reported totals ⁹. We also go a step further and check if counts by sex is equal to counts for both sexes and adjust in case of any difference ¹⁰ and distribute unknowns over age ¹¹. Finally, the most authoritative series is selected, in cases where there is more than one id for a given census year. This process is different for the different kinds of data.

For births and deaths ¹², we first check if there are multiple ids per time label in the dataset. If such records exist, we separate them from those that only have one id for each time label. On the former dataset, and for each time label, we begin by dropping records based on the StatisticalConceptName, preferring “Year of occurrence” over “Year of registration”. In cases where we have multiple types of DataStatus, we prefer the “Final” over “Provisional”. We then rank the data by DataProcess, preferring “Census” over “Vital Registration” data. We also prefer better DataReliability (where the data is ranked as 1: High Quality, 2: Fair Quality, 3: Low Quality and so on). We also have cases where there are multiple DataSourceYear values for the same time label in which case we subset the data to select the latest DataSourceYear for each of the time labels. In cases where we have some ids with both abridged and complete series, and others with only one of the two, we prefer the former to the latter. We also check on the maximum age groups for each of the ids, and keep the series with the highest open age group. We have several DataSourceNames present in the data, and in the case where these exist for the same time label, we prefer “Demographic Yearbook” over the rest. The final process of this part is checking for duplicates and dropping them, if they exist. The resulting data is then merged back with the records that only have one id for each time label and in cases where multiple ids still exist, we randomly drop the rest to be left with only one.

For population data ¹³, we also check if there are multiple ids and separate them from those that only have one id for each time label. Here, we begin by checking cases where both the abridged and complete series exist, and keep this over ids with only one of the series present. As with the births and deaths data, we also prefer better DataReliability but when it comes to ranking the StatisticalConceptName, we prefer “De-facto” over “De-jure”. We also check on the maximum age groups for each of the ids, keeping the series with the highest open age group and prefer “Demographic Yearbook” over the rest in cases where multiple DataSourceNames exist. If there are cases with “IPUMS” as a duplicate source, we drop it since many of these series are samples. We then subset the data to select the latest DataSourceYear for each of the time labels, drop duplicates where they exist, merge the dataset with the records that only have one id for each time label and randomly pick one id, in cases where multiple ids still exist in the data.

Part 4

This being the final part of the work flow, we look for years that are in the raw data but not in the output and append them to the output with a note. If there are series with non-standard age groups, we add these to the output as well. We then combine the harmonized data with the Total counts by sex data ensuring that, for each id, one, if Total counts by sex records exist and the harmonized data does not, we drop the former and two, if the Total counts by sex reported total value does not match the total value reported in the harmonized data, we drop it¹⁴. We also ensure that we have one id per time label and finally drop duplicate records (especially in cases where we may have duplicate Total counts by sex records).