Introduction

DDharmonize_validate_DeathCounts() is a function that implements a work flow for death records extracted from vital registration databases and census. This work flow includes extracting data from the UNPD (United Nations Population Division) database, harmonizing age groups, identifying full series, validating totals by age, and eventually producing clean and harmonised datasets for each location. See the harmonization workflow article for a detailed overview of this process.

The death records are grouped into two types of data:

  • Deaths by age and sex

  • Total deaths by sex

Function definition


# clean_df <- DDharmonize_validate_DeathCounts(locid = 404,
#                                              times = c(1950, 2000),
#                                              process = c("census", "vr"),
#                                              return_unique_ref_period = TRUE,
#                                              retainKeys = FALSE)

Function arguments

The function contains several arguments:

locid: This is the a numeric variable representing the location id of each of the locations. You can run View(get_locations()) to get the list of plausible location ids. The ids are listed in the PK_LocID variable. You can also run the function check_locid(insert locid here) to check whether a location id is valid (part of the locations in the UNPD website). Running check_locid(insert locid here) with a valid id returns a message confirming that the location id is valid and also gives the location name of that particular id. Running the same code with an invalid id returns a message directing the user to run View(get_locations()) in order to get a list of plausible location ids. See example below.

## valid id
## check_locid(404)

## invalid id
## check_locid(202178)

times: The period of the data to be extracted. You can extract one year data e.g times = 2020 or a longer period of time e.g times = c(1950, 2020).

process: The process used to collect or to obtain the data i.e either via census or vital registrations (vr). By default, the function pulls data obtained through both of these processes.

return_unique_ref_period: Specifies whether the data to be returned should contain one unique id (return_unique_ref_period == TRUE) or several ids (return_unique_ref_period == FALSE) per time label. ids are a unique identifier for each unique set of records based on LocID, LocName, DataProcess, ReferencePeriod, DataSourceName, StatisticalConceptName, DataTypeName and DataReliabilityName. The definitions of these variables are provided later in this article.

retainKeys: Specifies whether only a few (retainKeys == FALSE) or all (retainKeys == TRUE) variables should be retained in the output.

Output structure

The function returns clean data with 26 variables (when retainKeys == TRUE) which are defined below:

  • id: A unique id that is generated by combining the LocID, LocName, DataProcess, type of data (deaths), TimeLabel, DataProcessType, DataSourceName, StatisticalConceptName, DataTypeName and DataReliabilityName.

  • LocID: Location Id. This is a numerical Location Code (3-digit codes following ISO 3166-1 numeric standard - UNSD M49 codes) - see http://en.wikipedia.org/wiki/ISO_3166-1_numeric .

  • LocName: Name of a country or territory identified by each Location Id e.g when LocID == 752 , LocName == Sweden.

  • IndicatorName: Identifies the type of data i.e. Deaths by age and sex or Total deaths by sex.

  • IndicatorID: An id representing each indicator. IndicatorID = 194 where IndicatorName = Deaths by age and sex - abridged, IndicatorID = 195 where IndicatorName == Deaths by age and sex - complete and IndicatorID = 188 where IndicatorName == Total deaths by sex.

  • TimeStart: Start year defining the period of interest (e.g 01/01/2000).

  • TimeLabel: The year of interest (e.g 2000).

  • TimeEnd: End year defining the period of interest (e.g 31/12/2000).

  • TimeMid: Mid-period based on TimeStart and TimeEnd (e.g 2000.500).

  • DataProcessType: Defines the process used to collect or to obtain the data.

  • DataSourceName: This defines the source of data (e.g. Demographic Year Book, World Health Organization records, etc).

  • StatisticalConceptName: Defines the concept under which individuals (or vital events) are recorded (e.g De-facto, Year of occurrence).

  • DataTypeName: Indicates the type of collected data or estimation process used to derive the data.

  • DataReliabilityName: Denotes the reliability of the data values (default is unknown): Error, typo or invalid value, Very low quality, Low , Fair, High quality, etc. This is a default rating either obtained from the Data Source or assigned by default to the data during initial loading.

  • AgeLabel: Age bracket (where data is abridged) or single year of age (where the data is complete).

  • AgeStart: The start age of a particular age label e.g where the age label is 10-14, the start age is 10.

  • AgeEnd: The end age of a particular age label + 1 e.g where the age label is 10-14, the end age is 15.

  • AgeSpan: The difference between AgeStart and AgeEnd.

  • AgeSort: Defines the order of a particular age label when the age labels are arranged in ascending order.

  • abridged: Identifies age labels that have an age span of 5 years, including open age groups and the total, but excluding the label 0-4 (IndicatorID = 194).

  • five_year: Identifies age labels that are abridged including 0-4.

  • complete: Identifies single year of age labels, including open age groups and the total (IndicatorID = 195).

  • non_standard: These are age labels that are not standard e.g those with an age span of anything other than 1, 5 and they are neither total nor unknown.

  • SexID: ID for each of the sex groups (1: Males, 2: Females, 3: Both sexes).

  • DataValue: The numerical value of interest for a specific set of unique characteristics as defined by id.

  • note: Gives the user more information about a particular record in cases where the data is not fully clean e.g the record may not have been harmonized because of non-standard age groups or that the series is missing data for one or more age groups.

Harmonization Workflow

For a detailed explanation of the harmonization work flow, see this article .