Introduction

DDharmonize_validate_BirthCounts() is a function that implements a workflow for birth records extracted from vital registration databases and census. This workflow includes extracting data from the UNPD (United Nations Population Division) database, harmonizing age groups, identifying full series, validating totals by age, and eventually producing clean and harmonised datasets for each location. See the harmonization workflow article for a detailed overview of this process.

The birth records are grouped into two types of data:

  • Births by age of mother and sex of child

  • Total births by sex of child

Function definition


# clean_df <- DDharmonize_validate_BirthCounts(locid, 
#                                              times, 
#                                              process = c("census", "vr"),
#                                              return_unique_ref_period = TRUE,
#                                              retainKeys = FALSE)

                                             
# example: extracting sweden's data                                            
# clean_df <- DDharmonize_validate_BirthCounts(locid = 752,
#                                              times = c(2010, 2011),
#                                              process = c("census", "vr"),
#                                              return_unique_ref_period = TRUE,
#                                              retainKeys = FALSE)

Function arguments

The function contains several arguments:

locid: This is the a numeric variable representing the location id of each of the locations. You can run View(get_locations()) to get the list of plausible location ids. The ids are listed in the PK_LocID variable. You can also run the function check_locid(insert locid here) to check whether a location id is valid (part of the locations in the UNPD website). Running check_locid(insert locid here) with a valid id returns a message confirming that the location id is valid and also gives the location name of that particular id. Running the same code with an invalid id returns a message directing the user to run View(get_locations()) in order to get a list of plausible location ids. See example below.

## valid id
## check_locid(752)

## invalid id
## check_locid(2021)

times: The period of the data to be extracted. You can extract one year data e.g times = 2020 or a longer period of time e.g times = c(1950, 2020).

process: The process used to collect or to obtain the data i.e either via census or vital registrations (vr). By default, the function pulls data obtained through both of these processes.

return_unique_ref_period: Specifies whether the data to be returned should contain one unique id (return_unique_ref_period == TRUE) or several ids (return_unique_ref_period == FALSE) per time label. ids are a unique identifier for each unique set of records based on LocID, LocName, DataProcess, ReferencePeriod, DataSourceName, StatisticalConceptName, DataTypeName and DataReliabilityName. The definitions of these variables are provided later in this article.

retainKeys: Specifies whether only a few (retainKeys == FALSE) or all (retainKeys == TRUE) variables should be retained in the output.

Output structure

The function returns clean data with 26 variables (when retainKeys == TRUE) which are defined below:

  • id: A unique id that is generated by combining the LocID, LocName, DataProcess, type of data (births), TimeLabel, DataProcessType, DataSourceName, StatisticalConceptName, DataTypeName and DataReliabilityName.

  • LocID: Location Id. This is a numerical Location Code (3-digit codes following ISO 3166-1 numeric standard - UNSD M49 codes) - see http://en.wikipedia.org/wiki/ISO_3166-1_numeric .

  • LocName: Name of a country or territory identified by each Location Id e.g when LocID == 404 , LocName == Kenya.

  • IndicatorName: Identifies the type of data i.e. Births by age of mother and sex of child or Total births by sex of child.

  • IndicatorID: An id representing each indicator. IndicatorID = 170 where IndicatorName == Births by age of mother and sex of child and IndicatorID = 159 where IndicatorName == Total births by sex of child.

  • TimeStart: Start year defining the period of interest (e.g 01/01/2000).

  • TimeLabel: The year of interest (e.g 2000).

  • TimeEnd: End year defining the period of interest (e.g 31/12/2000).

  • TimeMid: Mid-period based on TimeStart and TimeEnd (e.g 2000.500).

  • DataProcessType: Defines the process used to collect or to obtain the data.

  • DataSourceName: This defines the source of data (e.g. Demographic Year Book, World Health Organization records, etc).

  • StatisticalConceptName: Defines the concept under which individuals (or vital events) are recorded (e.g De-facto, Year of occurrence).

  • DataTypeName: Indicates the type of collected data or estimation process used to derive the data.

  • DataReliabilityName: Denotes the reliability of the data values (default is unknown): Error, typo or invalid value, Very low quality, Low , Fair, High quality, etc. This is a default rating either obtained from the Data Source or assigned by default to the data during initial loading.

  • AgeLabel: Age bracket (where data is abridged) or single year of age (where the data is complete).

  • AgeStart: The start age of a particular age label e.g where the age label is 10-14, the start age is 10.

  • AgeEnd: The end age of a particular age label + 1 e.g where the age label is 10-14, the end age is 15.

  • AgeSpan: The difference between AgeStart and AgeEnd.

  • AgeSort: Defines the order of a particular age label when the age labels are arranged in ascending order.

  • abridged: Identifies age labels that have an age span of 5 years, including open age groups and the total, but excluding the label 0-4.

  • five_year: Identifies age labels that are abridged including 0-4.

  • complete: Identifies single year of age labels, including open age groups and the total.

  • non_standard: These are age labels that are not standard e.g those with an age span of anything other than 1, 5 and they are neither total nor unknown.

  • SexID: ID for each of the sex groups (1: Males, 2: Females, 3: Both sexes).

  • DataValue: The numerical value of interest for a specific set of unique characteristics as defined by id.

  • note: Gives the user more information about a particular record in cases where the data is not fully clean e.g the record may not have been harmonised because of non-standard age groups or that the series is missing data for one or more age groups.

Harmonization Workflow

For a detailed explanation of the harmonization work flow, see this article .