The purpose of this process is to standardizes/harmonize five-year vital and population counts taken from the UNDP database.

There are two functions involved. DDharmonize_Vitals5 1 is implemented on births and deaths data while DDharmonize_Pop5 is implemented on population data.

The functions are defined as follows:

DDharmonize_Vitals5(indata, type = c(“births”,“deaths”))

DDharmonize_Pop5(indata)

In this vignette, we will use the vitals5_df dataset that is embedded on this package, and the DDharmonize_Vitals5() function to show how five-year birth counts are harmonized. This dataset represents births by age of mother and sex of child in Norway for the year 2002 and for cases where SexID == “Both sexes”.

## Load the packages required
library(rddharmony)
library(kableExtra)
library(dplyr)
library(purrr)

## Create a function to be used to generate the table output
tab_output <- function(tab) {
  kable(tab, booktabs = TRUE, align = "c", table.envir = "capctable", longtable = TRUE) %>%
    kable_styling() %>%
    row_spec(0, bold = T, color = "white", background = "#6ebed8") %>%
    kable_paper(html_font = "helvetica") %>%
    scroll_box(width = "100%", height = "300px")
}

The data, with only a few variables displayed, is shown below.

new_df <- vitals5_df %>%
  select(LocID, LocName, DataSourceYear, DataStatusName, SexName, SexID, DataSourceYear, TimeLabel, TimeMid, starts_with("Age"), DataValue, SeriesID) %>%
  select(-agesort) %>%
  arrange(AgeLabel)

new_df %>% tab_output()
LocID LocName DataSourceYear DataStatusName SexName SexID TimeLabel TimeMid AgeID AgeUnit AgeStart AgeEnd AgeSpan AgeMid AgeLabel AgeSort DataValue SeriesID
578 Norway 2005 Final Both sexes 3 2002 2002.5 3222 Year 0 15 15 7.5 0-14 9 4 1.106610e+14
578 Norway 2004 Final Both sexes 3 2002 2002.5 30504 Year 12 15 3 14.5 12-14 106 4 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 304 Year 15 20 5 17.5 15-19 126 1322 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 305 Year 20 25 5 22.5 20-24 172 8032 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 306 Year 25 30 5 27.5 25-29 195 18556 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 307 Year 30 35 5 32.5 30-34 211 18907 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 308 Year 35 40 5 37.5 35-39 232 7365 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 309 Year 40 45 5 42.5 40-44 249 1211 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 310 Year 45 50 5 47.5 45-49 263 37 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 398 Year 0 -1 -1 -1.0 Total 999 55434 7.878205e+12
578 Norway 2005 Final Both sexes 3 2002 2002.5 398 Year 0 -1 -1 -1.0 Total 999 4 1.106610e+14

The function should ideally be run over each of the specific SexIDs (Males (1), Females (2) and Both sexes (3)), but when it is applied to births data, only “Both sexes” counts are considered. It is also implemented in cases where we have at least one abridged age group.

Before running this function, we need to ensure that all the age labels in the data have an AgeSpan >= 5.

new_df <- new_df %>%
  dplyr::filter(AgeSpan %in% c(-2, -1) | AgeSpan >= 5)
new_df %>% tab_output()
LocID LocName DataSourceYear DataStatusName SexName SexID TimeLabel TimeMid AgeID AgeUnit AgeStart AgeEnd AgeSpan AgeMid AgeLabel AgeSort DataValue SeriesID
578 Norway 2005 Final Both sexes 3 2002 2002.5 3222 Year 0 15 15 7.5 0-14 9 4 1.106610e+14
578 Norway 2004 Final Both sexes 3 2002 2002.5 304 Year 15 20 5 17.5 15-19 126 1322 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 305 Year 20 25 5 22.5 20-24 172 8032 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 306 Year 25 30 5 27.5 25-29 195 18556 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 307 Year 30 35 5 32.5 30-34 211 18907 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 308 Year 35 40 5 37.5 35-39 232 7365 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 309 Year 40 45 5 42.5 40-44 249 1211 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 310 Year 45 50 5 47.5 45-49 263 37 7.878205e+12
578 Norway 2004 Final Both sexes 3 2002 2002.5 398 Year 0 -1 -1 -1.0 Total 999 55434 7.878205e+12
578 Norway 2005 Final Both sexes 3 2002 2002.5 398 Year 0 -1 -1 -1.0 Total 999 4 1.106610e+14

We begin by subsetting the data to ensure that we only have “Both sexes” data and none of the data values are missing.

new_df <- new_df %>%
  dplyr::filter(SexID == 3 & !is.na(DataValue)) %>%
  select(-SexID) %>%
  distinct()

new_df %>% tab_output()
LocID LocName DataSourceYear DataStatusName SexName TimeLabel TimeMid AgeID AgeUnit AgeStart AgeEnd AgeSpan AgeMid AgeLabel AgeSort DataValue SeriesID
578 Norway 2005 Final Both sexes 2002 2002.5 3222 Year 0 15 15 7.5 0-14 9 4 1.106610e+14
578 Norway 2004 Final Both sexes 2002 2002.5 304 Year 15 20 5 17.5 15-19 126 1322 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 305 Year 20 25 5 22.5 20-24 172 8032 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 306 Year 25 30 5 27.5 25-29 195 18556 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 307 Year 30 35 5 32.5 30-34 211 18907 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 308 Year 35 40 5 37.5 35-39 232 7365 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 309 Year 40 45 5 42.5 40-44 249 1211 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 310 Year 45 50 5 47.5 45-49 263 37 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 398 Year 0 -1 -1 -1.0 Total 999 55434 7.878205e+12
578 Norway 2005 Final Both sexes 2002 2002.5 398 Year 0 -1 -1 -1.0 Total 999 4 1.106610e+14

We then check if “Final” data status is available and if so, we prioritize this over the provisional records.

if ("Final" %in% unique(new_df$DataStatusName)) {
  new_df <- new_df %>%
    dplyr::filter(DataStatusName == "Final")
}

new_df %>% tab_output()
LocID LocName DataSourceYear DataStatusName SexName TimeLabel TimeMid AgeID AgeUnit AgeStart AgeEnd AgeSpan AgeMid AgeLabel AgeSort DataValue SeriesID
578 Norway 2005 Final Both sexes 2002 2002.5 3222 Year 0 15 15 7.5 0-14 9 4 1.106610e+14
578 Norway 2004 Final Both sexes 2002 2002.5 304 Year 15 20 5 17.5 15-19 126 1322 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 305 Year 20 25 5 22.5 20-24 172 8032 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 306 Year 25 30 5 27.5 25-29 195 18556 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 307 Year 30 35 5 32.5 30-34 211 18907 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 308 Year 35 40 5 37.5 35-39 232 7365 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 309 Year 40 45 5 42.5 40-44 249 1211 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 310 Year 45 50 5 47.5 45-49 263 37 7.878205e+12
578 Norway 2004 Final Both sexes 2002 2002.5 398 Year 0 -1 -1 -1.0 Total 999 55434 7.878205e+12
578 Norway 2005 Final Both sexes 2002 2002.5 398 Year 0 -1 -1 -1.0 Total 999 4 1.106610e+14

We proceed to check for multiple series ids and for each unique series id, we check whether it is a full series 2 with all age groups represented and an open age greater than 60.

ids_series <- unique(new_df$SeriesID)
n_series <- length(ids_series)

# for each unique series,
df_out <- NULL

for (i in 1:n_series) {

  ## Filter the data for a specific SeriesID
  df <- new_df %>% dplyr::filter(SeriesID == ids_series[i])

  ## Check whether it is a full series with all age groups represented and an open age greater than 60
  ## AgeSpan == -1 (Total), AgeSpan == -2 (Unknown), AgeSpan == 5(five-year)

  df_abr_std <- df[(df$AgeStart == 0 & df$AgeSpan == 1) |
    df$AgeSpan %in% c(-1, -2, 5), ]

  if (nrow(df_abr_std) > 13) {
    df$check_full <- dd_series_isfull(df_abr_std, abridged = TRUE)
  } else {
    df$check_full <- FALSE
  }
  df_out <- rbind(df_out, df)
}
new_df <- df_out
rm(df_out)

new_df %>% tab_output()
LocID LocName DataSourceYear DataStatusName SexName TimeLabel TimeMid AgeID AgeUnit AgeStart AgeEnd AgeSpan AgeMid AgeLabel AgeSort DataValue SeriesID check_full
578 Norway 2005 Final Both sexes 2002 2002.5 3222 Year 0 15 15 7.5 0-14 9 4 1.106610e+14 FALSE
578 Norway 2005 Final Both sexes 2002 2002.5 398 Year 0 -1 -1 -1.0 Total 999 4 1.106610e+14 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 304 Year 15 20 5 17.5 15-19 126 1322 7.878205e+12 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 305 Year 20 25 5 22.5 20-24 172 8032 7.878205e+12 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 306 Year 25 30 5 27.5 25-29 195 18556 7.878205e+12 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 307 Year 30 35 5 32.5 30-34 211 18907 7.878205e+12 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 308 Year 35 40 5 37.5 35-39 232 7365 7.878205e+12 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 309 Year 40 45 5 42.5 40-44 249 1211 7.878205e+12 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 310 Year 45 50 5 47.5 45-49 263 37 7.878205e+12 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 398 Year 0 -1 -1 -1.0 Total 999 55434 7.878205e+12 FALSE

As can be seen in this data, we have situations were we have multiple total labels for the same series. This may have occurred when a country updates an age label and reports the total for that update. If there exists more than one total reported and there is a one that is equal to the computed total (summation over the age labels), drop the others to be left with this one 3.

new_df <- new_df %>% dd_multiple_totals()
new_df %>% tab_output()
LocID LocName DataSourceYear DataStatusName SexName TimeLabel TimeMid AgeID AgeUnit AgeStart AgeEnd AgeSpan AgeMid AgeLabel AgeSort DataValue SeriesID check_full counter misplaced_oag
578 Norway 2005 Final Both sexes 2002 2002.5 3222 Year 0 15 15 7.5 0-14 9 4 1.106610e+14 FALSE 2 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 304 Year 15 20 5 17.5 15-19 126 1322 7.878205e+12 FALSE 1 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 305 Year 20 25 5 22.5 20-24 172 8032 7.878205e+12 FALSE 1 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 306 Year 25 30 5 27.5 25-29 195 18556 7.878205e+12 FALSE 1 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 307 Year 30 35 5 32.5 30-34 211 18907 7.878205e+12 FALSE 1 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 308 Year 35 40 5 37.5 35-39 232 7365 7.878205e+12 FALSE 1 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 309 Year 40 45 5 42.5 40-44 249 1211 7.878205e+12 FALSE 1 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 310 Year 45 50 5 47.5 45-49 263 37 7.878205e+12 FALSE 1 FALSE
578 Norway 2004 Final Both sexes 2002 2002.5 398 Year 0 -1 -1 -1.0 Total 999 55434 7.878205e+12 FALSE 2 FALSE

We then check the data to ensure that we have the latest data source year for each of the age labels 4.

# if there is more than one series ...
if (n_series > 1) {
  latest_source_year <- max(new_df$DataSourceYear)
  check_latest_full <- unique(new_df$check_full[new_df$DataSourceYear == latest_source_year])
  # ... and latest series is full then keep only that one
  if (any(check_latest_full)) {
    new_df <- new_df[new_df$DataSourceYear == latest_source_year, ]
  } else {
    # ... and latest series is not full, then keep the latest data source record for each age
    new_df <- new_df %>% dd_latest_source_year()
  }
}

We proceed to tidy up the data frame and in cases where we do not have an “Unknown” age label present, we generate one and set the data value to 0.

new_df <- new_df %>%
  select(DataSourceYear, AgeStart, AgeEnd, AgeLabel, AgeSpan, DataValue) %>%
  distinct()

if (!("Unknown" %in% new_df$AgeLabel)) {
  new_df <- new_df %>%
    bind_rows(data.frame(
      AgeStart = -2,
      AgeEnd = -2,
      AgeSpan = -2,
      AgeLabel = "Unknown",
      DataSourceYear = NA,
      DataValue = 0
    ))
}

We have situations where the “Unknown” value represents the difference between the reported total and the summation of the data values (computed totals) and in such cases, we also set the “Unknown” data value to 0 5.

new_df <- new_df %>% dd_drop_unknowns()

We also reconcile early age groups for births, for example in this case, we have 0-14 and since the minimum age should be 10, we set this to 10-14 6. We also reconcile ages 0-1, 1-4 and 0-4 such that if 0-4 is missing and 0-1 and 1-4 are present, we sum both to get 0-4 7.

new_df <- new_df %>%
  dd_firstages_compute_births() %>%
  select(-AgeSort)

new_df <- new_df %>% dd_firstages_compute()
new_df %>% tab_output()
AgeStart AgeEnd AgeLabel AgeSpan AgeSort DataSourceYear DataValue
10 15 10-14 5 5 NA 4
15 20 15-19 5 6 2004 1322
20 25 20-24 5 7 2004 8032
25 30 25-29 5 8 2004 18556
30 35 30-34 5 9 2004 18907
35 40 35-39 5 10 2004 7365
40 45 40-44 5 11 2004 1211
45 50 45-49 5 12 2004 37
0 -1 Total -1 184 2004 55434
-2 -2 Unknown -2 185 NA 0

We proceed to check whether there are multiple open age groups (TRUE/FALSE), and if so, compute closed age groups from multiple open age groups and add to the data if missing 8.

oag_multi <- new_df %>% dd_oag_multiple()

if (oag_multi) {
  add <- new_df %>%
    dd_oag2closed() %>%
    dplyr::filter(!(AgeLabel %in% new_df$AgeLabel[!is.na(new_df$DataValue)]))

  if (nrow(add > 0)) {
    new_df <- new_df %>%
      bind_rows(add) %>%
      arrange(AgeStart)
  }
}

new_df %>% tab_output()
AgeStart AgeEnd AgeLabel AgeSpan AgeSort DataSourceYear DataValue
10 15 10-14 5 5 NA 4
15 20 15-19 5 6 2004 1322
20 25 20-24 5 7 2004 8032
25 30 25-29 5 8 2004 18556
30 35 30-34 5 9 2004 18907
35 40 35-39 5 10 2004 7365
40 45 40-44 5 11 2004 1211
45 50 45-49 5 12 2004 37
0 -1 Total -1 184 2004 55434
-2 -2 Unknown -2 185 NA 0

We then identify the start age of the open age group needed to close the series 9, flag whether this open age group exists in the series and drop records for open age groups that do not close the series.


oag_start <- new_df %>% dd_oag_agestart()

oag_check <- paste0(oag_start, "+") %in% new_df$AgeLabel

if (!is_empty(oag_start)) {
  new_df <- new_df %>%
    dplyr::filter(!(AgeStart > 0 & AgeSpan == -1 & AgeStart != oag_start))
}

new_df %>% tab_output()
AgeStart AgeEnd AgeLabel AgeSpan AgeSort DataSourceYear DataValue
10 15 10-14 5 5 NA 4
15 20 15-19 5 6 2004 1322
20 25 20-24 5 7 2004 8032
25 30 25-29 5 8 2004 18556
30 35 30-34 5 9 2004 18907
35 40 35-39 5 10 2004 7365
40 45 40-44 5 11 2004 1211
45 50 45-49 5 12 2004 37
0 -1 Total -1 184 2004 55434
-2 -2 Unknown -2 185 NA 0

We also check that there are no missing age groups on the abridged series and if so, we compute all possible open age groups 10 given available input and append the open age group that completes the abridged series. We also check again to see whether any open age group exists.

## Check that there are no missing age groups on the abridged series
if (nrow(new_df[new_df$AgeStart >= 5, ]) > 0) {
  check_abr <- DemoTools::is_abridged(new_df$AgeStart[new_df$AgeStart >= 5])
} else {
  check_abr <- FALSE
}

if (check_abr == TRUE) {

  ## Compute all possible open age groups given available input
  abr_oag <- dd_oag_compute(new_df, age_span = 5)

  ## Append the oag that completes the abridged series
  new_df <- new_df %>%
    bind_rows(abr_oag[!(abr_oag$AgeLabel %in% new_df$AgeLabel) &
      abr_oag$AgeStart == oag_start, ]) %>%
    arrange(AgeSort)
}


## Check again whether any open age group exists
oag_check <- paste0(oag_start, "+") %in% new_df$AgeLabel
oag_check
#> [1] TRUE

new_df %>% tab_output()
AgeStart AgeEnd AgeLabel AgeSpan AgeSort DataSourceYear DataValue
10 15 10-14 5 5 NA 4
15 20 15-19 5 6 2004 1322
20 25 20-24 5 7 2004 8032
25 30 25-29 5 8 2004 18556
30 35 30-34 5 9 2004 18907
35 40 35-39 5 10 2004 7365
40 45 40-44 5 11 2004 1211
45 50 45-49 5 12 2004 37
50 0 50+ -1 167 NA 0
0 -1 Total -1 184 2004 55434
-2 -2 Unknown -2 185 NA 0

If “Total” is missing and the series is otherwise complete, we compute it .

if (!("Total" %in% new_df$AgeLabel) & "0-4" %in% new_df$AgeLabel & oag_check == TRUE) {
  new_df <- new_df %>%
    bind_rows(data.frame(
      AgeStart = 0,
      AgeEnd = -1,
      AgeLabel = "Total",
      AgeSpan = -1,
      AgeSort = 184,
      DataSourceYear = NA,
      DataValue = sum(new_df$DataValue[new_df$AgeSpan == 5]) +
        new_df$DataValue[new_df$AgeSpan == -1 & new_df$AgeStart == oag_start] +
        new_df$DataValue[new_df$AgeLabel == "Unknown"]
    ))
}

new_df %>% tab_output()
AgeStart AgeEnd AgeLabel AgeSpan AgeSort DataSourceYear DataValue
10 15 10-14 5 5 NA 4
15 20 15-19 5 6 2004 1322
20 25 20-24 5 7 2004 8032
25 30 25-29 5 8 2004 18556
30 35 30-34 5 9 2004 18907
35 40 35-39 5 10 2004 7365
40 45 40-44 5 11 2004 1211
45 50 45-49 5 12 2004 37
50 0 50+ -1 167 NA 0
0 -1 Total -1 184 2004 55434
-2 -2 Unknown -2 185 NA 0

Finally, a note is generated to alert the user about missing data and this process is repeated for all the sex ids (in the case of deaths data).

new_df$note <- NA
if (check_abr == FALSE | oag_check == FALSE) {
  abr$note <- "The abridged series is missing data for one or more age groups."
}
if (!("0" %in% new_df$AgeLabel & "1-4" %in% new_df$AgeLabel & "0-4" %in% new_df$AgeLabel)) {
  new_df$note <- "The abridged series is missing data for one or more age groups."
}

new_df %>% tab_output()
AgeStart AgeEnd AgeLabel AgeSpan AgeSort DataSourceYear DataValue note
10 15 10-14 5 5 NA 4 The abridged series is missing data for one or more age groups.
15 20 15-19 5 6 2004 1322 The abridged series is missing data for one or more age groups.
20 25 20-24 5 7 2004 8032 The abridged series is missing data for one or more age groups.
25 30 25-29 5 8 2004 18556 The abridged series is missing data for one or more age groups.
30 35 30-34 5 9 2004 18907 The abridged series is missing data for one or more age groups.
35 40 35-39 5 10 2004 7365 The abridged series is missing data for one or more age groups.
40 45 40-44 5 11 2004 1211 The abridged series is missing data for one or more age groups.
45 50 45-49 5 12 2004 37 The abridged series is missing data for one or more age groups.
50 0 50+ -1 167 NA 0 The abridged series is missing data for one or more age groups.
0 -1 Total -1 184 2004 55434 The abridged series is missing data for one or more age groups.
-2 -2 Unknown -2 185 NA 0 The abridged series is missing data for one or more age groups.