Harmonizing five-year counts • rddharmony

The purpose of this process is to standardizes/harmonize five-year vital and population counts taken from the UNDP database.

There are two functions involved. DDharmonize_Vitals5 ¹ is implemented on births and deaths data while DDharmonize_Pop5 is implemented on population data.

The functions are defined as follows:

DDharmonize_Vitals5(indata, type = c(“births”,“deaths”))

DDharmonize_Pop5(indata)

In this vignette, we will use the vitals5_df dataset that is embedded on this package, and the DDharmonize_Vitals5() function to show how five-year birth counts are harmonized. This dataset represents births by age of mother and sex of child in Norway for the year 2002 and for cases where SexID == “Both sexes”.

## Load the packages required
library(rddharmony)
library(kableExtra)
library(dplyr)
library(purrr)

## Create a function to be used to generate the table output
tab_output <- function(tab) {
  kable(tab, booktabs = TRUE, align = "c", table.envir = "capctable", longtable = TRUE) %>%
    kable_styling() %>%
    row_spec(0, bold = T, color = "white", background = "#6ebed8") %>%
    kable_paper(html_font = "helvetica") %>%
    scroll_box(width = "100%", height = "300px")
}

The data, with only a few variables displayed, is shown below.

new_df <- vitals5_df %>%
  select(LocID, LocName, DataSourceYear, DataStatusName, SexName, SexID, DataSourceYear, TimeLabel, TimeMid, starts_with("Age"), DataValue, SeriesID) %>%
  select(-agesort) %>%
  arrange(AgeLabel)

new_df %>% tab_output()

LocID	LocName	DataSourceYear	DataStatusName	SexName	SexID	TimeLabel	TimeMid	AgeID	AgeUnit	AgeStart	AgeEnd	AgeSpan	AgeMid	AgeLabel	AgeSort	DataValue	SeriesID
578	Norway	2005	Final	Both sexes	3	2002	2002.5	3222	Year	0	15	15	7.5	0-14	9	4	1.106610e+14
578	Norway	2004	Final	Both sexes	3	2002	2002.5	30504	Year	12	15	3	14.5	12-14	106	4	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	304	Year	15	20	5	17.5	15-19	126	1322	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	305	Year	20	25	5	22.5	20-24	172	8032	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	306	Year	25	30	5	27.5	25-29	195	18556	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	307	Year	30	35	5	32.5	30-34	211	18907	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	308	Year	35	40	5	37.5	35-39	232	7365	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	309	Year	40	45	5	42.5	40-44	249	1211	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	310	Year	45	50	5	47.5	45-49	263	37	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	398	Year	0	-1	-1	-1.0	Total	999	55434	7.878205e+12
578	Norway	2005	Final	Both sexes	3	2002	2002.5	398	Year	0	-1	-1	-1.0	Total	999	4	1.106610e+14

The function should ideally be run over each of the specific SexIDs (Males (1), Females (2) and Both sexes (3)), but when it is applied to births data, only “Both sexes” counts are considered. It is also implemented in cases where we have at least one abridged age group.

Before running this function, we need to ensure that all the age labels in the data have an AgeSpan >= 5.

new_df <- new_df %>%
  dplyr::filter(AgeSpan %in% c(-2, -1) | AgeSpan >= 5)
new_df %>% tab_output()

LocID	LocName	DataSourceYear	DataStatusName	SexName	SexID	TimeLabel	TimeMid	AgeID	AgeUnit	AgeStart	AgeEnd	AgeSpan	AgeMid	AgeLabel	AgeSort	DataValue	SeriesID
578	Norway	2005	Final	Both sexes	3	2002	2002.5	3222	Year	0	15	15	7.5	0-14	9	4	1.106610e+14
578	Norway	2004	Final	Both sexes	3	2002	2002.5	304	Year	15	20	5	17.5	15-19	126	1322	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	305	Year	20	25	5	22.5	20-24	172	8032	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	306	Year	25	30	5	27.5	25-29	195	18556	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	307	Year	30	35	5	32.5	30-34	211	18907	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	308	Year	35	40	5	37.5	35-39	232	7365	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	309	Year	40	45	5	42.5	40-44	249	1211	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	310	Year	45	50	5	47.5	45-49	263	37	7.878205e+12
578	Norway	2004	Final	Both sexes	3	2002	2002.5	398	Year	0	-1	-1	-1.0	Total	999	55434	7.878205e+12
578	Norway	2005	Final	Both sexes	3	2002	2002.5	398	Year	0	-1	-1	-1.0	Total	999	4	1.106610e+14

We begin by subsetting the data to ensure that we only have “Both sexes” data and none of the data values are missing.

new_df <- new_df %>%
  dplyr::filter(SexID == 3 & !is.na(DataValue)) %>%
  select(-SexID) %>%
  distinct()

new_df %>% tab_output()

LocID	LocName	DataSourceYear	DataStatusName	SexName	TimeLabel	TimeMid	AgeID	AgeUnit	AgeStart	AgeEnd	AgeSpan	AgeMid	AgeLabel	AgeSort	DataValue	SeriesID
578	Norway	2005	Final	Both sexes	2002	2002.5	3222	Year	0	15	15	7.5	0-14	9	4	1.106610e+14
578	Norway	2004	Final	Both sexes	2002	2002.5	304	Year	15	20	5	17.5	15-19	126	1322	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	305	Year	20	25	5	22.5	20-24	172	8032	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	306	Year	25	30	5	27.5	25-29	195	18556	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	307	Year	30	35	5	32.5	30-34	211	18907	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	308	Year	35	40	5	37.5	35-39	232	7365	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	309	Year	40	45	5	42.5	40-44	249	1211	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	310	Year	45	50	5	47.5	45-49	263	37	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	398	Year	0	-1	-1	-1.0	Total	999	55434	7.878205e+12
578	Norway	2005	Final	Both sexes	2002	2002.5	398	Year	0	-1	-1	-1.0	Total	999	4	1.106610e+14

We then check if “Final” data status is available and if so, we prioritize this over the provisional records.

if ("Final" %in% unique(new_df$DataStatusName)) {
  new_df <- new_df %>%
    dplyr::filter(DataStatusName == "Final")
}

new_df %>% tab_output()

LocID	LocName	DataSourceYear	DataStatusName	SexName	TimeLabel	TimeMid	AgeID	AgeUnit	AgeStart	AgeEnd	AgeSpan	AgeMid	AgeLabel	AgeSort	DataValue	SeriesID
578	Norway	2005	Final	Both sexes	2002	2002.5	3222	Year	0	15	15	7.5	0-14	9	4	1.106610e+14
578	Norway	2004	Final	Both sexes	2002	2002.5	304	Year	15	20	5	17.5	15-19	126	1322	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	305	Year	20	25	5	22.5	20-24	172	8032	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	306	Year	25	30	5	27.5	25-29	195	18556	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	307	Year	30	35	5	32.5	30-34	211	18907	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	308	Year	35	40	5	37.5	35-39	232	7365	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	309	Year	40	45	5	42.5	40-44	249	1211	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	310	Year	45	50	5	47.5	45-49	263	37	7.878205e+12
578	Norway	2004	Final	Both sexes	2002	2002.5	398	Year	0	-1	-1	-1.0	Total	999	55434	7.878205e+12
578	Norway	2005	Final	Both sexes	2002	2002.5	398	Year	0	-1	-1	-1.0	Total	999	4	1.106610e+14

We proceed to check for multiple series ids and for each unique series id, we check whether it is a full series ² with all age groups represented and an open age greater than 60.

ids_series <- unique(new_df$SeriesID)
n_series <- length(ids_series)

# for each unique series,
df_out <- NULL

for (i in 1:n_series) {

  ## Filter the data for a specific SeriesID
  df <- new_df %>% dplyr::filter(SeriesID == ids_series[i])

  ## Check whether it is a full series with all age groups represented and an open age greater than 60
  ## AgeSpan == -1 (Total), AgeSpan == -2 (Unknown), AgeSpan == 5(five-year)

  df_abr_std <- df[(df$AgeStart == 0 & df$AgeSpan == 1) |
    df$AgeSpan %in% c(-1, -2, 5), ]

  if (nrow(df_abr_std) > 13) {
    df$check_full <- dd_series_isfull(df_abr_std, abridged = TRUE)
  } else {
    df$check_full <- FALSE
  }
  df_out <- rbind(df_out, df)
}
new_df <- df_out
rm(df_out)

new_df %>% tab_output()

LocID	LocName	DataSourceYear	DataStatusName	SexName	TimeLabel	TimeMid	AgeID	AgeUnit	AgeStart	AgeEnd	AgeSpan	AgeMid	AgeLabel	AgeSort	DataValue	SeriesID	check_full
578	Norway	2005	Final	Both sexes	2002	2002.5	3222	Year	0	15	15	7.5	0-14	9	4	1.106610e+14	FALSE
578	Norway	2005	Final	Both sexes	2002	2002.5	398	Year	0	-1	-1	-1.0	Total	999	4	1.106610e+14	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	304	Year	15	20	5	17.5	15-19	126	1322	7.878205e+12	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	305	Year	20	25	5	22.5	20-24	172	8032	7.878205e+12	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	306	Year	25	30	5	27.5	25-29	195	18556	7.878205e+12	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	307	Year	30	35	5	32.5	30-34	211	18907	7.878205e+12	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	308	Year	35	40	5	37.5	35-39	232	7365	7.878205e+12	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	309	Year	40	45	5	42.5	40-44	249	1211	7.878205e+12	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	310	Year	45	50	5	47.5	45-49	263	37	7.878205e+12	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	398	Year	0	-1	-1	-1.0	Total	999	55434	7.878205e+12	FALSE

As can be seen in this data, we have situations were we have multiple total labels for the same series. This may have occurred when a country updates an age label and reports the total for that update. If there exists more than one total reported and there is a one that is equal to the computed total (summation over the age labels), drop the others to be left with this one ³.

new_df <- new_df %>% dd_multiple_totals()
new_df %>% tab_output()

LocID	LocName	DataSourceYear	DataStatusName	SexName	TimeLabel	TimeMid	AgeID	AgeUnit	AgeStart	AgeEnd	AgeSpan	AgeMid	AgeLabel	AgeSort	DataValue	SeriesID	check_full	counter	misplaced_oag
578	Norway	2005	Final	Both sexes	2002	2002.5	3222	Year	0	15	15	7.5	0-14	9	4	1.106610e+14	FALSE	2	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	304	Year	15	20	5	17.5	15-19	126	1322	7.878205e+12	FALSE	1	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	305	Year	20	25	5	22.5	20-24	172	8032	7.878205e+12	FALSE	1	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	306	Year	25	30	5	27.5	25-29	195	18556	7.878205e+12	FALSE	1	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	307	Year	30	35	5	32.5	30-34	211	18907	7.878205e+12	FALSE	1	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	308	Year	35	40	5	37.5	35-39	232	7365	7.878205e+12	FALSE	1	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	309	Year	40	45	5	42.5	40-44	249	1211	7.878205e+12	FALSE	1	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	310	Year	45	50	5	47.5	45-49	263	37	7.878205e+12	FALSE	1	FALSE
578	Norway	2004	Final	Both sexes	2002	2002.5	398	Year	0	-1	-1	-1.0	Total	999	55434	7.878205e+12	FALSE	2	FALSE

We then check the data to ensure that we have the latest data source year for each of the age labels ⁴.

# if there is more than one series ...
if (n_series > 1) {
  latest_source_year <- max(new_df$DataSourceYear)
  check_latest_full <- unique(new_df$check_full[new_df$DataSourceYear == latest_source_year])
  # ... and latest series is full then keep only that one
  if (any(check_latest_full)) {
    new_df <- new_df[new_df$DataSourceYear == latest_source_year, ]
  } else {
    # ... and latest series is not full, then keep the latest data source record for each age
    new_df <- new_df %>% dd_latest_source_year()
  }
}

We proceed to tidy up the data frame and in cases where we do not have an “Unknown” age label present, we generate one and set the data value to 0.

new_df <- new_df %>%
  select(DataSourceYear, AgeStart, AgeEnd, AgeLabel, AgeSpan, DataValue) %>%
  distinct()

if (!("Unknown" %in% new_df$AgeLabel)) {
  new_df <- new_df %>%
    bind_rows(data.frame(
      AgeStart = -2,
      AgeEnd = -2,
      AgeSpan = -2,
      AgeLabel = "Unknown",
      DataSourceYear = NA,
      DataValue = 0
    ))
}

We have situations where the “Unknown” value represents the difference between the reported total and the summation of the data values (computed totals) and in such cases, we also set the “Unknown” data value to 0 ⁵.

new_df <- new_df %>% dd_drop_unknowns()

We also reconcile early age groups for births, for example in this case, we have 0-14 and since the minimum age should be 10, we set this to 10-14 ⁶. We also reconcile ages 0-1, 1-4 and 0-4 such that if 0-4 is missing and 0-1 and 1-4 are present, we sum both to get 0-4 ⁷.

new_df <- new_df %>%
  dd_firstages_compute_births() %>%
  select(-AgeSort)

new_df <- new_df %>% dd_firstages_compute()
new_df %>% tab_output()

AgeStart	AgeEnd	AgeLabel	AgeSpan	AgeSort	DataSourceYear	DataValue
10	15	10-14	5	5	NA	4
15	20	15-19	5	6	2004	1322
20	25	20-24	5	7	2004	8032
25	30	25-29	5	8	2004	18556
30	35	30-34	5	9	2004	18907
35	40	35-39	5	10	2004	7365
40	45	40-44	5	11	2004	1211
45	50	45-49	5	12	2004	37
0	-1	Total	-1	184	2004	55434
-2	-2	Unknown	-2	185	NA	0

We proceed to check whether there are multiple open age groups (TRUE/FALSE), and if so, compute closed age groups from multiple open age groups and add to the data if missing ⁸.

oag_multi <- new_df %>% dd_oag_multiple()

if (oag_multi) {
  add <- new_df %>%
    dd_oag2closed() %>%
    dplyr::filter(!(AgeLabel %in% new_df$AgeLabel[!is.na(new_df$DataValue)]))

  if (nrow(add > 0)) {
    new_df <- new_df %>%
      bind_rows(add) %>%
      arrange(AgeStart)
  }
}

new_df %>% tab_output()

AgeStart	AgeEnd	AgeLabel	AgeSpan	AgeSort	DataSourceYear	DataValue
10	15	10-14	5	5	NA	4
15	20	15-19	5	6	2004	1322
20	25	20-24	5	7	2004	8032
25	30	25-29	5	8	2004	18556
30	35	30-34	5	9	2004	18907
35	40	35-39	5	10	2004	7365
40	45	40-44	5	11	2004	1211
45	50	45-49	5	12	2004	37
0	-1	Total	-1	184	2004	55434
-2	-2	Unknown	-2	185	NA	0

We then identify the start age of the open age group needed to close the series ⁹, flag whether this open age group exists in the series and drop records for open age groups that do not close the series.


oag_start <- new_df %>% dd_oag_agestart()

oag_check <- paste0(oag_start, "+") %in% new_df$AgeLabel

if (!is_empty(oag_start)) {
  new_df <- new_df %>%
    dplyr::filter(!(AgeStart > 0 & AgeSpan == -1 & AgeStart != oag_start))
}

new_df %>% tab_output()

AgeStart	AgeEnd	AgeLabel	AgeSpan	AgeSort	DataSourceYear	DataValue
10	15	10-14	5	5	NA	4
15	20	15-19	5	6	2004	1322
20	25	20-24	5	7	2004	8032
25	30	25-29	5	8	2004	18556
30	35	30-34	5	9	2004	18907
35	40	35-39	5	10	2004	7365
40	45	40-44	5	11	2004	1211
45	50	45-49	5	12	2004	37
0	-1	Total	-1	184	2004	55434
-2	-2	Unknown	-2	185	NA	0

We also check that there are no missing age groups on the abridged series and if so, we compute all possible open age groups ¹⁰ given available input and append the open age group that completes the abridged series. We also check again to see whether any open age group exists.

## Check that there are no missing age groups on the abridged series
if (nrow(new_df[new_df$AgeStart >= 5, ]) > 0) {
  check_abr <- DemoTools::is_abridged(new_df$AgeStart[new_df$AgeStart >= 5])
} else {
  check_abr <- FALSE
}

if (check_abr == TRUE) {

  ## Compute all possible open age groups given available input
  abr_oag <- dd_oag_compute(new_df, age_span = 5)

  ## Append the oag that completes the abridged series
  new_df <- new_df %>%
    bind_rows(abr_oag[!(abr_oag$AgeLabel %in% new_df$AgeLabel) &
      abr_oag$AgeStart == oag_start, ]) %>%
    arrange(AgeSort)
}


## Check again whether any open age group exists
oag_check <- paste0(oag_start, "+") %in% new_df$AgeLabel
oag_check
#> [1] TRUE

new_df %>% tab_output()

AgeStart	AgeEnd	AgeLabel	AgeSpan	AgeSort	DataSourceYear	DataValue
10	15	10-14	5	5	NA	4
15	20	15-19	5	6	2004	1322
20	25	20-24	5	7	2004	8032
25	30	25-29	5	8	2004	18556
30	35	30-34	5	9	2004	18907
35	40	35-39	5	10	2004	7365
40	45	40-44	5	11	2004	1211
45	50	45-49	5	12	2004	37
50	0	50+	-1	167	NA	0
0	-1	Total	-1	184	2004	55434
-2	-2	Unknown	-2	185	NA	0

If “Total” is missing and the series is otherwise complete, we compute it .

if (!("Total" %in% new_df$AgeLabel) & "0-4" %in% new_df$AgeLabel & oag_check == TRUE) {
  new_df <- new_df %>%
    bind_rows(data.frame(
      AgeStart = 0,
      AgeEnd = -1,
      AgeLabel = "Total",
      AgeSpan = -1,
      AgeSort = 184,
      DataSourceYear = NA,
      DataValue = sum(new_df$DataValue[new_df$AgeSpan == 5]) +
        new_df$DataValue[new_df$AgeSpan == -1 & new_df$AgeStart == oag_start] +
        new_df$DataValue[new_df$AgeLabel == "Unknown"]
    ))
}

new_df %>% tab_output()

AgeStart	AgeEnd	AgeLabel	AgeSpan	AgeSort	DataSourceYear	DataValue
10	15	10-14	5	5	NA	4
15	20	15-19	5	6	2004	1322
20	25	20-24	5	7	2004	8032
25	30	25-29	5	8	2004	18556
30	35	30-34	5	9	2004	18907
35	40	35-39	5	10	2004	7365
40	45	40-44	5	11	2004	1211
45	50	45-49	5	12	2004	37
50	0	50+	-1	167	NA	0
0	-1	Total	-1	184	2004	55434
-2	-2	Unknown	-2	185	NA	0

Finally, a note is generated to alert the user about missing data and this process is repeated for all the sex ids (in the case of deaths data).

new_df$note <- NA
if (check_abr == FALSE | oag_check == FALSE) {
  abr$note <- "The abridged series is missing data for one or more age groups."
}
if (!("0" %in% new_df$AgeLabel & "1-4" %in% new_df$AgeLabel & "0-4" %in% new_df$AgeLabel)) {
  new_df$note <- "The abridged series is missing data for one or more age groups."
}

new_df %>% tab_output()

AgeStart	AgeEnd	AgeLabel	AgeSpan	AgeSort	DataSourceYear	DataValue	note
10	15	10-14	5	5	NA	4	The abridged series is missing data for one or more age groups.
15	20	15-19	5	6	2004	1322	The abridged series is missing data for one or more age groups.
20	25	20-24	5	7	2004	8032	The abridged series is missing data for one or more age groups.
25	30	25-29	5	8	2004	18556	The abridged series is missing data for one or more age groups.
30	35	30-34	5	9	2004	18907	The abridged series is missing data for one or more age groups.
35	40	35-39	5	10	2004	7365	The abridged series is missing data for one or more age groups.
40	45	40-44	5	11	2004	1211	The abridged series is missing data for one or more age groups.
45	50	45-49	5	12	2004	37	The abridged series is missing data for one or more age groups.
50	0	50+	-1	167	NA	0	The abridged series is missing data for one or more age groups.
0	-1	Total	-1	184	2004	55434	The abridged series is missing data for one or more age groups.
-2	-2	Unknown	-2	185	NA	0	The abridged series is missing data for one or more age groups.