This function is implemented in cases where we have both abridged and complete series. It is defined as follows:

DDharmonize_AbridgedAndComplete(data_abr, data_cpl_from_abr, data_cpl) 1

To show how it works, we shall use the data_abr and data_cpl datasets that are embedded on this package. For this vignette, we shall assume that data_cpl_from_abr == NULL i.e we do not have complete records obtained from abridged series. This is very common in the deaths data.

## Load the packages required
library(rddharmony)
library(kableExtra)
library(dplyr)
library(purrr)

## Create a function to be used to generate the table output
tab_output <- function(tab) {
  kable(tab, booktabs = TRUE, align = "c", table.envir = "capctable", longtable = TRUE, row.names = FALSE) %>%
    kable_styling() %>%
    row_spec(0, bold = T, color = "white", background = "#6ebed8") %>%
    kable_paper(html_font = "helvetica") %>%
    scroll_box(width = "100%", height = "300px")
}

The abridged data consists of 5-year age groups ranging from “10-14” to “50+” with a total value of 114870

data_abr %>%
  select(-note) %>%
  tab_output()
AgeStart AgeEnd AgeLabel AgeSpan AgeSort DataSourceYear DataValue SexID abridged complete series
10 15 10-14 5 5 NA 4 3 TRUE FALSE abridged
15 20 15-19 5 6 2017 1124 3 TRUE FALSE abridged
20 25 20-24 5 7 2017 14048 3 TRUE FALSE abridged
25 30 25-29 5 8 2017 36525 3 TRUE FALSE abridged
30 35 30-34 5 9 2017 38273 3 TRUE FALSE abridged
35 40 35-39 5 10 2017 20046 3 TRUE FALSE abridged
40 45 40-44 5 11 2017 4544 3 TRUE FALSE abridged
45 50 45-49 5 12 2017 278 3 TRUE FALSE abridged
50 0 50+ -1 167 2017 27 3 TRUE FALSE abridged
0 -1 Total -1 184 2017 114870 3 TRUE FALSE abridged
-2 -2 Unknown -2 185 2017 1 3 TRUE FALSE abridged

… and the complete series consists of single years of age ranging from “10” to “50+” with a total value of 114870 as well.

data_cpl %>%
  select(-note) %>%
  tab_output()
AgeStart AgeEnd AgeLabel AgeSpan AgeSort DataSourceYear DataValue SexID abridged complete series
10 11 10 1 43 NA 0.0 3 FALSE TRUE complete
11 12 11 1 43 NA 0.0 3 FALSE TRUE complete
12 13 12 1 43 NA 0.2 3 FALSE TRUE complete
13 14 13 1 43 NA 1.0 3 FALSE TRUE complete
14 15 14 1 43 NA 2.8 3 FALSE TRUE complete
15 16 15 1 43 2017 14.0 3 FALSE TRUE complete
16 17 16 1 44 2017 55.0 3 FALSE TRUE complete
17 18 17 1 45 2017 149.0 3 FALSE TRUE complete
18 19 18 1 46 2017 287.0 3 FALSE TRUE complete
19 20 19 1 47 2017 619.0 3 FALSE TRUE complete
20 21 20 1 48 2017 1174.0 3 FALSE TRUE complete
21 22 21 1 49 2017 1759.0 3 FALSE TRUE complete
22 23 22 1 50 2017 2717.0 3 FALSE TRUE complete
23 24 23 1 51 2017 3615.0 3 FALSE TRUE complete
24 25 24 1 52 2017 4783.0 3 FALSE TRUE complete
25 26 25 1 53 2017 6069.0 3 FALSE TRUE complete
26 27 26 1 54 2017 6803.0 3 FALSE TRUE complete
27 28 27 1 55 2017 7360.0 3 FALSE TRUE complete
28 29 28 1 56 2017 8029.0 3 FALSE TRUE complete
29 30 29 1 57 2017 8264.0 3 FALSE TRUE complete
30 31 30 1 58 2017 8482.0 3 FALSE TRUE complete
31 32 31 1 59 2017 8149.0 3 FALSE TRUE complete
32 33 32 1 60 2017 7756.0 3 FALSE TRUE complete
33 34 33 1 61 2017 7116.0 3 FALSE TRUE complete
34 35 34 1 62 2017 6770.0 3 FALSE TRUE complete
35 36 35 1 63 2017 5919.0 3 FALSE TRUE complete
36 37 36 1 64 2017 4805.0 3 FALSE TRUE complete
37 38 37 1 65 2017 3771.0 3 FALSE TRUE complete
38 39 38 1 66 2017 3187.0 3 FALSE TRUE complete
39 40 39 1 67 2017 2364.0 3 FALSE TRUE complete
40 41 40 1 68 2017 1853.0 3 FALSE TRUE complete
41 42 41 1 69 2017 1220.0 3 FALSE TRUE complete
42 43 42 1 70 2017 802.0 3 FALSE TRUE complete
43 44 43 1 71 2017 432.0 3 FALSE TRUE complete
44 45 44 1 72 2017 237.0 3 FALSE TRUE complete
45 46 45 1 73 2017 129.0 3 FALSE TRUE complete
46 47 46 1 74 2017 75.0 3 FALSE TRUE complete
47 48 47 1 75 2017 35.0 3 FALSE TRUE complete
48 49 48 1 76 2017 19.0 3 FALSE TRUE complete
49 50 49 1 77 2017 20.0 3 FALSE TRUE complete
50 0 50+ -1 167 2017 27.0 3 FALSE TRUE complete
0 -1 Total -1 184 2017 114870.0 3 FALSE TRUE complete
-2 -2 Unknown -2 185 2017 1.0 3 FALSE TRUE complete

As noted above, we assume that we do not have complete records obtained from abridged series, so we set this to NULL.

data_cpl_from_abr <- NULL

This function should be looped over each of the Sex Ids but in this vignette, we will only be handling SexId == 3 (Both sexes).

sex <- 3
abr_sex <- NULL ## will contain the final abridged records
cpl_sex <- NULL ## will contain the final complete records

The first step in this function involves creating flags to check whether each of the datasets listed in the function arguments is available.


has_abr <- nrow(data_abr[!is.na(data_abr$DataValue) & data_abr$SexID == sex, ]) > 0
has_abr <- ifelse(is_empty(has_abr), FALSE, has_abr)
has_cpl_from_abr <- nrow(data_cpl_from_abr[!is.na(data_cpl_from_abr$DataValue) & data_cpl_from_abr$SexID == sex, ]) > 0
has_cpl_from_abr <- ifelse(is_empty(has_cpl_from_abr), FALSE, has_cpl_from_abr)
has_cpl <- nrow(data_cpl[!is.na(data_cpl$DataValue) & data_cpl$SexID == sex, ]) > 0
has_cpl <- ifelse(is_empty(has_cpl), FALSE, has_cpl)

cat("has_abr: ", has_abr, "\n")
#> has_abr:  TRUE
cat("has_cpl: ", has_cpl, "\n")
#> has_cpl:  TRUE
cat("has_cpl_from_abr: ", has_cpl_from_abr, "\n")
#> has_cpl_from_abr:  FALSE

If the abridged series exists, extract the data value of AgeLabel == “Total” for each of the Sex Ids.

if (has_abr) {
  df_abr <- data_abr %>%
    dplyr::filter(SexID == sex) %>%
    select(-note, -SexID)
  total_abr <- df_abr$DataValue[df_abr$AgeLabel == "Total"]
} else {
  df_abr <- NULL
}
total_abr
#> [1] 114870

If complete records obtained from abridged series exist, extract them for each of the Sex Ids. We know that this does not exist in our case so df_cpl_from_abr will be NULL.

if (has_cpl_from_abr) {
  df_cpl_from_abr <- data_cpl_from_abr %>%
    dplyr::filter(SexID == sex) %>%
    select(-note, -SexID)
} else {
  df_cpl_from_abr <- NULL
}

If complete series exists, subset the data to only have records for the specific sex being looped over and extract the data value of AgeLabel == “Total”. If the complete series doesn’t exist but complete records obtained from abridged dataset does df_cpl_from_abr, set the former to the latter and generate abridged records from these complete records 2.

if (has_cpl) {
  df_cpl <- data_cpl %>%
    dplyr::filter(SexID == sex) %>%
    select(-note, -SexID)
  total_cpl <- df_cpl$DataValue[df_cpl$AgeLabel == "Total"]
  df_abr_from_cpl <- df_cpl %>% dd_single2abridged()
} else { # if no data_cpl, df_cpl will be equal to df_cpl_from_abr (complete records obtained from abridged)
  if (has_cpl_from_abr) {
    # use data_cpl_from_abr
    if (all(data_cpl_from_abr$AgeSpan < 0)) {
      data_cpl_from_abr <- NULL
    }
    df_cpl <- df_cpl_from_abr
    if (!is.null(df_cpl)) {
      total_cpl <- df_cpl$DataValue[df_cpl$AgeLabel == "Total"]
      df_abr_from_cpl <- df_cpl %>% dd_single2abridged()
      has_cpl <- TRUE
    }
  } else {
    df_cpl <- NULL
    df_abr_from_cpl <- NULL
  }
}

Next, where both abridged and complete series exist and are not empty, we check if the totals match.

if (has_abr & has_cpl) {
  total_diff <- total_abr - total_cpl
  # total_match <- total_diff == 0
  total_match <- total_diff <= 0.5
  if (is_empty(total_match)) {
    total_match <- TRUE
  }
} else {
  total_match <- FALSE
}

if (exists("total_abr") & exists("total_cpl")) {
  total_match <- ifelse(is_empty(total_abr) & is_empty(total_cpl), FALSE, total_match)
}
total_match
#> [1] TRUE

If they do, we proceed to reconcile abridged series with records from complete. This step involves first appending the abridged series with the abridged records derived from complete series, but only for the age labels that do not exist in the former, to avoid duplication. Open age groups that do not close the series are dropped and all possible open age groups are computed 3. We again drop records for open age groups that do not close the series, check to see whether the series is full 4 and eventually generate a note that alerts the user that the series is missing some data when the series is not full.

if (total_match) {
  # append the abridged series with the abridged records derived from complete series, but only for the
  # age labels that do not exist in the former, to avoid duplication
  df_abr <- df_abr %>%
    bind_rows(df_abr_from_cpl %>% dplyr::filter(!(AgeLabel %in% df_abr$AgeLabel)))

  # drop records for open age groups that do not close the series
  oag_start_abr <- dd_oag_agestart(df_abr, multiple5 = TRUE)

  if (!is_empty(oag_start_abr)) {
    df_abr <- df_abr %>%
      dplyr::filter(!(AgeStart > 0 & AgeSpan == -1 & AgeStart != oag_start_abr))
  }

  # compute all possible open age groups
  oag_abr <- dd_oag_compute(df_abr, age_span = 5)
  if (!is.null(oag_abr)) {
    df_abr <- df_abr %>%
      bind_rows(oag_abr %>% dplyr::filter(!(AgeLabel %in% df_abr$AgeLabel)))
  }

  # drop records for open age groups that do not close the series
  oag_start_abr <- dd_oag_agestart(df_abr, multiple5 = TRUE)
  if (!is_empty(oag_start_abr)) {
    df_abr <- df_abr %>%
      dplyr::filter(!(AgeStart > 0 & AgeSpan == -1 & AgeStart != oag_start_abr)) %>%
      mutate(series = "abridged reconciled with complete") %>%
      arrange(AgeSort)
  }

  # check to see whether the series is full
  isfull_abr <- dd_series_isfull(df_abr, abridged = TRUE)

  df_abr$note <- ifelse(isfull_abr, NA, "The abridged series is missing data for one or more age groups.")
  df_abr$SexID <- sex
}

Once that is done, we go ahead and reconcile complete series with records from abridged. We start by appending the complete series with the complete records derived from abridged series, but only for the age labels that do not exist in the former, to avoid duplication. Open age groups that do not close the series are dropped , all possible open age groups computed and an open age group that is a multiple of five is computed. A note is also generated alerting the user of any instances of missing data. If the only remaining record on complete is “Unknown” or “Total” , the whole series is discarded.

if (total_match) {
  if (!is.null(df_cpl_from_abr)) {
    df_cpl <- df_cpl %>%
      bind_rows(df_cpl_from_abr %>% dplyr::filter(!(AgeLabel %in% df_cpl$AgeLabel)))
  }

  # only process if there are multiple closed age groups in the series
  if (nrow(df_cpl[df_cpl$AgeSpan == 1, ]) > 1) {

    # drop records for open age groups that do not close the series
    oag_start_cpl <- dd_oag_agestart(df_cpl, multiple5 = FALSE)
    df_cpl <- df_cpl %>%
      dplyr::filter(!(AgeStart > 0 & AgeSpan == -1 & AgeStart != oag_start_cpl))

    # compute all possible open age groups
    oag_cpl <- dd_oag_compute(df_cpl, age_span = 1)
    if (!is.null(oag_cpl)) {
      df_cpl <- df_cpl %>%
        bind_rows(oag_cpl %>% dplyr::filter(!(AgeLabel %in% df_cpl$AgeLabel)))
    }

    # identify the open age group that is a multiple of five
    oag_start_cpl <- dd_oag_agestart(df_cpl, multiple5 = TRUE)

    df_cpl <- df_cpl %>%
      dplyr::filter(!(AgeStart > 0 & AgeSpan == -1 & AgeStart != oag_start_cpl)) %>%
      dplyr::filter(!(AgeSpan == 1 & AgeStart >= oag_start_cpl)) %>%
      mutate(series = "complete reconciled with abridged") %>%
      arrange(AgeSort)

    isfull_cpl <- dd_series_isfull(df_cpl, abridged = FALSE)

    df_cpl$note <- ifelse(isfull_cpl, NA, "The complete series is missing data for one or more age groups.")
    df_cpl$SexID <- sex
  } else {
    df_cpl <- NULL
  }

  # if the only remaining record on complete is "Unknown" or "Total" then discard the whole series

  if (all(unique(df_cpl$AgeLabel) %in% c("Unknown", "Total"))) {
    df_cpl <- NULL
  }
}

In a case where totals do not match, a note is generated alerting the user about the mismatch.


if (!total_match) {

if (!is.null(df_abr)) {
  df_abr$note <- "Different totals on abridged and complete preclude reconciliation"
  df_abr$SexID <- sex
}
if (!is.null(df_cpl)) {
  df_cpl$note <- "Different totals on abridged and complete preclude reconciliation"
  df_cpl$SexID <- sex
}
}

This process is repeated for each of the Sex Ids and abridged and complete variables are generated depending on the type of data.

abr_sex <- rbind(abr_sex, df_abr)
cpl_sex <- rbind(cpl_sex, df_cpl)

## generate abridged and complete variables which are TRUE/FALSE depending on the series
if (!is.null(abr_sex)) {
  abr_sex <- abr_sex %>%
    mutate(
      abridged = TRUE,
      complete = FALSE
    )
}
if (!is.null(cpl_sex)) {
  cpl_sex <- cpl_sex %>%
    mutate(
      abridged = FALSE,
      complete = TRUE
    )
}

Eventually, both data sets (abridged records and complete records) are appended to form one master dataset

outdata <- rbind(
  abr_sex,
  cpl_sex
)

outdata %>% tab_output()
AgeStart AgeEnd AgeLabel AgeSpan AgeSort DataSourceYear DataValue abridged complete series note SexID
10 15 10-14 5 5 NA 4.0 TRUE FALSE abridged reconciled with complete The abridged series is missing data for one or more age groups. 3
15 20 15-19 5 6 2017 1124.0 TRUE FALSE abridged reconciled with complete The abridged series is missing data for one or more age groups. 3
20 25 20-24 5 7 2017 14048.0 TRUE FALSE abridged reconciled with complete The abridged series is missing data for one or more age groups. 3
25 30 25-29 5 8 2017 36525.0 TRUE FALSE abridged reconciled with complete The abridged series is missing data for one or more age groups. 3
30 35 30-34 5 9 2017 38273.0 TRUE FALSE abridged reconciled with complete The abridged series is missing data for one or more age groups. 3
35 40 35-39 5 10 2017 20046.0 TRUE FALSE abridged reconciled with complete The abridged series is missing data for one or more age groups. 3
40 45 40-44 5 11 2017 4544.0 TRUE FALSE abridged reconciled with complete The abridged series is missing data for one or more age groups. 3
45 50 45-49 5 12 2017 278.0 TRUE FALSE abridged reconciled with complete The abridged series is missing data for one or more age groups. 3
50 0 50+ -1 167 2017 27.0 TRUE FALSE abridged reconciled with complete The abridged series is missing data for one or more age groups. 3
0 -1 Total -1 184 2017 114870.0 TRUE FALSE abridged reconciled with complete The abridged series is missing data for one or more age groups. 3
-2 -2 Unknown -2 185 2017 1.0 TRUE FALSE abridged reconciled with complete The abridged series is missing data for one or more age groups. 3
10 11 10 1 43 NA 0.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
11 12 11 1 43 NA 0.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
12 13 12 1 43 NA 0.2 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
13 14 13 1 43 NA 1.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
14 15 14 1 43 NA 2.8 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
15 16 15 1 43 2017 14.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
16 17 16 1 44 2017 55.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
17 18 17 1 45 2017 149.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
18 19 18 1 46 2017 287.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
19 20 19 1 47 2017 619.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
20 21 20 1 48 2017 1174.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
21 22 21 1 49 2017 1759.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
22 23 22 1 50 2017 2717.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
23 24 23 1 51 2017 3615.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
24 25 24 1 52 2017 4783.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
25 26 25 1 53 2017 6069.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
26 27 26 1 54 2017 6803.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
27 28 27 1 55 2017 7360.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
28 29 28 1 56 2017 8029.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
29 30 29 1 57 2017 8264.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
30 31 30 1 58 2017 8482.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
31 32 31 1 59 2017 8149.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
32 33 32 1 60 2017 7756.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
33 34 33 1 61 2017 7116.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
34 35 34 1 62 2017 6770.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
35 36 35 1 63 2017 5919.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
36 37 36 1 64 2017 4805.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
37 38 37 1 65 2017 3771.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
38 39 38 1 66 2017 3187.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
39 40 39 1 67 2017 2364.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
40 41 40 1 68 2017 1853.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
41 42 41 1 69 2017 1220.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
42 43 42 1 70 2017 802.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
43 44 43 1 71 2017 432.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
44 45 44 1 72 2017 237.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
45 46 45 1 73 2017 129.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
46 47 46 1 74 2017 75.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
47 48 47 1 75 2017 35.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
48 49 48 1 76 2017 19.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
49 50 49 1 77 2017 20.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
50 0 50+ -1 167 2017 27.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
0 -1 Total -1 184 2017 114870.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3
-2 -2 Unknown -2 185 2017 1.0 FALSE TRUE complete reconciled with abridged The complete series is missing data for one or more age groups. 3