vignettes/DDharmonize_5YearCounts.Rmd
DDharmonize_5YearCounts.Rmd
The purpose of this process is to standardizes/harmonize five-year vital and population counts taken from the UNDP database.
There are two functions involved. DDharmonize_Vitals5
1 is implemented on births and deaths data while DDharmonize_Pop5
is implemented on population data.
The functions are defined as follows:
DDharmonize_Vitals5(indata, type = c(“births”,“deaths”))
DDharmonize_Pop5(indata)
In this vignette, we will use the vitals5_df
dataset that is embedded on this package, and the DDharmonize_Vitals5()
function to show how five-year birth counts are harmonized. This dataset represents births by age of mother and sex of child in Norway for the year 2002 and for cases where SexID == “Both sexes”.
## Load the packages required
library(rddharmony)
library(kableExtra)
library(dplyr)
library(purrr)
## Create a function to be used to generate the table output
tab_output <- function(tab) {
kable(tab, booktabs = TRUE, align = "c", table.envir = "capctable", longtable = TRUE) %>%
kable_styling() %>%
row_spec(0, bold = T, color = "white", background = "#6ebed8") %>%
kable_paper(html_font = "helvetica") %>%
scroll_box(width = "100%", height = "300px")
}
The data, with only a few variables displayed, is shown below.
new_df <- vitals5_df %>%
select(LocID, LocName, DataSourceYear, DataStatusName, SexName, SexID, DataSourceYear, TimeLabel, TimeMid, starts_with("Age"), DataValue, SeriesID) %>%
select(-agesort) %>%
arrange(AgeLabel)
new_df %>% tab_output()
LocID | LocName | DataSourceYear | DataStatusName | SexName | SexID | TimeLabel | TimeMid | AgeID | AgeUnit | AgeStart | AgeEnd | AgeSpan | AgeMid | AgeLabel | AgeSort | DataValue | SeriesID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
578 | Norway | 2005 | Final | Both sexes | 3 | 2002 | 2002.5 | 3222 | Year | 0 | 15 | 15 | 7.5 | 0-14 | 9 | 4 | 1.106610e+14 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 30504 | Year | 12 | 15 | 3 | 14.5 | 12-14 | 106 | 4 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 304 | Year | 15 | 20 | 5 | 17.5 | 15-19 | 126 | 1322 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 305 | Year | 20 | 25 | 5 | 22.5 | 20-24 | 172 | 8032 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 306 | Year | 25 | 30 | 5 | 27.5 | 25-29 | 195 | 18556 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 307 | Year | 30 | 35 | 5 | 32.5 | 30-34 | 211 | 18907 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 308 | Year | 35 | 40 | 5 | 37.5 | 35-39 | 232 | 7365 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 309 | Year | 40 | 45 | 5 | 42.5 | 40-44 | 249 | 1211 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 310 | Year | 45 | 50 | 5 | 47.5 | 45-49 | 263 | 37 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 398 | Year | 0 | -1 | -1 | -1.0 | Total | 999 | 55434 | 7.878205e+12 |
578 | Norway | 2005 | Final | Both sexes | 3 | 2002 | 2002.5 | 398 | Year | 0 | -1 | -1 | -1.0 | Total | 999 | 4 | 1.106610e+14 |
The function should ideally be run over each of the specific SexIDs (Males (1), Females (2) and Both sexes (3)), but when it is applied to births data, only “Both sexes” counts are considered. It is also implemented in cases where we have at least one abridged age group.
Before running this function, we need to ensure that all the age labels in the data have an AgeSpan
>= 5.
LocID | LocName | DataSourceYear | DataStatusName | SexName | SexID | TimeLabel | TimeMid | AgeID | AgeUnit | AgeStart | AgeEnd | AgeSpan | AgeMid | AgeLabel | AgeSort | DataValue | SeriesID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
578 | Norway | 2005 | Final | Both sexes | 3 | 2002 | 2002.5 | 3222 | Year | 0 | 15 | 15 | 7.5 | 0-14 | 9 | 4 | 1.106610e+14 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 304 | Year | 15 | 20 | 5 | 17.5 | 15-19 | 126 | 1322 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 305 | Year | 20 | 25 | 5 | 22.5 | 20-24 | 172 | 8032 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 306 | Year | 25 | 30 | 5 | 27.5 | 25-29 | 195 | 18556 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 307 | Year | 30 | 35 | 5 | 32.5 | 30-34 | 211 | 18907 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 308 | Year | 35 | 40 | 5 | 37.5 | 35-39 | 232 | 7365 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 309 | Year | 40 | 45 | 5 | 42.5 | 40-44 | 249 | 1211 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 310 | Year | 45 | 50 | 5 | 47.5 | 45-49 | 263 | 37 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 3 | 2002 | 2002.5 | 398 | Year | 0 | -1 | -1 | -1.0 | Total | 999 | 55434 | 7.878205e+12 |
578 | Norway | 2005 | Final | Both sexes | 3 | 2002 | 2002.5 | 398 | Year | 0 | -1 | -1 | -1.0 | Total | 999 | 4 | 1.106610e+14 |
We begin by subsetting the data to ensure that we only have “Both sexes” data and none of the data values are missing.
new_df <- new_df %>%
dplyr::filter(SexID == 3 & !is.na(DataValue)) %>%
select(-SexID) %>%
distinct()
new_df %>% tab_output()
LocID | LocName | DataSourceYear | DataStatusName | SexName | TimeLabel | TimeMid | AgeID | AgeUnit | AgeStart | AgeEnd | AgeSpan | AgeMid | AgeLabel | AgeSort | DataValue | SeriesID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
578 | Norway | 2005 | Final | Both sexes | 2002 | 2002.5 | 3222 | Year | 0 | 15 | 15 | 7.5 | 0-14 | 9 | 4 | 1.106610e+14 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 304 | Year | 15 | 20 | 5 | 17.5 | 15-19 | 126 | 1322 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 305 | Year | 20 | 25 | 5 | 22.5 | 20-24 | 172 | 8032 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 306 | Year | 25 | 30 | 5 | 27.5 | 25-29 | 195 | 18556 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 307 | Year | 30 | 35 | 5 | 32.5 | 30-34 | 211 | 18907 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 308 | Year | 35 | 40 | 5 | 37.5 | 35-39 | 232 | 7365 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 309 | Year | 40 | 45 | 5 | 42.5 | 40-44 | 249 | 1211 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 310 | Year | 45 | 50 | 5 | 47.5 | 45-49 | 263 | 37 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 398 | Year | 0 | -1 | -1 | -1.0 | Total | 999 | 55434 | 7.878205e+12 |
578 | Norway | 2005 | Final | Both sexes | 2002 | 2002.5 | 398 | Year | 0 | -1 | -1 | -1.0 | Total | 999 | 4 | 1.106610e+14 |
We then check if “Final” data status is available and if so, we prioritize this over the provisional records.
if ("Final" %in% unique(new_df$DataStatusName)) {
new_df <- new_df %>%
dplyr::filter(DataStatusName == "Final")
}
new_df %>% tab_output()
LocID | LocName | DataSourceYear | DataStatusName | SexName | TimeLabel | TimeMid | AgeID | AgeUnit | AgeStart | AgeEnd | AgeSpan | AgeMid | AgeLabel | AgeSort | DataValue | SeriesID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
578 | Norway | 2005 | Final | Both sexes | 2002 | 2002.5 | 3222 | Year | 0 | 15 | 15 | 7.5 | 0-14 | 9 | 4 | 1.106610e+14 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 304 | Year | 15 | 20 | 5 | 17.5 | 15-19 | 126 | 1322 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 305 | Year | 20 | 25 | 5 | 22.5 | 20-24 | 172 | 8032 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 306 | Year | 25 | 30 | 5 | 27.5 | 25-29 | 195 | 18556 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 307 | Year | 30 | 35 | 5 | 32.5 | 30-34 | 211 | 18907 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 308 | Year | 35 | 40 | 5 | 37.5 | 35-39 | 232 | 7365 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 309 | Year | 40 | 45 | 5 | 42.5 | 40-44 | 249 | 1211 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 310 | Year | 45 | 50 | 5 | 47.5 | 45-49 | 263 | 37 | 7.878205e+12 |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 398 | Year | 0 | -1 | -1 | -1.0 | Total | 999 | 55434 | 7.878205e+12 |
578 | Norway | 2005 | Final | Both sexes | 2002 | 2002.5 | 398 | Year | 0 | -1 | -1 | -1.0 | Total | 999 | 4 | 1.106610e+14 |
We proceed to check for multiple series ids and for each unique series id, we check whether it is a full series 2 with all age groups represented and an open age greater than 60.
ids_series <- unique(new_df$SeriesID)
n_series <- length(ids_series)
# for each unique series,
df_out <- NULL
for (i in 1:n_series) {
## Filter the data for a specific SeriesID
df <- new_df %>% dplyr::filter(SeriesID == ids_series[i])
## Check whether it is a full series with all age groups represented and an open age greater than 60
## AgeSpan == -1 (Total), AgeSpan == -2 (Unknown), AgeSpan == 5(five-year)
df_abr_std <- df[(df$AgeStart == 0 & df$AgeSpan == 1) |
df$AgeSpan %in% c(-1, -2, 5), ]
if (nrow(df_abr_std) > 13) {
df$check_full <- dd_series_isfull(df_abr_std, abridged = TRUE)
} else {
df$check_full <- FALSE
}
df_out <- rbind(df_out, df)
}
new_df <- df_out
rm(df_out)
new_df %>% tab_output()
LocID | LocName | DataSourceYear | DataStatusName | SexName | TimeLabel | TimeMid | AgeID | AgeUnit | AgeStart | AgeEnd | AgeSpan | AgeMid | AgeLabel | AgeSort | DataValue | SeriesID | check_full |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
578 | Norway | 2005 | Final | Both sexes | 2002 | 2002.5 | 3222 | Year | 0 | 15 | 15 | 7.5 | 0-14 | 9 | 4 | 1.106610e+14 | FALSE |
578 | Norway | 2005 | Final | Both sexes | 2002 | 2002.5 | 398 | Year | 0 | -1 | -1 | -1.0 | Total | 999 | 4 | 1.106610e+14 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 304 | Year | 15 | 20 | 5 | 17.5 | 15-19 | 126 | 1322 | 7.878205e+12 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 305 | Year | 20 | 25 | 5 | 22.5 | 20-24 | 172 | 8032 | 7.878205e+12 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 306 | Year | 25 | 30 | 5 | 27.5 | 25-29 | 195 | 18556 | 7.878205e+12 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 307 | Year | 30 | 35 | 5 | 32.5 | 30-34 | 211 | 18907 | 7.878205e+12 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 308 | Year | 35 | 40 | 5 | 37.5 | 35-39 | 232 | 7365 | 7.878205e+12 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 309 | Year | 40 | 45 | 5 | 42.5 | 40-44 | 249 | 1211 | 7.878205e+12 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 310 | Year | 45 | 50 | 5 | 47.5 | 45-49 | 263 | 37 | 7.878205e+12 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 398 | Year | 0 | -1 | -1 | -1.0 | Total | 999 | 55434 | 7.878205e+12 | FALSE |
As can be seen in this data, we have situations were we have multiple total labels for the same series. This may have occurred when a country updates an age label and reports the total for that update. If there exists more than one total reported and there is a one that is equal to the computed total (summation over the age labels), drop the others to be left with this one 3.
new_df <- new_df %>% dd_multiple_totals()
new_df %>% tab_output()
LocID | LocName | DataSourceYear | DataStatusName | SexName | TimeLabel | TimeMid | AgeID | AgeUnit | AgeStart | AgeEnd | AgeSpan | AgeMid | AgeLabel | AgeSort | DataValue | SeriesID | check_full | counter | misplaced_oag |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
578 | Norway | 2005 | Final | Both sexes | 2002 | 2002.5 | 3222 | Year | 0 | 15 | 15 | 7.5 | 0-14 | 9 | 4 | 1.106610e+14 | FALSE | 2 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 304 | Year | 15 | 20 | 5 | 17.5 | 15-19 | 126 | 1322 | 7.878205e+12 | FALSE | 1 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 305 | Year | 20 | 25 | 5 | 22.5 | 20-24 | 172 | 8032 | 7.878205e+12 | FALSE | 1 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 306 | Year | 25 | 30 | 5 | 27.5 | 25-29 | 195 | 18556 | 7.878205e+12 | FALSE | 1 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 307 | Year | 30 | 35 | 5 | 32.5 | 30-34 | 211 | 18907 | 7.878205e+12 | FALSE | 1 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 308 | Year | 35 | 40 | 5 | 37.5 | 35-39 | 232 | 7365 | 7.878205e+12 | FALSE | 1 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 309 | Year | 40 | 45 | 5 | 42.5 | 40-44 | 249 | 1211 | 7.878205e+12 | FALSE | 1 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 310 | Year | 45 | 50 | 5 | 47.5 | 45-49 | 263 | 37 | 7.878205e+12 | FALSE | 1 | FALSE |
578 | Norway | 2004 | Final | Both sexes | 2002 | 2002.5 | 398 | Year | 0 | -1 | -1 | -1.0 | Total | 999 | 55434 | 7.878205e+12 | FALSE | 2 | FALSE |
We then check the data to ensure that we have the latest data source year for each of the age labels 4.
# if there is more than one series ...
if (n_series > 1) {
latest_source_year <- max(new_df$DataSourceYear)
check_latest_full <- unique(new_df$check_full[new_df$DataSourceYear == latest_source_year])
# ... and latest series is full then keep only that one
if (any(check_latest_full)) {
new_df <- new_df[new_df$DataSourceYear == latest_source_year, ]
} else {
# ... and latest series is not full, then keep the latest data source record for each age
new_df <- new_df %>% dd_latest_source_year()
}
}
We proceed to tidy up the data frame and in cases where we do not have an “Unknown” age label present, we generate one and set the data value to 0.
new_df <- new_df %>%
select(DataSourceYear, AgeStart, AgeEnd, AgeLabel, AgeSpan, DataValue) %>%
distinct()
if (!("Unknown" %in% new_df$AgeLabel)) {
new_df <- new_df %>%
bind_rows(data.frame(
AgeStart = -2,
AgeEnd = -2,
AgeSpan = -2,
AgeLabel = "Unknown",
DataSourceYear = NA,
DataValue = 0
))
}
We have situations where the “Unknown” value represents the difference between the reported total and the summation of the data values (computed totals) and in such cases, we also set the “Unknown” data value to 0 5.
new_df <- new_df %>% dd_drop_unknowns()
We also reconcile early age groups for births, for example in this case, we have 0-14 and since the minimum age should be 10, we set this to 10-14 6. We also reconcile ages 0-1, 1-4 and 0-4 such that if 0-4 is missing and 0-1 and 1-4 are present, we sum both to get 0-4 7.
new_df <- new_df %>%
dd_firstages_compute_births() %>%
select(-AgeSort)
new_df <- new_df %>% dd_firstages_compute()
new_df %>% tab_output()
AgeStart | AgeEnd | AgeLabel | AgeSpan | AgeSort | DataSourceYear | DataValue |
---|---|---|---|---|---|---|
10 | 15 | 10-14 | 5 | 5 | NA | 4 |
15 | 20 | 15-19 | 5 | 6 | 2004 | 1322 |
20 | 25 | 20-24 | 5 | 7 | 2004 | 8032 |
25 | 30 | 25-29 | 5 | 8 | 2004 | 18556 |
30 | 35 | 30-34 | 5 | 9 | 2004 | 18907 |
35 | 40 | 35-39 | 5 | 10 | 2004 | 7365 |
40 | 45 | 40-44 | 5 | 11 | 2004 | 1211 |
45 | 50 | 45-49 | 5 | 12 | 2004 | 37 |
0 | -1 | Total | -1 | 184 | 2004 | 55434 |
-2 | -2 | Unknown | -2 | 185 | NA | 0 |
We proceed to check whether there are multiple open age groups (TRUE/FALSE), and if so, compute closed age groups from multiple open age groups and add to the data if missing 8.
oag_multi <- new_df %>% dd_oag_multiple()
if (oag_multi) {
add <- new_df %>%
dd_oag2closed() %>%
dplyr::filter(!(AgeLabel %in% new_df$AgeLabel[!is.na(new_df$DataValue)]))
if (nrow(add > 0)) {
new_df <- new_df %>%
bind_rows(add) %>%
arrange(AgeStart)
}
}
new_df %>% tab_output()
AgeStart | AgeEnd | AgeLabel | AgeSpan | AgeSort | DataSourceYear | DataValue |
---|---|---|---|---|---|---|
10 | 15 | 10-14 | 5 | 5 | NA | 4 |
15 | 20 | 15-19 | 5 | 6 | 2004 | 1322 |
20 | 25 | 20-24 | 5 | 7 | 2004 | 8032 |
25 | 30 | 25-29 | 5 | 8 | 2004 | 18556 |
30 | 35 | 30-34 | 5 | 9 | 2004 | 18907 |
35 | 40 | 35-39 | 5 | 10 | 2004 | 7365 |
40 | 45 | 40-44 | 5 | 11 | 2004 | 1211 |
45 | 50 | 45-49 | 5 | 12 | 2004 | 37 |
0 | -1 | Total | -1 | 184 | 2004 | 55434 |
-2 | -2 | Unknown | -2 | 185 | NA | 0 |
We then identify the start age of the open age group needed to close the series 9, flag whether this open age group exists in the series and drop records for open age groups that do not close the series.
oag_start <- new_df %>% dd_oag_agestart()
oag_check <- paste0(oag_start, "+") %in% new_df$AgeLabel
if (!is_empty(oag_start)) {
new_df <- new_df %>%
dplyr::filter(!(AgeStart > 0 & AgeSpan == -1 & AgeStart != oag_start))
}
new_df %>% tab_output()
AgeStart | AgeEnd | AgeLabel | AgeSpan | AgeSort | DataSourceYear | DataValue |
---|---|---|---|---|---|---|
10 | 15 | 10-14 | 5 | 5 | NA | 4 |
15 | 20 | 15-19 | 5 | 6 | 2004 | 1322 |
20 | 25 | 20-24 | 5 | 7 | 2004 | 8032 |
25 | 30 | 25-29 | 5 | 8 | 2004 | 18556 |
30 | 35 | 30-34 | 5 | 9 | 2004 | 18907 |
35 | 40 | 35-39 | 5 | 10 | 2004 | 7365 |
40 | 45 | 40-44 | 5 | 11 | 2004 | 1211 |
45 | 50 | 45-49 | 5 | 12 | 2004 | 37 |
0 | -1 | Total | -1 | 184 | 2004 | 55434 |
-2 | -2 | Unknown | -2 | 185 | NA | 0 |
We also check that there are no missing age groups on the abridged series and if so, we compute all possible open age groups 10 given available input and append the open age group that completes the abridged series. We also check again to see whether any open age group exists.
## Check that there are no missing age groups on the abridged series
if (nrow(new_df[new_df$AgeStart >= 5, ]) > 0) {
check_abr <- DemoTools::is_abridged(new_df$AgeStart[new_df$AgeStart >= 5])
} else {
check_abr <- FALSE
}
if (check_abr == TRUE) {
## Compute all possible open age groups given available input
abr_oag <- dd_oag_compute(new_df, age_span = 5)
## Append the oag that completes the abridged series
new_df <- new_df %>%
bind_rows(abr_oag[!(abr_oag$AgeLabel %in% new_df$AgeLabel) &
abr_oag$AgeStart == oag_start, ]) %>%
arrange(AgeSort)
}
## Check again whether any open age group exists
oag_check <- paste0(oag_start, "+") %in% new_df$AgeLabel
oag_check
#> [1] TRUE
new_df %>% tab_output()
AgeStart | AgeEnd | AgeLabel | AgeSpan | AgeSort | DataSourceYear | DataValue |
---|---|---|---|---|---|---|
10 | 15 | 10-14 | 5 | 5 | NA | 4 |
15 | 20 | 15-19 | 5 | 6 | 2004 | 1322 |
20 | 25 | 20-24 | 5 | 7 | 2004 | 8032 |
25 | 30 | 25-29 | 5 | 8 | 2004 | 18556 |
30 | 35 | 30-34 | 5 | 9 | 2004 | 18907 |
35 | 40 | 35-39 | 5 | 10 | 2004 | 7365 |
40 | 45 | 40-44 | 5 | 11 | 2004 | 1211 |
45 | 50 | 45-49 | 5 | 12 | 2004 | 37 |
50 | 0 | 50+ | -1 | 167 | NA | 0 |
0 | -1 | Total | -1 | 184 | 2004 | 55434 |
-2 | -2 | Unknown | -2 | 185 | NA | 0 |
If “Total” is missing and the series is otherwise complete, we compute it .
if (!("Total" %in% new_df$AgeLabel) & "0-4" %in% new_df$AgeLabel & oag_check == TRUE) {
new_df <- new_df %>%
bind_rows(data.frame(
AgeStart = 0,
AgeEnd = -1,
AgeLabel = "Total",
AgeSpan = -1,
AgeSort = 184,
DataSourceYear = NA,
DataValue = sum(new_df$DataValue[new_df$AgeSpan == 5]) +
new_df$DataValue[new_df$AgeSpan == -1 & new_df$AgeStart == oag_start] +
new_df$DataValue[new_df$AgeLabel == "Unknown"]
))
}
new_df %>% tab_output()
AgeStart | AgeEnd | AgeLabel | AgeSpan | AgeSort | DataSourceYear | DataValue |
---|---|---|---|---|---|---|
10 | 15 | 10-14 | 5 | 5 | NA | 4 |
15 | 20 | 15-19 | 5 | 6 | 2004 | 1322 |
20 | 25 | 20-24 | 5 | 7 | 2004 | 8032 |
25 | 30 | 25-29 | 5 | 8 | 2004 | 18556 |
30 | 35 | 30-34 | 5 | 9 | 2004 | 18907 |
35 | 40 | 35-39 | 5 | 10 | 2004 | 7365 |
40 | 45 | 40-44 | 5 | 11 | 2004 | 1211 |
45 | 50 | 45-49 | 5 | 12 | 2004 | 37 |
50 | 0 | 50+ | -1 | 167 | NA | 0 |
0 | -1 | Total | -1 | 184 | 2004 | 55434 |
-2 | -2 | Unknown | -2 | 185 | NA | 0 |
Finally, a note is generated to alert the user about missing data and this process is repeated for all the sex ids (in the case of deaths data).
new_df$note <- NA
if (check_abr == FALSE | oag_check == FALSE) {
abr$note <- "The abridged series is missing data for one or more age groups."
}
if (!("0" %in% new_df$AgeLabel & "1-4" %in% new_df$AgeLabel & "0-4" %in% new_df$AgeLabel)) {
new_df$note <- "The abridged series is missing data for one or more age groups."
}
new_df %>% tab_output()
AgeStart | AgeEnd | AgeLabel | AgeSpan | AgeSort | DataSourceYear | DataValue | note |
---|---|---|---|---|---|---|---|
10 | 15 | 10-14 | 5 | 5 | NA | 4 | The abridged series is missing data for one or more age groups. |
15 | 20 | 15-19 | 5 | 6 | 2004 | 1322 | The abridged series is missing data for one or more age groups. |
20 | 25 | 20-24 | 5 | 7 | 2004 | 8032 | The abridged series is missing data for one or more age groups. |
25 | 30 | 25-29 | 5 | 8 | 2004 | 18556 | The abridged series is missing data for one or more age groups. |
30 | 35 | 30-34 | 5 | 9 | 2004 | 18907 | The abridged series is missing data for one or more age groups. |
35 | 40 | 35-39 | 5 | 10 | 2004 | 7365 | The abridged series is missing data for one or more age groups. |
40 | 45 | 40-44 | 5 | 11 | 2004 | 1211 | The abridged series is missing data for one or more age groups. |
45 | 50 | 45-49 | 5 | 12 | 2004 | 37 | The abridged series is missing data for one or more age groups. |
50 | 0 | 50+ | -1 | 167 | NA | 0 | The abridged series is missing data for one or more age groups. |
0 | -1 | Total | -1 | 184 | 2004 | 55434 | The abridged series is missing data for one or more age groups. |
-2 | -2 | Unknown | -2 | 185 | NA | 0 | The abridged series is missing data for one or more age groups. |