Introduction to cleanepi

An overview

Data cleaning is a critical step of data analysis, especially considering the messy nature of real-world data, which often includes duplicates, errors, incomplete entries, irrelevant formats, etc. Addressing these issues is essential for producing accurate, reliable, and reproducible results. However, data cleaning can pose a substantial barrier in data analysis due to the time-consuming nature of the process.

{cleanepi} is an R package designed specifically to address this challenge by offering tools to clean, curate, and standardize datasets. Tailored specifically for epidemiological data and compatible with data frame-like structures, {cleanepi} offers a suite of functions designed to streamline common data cleaning tasks.

This vignette provides a comprehensive guide to the functionalities encapsulated within {cleanepi}. It provides users with detailed insights into each function’s purpose and practical usage, equipping them with the tools necessary to navigate and manipulate cluttered datasets effectively.

library("cleanepi")

General data cleaning tasks

The main function in {cleanepi} is clean_data() that can perform the following tasks:

  1. Clean up column names and convert them to more readable formats. This includes many sub-tasks such as replacing a space, dot, or hyphen between two words with underscore; converting camel-cases to snake-cases; substituting foreign characters with their corresponding English characters; and splitting a long word into multiple short words by capital characters within, if any, and connecting them with underscores.
  2. Find and remove duplicated rows across all or some specific columns.
  3. Remove empty rows and columns as well as constant columns, i.e. columns with the same value across all rows.
  4. Replace missing entries with NA.
  5. Standardizes date formats and verifies that the sequence of date events follows the correct chronological order across columns for all rows in the input data.
  6. Convert character columns into Date if the column actually contains values of type Date to some extent (default is at least 60% of the values are Date).
  7. Detect and remove rows with subject ids that do not comply with the expected format.
  8. Perform dictionary-based cleaning: replace keys in specific columns with their corresponding values stored in a data dictionary provided as a data frame, detect and replace misspelled values with their correct ones.
  9. Convert numbers written in characters into numeric.
  10. Calculate the time span between columns or values of type Date.
  11. Convert numeric values into Date.

In addition, the package also has two surrogate functions:
1. scan_data(): scan the input data to determine the percent of missing, numeric, character, logical and date values in every character column of the input data frame.
2. print_report(): print the data cleaning report.

# IMPORTING THE TEST DATASET
test_data <- readRDS(
  system.file("extdata", "test_df.RDS", package = "cleanepi")
)
study_id event_name country_code country_name date.of.admission dateOfBirth date_first_pcr_positive_test sex
PS001P2 day 0 2 Gambia 01/12/2020 06/01/1972 Dec 01, 2020 1
PS002P2 day 0 2 Gambia 28/01/2021 02/20/1952 Jan 01, 2021 1
PS004P2-1 day 0 2 Gambia 15/02/2021 06/15/1961 Feb 11, 2021 -99
PS003P2 day 0 2 Gambia 11/02/2021 11/11/1947 Feb 01, 2021 1
P0005P2 day 0 2 Gambia 17/02/2021 09/26/2000 Feb 16, 2021 2
PS006P2 day 0 2 Gambia 17/02/2021 -99 May 02, 2021 2
PB500P2 day 0 2 Gambia 28/02/2021 11/03/1989 Feb 19, 2021 1
PS008P2 day 0 2 Gambia 22/02/2021 10/05/1976 Sep 20, 2021 2
PS010P2 day 0 2 Gambia 02/03/2021 09/23/1991 Feb 26, 2021 1
PS011P2 day 0 2 Gambia 05/03/2021 02/08/1991 Mar 03, 2021 2
# SCAN THE DATA
scan_result <- scan_data(test_data)
Field_names missing numeric date character logical
study_id 0 0.0 0.0 1 0
event_name 0 0.0 0.0 1 0
country_name 0 0.0 0.0 1 0
date.of.admission 0 0.0 1.0 0 0
dateOfBirth 0 0.1 0.9 0 0
date_first_pcr_positive_test 0 0.0 1.0 0 0

In {cleanepi}, every cleaning operation is encapsulated within a module, with detailed descriptions provided in the package design vignette. The parameters required for the main functions of each module are outlined in the sections below.

The clean_data() function, described above, can take the following arguments:

  1. data: a data.frame or linelist.
  2. standardize_column_names: a list of parameters to be used for standardizing the column names. These arguments are the inputs for the standardize_column_names() function.
  3. replace_missing_values: a list of parameters to be used when replacing the missing values by NA. Elements of this list are the inputs for the replace_missing_values() function.
  4. remove_duplicates: a list with the arguments that define the columns and other parameters to be considered when looking for duplicates. They are the input values for the remove_duplicates() function.
  5. remove_constants: a list with the parameters that define whether to remove constant data or not. The values are the input for the remove_constants() function.
  6. standardize_dates: a list of parameters that will be used to standardize the date values from the input data. They represent the input values for the standardize_dates() function.
  7. standardize_subject_ids: a list of parameters that are needed to check whether the subject IDs comply with the expected format. These arguments are the input values of the check_subject_ids().
  8. to_numeric: a list with the parameters needed to convert the specified columns into numeric. When provided, the parameters will be the input values for the convert_to_numeric() function.
  9. dictionary: a data frame that will be used to substitute the current values in the specified columns with those in the dictionary. It is the main argument for the clean_using_dictionary() function.
  10. check_date_sequence: a list of arguments to be used when determining whether the sequence of the target date events is respected across all rows of the input data. The elements of this list are the input for the check_date_sequence() function.

It is important to note that only cleaning operations that are explicitly defined will be performed.

The below code chunk shows how to define a set of cleaning operations that we want to perform on the input data.

# PARAMETERS FOR REPLACING MISSING VALUES WITH NA
replace_missing_values <- list(target_columns = NULL, na_strings = "-99")

# PARAMETERS FOR COLUMN NAMES STANDARDIZATION
standardize_column_names <- list(keep = NULL, rename = NULL)

# PARAMETERS FOR DUBLICATES DETECTION AND REMOVAL
remove_duplicates <- list(target_columns = NULL)

# PARAMETERS FOR STANDARDING DATES
standardize_dates <- list(
  target_columns = NULL,
  error_tolerance = 0.4,
  format = NULL,
  timeframe = as.Date(c("1973-05-29", "2023-05-29")),
  orders = list(
    world_named_months = c("Ybd", "dby"),
    world_digit_months = c("dmy", "Ymd"),
    US_formats = c("Omdy", "YOmd")
  )
)

# PARAMETERS FOR STANDARDING SUBJECT ids
standardize_subject_ids <- list(
  target_columns = "study_id",
  prefix = "PS",
  suffix = "P2",
  range = c(1, 100),
  nchar = 7
)

# CONVERT THE 'sex' COLUMN INTO NUMERIC
to_numeric <- list(target_columns = "sex", lang = "en")

# PARAMETERS FOR CONSTANT COLUMNS, EMPTY ROWS AND COLUMNS REMOVAL
remove_constants <- list(cutoff = 1)

# LAOD THE DATA DICTIONARY FOR DICTIONARY-BASED CLEANING
dictionary <- readRDS(
  system.file("extdata", "test_dictionary.RDS", package = "cleanepi")
)
# CLEAN THE INPUT DATA FRAME
cleaned_data <- clean_data(
  data = test_data,
  remove_constants = remove_constants,
  replace_missing_values = replace_missing_values,
  remove_duplicates = remove_duplicates,
  standardize_dates = standardize_dates,
  standardize_subject_ids = standardize_subject_ids,
  to_numeric = to_numeric,
  dictionary = dictionary
)
#> ℹ Cleaning column names
#> ℹ Replacing missing values with NA
#> ℹ Removing constant columns and empty rows
#> ℹ Removing duplicated rows
#> ℹ No duplicates were found.
#> ℹ Standardizing Date columns
#> ! Detected 8 values that comply with multiple formats and no values that are
#>   outside of the specified time frame.
#> ℹ Enter `print_report(data = dat, "date_standardization")` to access them,
#>   where "dat" is the object used to store the output from this operation.
#> ℹ Checking subject IDs format
#> 
#> ! Detected no missing, no duplicated, and 3 incorrect subject IDs.
#> ℹ Enter `print_report(data = dat, "incorrect_subject_id")` to access them,
#>   where "dat" is the object used to store the output from this operation.
#> ℹ You can use the `correct_subject_ids()` function to correct them.
#> ℹ Converting the following  column into numeric: sex
#> 
#> ℹ Performing dictionary-based cleaning

Using {cleanepi} functionalities with pipe operators

# IMPORT THE INPUT DATASET
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))

# IMPORT THE DATA DICTIONARY
test_dictionary <- readRDS(
  system.file("extdata", "test_dictionary.RDS", package = "cleanepi")
)

# SCAN THROUGH THE DATA
scan_res <- scan_data(data)

# PERFORM DATA CLEANING
cleaned_data <- data %>%
  standardize_column_names(keep = NULL, rename = c(DOB = "dateOfBirth")) %>%
  replace_missing_values(target_columns = NULL, na_strings = "-99") %>%
  remove_constants(cutoff = 1.0) %>%
  remove_duplicates(target_columns = NULL) %>%
  standardize_dates(
    target_columns = NULL,
    error_tolerance = 0.4,
    format = NULL,
    timeframe = as.Date(c("1973-05-29", "2023-05-29"))
  ) %>%
  check_subject_ids(
    target_columns = "study_id",
    prefix = "PS",
    suffix = "P2",
    range = c(1L, 100L),
    nchar = 7L
  ) %>%
  convert_to_numeric(target_columns = "sex", lang = "en") %>%
  clean_using_dictionary(dictionary = test_dictionary)
#> ℹ No duplicates were found.
#> ! Detected 8 values that comply with multiple formats and no values that are
#>   outside of the specified time frame.
#> ℹ Enter `print_report(data = dat, "date_standardization")` to access them,
#>   where "dat" is the object used to store the output from this operation.
#> ! Detected no missing, no duplicated, and 3 incorrect subject IDs.
#> ℹ Enter `print_report(data = dat, "incorrect_subject_id")` to access them,
#>   where "dat" is the object used to store the output from this operation.
#> ℹ You can use the `correct_subject_ids()` function to correct them.

# ADD THE DATA SCANNING RESULT TO THE REPORT
cleaned_data <- add_to_report(
   x = cleaned_data,
   key = "scanning_result",
   value = scan_res
)

Printing the report

It returns the cleaned dataset. The report generated from the data cleaning operations is a list object that is attached to the cleaned data and can be accessed using the attr() or attributes() function. However, we strongly advice using the print_report() function to either display the entire report in your default browser in the form of an HTML file or only the report from a specific data cleaning operation. The function takes the following parameters:

  1. data: the output object returned from the clean_data() or the main functions of each data cleaning modules.
  2. print: a logical that specifies whether to open the report in your browser in the form of a HTML file or no. Default is FALSE.
  3. output_file_name: a character used to specify the name of the report file, excluding any file extension. If no file name is supplied, one will be automatically generated with the format cleanepi_report_YYMMDD_HHMMSS.
  4. format: a character with the file format of the report. Currently only "html" is supported.
  5. what: a character with the name of the specific data cleaning report which would be displayed. The table below contains the possible values of the this argument.
keywords goal
incorrect_date_sequence To display rows with the incorrect date sequences
colnames To display the column names before and after cleaning
converted_into_numeric To display the names of the columns that that have
been converted into numeric
date_standardization To display rows in the cleaned data with date values
misspelled_values To display the detected misspelled values
removed_duplicates To display the duplicated rows that have been removed
found_duplicates To display the duplicated rows
constant_data To display the constant data i.e. constant columns, empty rows and columns
missing_values_replaced_at To display the names of the columns where the missing value strings have been replaced with NA
incorrect_subject_id To display the missing, duplicated and invalid subject subject IDs

that are outside of the specified time frame, and rows with date values that comply with multiple formats |

# PRINT THE ENTIRE DATA CLEANING REPORT
print_report(
  data = cleaned_data,
  what = NULL,
  print = TRUE,
  report_title = "{cleanepi} data cleaning report",
  output_file_name = NULL,
  format = "html"
)

Specific data cleaning tasks

{cleanepi} luckily provides users with the flexibility to call a specific function if they wish to perform that particular task individually. This approach allows users to have more control over the data cleaning process and to apply additional data cleaning functions as needed.

For example, some data cleaning operations, such as renaming columns, removing empty rows and columns, removing columns with the same values across all rows, and standardizing date columns, play a central role in standardizing the epidemiological data. These operations can be performed within the framework of the clean_data() function or alone by calling the relevant function.

This setup offers users both convenience and flexibility, as they can perform all the cleaning operations at once, or execute them individually according to their specific needs.

Remove constant data

Some datasets contain constant columns (columns with the same values across all rows), and/or empty rows and columns (rows or columns where all values are missing i.e NA). The remove_constants() function can be used to remove such “noise”. The function takes the following arguments:

  1. data: the input data frame or linelist,
  2. cutoff: a numeric, between 0 and 1, to be used when removing empty rows and columns. When provided, only rows and columns where the percent of missing data is greater than this cut-off will removed. Rows and columns with 100% missing values will be remove by default.
# IMPORT THE INPUT DATA
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))

# INTRODUCE AN EMPTY COLUMN
data$empty_column <- NA

# remove the constant columns, empty rows and columns
dat <- remove_constants(
  data = data,
  cutoff = 1
)
study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
PS001P2 01/12/2020 06/01/1972 Dec 01, 2020 1
PS002P2 28/01/2021 02/20/1952 Jan 01, 2021 1
PS004P2-1 15/02/2021 06/15/1961 Feb 11, 2021 -99
PS003P2 11/02/2021 11/11/1947 Feb 01, 2021 1
P0005P2 17/02/2021 09/26/2000 Feb 16, 2021 2
PS006P2 17/02/2021 -99 May 02, 2021 2
PB500P2 28/02/2021 11/03/1989 Feb 19, 2021 1
PS008P2 22/02/2021 10/05/1976 Sep 20, 2021 2
PS010P2 02/03/2021 09/23/1991 Feb 26, 2021 1
PS011P2 05/03/2021 02/08/1991 Mar 03, 2021 2

The remove_constants() function returns a dataset where all constant columns, empty rows and columns are iteratively removed. Note that when the first iteration of constant data removal results in a dataset with new empty rows and/or columns and constant columns, this process will be repeated several times until there is no more constant data. Rows and columns that were deleted at any iterations will be mentioned in the report object.

print_report(dat, "constant_data")
iteration empty_columns empty_rows constant_columns
1 empty_column NA event_name, country_code, country_name

Cleaning column names

The syntax used to name the columns of a dataset during its creation depends on many factors such as the language, the naming convention, etc. We provide, in {cleanepi}, the standardize_column_names() function to clean up column names and convert them to more intuitive formats. It performs many sub-tasks including: replacing a space, dot, or hyphen between two words with underscore; converting camel-cases to snake-cases; substituting foreign characters with their corresponding English characters; and splitting along word into multiple short words by capital characters within, if any, and connecting them with underscores. The function can take the following arguments:

  1. data: the input data frame or linelist
  2. keep: a vector of column names to maintain as they are. When dealing with a linelist, this can be set to linelist_tags, to maintain the tagged column names. The Default is NULL.
  3. rename: a vector of column names to be renamed in the form of new_name = "old_name". If not provided, all columns will undergo standardization.
# IMPORT AND PRINT THE INITAL COLUMN NAMES
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
print(colnames(data))
#> [1] "study_id"                     "event_name"                  
#> [3] "country_code"                 "country_name"                
#> [5] "date.of.admission"            "dateOfBirth"                 
#> [7] "date_first_pcr_positive_test" "sex"

# KEEP 'date.of.admission' AS IS
cleaned_data <- standardize_column_names(
  data = data,
  keep = "date.of.admission"
)
print(colnames(cleaned_data))
#> [1] "study_id"                     "event_name"                  
#> [3] "country_code"                 "country_name"                
#> [5] "date.of.admission"            "date_of_birth"               
#> [7] "date_first_pcr_positive_test" "sex"

# KEEP 'date.of.admission' AS IS, BUT RENAME 'dateOfBirth' AND 'sex' TO
# 'DOB' AND 'gender' RESPECTIVELY
cleaned_data <- standardize_column_names(
  data = data,
  keep = "date.of.admission",
  rename = c(DOB = "dateOfBirth", gender = "sex")
)

# PRINT THE REPORT FOR THIS OPERATION
print_report(cleaned_data, "colnames")
before after
study_id study_id
event_name event_name
country_code country_code
country_name country_name
date.of.admission date.of.admission
dateOfBirth DOB
date_first_pcr_positive_test date_first_pcr_positive_test
sex gender

By providing the function with these parameters, the users can redefine the column names and get easy to work with column names. This enables a more comprehensive naming system that is tailored to the needs of the user.

Replacing missing entries with NA

It is common to have missing values in an input dataset. By default, R expects missing values to be represented by NA. However, this is not always the case as some dataset can contain a specific character string that denotes the missing value. In the presence of such a scenario, users can call the replace_missing_values() function to substitute these missing values with NA. This will make the data suitable for any data science operations. The function takes the following arguments:

  1. data: the input data frame or linelist.
  2. target_columns: a vector of column names. If provided, the substitution of missing values will only be executed in those specified columns. When the input data is a linelist object, this parameter can be set to linelist_tags if you wish to replace missing values with NA on tagged columns only. The default value is NULL i.e. replace missing values across all columns.
  3. na_strings: a vector of character strings that represents the missing values in the columns of interest. By default, it utilizes cleanepi::common_na_strings. However, if the missing values string in the columns of interest is not included in this predefined vector, it can be used as the value for this argument.
# VISUALIZE THE PREDEFINED VECTOR OF MISSING CHARACTERS
print(cleanepi::common_na_strings)
#>  [1] "missing"       "NA"            "N A"           "N/A"          
#>  [5] "#N/A"          "NA "           " NA"           "N /A"         
#>  [9] "N / A"         " N / A"        "N / A "        "na"           
#> [13] "n a"           "n/a"           "na "           " na"          
#> [17] "n /a"          "n / a"         " a / a"        "n / a "       
#> [21] "NULL"          "null"          ""              "\\?"          
#> [25] "\\*"           "\\."           "not available" "Not Available"
#> [29] "NOt available" "not avail"     "Not Avail"     "nan"          
#> [33] "NAN"           "not a number"  "Not A Number"
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))

# REPLACE ALL OCCURENCES OF "-99" WITH NA IN THE "sex" COLUMN
cleaned_data <- replace_missing_values(
  data = data,
  target_columns = "sex",
  na_strings = "-99"
)

# REPLACE ALL OCCURENCES OF "-99" WITH NA FROM ALL COLUMNS
cleaned_data <- replace_missing_values(
  data = data,
  target_columns = NULL,
  na_strings = "-99"
)

# PRINT THE REPORT FROM THIS OPERATION
print_report(cleaned_data, "missing_values_replaced_at")
#> [1] "dateOfBirth" "sex"
study_id event_name country_code country_name date.of.admission dateOfBirth date_first_pcr_positive_test sex
PS001P2 day 0 2 Gambia 01/12/2020 06/01/1972 Dec 01, 2020 1
PS002P2 day 0 2 Gambia 28/01/2021 02/20/1952 Jan 01, 2021 1
PS004P2-1 day 0 2 Gambia 15/02/2021 06/15/1961 Feb 11, 2021 NA
PS003P2 day 0 2 Gambia 11/02/2021 11/11/1947 Feb 01, 2021 1
P0005P2 day 0 2 Gambia 17/02/2021 09/26/2000 Feb 16, 2021 2
PS006P2 day 0 2 Gambia 17/02/2021 NA May 02, 2021 2
PB500P2 day 0 2 Gambia 28/02/2021 11/03/1989 Feb 19, 2021 1
PS008P2 day 0 2 Gambia 22/02/2021 10/05/1976 Sep 20, 2021 2
PS010P2 day 0 2 Gambia 02/03/2021 09/23/1991 Feb 26, 2021 1
PS011P2 day 0 2 Gambia 05/03/2021 02/08/1991 Mar 03, 2021 2

Standardizing Dates

The default date format in R is YYYY-MM-DD (the ISO8601 format). However, it is very common to encounter date values that are written differently from this. Also, there are cases where a column in a data frame contains both values of type Date, character or others.

The standardize_dates() function provides a comprehensive set of options for converting date columns into a specified format and handling various scenarios, such as different date formats and mixed data types in a column.

Entries which cannot be processed result in NA. An error threshold can be used to define the maximum number of resulting NA (i.e. entries without an identified date) that can be tolerated. If this threshold is exceeded, the initial values will be returned.

The function expects the following arguments:

  1. data: A data frame or linelist (required).
  2. target_columns: A vector of the names of the columns to be converted (optional). When not provided, the function will attempt to detect date columns and perform the conversion if the conditions above are met.
  3. format: A format of the values in the specified columns (optional). If not provided, the function will attempt to infer the format.
  4. timeframe: The expected time frame within which the date values should fall. Values outside of this range will be set to NA (optional).
  5. error_tolerance: The maximum percentage of NA values (non date values) that can be allowed in a converted column. Default is 40% i.e. 0.4.
  6. orders: A character vector or list of codes for fine-grained parsing of dates. It is used for parsing of mixed dates. If a list is supplied, that list will be used for successive tries in parsing. Default is:
orders <- list(
  quarter_partial_dates = c("Y", "Ym", "Yq"),
  world_digit_months = c("ymd", "ydm", "dmy", "mdy", "myd", "dym", "Ymd", "Ydm",
                         "dmY", "mdY", "mYd", "dYm"),
  world_named_months = c("dby", "dyb", "bdy", "byd", "ybd", "ydb", "dbY", "dYb",
                         "bdY", "bYd", "Ybd", "Ydb"),
  us_format = c("Omdy", "YOmd")
)

⚠️ The error_tolerance must be used with caution. When it is set, and the percentage of non-date values (NA i.e. values that were not converted into date) in a character column is greater than this threshold, the column will be returned as it is. The values outside of this timeframe can be accessed from the report object via its date_standardization element.

The value for the orders argument can be modified to suit the user’s needs. Other date formats can be specified too. For instance, if you want to prioritize American-style dates with numeric months, you can switch the second and third elements of the default orders as shown below:

# GIVE PRIORITY TO AMERICAN-STYLE DATES
us_ord <- orders[c(1L, 3L, 2L)]

# ADD A FORMAT WITH HOURS TO THE EXISTING orders
# THIS WILL ALLOW FOR THE CONVERSION OF VALUES SUCH AS "2014_04_05_23:15:43"
# WHEN THEY APPEAR IN THE TARGET COLUMNS.
orders$ymdhms <- c("Ymdhms", "Ymdhm")

This function provides users with the flexibility to standardize date columns in their dataset according to specified requirements, including format, timeframe, and error tolerance for conversion from character to date columns.

# STANDARDIZE VALUES IN THE 'date.of.admission' COLUMN
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
head(data$date_first_pcr_positive_test)
#> [1] "Dec 01, 2020" "Jan 01, 2021" "Feb 11, 2021" "Feb 01, 2021" "Feb 16, 2021"
#> [6] "May 02, 2021"

cleaned_data <- standardize_dates(
  data = test_data,
  target_columns = "date.of.admission",
  format = NULL,
  timeframe = as.Date(c("2021-01-01", "2021-12-01")),
  error_tolerance = 0.4,
  orders = list(
    world_named_months = c("Ybd", "dby"),
    world_digit_months = c("dmy", "Ymd"),
    US_formats = c("Omdy", "YOmd")
  )
)
#> ! Detected 4 values that comply with multiple formats and 1 value that is
#>   outside of the specified time frame.
#> ℹ Enter `print_report(data = dat, "date_standardization")` to access them,
#>   where "dat" is the object used to store the output from this operation.

# PRINT THE REPORT FROM THIS OPERATION
print_report(cleaned_data, "date_standardization")
#> $multi_format_dates
#>               field idx world_named_months world_digit_months US_formats
#> 1 date.of.admission   1               <NA>         2020-12-01 2020-01-12
#> 2 date.of.admission   4               <NA>         2021-02-11 2021-11-02
#> 3 date.of.admission   9               <NA>         2021-03-02 2021-02-03
#> 4 date.of.admission  10               <NA>         2021-03-05 2021-05-03
#> 
#> $out_of_range_dates
#>   idx            column original_value
#> 1   1 date.of.admission     01/12/2020

This function returns the input dataset where the (specified) columns are converted into Date if the conditions are met. The report made from this operation contains:

# STANDARDIZE VALUES IN ALL COLUMNS
cleaned_data <- standardize_dates(
  data = data,
  target_columns = NULL,
  format = NULL,
  timeframe = NULL,
  error_tolerance = 0.4,
  orders = list(
    world_named_months = c("Ybd", "dby"),
    world_digit_months = c("dmy", "Ymd"),
    US_formats = c("Omdy", "YOmd")
  )
)
#> ! Detected 8 values that comply with multiple formats and no values that are
#>   outside of the specified time frame.
#> ℹ Enter `print_report(data = dat, "date_standardization")` to access them,
#>   where "dat" is the object used to store the output from this operation.
#> ! Found <numeric> values that could also be of type <Date> in column:
#>   dateOfBirth.
#> ℹ It is possible to convert them into <Date> using: `lubridate::as_date(x,
#>   origin = as.Date("1900-01-01"))`
#> • where "x" represents here the vector of values from these columns
#>   (`data$target_column`).

# PRINT THE REPORT
print_report(cleaned_data, "date_standardization")
#> $multi_format_dates
#>               field idx world_named_months world_digit_months US_formats
#> 1 date.of.admission   1               <NA>         2020-12-01 2020-01-12
#> 2 date.of.admission   4               <NA>         2021-02-11 2021-11-02
#> 3 date.of.admission   9               <NA>         2021-03-02 2021-02-03
#> 4 date.of.admission  10               <NA>         2021-03-05 2021-05-03
#> 5       dateOfBirth   1               <NA>         1972-01-06 1972-06-01
#> 6       dateOfBirth   7               <NA>         1989-03-11 1989-11-03
#> 7       dateOfBirth   8               <NA>         1976-05-10 1976-10-05
#> 8       dateOfBirth  10               <NA>         1991-08-02 1991-02-08
study_id event_name country_code country_name date.of.admission dateOfBirth date_first_pcr_positive_test sex
PS001P2 day 0 2 Gambia 2020-12-01 1972-01-06 2020-12-01 1
PS002P2 day 0 2 Gambia 2021-01-28 1952-02-20 2021-01-01 1
PS004P2-1 day 0 2 Gambia 2021-02-15 1961-06-15 2021-02-11 -99
PS003P2 day 0 2 Gambia 2021-02-11 1947-11-11 2021-02-01 1
P0005P2 day 0 2 Gambia 2021-02-17 2000-09-26 2021-02-16 2
PS006P2 day 0 2 Gambia 2021-02-17 NA 2021-05-02 2
PB500P2 day 0 2 Gambia 2021-02-28 1989-03-11 2021-02-19 1
PS008P2 day 0 2 Gambia 2021-02-22 1976-05-10 2021-09-20 2
PS010P2 day 0 2 Gambia 2021-03-02 1991-09-23 2021-02-26 1
PS011P2 day 0 2 Gambia 2021-03-05 1991-08-02 2021-03-03 2

Standardizing subject IDs

Detecting incorrect, duplicated, and missing subject IDs

The check_subject_ids() function is designed to identify rows from the input dataset where the IDs don’t comply with the expected subject IDs format. It expects the following parameters:

  1. data: A data frame or linelist (required).
  2. target_columns: The name of the column containing the subject IDs in the dataset (required).
  3. nchar: The expected number of character in the subject ids (optional).
  4. prefix: A string. If subject IDs have a specific prefix, it is used as a value for this argument. This is optional and can be omitted if there is no prefix.
  5. suffix: A string. If subject IDs have a specific suffix, it is used as a value for this argument. It can be ignored otherwise.
  6. range: A vector of two elements. If there is an expected range of numbers within the subject IDs, define it using this parameter. It is optional and can be omitted if there is no specific range.

By providing these parameters, the function becomes a versatile tool for data cleaning, ensuring that the user is alerted on the presence of unexpected, missing and duplicated subject IDs. When using the function, make sure to tailor the parameters according to the specific requirements of your dataset and the expected characteristics of the subject IDs.

data <-  readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))

# MAKE FIRST AND LAST SUBJECT IDS THE SAME
data$study_id[10] <- data$study_id[1]

# SET SUBJECT ID NUMBER 9 TO NA
data$study_id[9] <- NA

# DETECT INCORRECT SUBJECT IDs
cleaned_data <- check_subject_ids(
  data = data,
  target_columns = "study_id",
  prefix = "PS",
  suffix = "P2",
  range = c(1L, 100L),
  nchar = 7L
)
#> ! Detected 1 missing, 2 duplicated, and 2 incorrect subject IDs.
#> ℹ Enter `print_report(data = dat, "incorrect_subject_id")` to access them,
#>   where "dat" is the object used to store the output from this operation.
#> ℹ You can use the `correct_subject_ids()` function to correct them.

# PRINT THE REPORT FROM THIS OPERATION
print_report(cleaned_data, "incorrect_subject_id")
#> $idx_missing_ids
#> [1] "9"
#> 
#> $duplicated_ids
#> # A tibble: 2 × 3
#> # Groups:   study_id [1]
#>   row_id group_id study_id
#>    <int>    <int> <chr>   
#> 1      1        1 PS001P2 
#> 2     10        1 PS001P2 
#> 
#> $invalid_subject_ids
#>   idx       ids
#> 1   3 PS004P2-1
#> 2   5   P0005P2
#> 3   7   PB500P2

The check_subject_ids() function returns the input dataset and send a warming when it finds some incorrect ids.

In addition to detecting undesirable subject IDs, the function will also look for missing and duplicated IDs. As the result of this, the report made from this operation can contain two extra elements: missing_ids (a vector of row indexes where there is a missing IDs) and duplicated_ids (a data frame of rows with the duplicated IDs).

Correct wrong subject ids

After the detection of the incorrect subject ids using the check_subject_ids(), use the correct_subject_ids() to replace non complying ids with the correct ones. The function requires a data frame with the following two columns:

  1. from: a column with the incorrect subject ids,
  2. to: a column with the values to be used to substitute the incorrect ids.
# IMPORT THE INPUT DATA
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))

# GENERATE THE CORRECTION TABLE
correction_table <- data.frame(
  from = c("P0005P2", "PB500P2", "PS004P2-1"),
  to = c("PB005P2", "PB050P2", "PS004P2"),
  stringsAsFactors = FALSE
)

# PERFORM THE CORRECTION
dat <- correct_subject_ids(
  data = data,
  target_columns = "study_id",
  correction_table = correction_table
)

Checking date sequence

The check_date_sequence() function verifies the order of sequences in date event columns within a dataset. It ensures that the values in the specified date columns follow the desired chronological order. Here are the arguments accepted by the function:

  1. data: A data frame or linelist (required).
  2. target_columns: A vector containing the names of the date columns of interest. These columns should be listed in the expected order of occurrence that reflects the chronological sequence of events. For example, target_columns = c("date_of_infection", "date_of_admission", "date_of_death").

By utilizing these arguments, the check_date_sequence() function facilitates the validation of date sequences within a dataset, ensuring data integrity and accuracy for further analysis. Additionally, it offers flexibility by allowing users to choose whether to remove rows with incorrect sequences or store them for further examination in the report object.

# IMPORT THE DATA AND STANDARDIZE THE TARGET DATE COLUMNS
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi")) %>%
  standardize_dates(
    target_columns = c("date_first_pcr_positive_test", "date.of.admission")
  )
#> ! Detected 4 values that comply with multiple formats and no values that are
#>   outside of the specified time frame.
#> ℹ Enter `print_report(data = dat, "date_standardization")` to access them,
#>   where "dat" is the object used to store the output from this operation.

# DETECT ROWS WITH INCORRECT DATE SEQUENCE
cleaned_data <- check_date_sequence(
  data = data,
  target_columns = c("date_first_pcr_positive_test", "date.of.admission")
)
#> ! Detected 2 incorrect date sequences at lines: "6, 8".
#> ℹ Enter `print_report(data = dat, "incorrect_date_sequence")` to access them,
#>   where "dat" is the object used to store the output from this operation.

# PRINT THE REPORT FROM THIS OPERATION
print_report(cleaned_data, "incorrect_date_sequence")
date_first_pcr_positive_test date.of.admission row_id
6 2021-05-02 2021-02-17 6
8 2021-09-20 2021-02-22 8

The check_date_sequence() function returns the input dataset, augmented with an attributes named as incorrect_date_sequence if there are rows with incorrect date sequences. This attribute highlights any discrepancies found in the date sequences, enabling users to take appropriate actions.

Converting character columns into numeric

In certain scenarios, the input data contains columns where the numbers are written in letters. For instance, the ages of a study participants can be written in letters. Similarly, a column can contain values written in both numbers and letters (an age column where some values are written in numbers and others in letters). The convert_to_numeric() function offers the framework to convert all numbers written in letters from a given column into numeric. It makes call of the main function from the {numberize} package and takes the following arguments:

  1. data: A data frame or linelist (required).
  2. target_columns: A vector containing the names of the columns of interest. When dealing with a linelist, this can be set to linelist_tags if the tagged columns are the one to be converted into numeric. When target_columns = NULL, the function uses the output form the scan_data() function to identify the columns where the proportion of numeric values is at least twice as the percentage of character values. Those columns will be the columns of interest and the character values in them will be converted into numeric.
    Note that any string in such column that can not be converted into numeric will be set to NA in the resulting data.
  3. lang: A character string with the language in which the letters are written. Currently one of "en", "fr", or "es" for English, French or Spanish respectively.
# CONVERT THE 'age' COLUMN IN THE TEST LINELIST DATA
dat <- readRDS(system.file("extdata", "messy_data.RDS", package = "cleanepi"))
head(dat$age, 10L)
#>  [1] "37"           "seventy-four" "17"           "3"            "37"          
#>  [6] NA             "6"            "26"           "eleven"       "44"

cleaned_data <- convert_to_numeric(
  data = dat,
  target_columns = "age",
  lang = "en"
)
head(cleaned_data$age, 10L)
#>  [1] 37 74 17  3 37 NA  6 26 11 44

# PRINT THE REPORT FROM THIS OPERATION
print_report(cleaned_data, "converted_into_numeric")
#> [1] "age"

Converting numeric values into date

Some columns in a data frame might contain numeric values that represents the number of days elapsed between two events. For instance, the recruitment day of individuals in a study can be stored as a numeric column where the numeric values are the count of days between when they are recruited and when there were admitted in the hospital. The actual dates when the individuals were recruited can be retrieved using the convert_numeric_to_date() function. This function can take the following parameters:

  1. data: the input data frame or linelist.
  2. target_columns: a vector or a comma-separated list of columns names to be converted from numeric to date. When the input data is a linelist object, this parameter can be set to linelist_tags if tagged variables are the target columns.
  3. ref_date: a reference date
  4. forward: a Boolean that indicates whether the counts started after the reference date (TRUE) or not (FALSE). The default is TRUE.
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi")) %>%
  standardize_dates(target_columns = "date.of.admission")
#> ! Detected 4 values that comply with multiple formats and no values that are
#>   outside of the specified time frame.
#> ℹ Enter `print_report(data = dat, "date_standardization")` to access them,
#>   where "dat" is the object used to store the output from this operation.

# CREATE THE RECRUITMENT DATE COLUMNS
data$recruitment_date <- sample(20:50, nrow(data), replace = FALSE)

# RETRIVE THE DATE INDIVIDUALS WERE RECRUITED FROM THEIR DATE OF ADMISSION
dat <- convert_numeric_to_date(
  data = data,
  target_columns = "recruitment_date",
  ref_date = "date.of.admission",
  forward = TRUE
)

# RETRIVE THE DATE INDIVIDUALS WERE RECRUITED FROM 2019-10-13
dat <- convert_numeric_to_date(
  data = data,
  target_columns = "recruitment_date",
  ref_date = as.Date("2019-10-13"),
  forward = FALSE
)

The function returns the input data where the values in the target columns(s) are converted into Date. This enables the usage of {cleanepi}’s functions that operates on Date columns as well as the powerful functions that can be used to manipulate Dates from the {base} R or {lubridate} packages.

Finding duplicated rows

The find_duplicates() function serves the purpose of identifying duplicated rows within a given dataset. It accepts the following parameters:

  1. data: The input data frame or linelist.
  2. target_columns: A vector containing either column names or column indices from which duplicated rows will be identified. If NULL is passed, duplicates will be detected across all columns of the dataset. Notably, if the input dataset is a linelist object, target_columns can be set to linelist_tags specifically to identify duplicates across the tagged variables only.

By leveraging the find_duplicates() function with appropriate parameters, users can efficiently pinpoint duplicated rows within their datasets, either across all columns or selectively across tagged variables in a linelist object.

# IMPORT A `linelist` DATA
data <- readRDS(
  system.file("extdata", "test_linelist.RDS", package = "cleanepi")
)

# SHOW THE TAGGED VARIABLES
linelist::tags(data)
#> $date_onset
#> [1] "dt_onset"
#> 
#> $date_reporting
#> [1] "dt_report"
#> 
#> $gender
#> [1] "sex"
#> 
#> $outcome
#> [1] "outcome"

# FIND DUPLICATES ACROSS ALL COLUMNS EXCEPT THE SUBJECT ids COLUMN
all_columns <- names(data)
target_columns <- all_columns[all_columns != "id"]
dups <- find_duplicates(
  data = data,
  target_columns = target_columns
)
#> ℹ No duplicates were found.

# FIND DUPLICATES ACROSS TAGGED VARIABLES
dups <- find_duplicates(
  data = data,
  target_columns = "linelist_tags"
)
#> ! Found 57 duplicated rows in the dataset.
#> ℹ Use `print_report(dat, "found_duplicates")` to access them, where "dat" is
#>   the object used to store the output from this operation.

Upon execution, the find_duplicates() function identifies all duplicated rows either based on all columns or those specified, and stores them in the report, where in addition to the existing columns, it appends two extra columns to the dataset:

  1. row_id: Contains indices of the duplicated rows from the original input dataset.
  2. group_id: Contains unique identifiers assigned to each duplicated group, which is defined as a set of rows sharing identical values in the designated columns of interest.

By including these extra columns, users gain insights into the specific rows identified as duplicates and their corresponding group identifiers, enabling efficient analysis and management of duplicated data within the dataset. When duplicates are found, the report from this operation will contain the following elements:

# PRINT THE REPORT FROM DUPLICATES FINDING
print_report(dups, "found_duplicates")
#> $duplicated_rows
#> # A tibble: 57 × 6
#> # Groups:   dt_onset, dt_report, sex, outcome [23]
#>    row_id group_id dt_onset   dt_report  sex   outcome
#>     <int>    <int> <date>     <date>     <fct> <fct>  
#>  1     26        1 2015-05-21 2015-06-03 M     Alive  
#>  2     33        1 2015-05-21 2015-06-03 M     Alive  
#>  3     55        2 2015-05-30 2015-06-06 M     Alive  
#>  4     62        2 2015-05-30 2015-06-06 M     Alive  
#>  5     23        3 2015-05-31 2015-06-02 M     Dead   
#>  6     24        3 2015-05-31 2015-06-02 M     Dead   
#>  7     99        4 2015-05-31 2015-06-09 M     Alive  
#>  8    105        4 2015-05-31 2015-06-09 M     Alive  
#>  9     27        5 2015-06-01 2015-06-03 M     Alive  
#> 10     31        5 2015-06-01 2015-06-03 M     Alive  
#> # ℹ 47 more rows
#> 
#> $duplicates_checked_from
#> [1] "dt_onset"  "dt_report" "sex"       "outcome"

Removing duplicates

To eliminate duplicated rows from a dataset, the remove_duplicates() function can be employed. This function internally utilizes the find_duplicates() function and expects the following parameters:

  1. data: A data frame or linelist from which duplicated rows will be removed.
  2. target_columns: A vector containing either column names or indices specifying the columns from which duplicated rows will be identified. If set to NULL, the function will remove duplicates across all columns. If the input dataset is a linelist object, setting this parameter to linelist_tags will identify and remove duplicates across the tagged variables only.
# REMOVE DUPLICATE ACROSS TAGGED COLUMNS ONLY.
data <- readRDS(
  system.file("extdata", "test_linelist.RDS", package = "cleanepi")
)

cleaned_data <- remove_duplicates(
  data = readRDS(
    system.file("extdata", "test_linelist.RDS", package = "cleanepi")
  ),
  target_columns = "linelist_tags"
)
#> ! Found 57 duplicated rows in the dataset.
#> ℹ Use `print_report(dat, "found_duplicates")` to access them, where "dat" is
#>   the object used to store the output from this operation.

# PRINT THE REPORT FROM THIS OPERATION
print_report(cleaned_data, "removed_duplicates")
row_id dt_onset dt_report sex outcome
33 2015-05-21 2015-06-03 M Alive
62 2015-05-30 2015-06-06 M Alive
24 2015-05-31 2015-06-02 M Dead
105 2015-05-31 2015-06-09 M Alive
31 2015-06-01 2015-06-03 M Alive
60 2015-06-01 2015-06-06 F Alive
73 2015-06-01 2015-06-07 F Alive
78 2015-06-01 2015-06-07 F Alive
82 2015-06-01 2015-06-07 F Alive
85 2015-06-01 2015-06-07 F Alive
51 2015-06-02 2015-06-06 F Dead
58 2015-06-02 2015-06-06 M Alive
61 2015-06-02 2015-06-06 M Alive
122 2015-06-02 2015-06-10 F Alive
63 2015-06-03 2015-06-06 F Alive
71 2015-06-04 2015-06-07 F Alive
79 2015-06-04 2015-06-07 F Alive
87 2015-06-04 2015-06-07 F Alive
108 2015-06-04 2015-06-09 F Alive
107 2015-06-06 2015-06-09 F Alive
104 2015-06-06 2015-06-09 M Alive
133 2015-06-08 2015-06-12 M Alive
135 2015-06-08 2015-06-12 M Alive
138 2015-06-10 2015-06-12 M Alive
49 NA 2015-06-06 M Alive
80 NA 2015-06-07 M Alive
95 NA 2015-06-08 M Alive
121 NA 2015-06-10 M Alive
161 NA 2015-06-16 F Alive
157 NA 2015-06-16 M Alive
158 NA 2015-06-16 M Alive
159 NA 2015-06-16 M Alive
160 NA 2015-06-16 M Alive
162 NA 2015-06-16 M Alive

Upon execution, the remove_duplicates() function returns the input dataset without duplicated rows removed (if found). The details about the duplicates removal operation are stored in the report object that is attached to the output object. When duplicates are found, this report will contain the following element:

By examining these elements within the report, users gain insights into the specific duplicated rows, those that were removed, and the columns used to identify the duplicates, thus facilitating transparency and documentation of the duplicates removal process.

Dictionary based data substituting

The clean_using_dictionary() function offers a convenient way to replace the options in a data frame or linelist with their corresponding values stored in a data dictionary. The function expects the following arguments:

  1. data: The input data frame or linelist that contains the options to be replaced.
  2. dictionary: The data dictionary in a form of a data frame that contains the complete labels for these options. The structure of this data dictionary file should adhere to the standards expected by the {matchmaker} package, as the clean_using_dictionary() function relies on functions from this package.
options values grp orders
1 male sex 1
2 female sex 2
# READING IN THE DATA
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))

# PERFORM THE DICTIONARY-BASED SUBSTITUTION
cleaned_data <- clean_using_dictionary(
  data = data,
  dictionary = dictionary
)
#> ! Cannot replace "-99" present in column sex but not defined in the dictionary.
#> ℹ You can either:
#> • correct the misspelled option from the input data, or
#> • add it to the dictionary using the `add_to_dictionary()` function.

# PRINT THE REPORT FROM THIS OPERATION
print_report(cleaned_data, "misspelled_values")
idx column value
3 sex -99

The add_to_dictionary() function is a useful tool for expanding the coverage of a data dictionary by defining options that are present in the input data but not originally included in the dictionary. This function enables users to dynamically update the dictionary to accommodate new values encountered in the dataset. In addition to the current data dictionary the function takes the arguments defined below:

option, value, grp, order: the values for the options to be added in the data dictionary. The example below shows how this function is used.

By employing the add_to_dictionary() function, users can ensure that the data dictionary remains comprehensive and aligned with the evolving nature of the input dataset, thereby enhancing the accuracy and completeness of data interpretation and analysis. In the example below, we add -99 to our test data dictionary, test_dictionary.

# ADD THE EXTRA OPTION TO THE DICTIONARY
dictionary <- add_to_dictionary(
  dictionary = dictionary,
  option = "-99",
  value = "unknow",
  grp = "sex",
  order = NULL
)
options values grp orders
1 male sex 1
2 female sex 2
-99 unknow sex 3
# PERFORM THE DICTIONARY-BASED SUBSTITUTION
cleaned_data <- clean_using_dictionary(
  data = data,
  dictionary = dictionary
)
study_id event_name country_code country_name date.of.admission dateOfBirth date_first_pcr_positive_test sex
PS001P2 day 0 2 Gambia 01/12/2020 06/01/1972 Dec 01, 2020 male
PS002P2 day 0 2 Gambia 28/01/2021 02/20/1952 Jan 01, 2021 male
PS004P2-1 day 0 2 Gambia 15/02/2021 06/15/1961 Feb 11, 2021 unknow
PS003P2 day 0 2 Gambia 11/02/2021 11/11/1947 Feb 01, 2021 male
P0005P2 day 0 2 Gambia 17/02/2021 09/26/2000 Feb 16, 2021 female
PS006P2 day 0 2 Gambia 17/02/2021 -99 May 02, 2021 female
PB500P2 day 0 2 Gambia 28/02/2021 11/03/1989 Feb 19, 2021 male
PS008P2 day 0 2 Gambia 22/02/2021 10/05/1976 Sep 20, 2021 female
PS010P2 day 0 2 Gambia 02/03/2021 09/23/1991 Feb 26, 2021 male
PS011P2 day 0 2 Gambia 05/03/2021 02/08/1991 Mar 03, 2021 female

Correct misspelled values

The add_to_dictionary() function is not suitable in the presence of a large number of misspelled values. As such, users can use the correct_misspelled_values() function to fix misspelled values across multiple columns without needing/updating the data dictionary. The function takes the following arguments:

  1. data: The input data frame or linelist
  2. target_columns: A vector of target columns across which to perform the spelling checks
  3. wordlist: A vector of characters with the words to match to the detected
    misspelled values
  4. max_distance: An integer that represents the maximum distance allowed for detecting a spelling mistakes from the wordlist. The distance is the generalized Levenshtein edit distance. Default is 1.
  5. confirm: A logical used to determine whether to show the user a menu of spelling corrections. If TRUE and using R interactively then the user will have the option to review the proposed spelling corrections.
df <- data.frame(
  case_type = c("confirmed", "confermed", "probable", "susspected"),
  outcome = c("died", "recoverd", "did", "recovered"),
  stringsAsFactors = FALSE
)
df

correct_misspelled_values(
  data = df,
  target_columns = c("case_type", "outcome"),
  wordlist = c("confirmed", "probable", "suspected", "died", "recovered"),
  confirm = FALSE
)

Calculating time span in different time scales (“years”, “months”, “weeks”, or “days”)

The timespan() function computes the time span between two elements of type Date. The resulting time span can be expressed in “years”, “months”, “weeks”, or “days”, depending on the user-specified unit. The functions can take the following arguments:

  1. data: The input dataset (required).
  2. target_column: A string with the name of the target column (require). The values in this column are expected to be in the form of Ymd i.e. 2024-01-31. The time span will be calculated between these values and the end_date defined below.
  3. end_date: It can be either a character that is the name of another column of type Date from the input data or a vector of Date values or a single Date value (required). This should also be in the ISO8601 format (“2024-01-31”) and its default value is today’s date Sys.Date().
  4. span_unit: This parameter determines the unit in which the time span is expressed (required). It can be calculated in “years”, “months”, “weeks”, or “days”. By default, the time span is calculated in “years” if this parameter is not provided.
  5. span_column_name: A string for the name of the column added to the input data. The default is span.
  6. span_remainder_unit: A parameter used to determine the unit in which the remainder of the time span calculation will be returned. The possible units are: “days” or “weeks” or “months”. By default, the function returns decimal values i.e. span_remainder_unit = NULL.

With these arguments, the function offers flexibility in determining the time span in different units. It facilitates various analytics tasks where the time span computation is a necessary component, providing users with the ability to customize the output according to their specific requirements.

# IMPORT DATA, REPLACE MISSING VALUES WITH 'NA' & STANDARDIZE DATES
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi")) %>%
  replace_missing_values(
    target_columns = "dateOfBirth",
    na_strings = "-99"
  ) %>%
  standardize_dates(
    target_columns = "dateOfBirth",
    error_tolerance = 0.0,
    format = "%m/%d/%Y" # nolint: nonportable_path_linter.
  )
#> ℹ The target  column will be standardized using the format: "%m/%d/%Y".
#> ! Detected no values that comply with multiple formats and no values that are
#>   outside of the specified time frame.
#> ℹ Enter `print_report(data = dat, "date_standardization")` to access them,
#>   where "dat" is the object used to store the output from this operation.

# CALCULATE INDIVIDUAL AGE IN YEARS FROM THE 'dateOfBirth' COLUMN AND SEND THE
# REMAINDER IN MONTHS
cleaned_data <- timespan(
  data = data,
  target_column = "dateOfBirth",
  end_date = Sys.Date(),
  span_unit = "years",
  span_column_name = "age_in_years",
  span_remainder_unit = "months"
)

# CALCULATE THE TIME SPAN IN DAYS BETWEEN INDIVIDUALS DATE OF ADMISSION AND THE
# DAY THEY TESTED POSITIVE
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
cleaned_data <- data %>%
  replace_missing_values(
    target_columns = "dateOfBirth",
    na_strings = "-99"
  ) %>%
  standardize_dates(
    target_columns = c("date_first_pcr_positive_test", "date.of.admission"),
    error_tolerance = 0.0,
    format = NULL
  ) %>%
  timespan(
    target_column = "date_first_pcr_positive_test",
    end_date = "date.of.admission",
    span_unit = "days",
    span_column_name = "elapsed_days",
    span_remainder_unit = NULL
  )
#> ! Detected 4 values that comply with multiple formats and no values that are
#>   outside of the specified time frame.
#> ℹ Enter `print_report(data = dat, "date_standardization")` to access them,
#>   where "dat" is the object used to store the output from this operation.

The timespan() function augments the input dataset by adding one or two extra columns containing age-related information. These additional columns are as follows:

  1. Calculated time span in the specified scale: Contains the calculated time span in the specified unit (“years”, “months”, “weeks”, or “days”).

  2. Remainder from the time span calculation: Indicates the remaining number of “days” or “weeks” or “months” after calculating the time span, representing the fractional part of the time span calculation. This column is included if needed, and provides additional granularity in the time span representation.

study_id event_name country_code country_name date.of.admission dateOfBirth date_first_pcr_positive_test sex elapsed_days
PS001P2 day 0 2 Gambia 2020-12-01 06/01/1972 2020-12-01 1 0
PS002P2 day 0 2 Gambia 2021-01-28 02/20/1952 2021-01-01 1 27
PS004P2-1 day 0 2 Gambia 2021-02-15 06/15/1961 2021-02-11 -99 4
PS003P2 day 0 2 Gambia 2021-02-11 11/11/1947 2021-02-01 1 10
P0005P2 day 0 2 Gambia 2021-02-17 09/26/2000 2021-02-16 2 1
PS006P2 day 0 2 Gambia 2021-02-17 NA 2021-05-02 2 -74
PB500P2 day 0 2 Gambia 2021-02-28 11/03/1989 2021-02-19 1 9
PS008P2 day 0 2 Gambia 2021-02-22 10/05/1976 2021-09-20 2 -210
PS010P2 day 0 2 Gambia 2021-03-02 09/23/1991 2021-02-26 1 4
PS011P2 day 0 2 Gambia 2021-03-05 02/08/1991 2021-03-03 2 2