parseRPDR

CRAN status Lifecycle: stable Codecov test coverage

The Research Patient Data Registry is a centralized clinical data registry, or data warehouse at Partners Hospitals. Populated by data from several source systems, including the TSI hospital and IDX clinic/physician billing systems from BWH and MGH, as well as data from Partners Clinical Data Repository (CDR), Epic and the Enterprise Patient Master Index (EMPI). In this tutorial we will go through how to use the parseRPDR R package to load and manipulate the text outputs generated by RPDR. The package does not provide compatibility with access databases provided by the system.

Installation

You can install the released version of parseRPDR from CRAN with:

install.packages("parseRPDR")

parseRPDR package functionalities

The aim of the package is to provide a standardized framework to analyze outputs provided by RPDR. All functions of parseRPDR are parallelized to assist large data queries (please see section on parallelization for details). Data is loaded into the R environment using the load_abc functions, where abc is the three letter abbreviation of the given datasource. Data is loaded into data.table objects to provide fast and efficient manipulations on even large datasets. Besides importing the data, the load functions also modify the variables names in a standardized fashion to help later analyses:

The functions also do minimal data cleaning to help later analyses:

Besides providing an interface to import text outputs from RPDR into the R environment, parseRPDR also provides functions to do common tasks. Similar to the load functions, the package contains another family of functions: convert_abc, where abc is the three letter abbreviation of the given datasource, which does common manipulations on a given datasource. Brief description of these can be found below:

Furthermore, parseRPDR provides the create_img_db function which creates a database from the headers of the DICOM images present in the provided folder. Be aware that the function requires python and pydicom to be installed! The function cycles through all folders present in the provided path and recursively goes through every subfolder, and extracts the DICOM header information from the files.

Besides these functions, there are ones that are not connected to specific datasources and provide other commonly used functionalities. Brief description of these can be found below:

Parallelization in parseRPDR

All functionalities of parseRPDR are parallelized to assist the analysis of large datasets. The user does not need to know anything of what is being done in the background, simply setting the nThread parameter in the function calls sets everything up. Please be aware that the optimal number of threads depends on the system running the application. By default nThread is set to parallel::detectCores()-1, which means that all except one of the threads on the given machine will be used. Be aware that parallelization also requires additional memory to run the functions. Setting nThread=1 results in sequential analysis, which might be beneficial for small datasets. All parallelizations are done using dynamic load balancing.

NOTE!!!

On macOS, data.table may return a warning message similar to: “data.table X.X.X using 1 threads (see ?getDTthreads)…” Disregard the warning message as the package does not use functionalities affected by this limitation of macOS.

Detailed functionalities of parseRPDR

The first step of most analyses is requesting data from RPDR.

Requesting data from RPDR

RPDR provides three main data sources:

parseRPDR supports the analysis of detailed patient information text files which can be requested separately, in conjunction with a radiological image or biological specimen data request.

There are two main ways to request data from RPDR:

If we have a known list of MRNs, first we need to format them according to the standards of RPDR. parseRPDR provides the pretty_mrn function to help with this. RPDR requires different standard lengths depending on the type of MRN. For example, MGH MRNs must be 7 digits long, while BWH MRNs are 8 digits long. Also, RPDR requires concatenating the source of the MRN to the beginning of the ID. pretty_mrn takes care of these rules automatically depending on the MRN source and converts a vector of MRNs to the required format of RPDR. This can then be exported using base functions of R such as: write.csv or data.table::fwrite() using the data.table package. Detailed functionality can be found in the help documentation of pretty_mrn.

mrns <- sample(1e4:1e7, size = 10) #Simulate MRNs

#MGH format
pretty_mrn(v = mrns, prefix = "MGH")

#BWH format
pretty_mrn(v = mrns, prefix = "BWH")

#Multiple sources using space as a separator
pretty_mrn(v = mrns[1:3], prefix = c("MGH", "BWH", "EMPI"), sep = " ")

#Keeping the length of the IDs despite not adhering to the requirements
pretty_mrn(v = mrns, prefix = "EMPI", id_length = "asis")

Once we have the IDs, then we can request data from RPDR and continue similarly as if we were to use the query tool of RPDR.

NOTE!!!

Be aware that RPDR uses Enterprise Master Patient Index IDs in the background. This means that the supplied MRNs are converted to EMPI IDs and these are used to fetch data for the patients. Therefore, the returned IDs might not match the requested IDs, if for example the patient has a new MGH ID. Therefore, it is advised to use the EMPI IDs to merge different data sources (provided by all load functions in the column: ID_MERGE) and manually check instances where the requested and returned IDs do not match to be sure that the right patient data has been retrieved.

Requesting radiological images from RPDR

As stated above, hospital provided IDs may change over time. While using EMPIs as IDs to merge different data sources solves this issue, in case of radiological images, previously mi2b2 only worked with MGH or BHW IDs, therefore a complete list of all MRNs the patients had at any time may be needed. Requesting images using Enterprise Master Patient Index solves this issue and is the preferred method now.

Previously, mi2b2 only worked with MGH and BWH IDs. If a predefined set of MRNs are used to request data from RPDR (not the query tool), then it is advised to parse out all IDs present in con and mrn datasources using all_ids_mi2b2 function of parseRPDR as specified under load_con function in the document. This may be needed as the mi2b2 workbench may only work for requested MRNs, therefore if the MGH MRN changes for a given patient, and we wish to access an image that was saved using the previous MRN, then the most recent MRN won’t grant us access to that image. Therefore, in case we work with a predefined set of MRNs, after requesting data for all datasources that we need, the user should parse out a complete list of MRNs and request the mi2b2 using this list. Requesting images using Enterprise Master Patient Index solves this issue and is the preferred method now.

Loading data using parseRPDR

parseRPDR provides individual functions to load each type of dataset into the R environment. For note type files (car, dis, end, hnp, opn, pat, prg, pul, rad and vis.), there is a load_notes function that can be used for any type of report file. For all datasources, load_notes and load_abc functions, where abc is the three letter abbreviation of the given datasource, are very similar in nature and have the same arguments. Also, for most cases default values for the input arguments are satisfactory and only the location of the data is needed to be specified. Nevertheless, the arguments provide full flexibility if needed.

load_con function

The con.txt dataset is a unique datasource as it has a Patient_ID_List column which contains all MRNs from all hospitals. As MRNs may change over time, this list contains a possible alternative ID for each hospital that was previously used by the patient. Also, there are additional hospital IDs present in this list. parseRPDR converts this list into specific columns. As there may be IDs which are only present in the minority of patients, the load function has the argument perc to specify what percentage of patients should have a given ID to add it as an additional column into the output of load_con function which have “_list” appended to them. Default is 0.6 corresponding to 60%. Also, by default the MRN_Type and MRN columns are parsed so that information from these columns are also provided. This should only be one for one datasource, as all datasources contain the same information in these columns.

load_notes function

There are several types of different note files provided by RPDR: car, dis, end, hnp, opn, pat, prg, pul, rad and vis. For these the load_notes function provides a single interface to load these files. Simply the type of note must be provided and a standard output is provided for each type of note. The report text is returned if load_report is TRUE, which is the default. By default all formatting is removed from the text to save space and make later manipulations easier. However, setting the format_orig to TRUE returns the original text in its original format.

#Using defaults
d_hnp <- load_notes(file = "test_Hnp.txt", type = "hnp")
#Use sequential processing
d_hnp <- load_notes(file = "test_Hnp.txt", type = "hnp", nThread = 1)
#Use parallel processing and parse data in MRN_Type and MRN columns and keep all IDs
d_hnp <- load_notes(file = "test_Hnp.txt", type = "hnp", nThread = 20, mrn_type = TRUE, perc = 1, load_report = TRUE, format_orig = TRUE)

load_all_data function

parseRPDR provides a convenient function to load all RPDR data using a single line of code. The load_all_data function can be use for loading different datasources at once and/or to load multiple files of the same datasource, which occurs if our query results in more than 25,000 patients. The function requires a folder path instead of a file path. Also, we can use the which_data argument to specify which datasources we wish to load into the list. All datasets are supported. In case there are multiple files for a given datasource, then add a “_” and a number to merge the same data sources into a single output in the order of the provided number.

The Demographics table data definitions have changes in 2022. Currently, the software supports the latest version. However, the load_all_data function can also process data prior to these changes by setting the old_dem argument to TRUE, in which case the load_dem_old function is used, which corresponds to the load_dem function prior to version 0.2.2.

The load_all_data function is parallelized for efficiency. It allows two different forms of parallelization. Either it is done on the level of different datasources (i.e. mrn, con, dem are loaded parallel) or within the datasources (if multiple files are present per datasource i.e. con_1, con_2 etc.). Using the many_sources arguments we can specify which to use. If set to TRUE, then parallelization will be done on the level of different datasources, if FALSE then parallelization is done within the datasources. It should only be set to FALSE if there is more than one txt files per datasource. If the user does not wish to use parallel processing (i.e. small dataset), then setting nThread=1 will promote sequential processing. However, be aware that all functions calls run sequentially within load_all, that is all loading sub-processes are done using 1 thread, so that it does not cause issues with load functions running in parallel.

NOTE!!!

Large datasets may crash R as they exceed the available memory. In this case consider loading specific data sources separately and filtering out cases to decrease the amount of memory needed.

RPDR returns large requests in multiple files (in case of patient numbers > 25,000). These are arranged so that a given individuals’ data is present in only that given batch. Therefore, in case of large queries, it is advised to process the data sequentially, meaning that doing all your calculations on the given batch (i.e. mrn, lab etc. datasources), for each batch, and then concatenating the results together from each 25,000 patients’ data to receive the final database.

#Load all Con, Dem and Mrn datasets processing all files within given datasource in parallel
load_all_data(folder = folder_rpdr, which_data = c("con", "dem", "mrn"), nThread = 2, many_sources = FALSE)

#Load all supported file types parallelizing on the level of datasources
load_all_data(folder = folder_rpdr, nThread = 2, many_sources = TRUE, load_report = TRUE, format_orig = TRUE)

Manipulating data using parseRPDR

parseRPDR provides a family of convert_abc functions, where where abc is the three letter abbreviation of the given datasource, which execute common manipulations on a given datasource. Besides these functions the program also provides functions to assist common tasks which are not dependent of a given datasource. The arguments are standard among the functions:

convert_enc - Parsing ICD codes and finding disease diagnoses

parseRPDR provides the convert_enc function to format ICD codes provided by RPDR in the encounter tables. It also identifies disease groups by searching through the registered diagnoses and providing an indicator column whether that encounter has any of the ICD codes associated with that given disease. For this we need to specify the following arguments:

The function returns the data.table provided in the argument d with the new parse ICD columns starting with ICD_ and also the original ICD columns, if requested next to the new indicator columns if a list is supplied in codes_to_find argument.

The function can also provide summary information by collapsing the data.table based-on the column specified in collapse. For example, if the ID_MERGE column is given, then a data.table is returned where each unique value in the column collapse is returned and the indicator columns, whether that ID had at any time the given ICD coded. Furthermore, aggr_type defines whether the earliest or latest time is returned defined by code_time if multiple occurrences are present. Also, if for example a complex ID is created prior to the function call based-on the ID and the encounter timepoint, then the function can collapse the results based-on this complex ID and return whether any of the ICD codes were registered during encounters on the same day.

#Parse encounter ICD columns and keep original ones as well
data_enc_parse <- convert_enc(d = data_enc, keep = TRUE, nThread = 2)

#Parse encounter ICD columns and discard original ones,
#and create indicator variables for the following diseases
diseases <- list(HT = c("I10"), Stroke = c("434.91", "I63.50"))
data_enc_disease <-  convert_enc(d = data_enc, keep = FALSE, codes_to_find = diseases, nThread = 2)

#Parse encounter ICD columns and discard original ones,
#and create indicator variables for the following diseases and summarize per patient
#whether there are any encounters where the given diseases were registered
diseases <- list(HT = c("I10"), Stroke = c("434.91", "I63.50"))
data_enc_disease <-  convert_enc(d = data_enc, keep = FALSE, codes_to_find = diseases, nThread = 2,
                                 collapse = "ID_MERGE", aggr_type = "earliest")

convert_dia - Searching for given diagnosis codes

Similar to convert_enc, the convert_dia function is used to search for given diagnosis groups withing the diagnosis data.table. The difference is that the diagnoses do not need to be parsed as they are stored separately. Also, since not only ICD codes are present, the code type needs to be defined also. For this, instead of a simple code, a complex code should be given in the form of: code_type:code, i.e. ICD9:250.00. All other functions are similar to convert_enc.

#Search for Hypertension and Stroke ICD codes
diseases <- list(HT = c("ICD10:I10"), Stroke = c("ICD9:434.91", "ICD9:I63.50"))
data_dia_parse <- convert_dia(d = data_dia, codes_to_find = diseases, nThread = 2)

#Search for Hypertension and Stroke ICD codes and summarize per patient providing earliest time
diseases <- list(HT = c("ICD10:I10"), Stroke = c("ICD9:434.91", "ICD9:I63.50"))
data_dia_disease <-  convert_dia(d = data_dia, codes_to_find = diseases, nThread = 2,
                                 collapse = "ID_MERGE", aggr_type = "earliest")

convert_phy - Searching for given health hisotry codes

Similar to convert_enc, the convert_phy function is used to search for given health history codes withing the health history data.table. The difference is that the health history do not need to be parsed as they are stored separately.

#Search for for Height and Weight codes and summarize per patient providing earliest time
anthropometrics <- list(Weight = c("LMR:3688", "EPIC:WGT"), Height = c("LMR:3771", "EPIC:HGT"))
data_phy_parse <- convert_phy(d = data_phy, codes_to_find = anthropometrics, nThread = 2,
                              collapse = "ID_MERGE", aggr_type = "earliest")

convert_prc - Searching for given procedures

Similar to convert_enc, the convert_prc function is used to search for given procedures within the procedures data.table. The difference is that the procedures do not need to be parsed as they are stored separately.

#Parse procedure columns and create indicator variables for the following procedures
#and summarize per patient, whether there are any procedures registered
procedures <- list(Anesthesia = c("CTP:00410", "CPT:00104"))
data_prc_procedures <- convert_prc(d = data_prc, codes_to_find = procedures, nThread = 2,
                                   collapse = "ID_MERGE", aggr_type = "earliest")

convert_rfv - Searching for given reason for visit codes

Similar to convert_enc, the convert_rfv function is used to search for given reason for visit groups withing the reason for visit data.table. The difference is that the diagnoses do not need to be parsed as they are stored separately.

#Parse reason for visit columns and create indicator variables for the following reasons
#and summarize per patient, whether there are any encounters where the given reasons were registered
reasons <- list(Pain = c("ERFV:160357", "ERFV:140012"), Visit = c("ERFV:501"))
data_rfv_disease <-  convert_rfv(d = data_rfv, keep = FALSE, codes_to_find = reasons, nThread = 2,
                                 collapse = "ID_MERGE")

convert_lab - Parsing lab results and identifying abnormal values

Laboratory results can be loaded using the load_lab function. However, RPDR only provides the raw results of the tests (i.e. <0.03), which make it difficult to use for later analyses. parseRPDR provides the convert_lab function, which converts the results to numerical values. This is done by replacing x< or x> notations with the value x, as the only thing we know certainly is that the values is not larger or smaller then the boundary value. When ranges are returned, then the upper bound of the range is provided. Also, qualitative results are converted to standard outputs of POS: positive, NEG: negative and BORD: borderline. These converted values are returned in a new column called: lab_result_pretty. The function also returns a column lab_result_abn_pretty which uses the normal range values to determine whether the value is normal or abnormal. Please be aware that there can be very different representations of values, and in some cases this will result in misclassification of values. Therefore a new column: lab_result_abn_flag_pretty is added, which gives back ABNORMAL if there is any character in Abnormal_Flag column in RPDR (lab_result_abn using load_lab). Borderline values are considered as normal.

#Convert loaded lab results
data_lab_pretty <- convert_lab(d = data_lab)
data_lab_pretty[, c("lab_result", "lab_result_pretty", "lab_result_range", "lab_result_abn_pretty")]

convert_med - Parsing medication results and finding given medications

When putting together the database, it is often the question whether a patient has received a given medication or not. parseRPDR provides the convert_med function to search for the presence of given medications. Its use is similar to convert_enc without converting the columns, as medication info columns are not standard. Therefore, we only have to provide it the data.table parsed using load_med and a list of medications, which can contain multiple list entries. The function also provides summary statistics using the collapse argument, which looks whether a given medication is present among the IDs in the column specified. If this column is a patient ID, then the summation will be done per person. However, composite IDs can also be created and provided, for example to summarize within each encounter of the given patient. This functionality works similar to convert_enc. The function is case insensitive, therefore differences in capitalization do not matter.

#Define medication group and add an indicator column whether the
#given medication group was administered
meds <- list(statin = c("Simvastatin", "Atorvastatin"),
             NSAID  = c("Acetaminophen", "Paracetamol"))

data_med_indic <- convert_med(d = data_med, codes_to_find = meds, nThread = 2)
data_med_indic[, c("statin", "NSAID")]

#Summarize per patient if they ever had the given medication groups registered
data_med_indic_any <- convert_med(d = data_med, codes_to_find = meds, nThread = 2,
                                  collapse = "ID_MERGE", aggr_type = "earliest")

convert_notes - Extracting information from note free text

There are many information in note free text reports. parseRPDR does not provide natural language processing or other text analysis techniques, however the convert_notes function splits the reports into sections specified by the user. This function should be used for data imported using load_notes or load_lno functions. By default, many reports have identical sections, such as findings or impressions. These potential sections, called anchors may be provided, in which case the convert_notes function exacts text following the specific section name, until the next section. This way the report is cut into different parts, which are easier to analyze later. The function needs an array of anchor points, which which is different for each type of note. For potential defaults please see the documentation of the function.

A few things to note. The order of the section elements does not matter, the function figures out the next closest one, therefore any arbitrary order can be given. “report_end” should always be present among the anchors, as it is the last text located in all reports. The anchor points are case sensitive, therefore be careful specifying them. Multiple versions may also be given, in which case then the user can merge these columns later to cover all possible differences in spelling. These anchor points may be also standard values reported in the text such as CAD-RADS on cardiac CTA, or EF on cardiac echo.

#Create columns with specific parts of the radiological report defined by anchors
data_rad_parsed <- convert_notes(d = data_rad, code = "rad_rep_txt",
                                 anchors = c("Exam Code", "Ordering Provider", "HISTORY",
                                             "Associated Reports", "Report Below", "REASON",
                                             "REPORT", "TECHNIQUE", "COMPARISON", "FINDINGS",
                                             "IMPRESSION", "RECOMMENDATION", "SIGNATURES",
                                             "report_end"), nThread = 2)

export_notes - Saving individual notes as text files

The export_notes function provides a standardized framework to write out each individual report from loaded notes. The text of each note is saved as a text file into a given folder. The folder should already be present, otherwise the function does not work.

#Load notes maintaining the original formatting of the document and save each note as a separate text file named using the patient identifier and the note number
d <- load_notes("Car.txt", type = "car", nThread = 2, format_orig = TRUE)
export_notes(d, folder = "/Users/Test/Notes/", code = "car_rep_txt", name1 = "ID_MERGE", name2 = "car_rep_num")

find_exam - Finding exams within a timeframe of a timepoint

One of the most common tasks in creating a database from clinical data, is to find given examinations (lab results, radiological examinations, diagnoses etc.) within a given timeframe of an event or encounter. For this, parseRPDR provides the find_exam function, which tries to find the earliest, closest or all the examinations within a given timeframe of an event or encounter. The function runs the search in parallel on multiple threads which can be specified using nThread. If it is set to 1, then no parallel backends are created and the function is executed sequentially. A progress bar is also reported in the terminal. The function allows flexibility using its input arguments:

Here we present example cases of using the function to locate radiological examinations within a timeframe of given encounters.

#Filter encounters for first emergency visits at one of MGH's ED departments
data_enc_ED <- data_enc[enc_clinic == "MGH EMERGENCY (10020010608)"]
data_enc_ED <- data_enc_ED[!duplicated(data_enc_ED$ID_MERGE)]

#Find all radiological examinations within 3 day of the ED registration
rdt_ED <- find_exam(d_from = data_rdt, d_to = data_enc_ED,
                    d_from_ID = "ID_MERGE", d_to_ID = "ID_MERGE",
                    d_from_time = "time_rdt_exam", d_to_time = "time_enc_admit",
                    time_diff_name = "time_diff_ED_rdt", before = TRUE, after = TRUE,
                    time = 3, time_unit = "days", multiple = "all",
                    nThread = 2)

#Find earliest radiological examinations within 3 day of the ED registration
rdt_ED <- find_exam(d_from = data_rdt, d_to = data_enc_ED,
                    d_from_ID = "ID_MERGE", d_to_ID = "ID_MERGE",
                    d_from_time = "time_rdt_exam", d_to_time = "time_enc_admit",
                    time_diff_name = "time_diff_ED_rdt", before = TRUE, after = TRUE,
                    time = 3, time_unit = "days", multiple = "earliest",
                    nThread = 2)

#Find closest radiological examinations on or after 1 day of the ED registration
#and add primary diagnosis column from encounters
rdt_ED <- find_exam(d_from = data_rdt, d_to = data_enc_ED,
                    d_from_ID = "ID_MERGE", d_to_ID = "ID_MERGE",
                    d_from_time = "time_rdt_exam", d_to_time = "time_enc_admit",
                    time_diff_name = "time_diff_ED_rdt", before = FALSE, after = TRUE,
                    time = 1, time_unit = "days", multiple = "earliest",
                    add_column = "enc_diag_princ", nThread = 2)

#Find closest radiological examinations on or after 1 day of the ED registration but
#also provide empty rows for patients with exam data but not within the timeframe
rdt_ED <- find_exam(d_from = data_rdt, d_to = data_enc_ED,
                    d_from_ID = "ID_MERGE", d_to_ID = "ID_MERGE",
                    d_from_time = "time_rdt_exam", d_to_time = "time_enc_admit",
                    time_diff_name = "time_diff_ED_rdt", before = FALSE, after = TRUE,
                    time = 1, time_unit = "days", multiple = "earliest",
                    add_column = "enc_diag_princ", keep_data = TRUE,
                    nThread = 2)

The function only supports intervals following up to or beginning at the given encounter. If for example the user wishes to search whether a patient had any radiological examination 1 year prior to the encounter, then the search criteria should be set to a large interval prior to the encounter and then using one line of code the user can filer out all exams which are less than 365 days from the encounter using the resulting time_diff_name column in the returned data.table.

A similar task might be to find out whether a patient had an diagnosis of a given disease within a timeframe of an index encounter or event. For this we can use the find_exam and convert_enc functions to locate all encounters within a timeframe and see whether given diseases occurred within this timeframe.

First we create a data.table containing the index encounters. Then we create a unique ID containing both IDs and time variables, therefore if a patient has multiple index encounters, they will not be combined but handled as separate encounters. Then we search for all encounters (this could also be diagnoses) within a given timeframe. Then we convert the ICD codes and search for given diagnoses and collapse the results based-on the created ID and time unique ID to create a data.table which contains information whether there was a given disease diagnosed within the timeframe for that given encounter.

#Filter encounters for first emergency visits at one of MGH's ED departments
data_enc_ED <- data_enc[enc_clinic == "MGH EMERGENCY (10020010608)"]

#Create new column adding a time stamp to ID if individual had multiple encounters 
data_enc_ED$ID_MERGE_time <- paste0(data_enc_ED$ID_MERGE, "_", data_enc_ED$time_enc_admit)

#Find all encounters within 30 days after the registration to ED
data_enc_ED_30d <- find_exam(d_from = data_enc,d_to = data_enc_ED,
                             d_from_ID = "ID_MERGE", d_to_ID = "ID_MERGE",
                             d_from_time = "time_enc_admit", d_to_time = "time_enc_admit",
                             time_diff_name = "time_diff_ED_enc", before = FALSE, after = TRUE,
                             time = 30, time_unit = "days", multiple = "all",
                             add_column = "ID_MERGE_time", keep_data = FALSE, nThread = 2)

#Combine encounters and search if any of them registered a given diagnosis for each ID
#and admission time unique ID and return the earliest date of occurrence
enc_cols <- colnames(data_enc)[19:30]
diseases <- list(HT = c("I10"), Stroke = c("434.91", "I63.50"))

data_enc_ED_30d_summ <- convert_enc(d = data_enc_ED_30d, code = enc_cols,
                                    keep = FALSE, codes_to_find = diseases,
                                    collapse = "ID_MERGE_time", nThread = 2)

#Merge original encounter data with 30 day summary data
data_enc_ED_ALL <- data.table::merge.data.table(x = data_enc_ED, y = data_enc_ED_30d_summ,
                                                by = "ID_MERGE_time", all.x = TRUE, all.y = FALSE)

create_img_db - Creating a DICOM header database

In many cases, additional information regarding the radiological examinations is needed that is not stored in the rdt file, for example: the exact time when the image was taken, not just the date. For this we need to extract meta-data from the radiological images which can be done using the create_img_db. Please be aware that the function requires python and pydicom to be installed! The function creates a database of DICOM headers present in a folder structure. Each series should be in its own folder, but they can be in a nested folder structure. Files where there are also folder present next to them at the same level will not be parsed. That is the folder structure needs to comply with the DICOM standard. The function cycles through all folders present in the provided path and recursively goes through them, every subfolder, and extracts the DICOM header information from the files using the dcmread function of the pydicom package. The extension of the files can be provided by the ext argument, as DICOM files may have different extensions then that of .dcm. These are converted to lower case before matching. Furthermore, it is advised to add . before the extensions as the given character patterns may be present elsewhere in the file names. Also, using the all boolean argument, you can specify whether the function provides output for each file, or only for the first file, which is beneficial if you are analyzing multi-slice series, as all instances have almost all the same header information. Furthermore, using the keywords argument you can manually specify which DICOM keywords you wish to extract. These need to be a valid keyword specified in the DICOM standard.

#Create a database with DICOM header information
all_dicom_headers <- create_img_db(path = "/Users/Test/Data/DICOM/")
#Create a database with DICOM header information with additional file extensions
all_dicom_headers <- create_img_db(path = "/Users/Test/Data/DICOM/", ext = c(".dcm", ".DICOM"))
#Create a database with DICOM header information for only IDs and accession numbers
all_dicom_headers <- create_img_db(path = "/Users/Test/Data/DICOM/", keywords = c("PatientID", "AccessionNumber"))

Words of caution

Be aware that RPDR is a clinical research database that utilizes information present in the hospital information systems, therefore there can be anomalies and missing values in the exported datasources. Always conduct quality control of the data. There can be missing MRNS, different patients with identical MRNs, inability to locate individuals in the system, unstandardized examination results, multiple coding of diseases etc. These need to be tested and taken care of to end up with a clean database. Also be aware that parseRPDR was created with the best intention and knowledge of the creator and developer, but it still may have bugs. Therefore, the software does not offer any warranty of any sort and therefore the author cannot be help liable for any claim.

Conclusions

parseRPDR is an R package specifically built to assist large scale analyses on RPDR outputs. Besides providing functions to load and clean datasets, it also provides functions for most commonly used data manipulations. All of its functions support parallelization and therefore provide fast and efficient analyses of RPDR data. The package is updated regularly with new functionalities. For support contact: mkolossvary@mgh.harvard.edu