Descriptive

Introduction

Objectives

Prerequisite

Demo data: counts of drugs, adrs, case characteristics

Step 0: Load packages

library(vigicaen)
library(rlang)
library(dplyr)

Step 1: Load datasets and add drug and adr columns

This vignette uses the preloaded datasets (and a spurious suspdup table).

demo     <- demo_
adr      <- adr_
drug     <- drug_
link     <- link_
out      <- out_
followup <- followup_

srce     <- srce_

thg      <- thg_
mp       <- mp_
meddra   <- meddra_
smq_list <- smq_list_
smq_content <- smq_content_

suspdup <- 
  data.table::data.table(
    UMCReportId = 1,
    SuspectedduplicateReportId = NA
  )

And preloaded drug and adr dictionaries.

d_drecno <- ex_$d_drecno

a_llt <- ex_$a_llt
demo <-
  demo |>
  add_drug(
    d_code = d_drecno,
    drug_data = drug
  )
#> ℹ `.data` detected as `demo` table.

demo <-
  demo |>
  add_adr(
    a_code = a_llt,
    adr_data = adr
  )
#> ℹ `.data` detected as `demo` table.

As we aim to describe drug and adr counts, but also other variables (age, sex, type of reporter), they will be added too.

You can still refer to

# Age, sex

demo <-
  demo |>
  mutate(
    age = cut(as.integer(AgeGroup),
              breaks = c(0,4,5,6,7,8),
              include.lowest = TRUE, right = TRUE,
              labels = c("<18", "18-45","45-64", "65-74", "75+")),

    sex = case_when(Gender == "1" ~ 1,
                    Gender == "2" ~ 2,
                    Gender %in% c("-","0","9") ~ NA_real_,
                    TRUE ~ NA_real_)
  )

# Death + outcome availability

demo <- 
  demo |> 
  mutate(death = 
           ifelse(UMCReportId %in% out$UMCReportId,
                  UMCReportId %in% 
                    (out |> 
                    filter(Seriousness == "1") |> 
                    pull(UMCReportId)
                    ),
                  NA)
         )

# follow-up, seriousness

demo <-
  demo |>
  mutate(
    fup = if_else(UMCReportId %in% followup$UMCReportId, 1, 0),
    serious = 
      ifelse(
        UMCReportId %in% out$UMCReportId,
        UMCReportId %in% 
          (out |> 
          filter(Serious == "Y") |> 
          pull(UMCReportId)
          ),
        NA)
  )

# year

demo <- 
  demo |> 
  mutate(
    year = as.numeric(substr(FirstDateDatabase, start = 1, stop = 4))
    )

# type of reporter

demo <-
  demo |>
  left_join(
    srce |> transmute(UMCReportId, type_reporter = Type),
    by = "UMCReportId")

desc_facvar()

desc_facvar() generates a summary of categorical variables with 2 or more levels.

Its .data argument is a dataset to describe. Described variables should be passed to vf, as a character vector.

Multi-level variables

Let’s take the demo dataset as an example, with variable “age”.

desc_facvar(
  .data = demo,
  vf = "age"
)
#> # A tibble: 5 × 4
#>   var   level value           n_avail
#>   <chr> <chr> <chr>             <int>
#> 1 age   <18   " 1/499 (0%) "      499
#> 2 age   18-45 "43/499 (9%) "      499
#> 3 age   45-64 "173/499 (35%)"     499
#> 4 age   65-74 "174/499 (35%)"     499
#> 5 age   75+   "108/499 (22%)"     499

The output format is a data.frame, of class tibble.

The first column, var, contains the name of the variable of interest. The second column, level, contains the level of the variable.

In this example, the first line shows the number of patients whose age variable (var) is “<18”, i.e. patients under 18 years old.

The percentage appears in the value column, after the count of cases and the total number of reports for which the information is available.

This number of reports with available information is recalled in the n_avail column.

Binary variables

What happens when the variable has only two levels, for example 1 and 0, as is often the case for the drug and adr variables?

desc_facvar(
  .data = demo,
  vf = "nivolumab"
)
#> # A tibble: 2 × 4
#>   var       level value         n_avail
#>   <chr>     <chr> <chr>           <int>
#> 1 nivolumab 0     525/750 (70%)     750
#> 2 nivolumab 1     225/750 (30%)     750

The output format is unchanged, with a data.frame as output.

The reading is unchanged: we get the count of cases of the variable nivolumab, by its two levels. There are thus 225 patients exposed to nivolumab, out of 750 reports in total, which represents 30% of patients.

Conversely, 525 reports do not mention nivolumab.

In general, when presenting the results, the level 0 of binary variables provides little information and can be omitted.

Logical variables

Let’s continue with another example on the “seriousness” status.

desc_facvar(
  .data = demo,
  vf = "serious"
)
#> # A tibble: 2 × 4
#>   var     level value         n_avail
#>   <chr>   <chr> <chr>           <int>
#> 1 serious FALSE 181/747 (24%)     747
#> 2 serious TRUE  566/747 (76%)     747

The “serious” variable takes the values TRUE/FALSE, and not 1/0, but it is interpreted in the same way (it is only an artifact of construction).

Thus, 566 cases are considered serious, out of 747 where the information is available.

Exporting raw values

You can export to run plotting or other formatting functions, with argument export_raw_values.

desc_facvar(
  .data = demo,
  vf = "nivolumab",
  export_raw_values = TRUE
)
#> # A tibble: 2 × 6
#>   var       level value         n_avail     n    pc
#>   <chr>     <chr> <chr>           <int> <int> <dbl>
#> 1 nivolumab 0     525/750 (70%)     750   525    70
#> 2 nivolumab 1     225/750 (30%)     750   225    30

Grouping several levels of a variable

What if the available categories do not match our final needs?

In the example on age, there is only one patient under 18 years old, and few patients under 45 years old. We would like to group all this data into a single line for a summary.

The solution is to create the variable with the desired levels upstream, in a data management step.

demo <-
  demo |>
  mutate(
    age2 = cut(as.integer(AgeGroup),
              breaks = c(0, 6, 7, 8),
              include.lowest = TRUE, right = TRUE,
              labels = c("<64", "65-74", "75+"))
  )


desc_facvar(
  demo,
  vf = "age2"
)
#> # A tibble: 3 × 4
#>   var   level value         n_avail
#>   <chr> <chr> <chr>           <int>
#> 1 age2  <64   217/499 (43%)     499
#> 2 age2  65-74 174/499 (35%)     499
#> 3 age2  75+   108/499 (22%)     499

The same is true for columns like “year”.

When studying the “year” column, it is common to get an error message

desc_facvar(
  .data = demo,
  vf = "year"
)
#> Error in `desc_facvar()`:
#> ! Too many levels detected in: year
#> ✖ Number of levels: 13 exceeded `ncat_max`(10)
#> ℹ Did you pass a continuous variable to `desc_facvar()`?
#> → Set `ncat_max` to suppress this error.

The error message “Too many levels detected in year” is intentional, to avoid passing continuous variables in the vf argument.

The maximum number of categories that can be taken by a variable treated by desc_facvar is controlled by the ncat_max argument.

If a variable has more than ncat_max different levels, the function stops.

We can therefore solve this problem by adjusting the value of this parameter.

desc_facvar(
  .data = demo,
  vf = "year",
  ncat_max = 20
)
#> # A tibble: 13 × 4
#>    var   level value           n_avail
#>    <chr> <chr> <chr>             <int>
#>  1 year  2011  " 1/750 (0%) "      750
#>  2 year  2012  " 1/750 (0%) "      750
#>  3 year  2013  " 2/750 (0%) "      750
#>  4 year  2014  "10/750 (1%) "      750
#>  5 year  2015  " 8/750 (1%) "      750
#>  6 year  2016  "15/750 (2%) "      750
#>  7 year  2017  "116/750 (15%)"     750
#>  8 year  2018  "150/750 (20%)"     750
#>  9 year  2019  "116/750 (15%)"     750
#> 10 year  2020  "72/750 (10%)"      750
#> 11 year  2021  "99/750 (13%)"      750
#> 12 year  2022  "119/750 (16%)"     750
#> 13 year  2023  "41/750 (5%) "      750

This allows to review the main years, but will be less transposable in a final table of a manuscript. A categorization of the reporting years may be more informative.

Explicit categorical variables

Levels of some variables are indicated by numbers.

desc_facvar(
  .data = demo,
  vf = "Region"
)
#> # A tibble: 6 × 4
#>   var    level value           n_avail
#>   <chr>  <chr> <chr>             <int>
#> 1 Region 1     " 1/750 (0%) "      750
#> 2 Region 2     "389/750 (52%)"     750
#> 3 Region 3     "17/750 (2%) "      750
#> 4 Region 4     "276/750 (37%)"     750
#> 5 Region 5     " 6/750 (1%) "      750
#> 6 Region 6     "61/750 (8%) "      750

We know that 389 cases come from Region “2”, without being able to say which geographical area this region belongs to.

To obtain the correspondence, there are external tables, such as this one for the Region: (they can be found in the subsidiary tables of vigibase).

Code Label
1 African Region
2 Region of the Americas
3 South-East Asia Region
4 European Region
5 Eastern Mediterranean Region
6 Western Pacific Region

Several options are possible to bring the information back directly into demo, the simplest is to use factors

demo <-
  demo |> 
  mutate(
    Region = factor(Region, levels = c("1", "2", "3", "4", "5", "6"))
  )

levels(demo$Region) <-
  c("African Region",                                    
    "Region of the Americas",                            
    "South-East Asia Region",                            
    "European Region",                                   
    "Eastern Mediterranean Region",                      
    "Western Pacific Region"  
  )

Note the transformation in two steps. The first to sort the levels of the variable, the second to assign the labels to its levels. This sequence is necessary to avoid a random sorting of levels.

This transformation has the effect of modifying the result of desc_facvar()

desc_facvar(
  .data = demo,
  vf = "Region"
)
#> # A tibble: 6 × 4
#>   var    level                        value           n_avail
#>   <chr>  <chr>                        <chr>             <int>
#> 1 Region African Region               " 1/750 (0%) "      750
#> 2 Region Region of the Americas       "389/750 (52%)"     750
#> 3 Region South-East Asia Region       "17/750 (2%) "      750
#> 4 Region European Region              "276/750 (37%)"     750
#> 5 Region Eastern Mediterranean Region " 6/750 (1%) "      750
#> 6 Region Western Pacific Region       "61/750 (8%) "      750

The two other variables mainly affected by this phenomenon are Type and type_reporter. The transformation code is found in vignette("template_main.R")

Other arguments of desc_facvar()

Three other arguments allow to control the output format of the results.

  1. format is a character string that must necessarily contain the values n, N and pc.

This argument allows to customize the way the result is displayed. For example, if you want to put the percentage in brackets instead of parentheses

desc_facvar(
  .data = demo,
  vf = "nivolumab",
  format = "n_/N_ [pc_%]"
)
#> # A tibble: 2 × 4
#>   var       level value         n_avail
#>   <chr>     <chr> <chr>           <int>
#> 1 nivolumab 0     525/750 [70%]     750
#> 2 nivolumab 1     225/750 [30%]     750

You can also change all other elements of this argument.

  1. pad_width allows to center the results in the middle of a character string. If you have particularly high numbers, you can increase the value of this parameter, so that your results remain well centered.

  2. digits controls the number of digits after the decimal point for the percentage. Warning, it is not guaranteed that the sum will be exactly 100%.

desc_facvar(
  .data = demo,
  vf = "nivolumab",
  digits = 1
)
#> # A tibble: 2 × 4
#>   var       level value           n_avail
#>   <chr>     <chr> <chr>             <int>
#> 1 nivolumab 0     525/750 (70.0%)     750
#> 2 nivolumab 1     225/750 (30.0%)     750

Drug data: drug screening

Adr data: adr screening and evolution of adverse events

screen_drug() let you screen the most drugs reported in a drug dataset, sorted by frequency.

screen_drug(drug, mp_data = mp, top_n = 5)
#> # A tibble: 5 × 4
#>   `Drug name`      DrecNo     N percentage
#>   <chr>             <int> <int>      <dbl>
#> 1 pembrolizumab  20116296   298      39.7 
#> 2 nivolumab     111841511   225      30   
#> 3 ipilimumab    133138448    86      11.5 
#> 4 atezolizumab  112765189    69       9.2 
#> 5 durvalumab    125456180    68       9.07

Most of the time, you will have filtered the drug data upstream, with some add_* function, allowing to focus on a subset of cases (of a specific drug, adr, or any set of these)

For example, identify colitis cases and screen drugs under this reaction.

drug |> 
  add_adr(
    a_llt,
    adr_data = adr
  ) |> 
  filter(a_colitis == 1) |> 
  screen_drug(
    mp_data = mp, top_n = 5
  )
#> ℹ `.data` detected as `drug` table.
#> # A tibble: 5 × 4
#>   `Drug name`      DrecNo     N percentage
#>   <chr>             <int> <int>      <dbl>
#> 1 nivolumab     111841511    44       42.3
#> 2 pembrolizumab  20116296    40       38.5
#> 3 ipilimumab    133138448    20       19.2
#> 4 <NA>           73636724    14       13.5
#> 5 <NA>           34178924    13       12.5

Adr screening

screen_adr() let you screen the most frequent reactions reported in an adr dataset, sorted by frequency.

screen_adr(adr_, meddra = meddra_)
#>                                               term     n percentage
#>                                             <char> <int>      <num>
#> 1:                                            <NA>   678 90.4000000
#> 2: Respiratory, thoracic and mediastinal disorders   110 14.6666667
#> 3:                      Gastrointestinal disorders   104 13.8666667
#> 4:                              Vascular disorders     9  1.2000000
#> 5:                         Immune system disorders     6  0.8000000
#> 6:                         Hepatobiliary disorders     5  0.6666667
#> 7:          Skin and subcutaneous tissue disorders     1  0.1333333

Different term levels can be used, according to meddra, with argument term_level.

Most of the time, you will have filtered the adr data upstream, with some add_* function, allowing to focus on a subset of cases (of a specific drug, adr, or any set of these).

Outcome

The adr table contains information on the evolution of adverse events.

The possible outcomes (column Outcome) are

The adr structure is as follows

UMCReportId Adr_Id Outcome
1 a_1 1
1 a_2 2
2 a_3 3
2 a_4 1

A case, identified by its UMCReportId, may have several adverse events (Adr_Id) with different outcomes. Summarizing this information requires prioritization.

The logic is as follows: take the ” worst evolution” possible for each event of each case, in order to count each event only once for each case.

In order to filter cases according to a drug exposition, it is necessary to join the drug data to the adr table.

Step 1: Data management of adr with add_drug and add_adr

add_drug() and add_adr() can be used on adr data.


adr <-
  adr |>
  add_drug(
    d_code = d_drecno,
    drug_data = drug
  )
#> ℹ `.data` detected as `adr` table.

adr <-
  adr |>
  add_adr(
    a_code = a_llt,
    adr_data = adr
  )
#> ℹ `.data` detected as `adr` table.

This allows to identify drugs and adverse events of interest in the adr table.

Drugs are identified at the case level in this table.

Step 2: desc_outcome() function

The desc_outcome function prioritizes data according to the rule:

Take the “worst evolution” possible for each event of each case, in order to count each event only once for each case.

adr |> 
  desc_outcome(
    drug_s = "nivolumab",
    adr_s = "a_colitis"
  )
#> # A tibble: 5 × 4
#>   drug_s    adr_s     n_cas out_label                 
#>   <chr>     <chr>     <int> <chr>                     
#> 1 nivolumab a_colitis    10 Unknown                   
#> 2 nivolumab a_colitis    25 Recovered/resolved        
#> 3 nivolumab a_colitis     6 Recovering/resolving      
#> 4 nivolumab a_colitis     1 Not recovered/not resolved
#> 5 nivolumab a_colitis     2 Fatal

In the case where adr was previously filtered to contain only data of a specific adverse drug reaction (for example, with tb_subset()), it is still preferable to recreate the drug column with add_drug (it will take the value 1 for all cases).