When working with survey data there are several issues / strategies to clean and prepare the data that are useful and worth being incorporated to the routines and workflow. This vignette uses the CEOdata package to present several examples.

It uses primarily the bundled offline example version of the default BOP_presencial dataset.

Incorporate Tables and Figures

Once you have retrieved the data of the surveys, it is easy to accommodate them to your regular workflow. For instance, to get the overall number of males and females surveyed:

library(dplyr)
library(tidyr)
library(ggplot2)

d |>
  count(SEXE)

## # A tibble: 2 × 2
##   SEXE        n
##   <fct>   <int>
## 1 Masculí   441
## 2 Femení    459

Or to trace the proportion of females surveyed over time, across barometers:

d |>
  group_by(BOP_NUM) |>
  summarize(propFemales = length(which(SEXE == "Femení")) / n()) |>
  ggplot(aes(x = BOP_NUM, y = propFemales, group = 1)) +
  geom_point() +
  geom_line() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  expand_limits(y = c(0, 1))

Proportion of females in the different Barometers.

Topics (Tags)

Alternatively, the metadata can also be explored using the different topics (tags, called “Descriptors”) covered as reported by the CEO.

tags <- meta |>
  separate_rows(Descriptors, sep = ";") |>
  mutate(tag = factor(stringr::str_trim(Descriptors))) |>
  select(REO, tag)

tags |>
  group_by(tag) |>
  count() |>
  filter(n > 5) |>
  ggplot(aes(x = n, y = reorder(tag, n))) +
    geom_point() +
    ylab("Topic")

Prevalence of topics covered.

Fieldwork

The metadata also provides the option of examining the time periods where there has been fieldwork in quantitative studies, since 2018. In addition, we can distinguish between studies that provide microdata and surveys that don’t.

meta |>
  filter(`Dia inici treball de camp` > "2018-01-01") |>
  ggplot(aes(xmin = `Dia inici treball de camp`,
             xmax = `Dia final treball de camp`,
             y = reorder(REO, `Dia final treball de camp`),
             color = microdata_available)) +
  geom_linerange() +
  xlab("Date") + ylab("Surveys with fieldwork") +
  theme(axis.ticks.y = element_blank(), axis.text.y = element_blank())

Fieldwork periods.

Arrange and store

Once a dataset has been loaded, it is important to clean it and arrange it to one’s individual preferences, and store the result in an R object.

The following example, for instance, process several variables of the survey, picks them and stores the resulting object in a workspace (RData) format.

survey.data <- d |>
  mutate(Female = ifelse(SEXE == "Dona", 1, 0),
         Age = EDAT,
         # Pass NA correctly
         Income = ifelse(INGRESSOS_1_15 %in% c("No ho sap", "No contesta"), 
                         NA,
                         INGRESSOS_1_15),
         Date = DATA_FIN,
         # Reorganize factor labels
         `Place of birth` = factor(case_when(
            LLOC_NAIX == "Catalunya" ~ "Catalonia",
            LLOC_NAIX %in% c("No ho sap", "No contesta") ~ as.character(NA),
            TRUE ~ "Outside Catalonia")),
         # Convert into numerical (integer)
         `Interest in politics` = case_when(
            INTERES_POL == "Gens" ~ 0L,
            INTERES_POL == "Poc" ~ 1L,
            INTERES_POL == "Bastant" ~ 2L,
            INTERES_POL == "Molt" ~ 3L,
            TRUE ~ as.integer(NA)),
         # Convert into numeric (double) and properly address missing values
         `Satisfaction with democracy` = ifelse(
            SATIS_DEMOCRACIA %in% c("No ho sap", "No contesta"),
            NA,
            as.numeric(SATIS_DEMOCRACIA))) |>
  # Center income to the median
  mutate(Income = Income - median(Income, na.rm = TRUE)) |>
  # Pick only specific variables
  select(Date, Female, Age, Income,
         `Place of birth`, `Interest in politics`, 
         `Satisfaction with democracy`)

Finally, this can be stored for further analysis (hence, without the need to download and arrange the data again) in an R’s native format:

save(survey.data, file = "my_cleaned_dataset.RData")

Descriptive summary

There are several packages that construct convenient tables with the descriptive summary of a dataset. For example, using the vtable package to produce a table with descriptive statistics.

library(vtable)
st(survey.data)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
Female	900	0	0	0	0	0	0
Age	900	53	18	18	39	66	97
Income	700	0.21	2.9	-8	-2	2	6
Place of birth	900
… Catalonia	639	71%
… Outside Catalonia	261	29%
Interest in politics	598	1.3	0.95	0	1	2	3
Satisfaction with democracy	880	2.9	0.75	1	2	3	4

Or the compareGroups that allows to flexibly produce tables that compare descriptive statistics for different groups of individuals.

library(compareGroups)
createTable(compareGroups(Female ~ . -Date, data = survey.data))

## Warning: el paquet 'compareGroups' es va construir amb la versió d'R 4.5.2

## 
## --------Summary descriptives table by 'Female'---------
## 
## _________________________________________________ 
##                                  0      p.overall 
##                                N=900              
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## Edat                        52.5 (18.1)     .     
## Income                      0.21 (2.89)     .     
## Place of birth:                             .     
##     Catalonia               639 (71.0%)           
##     Outside Catalonia       261 (29.0%)           
## Interest in politics        1.35 (0.95)     .     
## Satisfaction with democracy 2.87 (0.75)     .     
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Working with survey data using the CEOdata package

Xavier Fernández-i-Marín

27/03/2026 - Version 1.4.0

Incorporate Tables and Figures

Topics (Tags)

Fieldwork

Arrange and store

Descriptive summary

Development and acknowledgement