---
title: "2 - Data Analysis"
author: "Marina Papadopoulou"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{2 - Data Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

`swaRverse` provides a pipeline to extract metrics of collective motion from grouping individuals trajectories. Metrics include either global (group-level) or pairwise (individual-level) characteristics of the group. After calculating the timeseries of these metrics, the package estimates their averages over each 'event' of collective motion. More details about how an event is defined is given below. Let's start with ..


## 2.1 Velocity estimations 

We start by adding headings and speeds to the trajectory data, and splitting the whole dataframe into a list of dataframes, one per set. For this, we need to specify whether the data correspond to geo data (lon-lat) or not. 
```{r message=FALSE, warning=FALSE}
library(swaRmverse)
#data_df <- trackdf::tracks
#raw$set <- c(rep('ctx1', nrow(raw)/2 ), rep('ctx2', nrow(raw)/2))
raw <- read.csv(system.file("extdata/video/01.csv", package = "trackdf"))
raw <- raw[!raw$ignore, ]

## Add fake context
raw$context <- c(rep("ctx1", nrow(raw) / 2), rep("ctx2", nrow(raw) / 2))

data_df <- set_data_format(raw_x = raw$x,
                          raw_y = raw$y,
                          raw_t = raw$frame,
                          raw_id = raw$id,
                          origin = "2020-02-1 12:00:21",
                          period = "0.04S",
                          tz = "America/New_York",
                          raw_context = raw$context
                          )

is_geo <- FALSE
data_dfs <- add_velocities(data_df,
                           geo = is_geo,
                           verbose = TRUE,
                           parallelize = FALSE
                           ) ## A list of dataframes
#head(data_dfs[[1]])
print(paste("Velocity information added for", length(data_dfs), "sets."))
```

If there is a high number of sets in the dataset, the parallelization of the function can be turned on (setting _parallelize_ argument to TRUE). This is not recommended for small to intermediate data sizes.

## 2.2 Group characteristics 

Based on the list of positional data and calculated velocities, we can now calculate the timeseries of group polarization, average speed, and shape. As a proxy for group shape we use the angle between the object-oriented bounding box that includes the position of all group members and the average heading of the group. Small angles close to 0 rads represent oblong groups, while large angles close to pi/2 rads wide groups. The _group_metrics_ function calculates the timeseries of each measurement across sets. To reduce noise, the function further calculates the smoothed timeseries of speed and polarization over a given time window (using a moving average).

```{r message=FALSE, warning=FALSE}

sampling_timestep <- 0.04
time_window <- 1 # seconds
smoothing_time_window <- time_window / sampling_timestep

g_metr <- group_metrics_per_set(data_list = data_dfs,
                               mov_av_time_window = smoothing_time_window,
                               step2time = sampling_timestep,
                               geo = is_geo,
                               parallelize = FALSE
                               )
summary(g_metr)

```

As before, one can parallelize the function if the data are from many different days/sets. A column of N and missing_ind are added to the dataframe, showing the group size of that time point and whether an individual has NA data.

## 2.3 Pairwise measurements  

From the timeseries of positions and velocities, we can calculate information concerning the nearest neighbor of each group member. Here we estimate the distance and the bearing angle (angle between the focal individual's heading and its neighbor) to the nearest neighbor of each individual. These, along with the id of the nearest neighbor, are added as columns to the positional timeseries dataframe:

```{r message=FALSE, warning=FALSE}

data_df <- pairwise_metrics(data_list = data_dfs,
                            geo = is_geo,
                            verbose = TRUE,
                            parallelize = FALSE,
                            add_coords = FALSE # could be set to TRUE if the relative positions of neighbors are needed 
                            )

#tail(data_df)
```

## 2.4 Metrics of collective motion

Based on the global and local measurements, we then calculate a series of metrics that aim to capture the dynamics of the collective motion of the group. These metrics are calculated over parts of the trajectories that the group is performing coordinated collective motion, when the group is moving (average speed is higher than a given threshold) and is somewhat polarized (polarization higher than a given threshold). These parts are defined as 'events'. The thresholds are asked by the user in run time if 'interactive_mode' is activated, after printing the quantiles of average speed and polarization across all data. Otherwise, the thresholds (pol_lim and speed_lim) should be given as inputs. If both limits are set to 0, a set will be taken as a complete event. The time between observation is needed as input to distinguish between continuous events. When the group and pairwise timeseries are calculated, one can calculate the metrics per event:

```{r message=FALSE, warning=FALSE}

### Interactive mode, if the limits of speed and polarization are unknown
# new_species_metrics <- col_motion_metrics(data_df,
#                                            global_metrics = g_metr,
#                                            step2time = sampling_timestep,
#                                            verbose = TRUE,
#                                            speed_lim = NA,
#                                            pol_lim = NA
#                                             
# )

new_species_metrics <- col_motion_metrics(data_df,
                                           global_metrics = g_metr,
                                           step2time = sampling_timestep,
                                           verbose = TRUE,
                                           speed_lim = 150,
                                           pol_lim = 0.3
)

# summary(new_species_metrics)

```

The number of events and their total duration given the input thresholds is also printed. If we are not interested in inspecting the timeseries of the measurements, on can calculate the metrics directly from the formatted dataset: 

```{r message=FALSE, warning=FALSE}

new_species_metrics <- col_motion_metrics_from_raw(data_df,
                                mov_av_time_window = smoothing_time_window,
                                step2time = sampling_timestep,
                                geo = is_geo,
                                verbose = TRUE,
                                speed_lim = 150,
                                pol_lim = 0.3,
                                parallelize_all = FALSE
                                )

# summary(new_species_metrics)

```

Since we are interested in comparing different datasets across species or contexts, a new species id column should be added:

```{r message=FALSE, warning=FALSE}

new_species_metrics$species <- "new_species_1"

head(new_species_metrics)

## Un-comment bellow to save the output in order to combine it with other datasets (replace 'path2file' with appropriate local path and name).
# write.csv(new_species_metrics, file = path2file.csv, row.names = FALSE) # OR R object
# save(new_species_metrics, file = path2file.rda) 

```

The duration, starting time and group size (N) of each event are also added to the result dataframe. We suggest filtering out events of very small duration and with less than 3 individuals (singletons and pairs). The calculated metrics are:

* _mean_mean_nnd_: the temporal average of the group's average nearest neighbor distance
* _mean_sd_nnd_: the temporal average of the group's standard deviation in nearest neighbor distance
* _sd_mean_nnd_: the temporal standard deviation of the group's average nearest neighbor distance     
* _mean_pol_: the average of the group's polarization during the event
* _sd_pol_: the standard deviation of the group's polarization during the event
* _cv_speed_: the CV (coefficient of variation) of the group's average speed during the event
* _mean_sd_front_: the average standard deviation of the individuals' frontness to their nearest neighbor during an event
* _mean_mean_bangl_: the temporal average of the group's average angle
* _mean_shape_: the average shape index of the group during an event (0= perfectly wide and 1= perfectly long relative to the heading direction)
* _sd_mean_shape_: the standard deviation of the group's average shape index during an event.