The SqueakR
package is a centralized,
experiment-centered interface for efficiently organizing and analyzing
bioacoustics data exported from DeepSqueak
.
Of its diverse functions, SqueakR
is capable of generating
quick plots from vocalization data detected from
DeepSqueak
. On top of these visualizations, the package
contains functions which combine all exported DeepSqueak data from
multiple recordings into a single experimental object, allowing
subsequent analysis to be conducted directly from R.
To use SqueakR
, we need to install the package. To
install and load the CRAN version of the package, run the following
function:
install.packages("SqueakR")
Next, load the package:
library(SqueakR)
To install the development version of the package from GitHub, run
the code below. Ensure the devtools
package is installed,
and devtools
is loaded using
library(devtools)
, before running the following:
library(devtools)
install_github("osimon81/SqueakR")
library(SqueakR)
Experiment
Objectexperiment
This package allows all data necessary for visualizations to be
stored in a single object. However, to do so, the experimental object
must be created. To create an experiment
, run the following
code:
experiment <- create_experiment(experiment_name = "my_experiment")
#> Creating new experiment...
This code creates a new experiment
object, with the name
my_experiment
. When the object is saved, it will be saved
using this name, along with a timestamp. At this point, there’s nothing
stored in this experiment
object except its name, but let’s
inspect how experiment
is structured:
str(experiment)
#> List of 6
#> $ name : chr "my_experiment"
#> $ last_saved : POSIXct[1:1], format: "2022-06-24 22:20:00"
#> $ groups : NULL
#> $ animals : NULL
#> $ experimenters : NULL
#> $ experimental_data: list()
We can see from the str()
function that the
experiment
object has 5 main groups listed in it:
name
: The name we just set for this experimentlast_saved
: A timestamp for the last time this
experiment
was saved (in this case, this is the time the
object was created)groups
: An empty variable which will show the
experimental groupsanimals
: An empty variable which will show the distinct
animal (IDs) testedexperimenters
: An empty variable which will show the
experimenters who collected dataexperimental_data
: An empty list which will store all
of the raw and processed data for this experimentNow that our experiment
is created, we can start to add
data to it.
When call data is exported from DeepSqueak, it includes all detected
calls through the length of the recording. However, sometimes, we’re
only interested in calls within a certain range of the entire recording.
SqueakR
has the add_timepoint_data()
function
to assist with this:
my_new_data <- add_timepoint_data(data_path = "../inst/extdata/Example_Mouse_Data.xlsx", t1 = 5, t2 = 25)
#> Adding call features Excel file to workspace...
#> Restricting data to range: 5 to 25 seconds...
The parameters for add_timepoint_data()
are defined as
follows:
data_path
: The full path to the data filet1
: The timepoint at which calls will start being
extracted into the my_new_data
objectt2
: The timepoint at which calls will stop being
extracted into the object.In the context of the code above, we’ve just extracted all of the calls in the 5-25 second region. To view the data we’ve extracted to confirm this:
# The first few rows of the dataset
head(my_new_data)
# The last few rows of the dataset
tail(my_new_data)
If we inspect the Begin Time (s)
column in the first
table that generated above, you’ll notice the first observation (row)
represents a call that begins at ~5 seconds. Inspecting the
End Time (s)
column in the second table, the last call in
the dataset ends at ~24 seconds, indicating that we’ve selected the 5-25
region for calls.
Once raw data is loaded into R, we can calculate some summary statistics on the region we’ve selected. To do this, run the following code:
my_scored_data <- score_timepoint_data(data_subset = my_new_data,
group = "Control",
id = "my_data.xlsx",
animal = "3330",
experimenter = "my_name")
#> Summarizing call features for datapoint...
str(my_scored_data)
#> List of 16
#> $ id : chr "my_data.xlsx"
#> $ animal : chr "3330"
#> $ group : chr "Control"
#> $ experimenter : chr "my_name"
#> $ calls_n : int 94
#> $ call_length :List of 3
#> ..$ mean : num 0.0437
#> ..$ standard_deviation: num 0.0228
#> ..$ range : num 0.116
#> $ delta_frequency :List of 3
#> ..$ mean : num 16.8
#> ..$ standard_deviation: num 5.6
#> ..$ range : num 30.5
#> $ high_frequency :List of 3
#> ..$ mean : num 65.6
#> ..$ standard_deviation: num 4.11
#> ..$ range : num 19.2
#> $ low_frequency :List of 3
#> ..$ mean : num 48.8
#> ..$ standard_deviation: num 4.54
#> ..$ range : num 26
#> $ peak_frequency :List of 3
#> ..$ mean : num 62.2
#> ..$ standard_deviation: num 5.55
#> ..$ range : num 22.6
#> $ power :List of 3
#> ..$ mean : num -71.9
#> ..$ standard_deviation: num 5.82
#> ..$ range : num 26.9
#> $ principal_frequency:List of 3
#> ..$ mean : num 61.9
#> ..$ standard_deviation: num 3.8
#> ..$ range : num 16.4
#> $ sinuosity :List of 3
#> ..$ mean : num 1.41
#> ..$ standard_deviation: num 0.245
#> ..$ range : num 1.14
#> $ slope :List of 3
#> ..$ mean : num 318
#> ..$ standard_deviation: num 200
#> ..$ range : num 937
#> $ tonality :List of 3
#> ..$ mean : num 0.457
#> ..$ standard_deviation: num 0.109
#> ..$ range : num 0.484
#> $ raw : tibble [94 × 17] (S3: tbl_df/tbl/data.frame)
#> ..$ ID : num [1:94] 12 13 14 15 16 17 18 19 20 21 ...
#> ..$ Label : chr [1:94] "18" "12" "26" "19" ...
#> ..$ Accepted : logi [1:94] TRUE TRUE TRUE TRUE TRUE TRUE ...
#> ..$ Score : num [1:94] 0.573 0.649 0.931 0.969 0.885 ...
#> ..$ Begin Time (s) : num [1:94] 5.36 6.35 6.46 6.59 6.73 ...
#> ..$ End Time (s) : num [1:94] 5.38 6.39 6.51 6.66 6.77 ...
#> ..$ Call Length (s) : num [1:94] 0.0152 0.0403 0.0488 0.0787 0.046 ...
#> ..$ Principal Frequency (kHz) : num [1:94] 60.4 66.6 68.7 66.5 66.6 ...
#> ..$ Low Freq (kHz) : num [1:94] 51.8 66.2 48.7 51.4 48.3 ...
#> ..$ High Freq (kHz) : num [1:94] 61.9 67.3 71.3 69.2 69.3 ...
#> ..$ Delta Freq (kHz) : num [1:94] 10.06 1.11 22.59 17.82 20.93 ...
#> ..$ Frequency Standard Deviation (kHz): num [1:94] 2.618 0.266 5.784 3.991 5.354 ...
#> ..$ Slope (kHz/s) : num [1:94] 534.76 -5.56 352.47 94.86 357.88 ...
#> ..$ Sinuosity : num [1:94] 1.25 1.15 1.66 1.19 1.2 ...
#> ..$ Mean Power (dB/Hz) : num [1:94] -80.3 -66.9 -65 -60.5 -62.9 ...
#> ..$ Tonality : num [1:94] 0.281 0.597 0.581 0.671 0.648 ...
#> ..$ Peak Freq (kHz) : num [1:94] 61.9 66.6 67.9 69.1 68.1 ...
Since there is a lot of data contained in this object, here is a summary of the structure.
The following variables are assigned to a single value:
animal
: The animal (or testing group) ID, specified in
the score_timepoint_data()
function.id
: The name of the original file corresponding to the
dataset, which can be accessed using the unblinding functions discussed
latergroup
: The experimental group, specified in the
score_timepoint_data()
function.experimenter
: The experimenter who collected the data,
specified in the score_timepoint_data()
function.calls_n
: The number of detected calls in the following
range (automatically calculated)The rest of these variables are extracted from the
my_new_data
object have sub-variables stored under them:
mean, standard_deviation, and range:
call_length
delta_frequency
high_frequency
low_frequency
peak_frequency
power
principal_frequency
sinuosity
slope
tonality
More information about the above variables can be found on the DeepSqueak
wiki. Finally, the raw
variable contains the entire
extracted dataset (the my_new_data
object), which can be
referenced when plotting these data.
The data structure may appear complicated, but this will be useful later on to organize the data!
Now our data is prepared, we can add it to the
experiment
object we created. To do so, we can use the
add_to_experiment()
function. The parameters for the
function are the experiment
object (which we’ve
conveniently labeled experiment), and the data we want to add (which we
just assigned to my_scored_data
in the previous
section).
experiment <- add_to_experiment(experiment = experiment, added_data = my_scored_data)
#> Adding summarized data to experiment object...
#> Updating experiment metadata...
If we want to remove a call dataset from the experiment, we can run the following function:
experiment <- remove_experiment_data(experiment, data_id = 1)
data_id
corresponds to the index of data in the
experiment object - for example, in the code above, the first data added
to this experiment
will be removed. This function also
calls for the experiment
to be updated (i.e., check the
leftover data for leftover groups and/or experimenters), and updating
the experiment metadata if there are any groups removed. Data can be
indexed in the usual R style — to remove datasets 4 -> 8 in an
experiment, your data_id
variable can be set to 4:8.
Caution is advised for this function, since data that is removed cannot be undone. However, it can be reloaded by finding the original call data and adding it back to the experiment, or creating a new experiment using the pipelines detailed below (if many datasets were accidentally removed, and it would take too long to add data manually).
Now, our new data is added to our experiment
object! As
seen in the previous section, running str(experiment)
can
produce an unwieldy representation of our experiment, so this is
discouraged when dealing with large experiments. As an alternative, we
can run the following SqueakR
function:
describe_experiment(experiment = experiment)
#> Experiment name: my_experiment
#> Last saved: 2022-06-24 22:20:00
#> Experimenter(s): my_name
#> Animal(s): 3330
#> Experimental group(s): Control
#> Total call datasets: 1
#> Data for Control: 1
This can be a great way to inspect the contents of our experiment,
and it condenses the large list of parameters we set earlier. To clarify
the last two lines of this output: Total call datasets
shows the total number of call data sheets stored in the
experiment
(i.e. how many separately scored datasets were
added to the experiment
in total), and “Data for Control”
indicates how many of these datasets are part of this particular
experimental group. In this way, the function allows us to get a feel
for how much of our data is part of any given experimental group, to
ensure the data are balanced.
The add_to_experiment()
function runs another
SqueakR
function inside of it:
update_experiment()
. This function updates the
“Experimenter(s)” and “Experimental group(s)” fields within our
experiment
object to reflect the data which is stored in
it. In other words, when we add data that specifies a different
experimental group or a different experimenter, the Experimenter(s) and
Experimental group(s) metadata will auto-populate.
For completeness, the function is run below (however, since the
experiment
was already updated through the
add_to_experiment()
function, it will not change any
values). It looks through the data stored in the experiment
object, and searches for new groups or new experimenters. These new
groups or experimenters are added to the groups
and
experimenters
variables stored in the
experiment
object.
Note: In reality, since update_experiment()
is embedded
in some SqueakR subfunctions, it will not really be used very often (if
at all). However, the function is available for use in the package, to
allow experimenters to make sure they’ve updated their data.
experiment <- update_experiment(experiment = experiment)
#> Updating experiment metadata...
Finally, in order to save your experiment
to a given
location, you can run the following function. Simply change the value
assigned to save_path to the full path to the directory you want to save
the experiment
to.
The file will be saved as an RData file, with the name “[experiment name] ([current date]).RData”
Note: The values in square brackets will auto-populate, based on the
name you set for the experiment
(based on the Creating an
Experiment section) and the current date. This saving convention using
day-by-day timestamps ensures you never lose more than a day of progress
if any critical data deletions occur.
Usually, when there are many datasets present in an experiment, it
can be unwieldy to switch between these various functions to create an
experiment
and add data to it. SqueakR
has
interactive pipeline functions which enable data to be added either
semi-automatically (with experimenter input of metadata for every file)
or automatically (without experimenter input for metadata). The
semi-automatic pipeline can be run using the following code:
The data can be easily added this way, since the user will be prompted for parameters like experimenter, group, etc for specific data. It can also be helpful to supply descriptive names for the excel files (which may include information about experimenter name, experimental group, time range to subset, etc) when using this function, in order to ensure no mistakes are made while entering these data.
SqueakR also has a pipeline that can be used to automatically add
data and metadata to an experiment
without direct
experimenter prompting. The way the pipeline works is it references a Google Sheets document which
contains the metadata about the object. The pipeline prompts the user to
validate columns in the sheet (confirm which column the “experimental
groups” column is in), before adding data (which is stored in a
specified local directory) to the new experiment
. The
advantage of automatic pipeline is that all metadata can be entered in
an external Google Sheets document — if any mistakes are made in
metadata entry, they can be easily corrected before running the
pipeline.
On the other hand, if mistakes are made during metadata entry for the semi-automatic pipeline, the function will have to be stopped and re-run. Running the automatic pipeline is done the same way as the semi-automatic pipeline:
There are only a few requirements for the referenced Google Sheet (which SqueakR will use to grab metadata from):
Besides these requirements, the order of rows or columns in the sheet can be according to whatever conventions the experimenter likes — files do not have to be in the same order as they appear in the folder containing the data. The pipeline will prompt the user for the link to the Sheets document, the actual sheet number which contains the metadata (usually 1, unless it is added to a separate document), as well as ask the user which column corresponds to each metadata.
From this point, the pipeline will loop through every file automatically, assigning the appropriate metadata based on your Google Sheet, and export the created experiment object (if requested).
At the point of an experiment where we want to unblind ourselves to the anonymized datasets SqueakR has created, there are a few functions which can accomplish this:
unblind_all_ids(experiment)
#> [1] "my_data.xlsx"
Since we only have one dataset, this is the only set that displays in the list. If we had more, they would be arranged in the order that they appear in the experiment, allowing us to decode the anonymized datasets using the functions below. The first function allows us to find the corresponding dataset id for a filename:
unblind_data_id(experiment, "my_data.xlsx")
#> [1] 1
The next function allows us to do the opposite — to find the name of a particular anonymized dataset:
unblind_data_name(experiment, 1)
#> [1] "my_data.xlsx"
The SqueakR
package offers many tools for visualization
of data, and these can be applied to the experiment
object
we created to inspect our data. This section of the document will go in
detail for what each of these visualizations look like. In order to
familiarize ourselves with the R syntax, and recap on the structure of
the experiment
object, we can retrieve the raw data from
the data we just added to our experiment
by running the
following:
experiment$experimental_data[1]$call_data$raw
$
Operator in SqueakR
The $
operator allows us to dive deeper in a list, and
inspect values stored within that list. Using the code above, we access
the raw data we put into the experiment
by going from
experiment
-> experimental_data[1]
->
call_data
-> raw.
Specifying the number in
square brackets (i.e. experiment$experimental_data[1]
) will
locate the first data added, specifying
experiment$experimental_data[2]
will located the second
data, etc. Since we only added one dataset to the
experiment
, experiment$experimental_data[1]
will lead us to that first set of data we added. From there, we navigate
to call_data
(which is a deliberately unremarkable
variable in order to blind the user to the data stored inside it),
and finally the raw
data.
Especially if you are new to R, this structure may appear complicated, but it allows our data to be much more organized and allows graphing to be more efficient. We’ll use this structure when locating the data used to graph our visualizations.
Note: For the following functions, the only required variable is the
data_path variable, or the path towards our raw data. All other
parameters (graph_title, graph_subtitle, etc) are optional, since there
are default titles and descriptions prepared in
SqueakR
.
We can plot an ethnogram to reveal the occurance of a behavior (in our case, a call) over time, using the following function:
plotEthnogram(data_path = experiment$experimental_data[1]$call_data$raw)
We can also specify the graph title and subtitle, if we want to change them, by setting some optional parameters:
plotEthnogram(experiment$experimental_data[1]$call_data$raw,
graph_title = "My Ethnogram",
graph_subtitle = "This is the description I want instead!")
Tonality can be used as a proxy for the signal-to-noise ratio for a particular call. We can plot the same ethnogram, and split the detected calls according to tonality, using the following code:
plotEthnogramSplitByTonality(experiment$experimental_data[1]$call_data$raw,
graph_title = "My Tonality-Split Ethnogram")
We can also plot the call clusters (custom labels) on a 3D plane to examine the density of calls as a function of principal frequency (kHz), call length (s), and mean power (dB/kHz) below:
plotClusters(experiment$experimental_data[1]$call_data$raw)
We can use the MASS package to provide 2-dimensional kernel density estimations on a given call dataset, and plot it as a surface against principal frequency and call length (two important metrics of calls):
plotSurface(experiment$experimental_data[1]$call_data$raw)
We can plot a similar, non-interactive form of the plot using the function below. The benefit to using a 2D contour plot is its axes are flanked by histograms representing the distributions of call length and principal frequency across its axes:
plotContours
#> function (data_path)
#> {
#> x <- data_path$`Principal Frequency (kHz)`
#> y <- data_path$`Call Length (s)`
#> s <- subplot(plot_ly(x = x, type = "histogram"), plotly_empty(type = "scatter",
#> mode = "markers"), plot_ly(x = x, y = y, type = "histogram2dcontour"),
#> plot_ly(y = y, type = "histogram"), nrows = 2, heights = c(0.2,
#> 0.8), widths = c(0.8, 0.2), margin = 0, shareX = TRUE,
#> shareY = TRUE, titleX = FALSE, titleY = FALSE)
#> fig <- layout(s, showlegend = FALSE, title = "2D Contour Plot of Principal Frequency (x) against Call Length (y)")
#> fig
#> }
#> <bytecode: 0x7fe5f26c07e8>
#> <environment: namespace:SqueakR>
In order to inspect the distributions of metadata (i.e. how much experimenters contributed to an experiment, or how many datasets an animal contributed to an experimental group), SqueakR has sunburst plotting functions which can be called, as shown below. These graphs are interactive — a group can be clicked to expand that subsection of the graph.
plotSunburstAnimals(experiment)
The same can be done for experimenter distributions:
plotSunburstExperimenters(experiment)
We can plot the frequency ranges of calls using the following function:
plotDensityStackedByFrequency(experiment$experimental_data[1]$call_data$raw)