ABSTRACT
Zooplankton play critical roles in marine ecosystems, yet their fine-scale behavior remains poorly understood because of the difficulty in studying individuals in situ. Here, we combine biologging with supervised machine learning (ML) to propose a pipeline for studying in situ behavior of larger zooplankton such as jellyfish. We deployed the ITAG, a biologging package with high-resolution motion sensors designed for soft-bodied invertebrates, on eight Chrysaora fuscescens in Monterey Bay, using the tether method for retrieval. By analyzing simultaneous video footage of the tagged jellyfish, we developed ML methods to: (1) identify periods of tag data corrupted by the tether method, which may have compromised prior research findings, and (2) classify jellyfish behaviors. Our tools yield characterizations of fine-scale jellyfish activity and orientation over long durations, and we conclude that it is essential to develop behavioral classifiers on in situ rather than laboratory data.
INTRODUCTION
As oceans are altered by rising temperatures, acidification and other consequences of anthropogenic activity, understanding the behavioral patterns and responses of marine animals is required for effective stewardship. Researchers have made great strides in investigating marine megafauna behavior related to long-distance migrations (Block et al., 2011; Rasmussen et al., 2007; Sequeira et al., 2018) and foraging strategies (Sims et al., 2008; Weise et al., 2010). However, the behavior of more numerous, higher total-biomass, lower trophic-level animals such as zooplankton is much less well understood. Early attempts to investigate in situ behavior of zooplankton such as jellyfish relied on scuba divers following animals with hand-held video cameras (Colin and Costello, 2002; Costello et al., 1998) and later with remotely operated vehicles (ROVs; Kaartvedt et al., 2015; Purcell, 2009; Rife and Rock, 2003). Acoustic methods have also been used to describe large-scale movement patterns of jellyfish (Båmstedt et al., 2003; Kaartvedt et al., 2007; Klevjer et al., 2009), although these methods can be resolution-limited.
A promising alternative is biologging, where electronic transmitters or loggers with environmental and motion sensors are affixed to organisms (Kooyman, 2004; Rutz and Hays, 2009). Biologging has enabled a diverse array of marine vertebrate studies (Block et al., 2011; Goldbogen et al., 2006; Johnson and Tyack, 2003), while several technological challenges have hindered its widespread use to study gelatinous invertebrates such as jellyfish. Their sensitivity to drag induces constraints on tag size, shape and buoyancy (Fossette et al., 2016; Mills, 1984; Mooney et al., 2015), which, when coupled with bandlimited transmission capabilities, often restricts sensor payloads to low-resolution depth or location pingers (Honda et al., 2009; Moriarty et al., 2012; Seymour et al., 2004). As a result, very few studies have successfully deployed high-resolution motion sensors such as accelerometers on jellyfish in situ (Fossette et al., 2015), and these studies adopted the ‘tether method’ for retrieval (Fossette et al., 2016; Hays et al., 2008), where the tag is tethered to a surface float transmitting location. As tethering can restrict movement, it is unknown whether data collected in this manner are representative of natural behavior. Furthermore, without simultaneous observation of the tagged animal, interpretation of biologging data can be easily biased (Brown et al., 2013; Jeantet et al., 2018).
Recently, techniques from supervised machine learning (ML), which automatically fit or ‘learn’ patterns that optimally distinguish categories, have been successfully used to classify behaviors in various marine vertebrates (Brewster et al., 2018; Jeantet et al., 2018; Ladds et al., 2016). However, few studies develop their methods on ground-truthed in situ data owing to the difficulty of recording sustained observations of wild marine animals (Carroll et al., 2014). It is unknown whether classifiers developed on data from captive, controlled or laboratory conditions are equally effective on data from natural environments (Carroll et al., 2014), an example of a broader problem in ML known as domain adaptation (Pan and Yang, 2010; Zhang et al., 2013).
In this study, we demonstrate how to investigate fine-scale zooplankton behavior in situ by combining biologging with supervised ML methods. We study the movements of the scyphomedusa Chrysaora fuscescens in Monterey Bay, CA, USA, using the ITAG, a biologging tag equipped with high-resolution motion sensors and engineered specifically for soft-bodied invertebrates (Mooney et al., 2015). We use the tether method for retrieval and simultaneously record video footage of the tagged animals. We develop classifiers using the resulting data to: (1) detect when the tether method influences jellyfish behavior, and (2) distinguish swimming from drifting. We provide principled estimates of the classifier error characteristics, which allow us to remove tether-influenced data and estimate the fine-scale in situ orientation and swimming activity of C. fuscescens individuals for up to 10 h. We also investigate how classifiers trained on laboratory data perform on in situ data. By combining a highly specialized tag with supervised ML, our approach is the first complete pipeline for acquiring and interpreting high-resolution motion data from individual jellyfish or other zooplankton in situ.
MATERIALS AND METHODS
Code and Jupyter Notebooks demonstrating how to reproduce our classifiers are available at https://bitbucket.org/mbari/jellymove.
Laboratory deployments
Laboratory investigations of jellyfish tagging were conducted at the Monterey Bay Aquarium Research Institute (MBARI) in Moss Landing, CA in the spring of 2018. Four jellyfish (Chrysaora fuscescens Brandt 1835) with bell diameters ranging from 16 to 25 cm were collected in Monterey Bay from RV Paragon (California Department of Fish and Wildlife permit SC-13337) and kept in plastic bags filled with unfiltered seawater. Within 4 h of collection, animals were transported into large holding tanks in a 5°C cold room in MBARI's seawater lab. Experiments were conducted in a 275,000 gallon (1.25 million liter) test tank with dimensions of 13 m (L)×10 m (W)×10 m (D). Animals were transported from the seawater lab in plastic bags and placed in the test tank to acclimate for at least 1 hour prior to tagging trials. After acclimation, a neutrally buoyant bio-logging tag (ITAG v0.4; Mooney et al., 2015) was prepared for attachment. The ITAG (6.3 cm×2.9 cm×1.6 cm, air weight 30 g) is equipped with a triaxial accelerometer, gyroscope and magnetometer synchronously sampling at a rate of 100 Hz, and pressure, temperature and light sensors sampling at 1 Hz. The tag was attached to the animal's aboral surface using veterinary-grade tissue adhesive (3M Vetbond, Maplewood, Minnesota, USA), following the ‘glue method’ (Fossette et al., 2016). Care was taken to center the attachment site on the bell apex between the four gonads, so that the tag axis conventions aligned with the jellyfish, and the animal's radial symmetry was not disrupted (Fig. 1H). The entire attachment procedure took no longer than 2 min.
To replicate the in situ recovery strategy, the tags were attached by 6 m of monofilament line (20 lb/∼9 kg test) to a suspended walkway about 1 m above the tank surface (the tether length was set to prevent the animal from getting tangled with metal bars on the walls of the test tank). Simultaneous lateral-view video footage of the tagged jellyfish was collected with a HERO5 Black GoPro (GoPro, Inc., San Mateo, CA, USA) mounted onto a BlueROV2 (Blue Robotics, Torrance, CA, USA). Footage was synchronized with the tag data by sharply tapping the tag five times in front of the GoPro prior to attachment.
Field deployments
We deployed ITAGs on 8 C. fuscescens in Monterey Bay, CA in late spring of 2018, and collected in situ recordings with durations between 54 min and 10 h. The bell diameters of these animals were between 20 and 28 cm. Fig. 1A–G depict the main phases of the deployment protocol. Each animal was first spotted from aboard the RV Paragon, which then maneuvered next to the animal so it could be gently captured and brought aboard using a plastic bucket. Captured jellyfish were then transferred into individual 27 gallon (∼125 l) plastic holding tubs filled with seawater (Fig. 1A), and care was taken to not introduce air bubbles under their bells. To recover tags at the end of the deployment, we used the ‘tether method’ (Fossette et al., 2016): tags were tethered by 30 m of monofilament line to the bottom of a drogue, which was attached by dock line to a surface drifter. A fishing swivel was placed at the midpoint of the tether, as well as immediately below its attachment to the drogue, to prevent tether torsion from affecting the animal. The surface drifter consisted of PVC housing for a SPOT GPS tracking device (SPOT LLC, Milpitas, CA, USA), and a PVC pipe chamber containing batteries and ballasting material. The SPOT was configured to report its coordinates once every 15–20 min via email. The tethered tags were then affixed to jellyfish while in the holding tubs, following the aforementioned glue method also used for laboratory deployments (Fig. 1B,C). In order, the drogued drifter, ROV and tagged jellyfish were then released (Fig. 1D–F, respectively). The drogue was centered at a depth down to 9 m (see Table S1), and the jellyfish could therefore swim freely down to a depth of 30–39 m. We then piloted the ROV to track and record video footage of the tagged animal, until losing sight of it because of water turbidity and/or turbulence (see Movie 1 for example footage). Once visual contact with the tagged animal was lost, the ROV was recovered, and the tagged animals tethered to the drogued drifters were left behind.
Tags were retrieved the morning after deployment. The boat was navigated to the most recent coordinates reported by the SPOT, and once the drifter and drogue were located, the drifter, drogue and tag (with or without the animal still attached) were recovered. Data from the tag were then brought back to shore for analysis.
Orientation estimation
We defined axes conventions appropriate for the typical jellyfish swimming position (see Fig. 1H). Since we have control over the tag attachment site, we can ensure that the tag x-axis (surge direction) is orthogonal to the jellyfish bell at the apex, so no further data processing is necessary to align the tag and jellyfish axes.
In order to compute orientation from the accelerometer and magnetometer signals, we first used a finite impulse response filter to smooth the accelerometer and magnetometer data (Sato et al., 2003). The selected filter cut-off frequency of 0.2 Hz was within the typical range (0.25–0.50; Martín López et al., 2016) of the pulse frequency reported for Chrysaora (Matanoski et al., 2001). Filtering the accelerometer data separates the signal due to gravity (or static acceleration) from high-frequency animal-generated forces (or dynamic acceleration, DA; Wilson et al., 2006), which we later processed and featurized for behavioral classification. The resulting static acceleration was then combined with the smoothed magnetometer data to calculate orientation (i.e. Euler angles of heading, pitch, and roll) at every point in time, according to trigonometric relationships (Johnson and Tyack, 2003). Based on our axes conventions in Fig. 1A, heading refers to compass bearing from true north, positive pitch means the jellyfish bell apex is tilted upward with respect to the horizon, and positive roll means the jellyfish bell is rotating around its apex counterclockwise, when viewed facing the bell.
Annotation of video data
Laboratory and in situ video footage was manually annotated for jellyfish behavior and tether influence. Each second of footage was labeled according to whether the jellyfish was swimming or drifting (not actively pulsing its bell), and whether the tether was slack (i.e. when the animal was uninfluenced by tether tension) or taut (i.e. when the animal was influenced by tether tension). For in situ deployments, if the tether could not be clearly seen or was out of view because of turbidity and/or viewing angle (e.g. facing the subumbrella), the state of the tether was annotated as unknown. Similarly, if the jellyfish behavior could not be distinguished because of lighting or turbidity, the behavior was annotated as unknown. Any segments of footage where either the tether state or jellyfish behavior were unknown were excluded from training data for the methods we describe below.
Jellyfish behavior classification
When using the tether method as a tag retrieval strategy, prolonged deviations between the trajectories of the jellyfish and its tethered drifter can result in the tether pulling on the tag. These forces leave measurable signatures in the motion sensor data, which are distinct from the signals generated by the jellyfish's natural behavior. Our goal was to develop supervised ML methods to: (1) detect and remove segments of data corrupted by tether influence (tether influence classification), and (2) distinguish swimming from drifting on the remaining data (activity classification). In the following sections, we describe how these methods were developed.
Data preprocessing
In situ data were first split into annotated and unannotated pools. Annotated data were processed and featurized as described below, then set aside for model training and evaluation. Unannotated data were similarly processed and featurized, and then set aside for classification by the trained models. Laboratory data were completely annotated, since we were able to capture video footage of the entire deployment.
We used the following procedure to assemble data samples for each of the four categories annotated as described above: tether-influenced, uninfluenced, swimming and drifting. Upon visual inspection, the DA of every deployment displayed a nearly constant periodic nature, consistent with the nearly constant bell pulsing observed in both laboratory and in situ video footage. We therefore computed the discrete cosine transform of the DA and took the frequency with the maximum absolute coefficient as the representative pulse frequency (RPF) for each deployment.
For each category, we extracted all segments of motion sensor data whose corresponding video footage was annotated with that category. Each segment, which consisted of 10 channels of data (pressure sensor and triaxial accelerometer, gyroscope and magnetometer) was then split into consecutive, non-overlapping windows with a duration equal to the representative swimming cycle length (the reciprocal of the RPF). Segments shorter than this duration and trailing windows at the ends of segments shorter than this duration, were discarded from classification and analysis. Each of these windows, which we refer to as periods, was then featurized. Note that the period duration is different for each deployment, to account for the pulse frequency of each animal.
Featurization
For each period, we generated a total of 45 candidate features from the accelerometer and gyroscope. During training, we used a feature selection method to select a subset with the greatest predictive power, as described below. We computed various features of partial dynamic body acceleration (PDBA), which is the sum of the absolute values of the y- and z-axes of DA. PDBA is a variant of overall dynamic body acceleration (ODBA; Wilson et al., 2006), which is used extensively as a proxy for energetic input (Halsey et al., 2009; Wilson et al., 2006; Gleiss et al., 2011). By computing both PDBA and the absolute value of the x-axis of DA (DAx), we can separate energy expenditure in the direction of jellyfish propulsion from movements in the orthogonal plane (i.e. the x-axis from the y–z plane in Fig. 1H). To account for variation in propulsive force between individual jellyfish, we divided the PDBA and DAx by their respective averages over the deployment for each jellyfish. Analogous to PDBA and DAx, we also considered the norm of the y- and z-axes of the gyroscope data (or partial angular velocity, PAV) and the absolute value of the x-axis (AVx).
Accelerometer-based features included the maxima, means and standard deviations of DAx, PDBA, the absolute value of the x-axis of jerk, and the norm of the y- and z-axes of jerk. Spectral features were the sparsities of DAx and PDBA spectra (the absolute value of the Fourier transform), as measured by the Gini index (Hurley and Rickard, 2009; Zonoobi et al., 2011) and the spectral energies of DAx and PDBA in 0.2–1.0 Hz (roughly the typical range of pulse frequencies) and 1–8 Hz. We also included the spectral energy of DAx over 8 Hz but excluded it for PDBA because the two were too highly correlated, thereby leading to numerically unstable covariance matrix inversions in our model. The remaining features were the number of peaks in the DAx and PDBA, as identified by a peak-detection method (Duarte, 2013), the correlation between the y- and z-axes of DA, and the average of the correlations between the x- and y-axes and x- and z-axes of DA. The gyroscope-based features were completely analogous to the accelerometer-based features, substituting AVx, PAV and angular acceleration for DAx, PDBA and jerk, respectively, except excluding spectral energy of AVx over 8 Hz. For behavioral analysis, we also computed normalized ODBA by dividing the ODBA signal by the average ODBA over the deployment, to accommodate differences in propulsive strength between individual jellyfish. However, it was not included as a classification feature owing to redundancy with PDBA and DAx.
Training data
The video footage showed that the nature of tether influence was fundamentally different between in situ and laboratory deployments. In the test tank, the jellyfish simply turned slightly whenever it reached the end of the tether, whereas tether influence in situ took the form of sharp yanking or prolonged dragging on the jellyfish. Since our end goal was to detect tether influence in situ, and the nature of in situ tether influence was not replicated in laboratory footage, we only trained and evaluated the tether influence classifier on in situ data.
For training the tether-influence classifier, the annotated in situ data yielded 325 s of tether-influenced behavior and 2825 s of uninfluenced behavior (or 83 and 1245 periods, respectively) across all deployments. For training the activity classifier, the annotated laboratory data yielded 366 s (68 periods) and 9201 s (3069 periods) of uninfluenced drifting and uninfluenced swimming behavior, respectively, and the annotated in situ data yielded 79 s (17 periods) and 2740 s (1228 periods) of uninfluenced drifting and uninfluenced swimming behavior, respectively. Since only 17 periods of uninfluenced in situ drifting were observed, we trained the activity classifier on the combined in situ and laboratory data (85 and 4297 periods of drifting and swimming, respectively) to sufficiently capture drifting behavior. To assess the value of incorporating in situ data for training, we also trained the classifier solely on the laboratory data.
Classification methods
Quadratic discriminant analysis
Feature selection
There is often a large number of candidate features one can consider for a classifier. Principled methods for choosing an optimal subset of these features can help produce classifiers that perform better (due to the removal of noisy, irrelevant or redundant features), are faster and cheaper to use (since fewer feature need to be measured and processed) and are more interpretable (Dash and Liu, 1997; Guyon and Elisseeff, 2003; Liu and Motoda, 1998). Under the broader umbrella of model selection, feature selection encourages finding the simplest model that explains the data, a principle that is critical for performance generalization (Hastie et al., 2009; MacKay, 2003). We first computed the list of 45 candidate features described above from accelerometer and gyroscope data. As part of training, we used a popular greedy heuristic known as sequential forward selection (SFS; Whitney, 1971), which starts with an empty subset of features and iteratively adds the next feature whose inclusion to the existing subset improves some evaluation metric the most. Despite its simplicity, SFS has been shown to match or outperform more complex search methods by being less prone to overfitting (Reunanen, 2003).
Metric for feature selection
In choosing an evaluation metric for SFS, we observed that our video annotations showed highly skewed category distributions for both classification tasks: tether-influenced periods and drifting were observed far less often than uninfluenced periods and swimming, respectively. In this case, the common metric of accuracy loses meaning, since the accuracy of a simple majority decision rule (i.e. always predict the majority categories: influenced and swimming) is high even though (1) the features are not considered and (2) all instances of the minority category are misclassified. Regardless of category imbalance, the evaluation metric should reflect how well a classifier extracts discriminating information from the features, and should account for the balance of false positives and false negatives on the minority category.
We use the area under this curve (AUPRC) as the metric for feature selection, which provides a summary of performance across all possible thresholds (Boyd et al., 2013; Richardson and Domingos, 2006). The AUPRC ranges from 0 to 1, where an ideal classifier that suffers no trade-off has an AUPRC of 1, and a classifier no better than random guessing has an expected AUPRC of the proportion of the category in the dataset. During feature selection, we terminated SFS when the inclusion of the next feature failed to improve the AUPRC by at least 0.02.
Classifier evaluation
Unbiased evaluation of a classifier's performance on unseen data requires complete separation of the data used in the training and evaluation phases. The standard way to evaluate classifier performance is with k-fold cross-validation (CV; Hastie et al., 2009; Kohavi, 1995), in which the annotated dataset is split equally into k parts. For each part, a classifier is trained on the remaining k−1 parts (the training set) and evaluated on the excluded part (the test set) using some evaluation metric, and the average of the resulting k evaluation scores (the CV score) is used as an estimate of the method's evaluation score on unseen data. Since we want to take full advantage of our annotated dataset for training, a final classifier can then be trained on the complete dataset and deployed for future predictions (Cawley and Talbot, 2010; Varma and Simon, 2006).
Note that during evaluation, the training phase must include all aspects of model selection, including feature selection and choosing hyperparameter values. However, these aspects are sometimes incorrectly treated as external to the training process: performing hyperparameter and/or feature selection on the complete dataset prior to CV can result in dramatic inflations of the CV score (Ambroise and McLachlan, 2002; Cawley and Talbot, 2010; Smialowski et al., 2010; Varma and Simon, 2006). To remove this selection bias, for each of the k iterations of CV, we performed feature selection solely on the training set using an ‘inner’ CV (Ambroise and McLachlan, 2002; Varma and Simon, 2006). That is, the training set was itself split evenly into k parts, and for each iteration of SFS, each candidate feature was evaluated by: (1) adding it to the current feature subset, (2) training QDA with those features and evaluating it on the k pairs of inner training and test sets, and (3) averaging those k evaluation scores. The feature with the best inner CV score is then selected. After SFS has terminated, we use the finalized feature subset to train and evaluate QDA on the outer CV training and test set. For both the inner and outer CV, we take k=5.
After the outer CV is complete, we report the mean±s.e. of the k AUPRC values. We then use the average of the k PR curves to choose a decision threshold (Fawcett, 2006). For the purposes of demonstrating our methods, we prioritize precision and recall equally, and simply choose the threshold value out of {0.1, 0.2, …, 0.9} that yields precision and recall values closest to each other. We call this the equal error rate threshold (EER; Duda et al., 2000) and report the CV precision, recall and accuracy for this classifier. For future studies that prioritize either precision or recall over the other, the researcher can use the average PR curve to pick a threshold that achieves the desired trade-off (Manning and Schütze, 1999; Duda et al., 2000; Fawcett, 2006).
Activity classifier baselines
To see if our featurization and feature selection approach improved activity classification beyond simpler alternatives, we trained and evaluated two baseline classifiers. The first baseline, which we refer to as ODBA thresholding, simply classifies a period as swimming if the mean normalized OBDA is above some decision threshold. Since ODBA is often used as a proxy for energetic expenditure, intuition would suggest it should be sufficient for discriminating swimming from drifting in a noiseless scenario. The second baseline follows our method but only uses accelerometry features, excluding features from the gyroscope data.
In situ behavior prediction
After training and evaluating the classifiers, we used them to predict tether influence and activity on the unannotated in situ data. After removing any periods classified as tether-influenced, we then classified each remaining uninfluenced period as swimming or drifting. These classifications provide estimates of: (1) how often the tether method interferes with the natural movements of jellyfish, and (2) how much time jellyfish spend swimming versus drifting in situ over long durations.
Orientation change
To assess change in orientation during swimming, we computed the difference in heading, pitch and roll angles between and start and end of each period. We converted these differences into a non-negative total angle of rotation (Diebel, 2006), which we refer to as orientation change over a period. We also used circular mean and circular s.d. to compute the means±s.d. of heading, pitch and roll angles over periods. To avoid ill-defined heading and roll values due to gimbal lock, we excluded periods where the absolute pitch angle exceeded 70 deg (this removed 0.7% of total laboratory and in situ periods).
Statistical tests
We ran several statistical tests on the annotated data to investigate potential distinctions between laboratory and in situ behavior, and between tether-influenced and uninfluenced behavior. For the following four tests, we used the nonparametric Mann–Whitney U-test to avoid any distributional assumptions on the quantities of interest, and because we expected to have considerably large sample sizes (each period constitutes only a few seconds of data). We pooled tether-influenced and uninfluenced periods across the in situ deployments, and tested whether either group tends to exhibit (1) greater normalized ODBA or (2) greater orientation change than the other. We also tested these two hypotheses between laboratory and in situ data, by pooling together uninfluenced periods across the laboratory deployments and across the in situ deployments.
RESULTS AND DISCUSSION
Laboratory and in situ deployments
Fig. 2A shows drifter trajectories and timestamps for the 8 in situ deployments in Monterey Bay over three separate days (see Table S1 for laboratory and in situ deployment details). Video footage was successfully captured for 7 of these deployments, and annotated for activity and tether influence as summarized in Table S2 (see Movie 1 for examples). Drifting behavior was observed in 5 deployments, and ranged from 0.5% to 4.3% of the time. Tether influence was also observed in 5 deployments (0.35–28.6%).
Jellyfish behavior classification
Tether-influence classification
The tether-influence classifier had an AUPRC of 0.860±0.032 (mean±s.e.), and using the EER threshold had a precision of 86.1±5.3%, recall of 73.5±3.8% and accuracy of 97.6±0.4%. In order of selection by SFS, the features were: (1) spectral energy of DAx over 8 Hz, (2) mean PDBA, (3) number of peaks in the PDBA, and (4) mean DAx. SFS found no additional features that appreciably improved performance.
We used the tether-influence classifier to classify each unannotated in situ period as influenced or uninfluenced. Table S3 shows the proportion of each deployment classified as influenced, which ranged from 3.3% to 35.1%. To understand how the tether influenced in situ behavior, we evaluated how normalized ODBA and orientation change differed between annotated uninfluenced and influenced periods (Fig. 2B,C). The maximum normalized ODBA of influenced periods (median 2.05) tended to be larger than that of uninfluenced periods (median 1.27; Mann–Whitney U-test, two-sided P<0.0001). Similarly, orientation change tended to be greater during influenced periods (median 30.8 deg) than during uninfluenced periods (median 5.6 deg; Mann–Whitney U-test, two-sided P<0.0001). That is, jellyfish exhibited greater ODBA and more severe orientation changes when influenced by the tether.
Activity classification
The activity classifier had an AUPRC of 0.747±0.047 (mean±s.e.) and using the EER threshold, had a precision of 74.3±4.9%, recall of 76.4±2.7% and accuracy of 99.0±0.1%. Fig. 2D demonstrates the average PR curve used to identify the EER threshold. The features, in order of selection by SFS, were: (1) number of peaks in the PAV, (2) sparsity of the PAV spectrum and (3) sparsity of the PDBA spectrum.
Note that since drifting occupied only 1.9% of the annotated periods, a simple majority prediction rule has an accuracy of 98.1%. The other metrics therefore give more insight into whether the classifier actually learns discriminative information about the categories, rather than simply which category is more common. In comparison to our method, the baseline of ODBA thresholding had an AUPRC of 0.585±0.056 and, with the EER threshold, a precision of 68.9±9.3%, recall of 49.9±3.5% and accuracy of 98.6±0.1%. Training our classifier without gyroscope features and only with accelerometry features gave an AUPRC of 0.685±0.024, precision of 64.1±3.8%, recall of 63.7±4.7% and accuracy of 98.6±0.1% using the EER threshold.
We used our method to classify each unannotated in situ period as swimming or drifting, which provided estimates of how much time each jellyfish spent for each activity. We first removed periods predicted to be tether-influenced, so that our estimates are restricted to data representative of natural behavior. The proportion of uninfluenced time each jellyfish was classified as drifting ranged between 0% and 5.6% (Table S3), with the exception of deployment S1-1 (19.1%) which also experienced frequent tether influence (both annotated and predicted). We can then combine the outputs of the influence classifier, activity classifier, and orientation estimation (again, restricted to periods predicted as uninfluenced) to visualize fine-scale information about in situ behavior over several hours (Fig. 3).
Classifier trained on in situ versus laboratory data
When trained and evaluated only on laboratory data, the activity classifier had an AUPRC of 0.894±0.067 (mean±s.e.) and, using the EER threshold, precision of 87.4±6.1%, recall of 90.3±4.4% and accuracy of 99.4±0.2%. However, predictions made by this classifier on the annotated in situ data had an accuracy of 96.3%, precision of 0% and recall of 0%. We emphasize that this means none of the periods classified as drifting were truly drifting and none of the drifting periods were correctly classified. Similarly, ODBA thresholding had an optimistic AUPRC of 0.865±0.047 (mean±s.e.), precision of 78.2±6.3%, recall of 81.1±4.0% and accuracy of 99.1±0.1% when CV was performed only on laboratory data. However, predictions on annotated in situ data had an accuracy of 90.8%, precision of 0% and recall of 0%.
In situ versus laboratory behavior
The maximum normalized ODBA of uninfluenced laboratory periods (median 1.98) tended to be greater than that of in situ uninfluenced periods (median 1.27; Mann–Whitney U-test, two-sided P<0.0001). Orientation change also tended to be greater during uninfluenced laboratory periods (median 11.6 deg) than during uninfluenced in situ periods (median 5.6 deg; Mann–Whitney U-test, two-sided P<0.0001). Note that test tank walls did not cause turning behavior, since the tether length could not reach them.
Conclusions
Our work provides a pipeline for interpreting fine-scale in situ behavior of a zooplankton species (C. fuscescens) over long durations. Our approach of combining biologging with supervised ML methods yields records of in situ activity and orientation of individual jellyfish for several hours (up to 10 h so far), and may include the first successful in situ deployments of magnetometers and gyroscopes on jellyfish. Using our activity classifier, our estimates of in situ swimming activity on unannotated durations (on average 96.4% of the time; Table S3) is compatible with swimming in our annotated footage (on average 98.7% of behavior not annotated as unknown; Table S2). These long periods of sustained swimming with limited bouts of drifting are consistent with activity budget estimates of other oblate jellyfish (Colin et al., 2003; Costello et al., 1998), whose rowing mode of propulsion has been shown to be energy efficient (Dabiri et al., 2010; Gemmell et al., 2018). In spite of tether influence, uninfluenced periods of data also revealed that tagged animals underwent stereotypical vertical excursions (Fig. 3A; Hays et al., 2012). Although future studies of fine-scale zooplankton behavior would be best conducted with tetherless tag retrieval methods, our approach provides a reasonably precise solution for detecting this influence and removing it, since it may compromise findings on in situ energetics and orientation (Fig. 2B,C; Fossette et al., 2015; Hays et al., 2008).
Our findings also highlight the importance of collecting in situ biologging data, rather than captive laboratory data, for developing behavioral classifiers. An assumption fundamental to justifying the deployment of ML methods, is that the data seen during training and inference are drawn from the same underlying distribution (Pan and Yang, 2010; Sugiyama et al., 2007; Zhang et al., 2013). However, classifiers for interpreting accelerometry data have been overwhelmingly trained and validated on laboratory data (Carroll et al., 2014). In doing this, these studies implicitly assume that behavioral data generated in the laboratory is distributionally similar to in situ behavioral data. Our findings suggest that this assumption has limited applicability, even for organisms displaying simple behaviors like swimming and drifting. First, basic descriptive statistics differed significantly between laboratory and in situ data: jellyfish pulses induced greater orientation changes and greater ODBA in the laboratory than in situ. Second, the activity classifier trained and validated solely on laboratory data had optimistic estimates of precision and recall, but performed poorly with zero precision and recall when evaluated on in situ data. We highlight this as a cautionary tale against naively deploying ML classifiers developed on laboratory data in the field. As biologging moves forward, methods involving technologies that capture the behavioral ground truth of in situ data, such as camera tags, are strongly encouraged.
Our work also underscores the limitations of ODBA in characterizing even simple in situ behaviors. ODBA thresholding yielded zero precision and recall in classifying in situ swimming and drifting, but performed reasonably well when trained and evaluated on laboratory activity. This suggests that the standard way of computing ODBA may not be robust to dynamic and unpredictable sources of noise in in situ data (Shepard et al., 2008). Beyond accelerometry, our results also show that leveraging information from other sensors (e.g. gyroscope) can improve in situ behavioral classification considerably. Looking forward, our methods open the door to investigating more complex questions about fine-scale zooplankton behavior, such as how these species orient themselves in a current, whether they exhibit rolling behavior or lateral preferences (Fig. 3), and whether their behavioral patterns distinguish them from passive drifters.
Acknowledgements
We are grateful to Jared Figurski, Joost Daniels, Brett Hobson, Dale Graves, Chris Wahl, William Gough, James Fahlbusch, Jeremy Goldbogen, Rob Sherlock, Steve Haddock, Chad Kecy, Tom O'Reilly and Ivan Masmitja for guidance and assistance in the field, and to David Cade for calibration and orientation estimation tools.
Footnotes
Author contributions
Conceptualization: C.F., T.A.M., K.A.S., K.K.; Methodology: C.F., K.K.; Software: C.F.; Validation: C.F.; Formal analysis: C.F.; Investigation: C.F., K.K.; Resources: C.F., T.A.M., S.C., D.M., K.A.S., K.K.; Data curation: C.F.; Writing - original draft: C.F., K.K.; Writing - review & editing: C.F., T.A.M., S.C., K.K.; Visualization: C.F., K.K.; Supervision: K.K.; Project administration: K.K.; Funding acquisition: T.A.M., K.A.S., K.K.
Funding
This work was supported by the David and Lucile Packard Foundation (to K.K.), the Woods Hole Oceanographic Institution (WHOI) Green Innovation Award (to T.A.M., K.K. and K.A.S.) and National Science Foundation (NSF) DBI collaborative awards (1455593 to T.A.M. and K.A.S.; 1455501 to K.K.). Deposited in PMC for immediate release.
Data availability
Code and Jupyter Notebooks to reproduce our classifiers are available at https://bitbucket.org/mbari/jellymove.
References
Competing interests
David Mann is owner of Loggerhead Instruments which manufactures the ITAG.