Introduction

Natural populations change in size and composition, propelling the dynamics of ecological communities, species interactions, and energy flow through the ecosystem (Odum and Barrett 1971). At the heart of these changes, are individual animals being born, growing, behaving, and dying. Individual-based data provide the raw material to investigate the mechanics and dynamics of these natural populations, their ecological and behavioural interactions and evolution (Coulson 2020), which is particularly necessary in longitudinal studies (Clutton-Brock and Sheldon 2010). Therefore, a deep understanding of these patterns and processes in animal ecology requires identifying and tracking individual animals over time and space (Coulson 2020).

The available invasive and non-invasive methods for sampling individual animals present trade-offs on the accuracy, content and quality of the data they provide. Invasive methods require capturing animals to mark (e.g. with collars, tattoos, tags, freeze branding; Silvy et al. 2005) or fit tracking devices (RFID, GPS, acoustic, satellite tags: e.g. Krause et al. 2013) but provide detailed information about the individuals (e.g. identity, location, behaviour, health). Actively capturing and marking animals, however, can be unfeasible, expensive or disrupt natural behaviour or physiology (Walker et al. 2012). By contrast, non-invasive identification methods, such as photographic, acoustic and video recording (Karczmarski et al. 2022a, b), rely on systematic comparison of natural marks or behaviours (e.g. Karanth and Nichols 1998; Muller et al. 2018; Longden et al. 2020) to track individuals from a distance (e.g. Clapham et al. 2020; Ferreira et al. 2020). Although efficient in providing individual identities, non-invasive methods generally provide fewer information on other biological variables (but see Toms et al. 2020), which has motivated the simultaneous use of other multimedia sampling platforms, such as video (e.g. Raoult et al. 2018; Francisco et al. 2020; Landeo-Yauri et al. 2020) and audio recordings (Cheng et al. 2012; Erbe et al. 2020). Novel technologies for identifying and tracking individuals using such multimedia data are becoming increasingly more precise in the lab or captivity (e.g. Mersch et al. 2013; Dell et al. 2014; Pérez-Escudero et al. 2014; Alarcón-Nieto et al. 2018; Graving et al. 2019; Marks et al. 2021), but doing so in situ remains more challenging (e.g. Ferreira et al 2020; Guo et al. 2020). In the field, where animals are not spatially constrained, recording data from multiple sampling platforms simultaneously, or syncing large volumes of data to then link with that of individual identification a posteriori, can be troublesome.

In wild mammal research, cetacean studies exemplify the continuous development of non-invasive individual identification methods based on multimedia data. Photo-identification has been the go-to technique to recognize individual whales and dolphins in the last five decades (e.g. Würsig and Würsig 1977; Katona and Whitehead 1981; Hammond et al. 1990). Since whales and dolphins can range over large areas and spend long times underwater, photo-identification has been increasingly coupled to other multimedia sampling to detect the presence of individuals and/or describe their behavioural patterns. For instance, while cameras and acoustic sampling provide invaluable underwater perspectives, the growing market of unmanned aerial vehicles (drones) has popularized the recording of behaviour, movement and health of cetaceans from an overhead view (e.g. Torres et al. 2018; Gray et al. 2019; Hartman et al. 2020). With few exceptions, however, these sampling techniques do not provide individual identities—but see, e.g., identification from overhead images (e.g. Payne et al. 1983; Durban et al. 2015) or acoustic signals (e.g. Janik and Sayigh 2013). Combining traditional photo-identification sampling with hydrophones, underwater and drone cameras can resolve this limitation, but it inevitably creates another one: individual behavioural tracking from multiple platforms generates a large and multidimensional dataset that rapidly become unfeasible to handle manually. These technological advances have therefore produced a need for corresponding advances in computational tools to organize and process multiple data streams (e.g. Schneider et al. 2019).

Here, we introduce a free and open computational tool for aligning, linking and syncing photo-identification data with other multimedia data of free-ranging vertebrates. The R package MAMMals—Managing Animal MultiMedia: Align, Link, Sync—contains functions to synchronize different multimedia data streams a posteriori and so facilitate their post-processing to measure relevant biological and behavioural data. Using MAMMals, one can (i) extract, organize and line up the metadata of photographs, videos, audios, drone flight logs and any other timestamped text data; (ii) select, trim and export clips or stills of the footage or audio recording containing individual photo-identification; and (iii) wrangle, convert and plot data from cameras, drones, hydrophones, microphones and other timestamped data sources. In what follows, we describe the workflow for pre-processing individual photo-identification and link it to other multimedia data (Fig. 1). Next, we illustrate the utility of these tools by applying them to process and analyse empirical data on the foraging behaviour of coastal bottlenose dolphins. We conclude by discussing the caveats of our approach and how future work can alleviate them.

Fig. 1
figure 1

The MAMMals workflow to align, link and sync multimedia and timestamped text data. a The inputs are files commonly produced in individual identification and behavioural sampling methods, such as images (.jpg, .tiff, .png), videos (.mov, .mp4), audios (.wav, .mp3) and/or text files (.csv, .txt, .srt). After aligning, linking and syncing the inputs, the outputs can be text files with metadata and/or synced image, audio and/or video files. The minimum requirements for the MAMMals workflow are the photo-identification data (i.e. the image files associated to individual identification text data), and at least one more multimedia data source, such as videos, audios or text files. b The first step is to extract the metadata of all multimedia files (and flight logs, if available, or from captions in .srt files of commercially available drones). One can also export the metadata for posterior processing, such as attributing individual IDs to each photo processed by the getPhotoMetadata function, or assign individual identification from pre-processed data directly to the function getPhotoMetadata. c The second step is to align the metadata of photographs (or timestamped field notes) with that of the other media to automatically select the video or audio files containing individual photo-identification data. d The third step is to link the selected media by clipping the videos and audios around the information of interest (e.g. photo-identified individuals) to facilitate the post-processing of videos (getVideoClip), audios (getAudioClip) or stills from the video (getVideoFrame). If sampling includes drone videos, selected media can be linked to information from the flight, such as latitude, longitude and altitude. e The final step is to sync media and/or text by subsetting only the time intersect between data coming from different sampling platforms. The synced multimedia and text data can be exported as a single merged file or multiple separate files

Workflow overview: coupling photo-identification with other multimedia data

The MAMMals R package targets the challenge of coupling large volumes of observational and multimedia data to traditional techniques of identifying individuals, extending therefore the possibilities for studies that use methods of focal-animal and focal-group sampling (Altmann 1974). The minimum requirements are image files with assigned individual identification and at least one other multimedia data source. The workflow follows four steps (Fig. 1): (i) extracting the metadata of photographs and any other multimedia files available; (ii) aligning the metadata of these files to select the useful multimedia containing photo-identified individuals; (iii) linking these selected files by clipping the multimedia containing photo-identified individuals; (iv) and syncing media and text data around their intersection time. We detail each step of the MAMMals workflow in the next sessions, and provide instructions and examples of the input and output files in a documentation in an online tutorial (https://mammals-rpackage.netlify.app/index.html). The MAMMals R package can be installed from the online repository (instructions at https://bitbucket.org/maucantor/mammals/). It depends on the installation of the R environment (R Core Team 2021) and key R packages such as lubridate (Grolemund and Wickham 2011) to manage date-time formats (full list of dependencies, see the package repository), as well as external software ExifTool (https://exiftool.org) to extract the metadata of media files, and FFmpeg (https://ffmpeg.org) to clip video and audio files.

To align, link and sync multiple data sources, the MAMMals workflow relies on timestamped files: essentially, the recording times of the multiple sampling equipment are extracted from the metadata of the media files and lined up. Therefore, the most important recommendations for field sampling are to synchronise the clocks of all collection platforms, and to keep the original metadata of the media files unaltered. For accurate results, we recommend the clocks of cameras, drones, audio recorders and auxiliary equipment (such as cell phones used to pilot the drone or tablet apps to record observation data) to be adjusted to the maximum precision possible via information—either from the GPS satellite, or manually—and to be always double-checked and fine-tuned before each sampling occasion to account for clock drift. For example, one can photograph and film a reference clock prior to sampling or use audio or visual signals during sampling (e.g. flash in our case study detailed below) to offset time differences across images and videos.

When photographing animals for individual identification using natural marks, we recommend following the protocols for collecting, processing and organizing such data, which have been extensively detailed elsewhere (e.g. Speed et al. 2007; Urián et al. 2015). We highlight that using DSLR cameras equipped with GPS and digital compass can be useful when teasing apart photo-identified individuals in the field, especially when tracking them with overhead videos. For instance, when tracking multiple individuals or groups distributed in space, one can assign the photographs taken to each group recorded in the overhead footage by interpreting the GPS coordinates and shooting angle extracted from the photograph metadata. After the photographic data sampling, we recommend first processing the photo-identification data and organizing it in a plain text data frame, in which the first column contains the photograph file name and extension (e.g. ‘6Q1A8164.JPG’), and the second contains the individual (alphanumeric) identification code (e.g. ‘ID1248’).

When recording audio, we recommend using recorders that can produce timestamped files. Otherwise, one can manually check the end time of recordings after sampling and rename files accordingly with date and time. When recording videos from small drones (e.g. DJI Phantom, DJI Mavic Pro, DJI Inspire, Splash Drone) while simultaneously collecting photo-identification or audio recordings, we recommend keeping a constant flight height and point the camera straight down (i.e., drone and camera pitch = − 90°, roll = 0°) to ensure the centre of the frame matches the coordinates recorded by the drone GPS and to reduce the distortion from any measures taken from the drone footage. If measuring animals from the drone footage using photogrammetry, there will be additional requirements. In addition to the camera tilt, the aircraft altitude data are the main issues for precise and unbiased photogrammetric measurements. Off-the-shelf drones record the altitude relative to the aircraft’s take-off position (“home point”). Hence, if the aircraft takes off from the deck of a ship or a higher ground, the zero in the aircraft’s barometer does not match the sea level. To mitigate this, an object of known length can be used to calibrate a scale (details in Burnett et al. 2019). Another solution is to couple a LiDAR sensor to the drone (e.g. Dawson et al. 2017) to precisely measure the distance from the aircraft to the sea level. Correcting lens distortion and camera calibration also reduce errors in measurement estimates (see Dawson et al. 2017).

Step 1: extracting metadata of multimedia files

After conducting photo-identification as per standard protocols, the first step in the MAMMals workflow is to extract the metadata of all multimedia files (Fig. 1b) and organize them into a text database, such as an R data frame. We suggest allocating each media type in separate subfolders within the root folder of the project, then using the following functions to read and organize the metadata into a data frame where the number of rows equals the number of files, and each column corresponds to the available metadata. To extract the metadata of the photographs, access the subfolder with the image files with the function getPhotoMetadata, which handles many common image extensions (e.g. .jpg, .tiff, .png) and accesses the available metadata of each photograph—at least the date and time, but also the camera GPS coordinates and shooting angle, if available. The getPhotoMetadata function also assigns the individual ID to the full metadata of the photographs, by matching the file names with that of the simple two-column data frame containing the photo file name and the individual identification code. While we recommend having the individual identification ready prior linking and syncing with the other multimedia files, we highlight that, alternatively, one can also perform the photo-identification afterwards. In this case, the getPhotoMetadata function can be used to export the metadata of photographs to common text files (e.g. .csv or .txt) and to then assign individual ID to the database using any text editor or spreadsheet software (e.g. Microsoft Excel, Apple Numbers). Bear in mind, however, that issues with the date and time formats and precision are common when using spreadsheet software; thus, we suggest using plain text editors to avoid lack of precision when aligning, linking or syncing the photo-identification to the multimedia data.

For the audio subfolder, use getAudioMetadata to extract metadata of audio files (at least duration, initial and final time). If the audio files do not contain date and time in the metadata, initial and final time of recordings can be extracted from the filename automatically generated with date-time stamps, as exported by commonly used autonomous recorders (e.g. Whytock and Christie 2017; Hill et al. 2019). To extract the metadata of the videos (at least duration, initial and final time), access the video subfolder with the getVideoMetadata. If videos were recorded with drones, additional metadata can be available (e.g. altitude, GPS coordinates) and will be extracted and organized into a text database as well. Most commercially available drones save detailed logs of every flight. Information on aircraft sensors, motors, battery, remote controller and media are logged on-board and on remote applications, often using proprietary file structures. Hobbyists (e.g. DatCon, TXTlogtoCSVtool), companies (e.g. https://airdata.com) and forensics (e.g. Clark et al. 2017) have been developing tools to decode flight logs into readable .csv files. Alternatively, the MAMMals R package can extract the basic flight log data recorded by DJI drones. These drones can produce timestamped subtitles (1 Hz data) logging the aircraft latitude, longitude and height (calculated from the aircraft barometer), the home point latitude and longitude, and camera settings. However, subtitles do not contain auxiliary information on the aircraft and camera roll, pitch and angle; and the accuracy of latitude and longitude is limited to 10 m. But conveniently, subtitles are natively exported from DJI drones as text files (.srt) along video files, and the MAMMals readSRT function can read all .srt files in a folder and return an R data frame with the formatted metadata of the DJI drone flight logs.

Step 2: aligning multimedia files

After extracting the metadata of the multimedia files, large volumes of multimedia data can be aligned with the MAMMals functions that subset media files containing photo-identification data (Fig. 1c). Use the selectVideos or selectAudios functions to get the video and audio files of interest, respectively, by aligning their metadata with the metadata of the photographs of individuals (previously generated by the functions getVideoMetadata, getAudioMetadata, getPhotoMetadata, respectively). The select set of functions calculates the time of the photograph in the video or audio files for all photographs taken during the sampling event, and they return an R data frame with data matching the time in the video or audio files. Then, one can export an R data frame containing only the photo-identified individuals, or other events of interest, into a .csv or .txt. We highlight that while these functions are based on photograph metadata, they also work with other text data in which events are correctly timestamped (Fig. 1c), such as observed behavioural events recorded in the field notes, and GPS positions from loggers fitted to the animals.

Step 3: linking photographs with multimedia data

After aligning the metadata of the media files, the photo-identification data can be linked with video or audio files by trimming these media files (Fig. 1d) based on the information generated by the selectVideos and selectAudios functions. If the aim is to get a still from the video for every photo-identified individual, the getVideoFrame function can export a frame of the video in the moment each photo was taken. If the aim is to perform further video or audio analyses, one can export short clips around the time of each photo-identification for both video (getVideoClip) and audio files (getAudioClip). If sampling with drones, one can automatically link data from flight logs to every event exported by the selectVideos or selectAudios functions. The linkFlightToMetadata function returns an R data frame in which the number of lines is equal the number of photo-identification photographs, and the columns contain all available metadata. The linkMetadataToFlight function merges the media data with the flight data, returning an R data frame with all the flight logs, or a list with a data frame for each flight log data.

Step 4: syncing multimedia data

Finally, the multiple media data sources can be synchronized based on the intersection of their recording time (Fig. 1e). Using the function syncMedia, video and audio files that were sampled concurrently and selected by the selectVideos and selectAudios functions can be trimmed to match the time intersection, and merged into a single file or exported as separate media files. Other auxiliary text data (e.g. GPS trackers, heart rate loggers, flight logs) recorded simultaneously in the field can be synchronized based on the intersection of their sampling time and merged into a single text database using the function syncData, as long as the input clocks are precisely synced.

Auxiliary functions for post-processing multimedia data

The MAMMals R package was designed to streamline the pre-processing of photo-identification and multimedia data; thus its workflow does not include the post-processing of the biological data of interest. After linking the photographs with the useful parts of the videos and audios, manual or semiautomatic extraction of the target data is required. This may include video playback to quantify behavioural states and events (e.g. Torres et al. 2018), morphometry or health variables (e.g. Christiansen et al. 2020); automatic detection of species (e.g. Gray et al. 2019); or photographic comparison needed to identify individual animals (e.g. Urián et al. 2015). To efficiently measure and extract such biological data from photos, videos, and audio data, we point the reader to the growing number of computational tools available elsewhere (e.g. Abràmoff et al. 2004; Friard and Gamba 2016, Beery et al. 2020; Schneider et al. 2018; Torres and Bierlich 2020; Bird and Bierlich 2020). We exemplify one case of post-processing behavioural data in the next section, but here we highlight that the MAMMals R package also contains some functions and utilities to assist with the post-processing of the linked multimedia data or auxiliary data. For instance, one can use MAMMals to wrangle and convert information from the drone flight log data, such as gimbal and camera angles, GPS coordinates, digital compass and barometer sensors. We conceptually divide these functions into data tools and visualization (Table 1), which are, respectively, identified by the prefixes do and view. For instance, doCorrectAngle can be used to correct drone yaw ranging from 0 to 180 or − 180 to 0 to 0 to 360, and the function viewFlightPath can be used to visualize a 2D drone flight path with photos as points, using data from the linkMetadataToFlight.

Table 1 Auxiliary functions provided in the MAMMals R package to assist data wrangling, conversion and visualization

An illustrative case study

To illustrate the utility of the MAMMals R package, we used individual and behavioural data collected from a coastal bottlenose dolphin population in Laguna, southern Brazil, where some individual dolphins forage near the coast with net-casting fishers (Simões-Lopes et al. 1998). To explore the dolphins’ foraging behaviour, we combined standard photo-identification with overhead video, recorded using a commercially available drone (DJI Mavic Pro) with a built-in high-resolution camera mounted on a gimbal. We hovered the drone over the study area above 60 m to minimize potential disturbance to the dolphins (Fettermann et al. 2019), and follow all safety flight guidelines (Fiori et al. 2017; Raoult et al. 2020). The drone camera covered an area of ca. 7500 m2, including the coast where the fishers wait for dolphins and ca. 60 m of the lagoon canal. Simultaneously, two photographers registered the dolphins’ dorsal fins for posterior individual identification based on nicks, notches, scars and skin lesions, following photo-identification protocols (Hammond et al. 1990). One photographer positioned ashore used a DSLR Canon 60D camera equipped with a 100-400 mm lenses to photograph all dolphins in the video footage area, while the second photographer stood on a 1.5 m platform 3 m behind the fishers and used a DSLR Canon 7D MkII with built-in GPS and digital compass and a 70–300 mm lenses to identify the individual dolphins that approach the fishers to interact. This photographer was always captured in the drone footage and used a flash (Yongnuo) pointing up, so the timing of the photographs taken could be verified in the video to double-check if the clocks of the camera and drone were properly synced.

To illustrate two types of behavioural data that can be measured from the merged video and photo-identification dataset, we tracked (i) the foraging behaviour of individual dolphins in terms of distance and heading angles relative to the coast over time (Fig. 2a); and (ii) the foraging behaviour of a group of dolphins in terms of spatial cohesion and diving synchrony (Fig. 2b). In both cases, we used the MAMMals R package to automatically select examples of drone videos containing photo-identified dolphins from a total of 56.6 h of footage and 3614 photographs of 21 identified individual dolphins. First, we used the functions getPhotoMetadata and getVideoMetadata to extract and organize the metadata of photographs and videos, extracted the drone flight logs and used some of the auxiliary functions to correct angles of drone footage (doConvertAngle, doCorrectCameraYaw) and filter off flights that were too low (doFilterDroneHeight) or in which the camera was not pointed straight down (doFilterGimbalPitch).

Fig. 2
figure 2

Examples of individual- and group-level behaviour of photo-identified mammals extracted from overhead videos. a Tracking the foraging behaviour of individual coastal dolphins, in terms of distance and angle to the shore. The MAMMals package was used to automatically select and clip a video containing a solitary photo-identified dolphin (inset photo-identification). The video was then post-processed, when dolphins’ distances (yellow lines in the picture; y-axis in the plot) and angles (cyan lines in the picture, with the middle point centred on the dolphin; arrows in the plot, whose colours indicate temporal sequence) relative to shore were measured each time they surfaced to breath. Distances measured in pixels were converted to meters based on a 1-m scale placed behind the photographer; angles measured in degree relative to the shore, were converted to radians, considering the True North as a reference. b Group cohesion and dive synchrony of photo-identified bottlenose dolphins, in terms of relative distance to each member and timing of surfacing. The MAMMals R package was used to select the photographs with the dolphins’ dorsal fins for posterior identification of the 5 group members. The group of 5 dolphins were then tracked over time with a custom computer vision model trained to detect dolphins in drone videos. Cohesion was estimated as the average Euclidean distances among the centroids of all dolphins detected (i.e. the green rectangles with detection scores) every 0.2 s and converted to meters using a known 1-m scale captured in the video (not shown here). Synchrony was estimated as the time difference between detections. Pictures 1–4 illustrate a case of diving sequence of a subgroup of 5 dolphins, in which 1 individual is detected first, followed by three that surfaced simultaneously, and then by the fifth individual after a 2-s lag. Box plots present the distribution of mean distances and breath intervals (y-axes) across different number of simultaneous detections (circles) of dolphins at the surface (x-axes) during a ~ 20-min drone video

To describe (i) the individual-level foraging, we then used the function selectVideos to identify drone videos taken when there were 1 or 2 dolphins at the interaction site, and the function getVideoClip to crop 6-min video clips around the photographs taken. Next, we manually processed these clips with the open-source software imageJ (Abràmoff et al. 2004); each time the photo-identified dolphin surfaced to breath, we used the ‘straight line’ tool to measure the distance of the dolphin from shore, and the ‘angle’ tool to measure the angle between the dolphin’s heading and the shore. In videos with more than one dolphin at the site, we distinguished photo-identified dolphins recorded in the video at the same time but in different places using the angle (available in the metadata of the photographs) between the dolphin and the camera equipped with built-in compass used for photo-identification. Finally, we converted the distances measured in pixels to meters based on a 1-m scale captured in the drone video, and converted the angles measured in degrees relative to the shore to radians, considering the True North as a reference. In Fig. 2a, we present an example of these data on the distances and angles of a photo-identified individual dolphin foraging close to shore.

To describe the group-level foraging, we used the functions selectVideos and getVideoClip to select the photographs of all dolphins foraging in groups and trim the complete 20-min drone video into a shorter video around the time that the photos were taken. We first photo-identified individuals manually, and then measured group cohesion and dive synchrony, in terms of relative distance to each member of the group and timing of surfacing. To do that, we have used a convolutional neural network object detection classifier (He et al. 2016) to automatically detect and count dolphins in the drone footage. We have re-trained a TensorFlow pre-trained classifier with Faster-RCNN model architecture (Ren et al. 2015) using 838 drone video frames in which dolphins were manually labelled using LabelImg (Tzutalin 2015), and 200 other such images for testing the model. We then applied this supervised learning computer-vision model to detect and count the number of dolphins at every 0.2 s of the drone video, i.e. every 5 frames of a 25 FPS video (for a similar approach, see Guo et al. 2020). We highlight that although we have used machine learning to post-processes the video clips, this procedure could also be done manually. For instance, one can extract short .avi clips with a framerate of 1 fps using the getVideoClip function, and then import the clip to imageJ to measure the inter-individual distances and surface timing. To estimate the group cohesion, we calculated the relative time between each dolphin detection, considering greater cohesion when individuals are closer together; to estimate diving synchrony, we calculated the lag between dolphin surfacing times, considering greater synchrony when their breath intervals were shorter. We measured the group cohesion as the average Euclidean distance, in pixels, between the centroids of all dolphins detected in each frame, and converted these distances into meters based on a known 1-m scale recorded in the drone video. We measured the diving synchrony as the time lag between detections, considered the group to be in synchronous diving when more than one dolphin was detected in the same video frame. In Fig. 2b, we present these data on group cohesion and dive synchrony as the distribution of mean distances and breath intervals among different number of dolphins at the surface.

Caveats

The tools herein presented assist the organization of simultaneous sampling methods, but caveats exist. First, the level of detail of the outputs—be them the merged databases or the cropped and synced media—may depend on the accessibility of the study system. We have illustrated how the MAMMals tools work when recording and tracking coastal dolphins, but these tools could be used to process multimedia of mammals individually identifiable from photographs taken from the ground or sea level (e.g. sperm and humpback whale caudal fins, or blue whale pigmentation; Hammond et al. 1990) and from overhead (e.g. head of right whales, or other identifiable body parts of marine and terrestrial mammals; Landeo-Yauri et al. 2020; Maeda et al. 2021). However, in our example, we had the advantage of keeping the photographer in the overhead video frame at all times for recording the position of the GPS-equipped camera as a reference point, and for double-checking the synchrony between the video and photograph data streams. This setup is rather unusual for studies of free-ranging mammals, and require the sampling design to be adapted to fit the reality of other study systems. For example, boat-based focal follows of cetaceans could aim to keep the boat close to the group most of the time to allow the photographer to be in the overhead video frame, or overhead behavioural sampling of terrestrial mammals can be focused on a relatively small open area.

The second limitation of our tools is that the precision of the link between the photo-identification and the other multimedia can be dependent on group size and group cohesion. In our example, we tracked solitary and small groups of animals that can be easily photo-identified, but mismatches in individual identification can occur when collecting data from multiple individuals at the same time, such as in large and tight groups. Our drone videos can contain multiple individuals, leading to the possibility that an individual photographed at a given time could be linked to multiple individuals that appear in the drone video at that time. We have resolved this by keeping the photographer in the overhead video frame and relying on the angle of the built-in digital compass of the camera to tease apart individuals in the overhead footage. However, these decisions become increasingly more difficult to make as the group size, and the rate of pictures taken, increase, and/or the groups become tighter and closer to the photographer. In such situations, our tools could still help defining the timestamps of sampling events to extract group-level (but not individual) data or identify subgroups of animals.

Closing remarks

Our tools to streamline the use of multimedia data with traditional individual identification methods are steps toward the integration of multiplatform behavioural sampling on free-ranging mammals. We acknowledge there is room for improvement and, to encourage further development of these tools collectively, we provide all the code of the MAMMals package in an open repository (https://bitbucket.org/maucantor/mammals/). We hope to inspire further collective work in the scientific community to generalize the process of linking multiple sampling platforms to refine the collection and processing of data of individual animals. More importantly, we hope these computer tools improve the raw material needed to promote new insights on the population dynamics, ecological interactions and behaviour of free-ranging animals.