1 Introduction

We are motivated by the problem of tracking and identifying group-housed mice in videos of a cluttered home-cage environment, as in Fig. 1. This animal identification problem goes beyond tracking to assigning unique identities to each animal, as a precursor to automatically annotating their individual behaviour (e.g. feeding, grooming) in the video, and analysing their interactions. It is also distinct from object classification (recognition) [1]: the mice have no distinguishing visual features and cannot be merely treated as different objects. Their visual similarity, as well as the close confines of the enriched cage environment make standard visual tracking approaches very difficult, especially when the mice are huddled or interacting together. We found experimentally that standard trackers alone break down into outputting short tracklets, with spurious and missing detections, and thus do not provide a persistent identity. However, we can leverage additional information provided by a unique Radio-Frequency Identification (RFID) implant in each mouse, to provide a coarse estimate of its location (on a \(3 \times 6\) grid). This setting is not unique to animal tracking and can be applied to more general objects. Similar situations arise, e.g. when identifying specific team-mates in robotic soccer [2] (where the object identification is provided by the weak self-localisation of each robot), or identifying vehicles observed by traffic cameras at a busy junction (as required for example for law-enforcement), making use of a weak location signal provided by cell-phone data.

Our solution starts from training and running a mouse detector on each frame, and grouping the detections together with a tracker [3] to produce a set of tracklets, as shown in Fig. 3c. We then formulate an assignment problem to identify each tracklet as belonging to a mouse (based on RFID data) or a dummy object (to capture spurious tracklets). This problem is solved using Integer Linear Programming (ILP), see Fig. 3f. Tracking is used to reduce the complexity of the problem relative to a frame-wise approach and implicitly enforce temporal continuity for the solution.

Fig. 1
figure 1

Sample frames from our dataset. To improve visualisation, the frames are processed with CLAHE [4] and brightened: our methods operate on the raw frames. The mice are visually indistinguishable, their identity being inferred through the RFID pickup, shown as Red/Green/Blue crosses projected into image-space. This is enough to distinguish the mice when they are well separated (left): however, as they move around, they are occluded by cage elements (Green in centre) or by cage mates (Green by Blue in right) and we have to reason about temporal continuity and occlusion dynamics

The problem setup is not standard Multi-Object Tracking (MOT), since we go beyond tracking by identifying individual tracklets using additional RFID information which provides persistent unique identifiers for each animal. Our main contribution is thus to formulate this object identification as an assignment problem, and solve it using ILP. A key part of the model is a principled probabilistic treatment for modelling detections from the coarse localization information, which is used to provide assignment weights to the ILP.

We emphasise that we are solving a real-world problem which is of considerable importance to the biological community (as per the International Mouse Phenotype Consortium (IMPC) [5]), and which needs to scale up to thousands of hours of data in an efficient manner—this necessarily affects some of our design choices. More generally, with home-cage monitoring systems being more readily available to scientists [6], there are serious efforts in maximising their use.Footnote 1 Consequently there is a real need for analysis methods like ours, particularly where multiple animals are co-housed.

In this paper, we first frame our problem in the light of recent efforts (Sect. 2), indicating how our situation is quite unique in its formulation. We discuss our approach from a theoretical perspective in Sect. 3, postponing the implementation mechanics to Sect. 4. Finally, we showcase the utility of our solution through rigorous experiments (Sect. 5) on our dataset which allows us to analyse its merits in some depth. A more detailed description of the dataset, details for replicating our experimental setups (including parameter fine-tuning) as well as deeper theoretical insights are relegated to the Supplementary Material. The curated dataset, together with code and some video clips of our framework in action, is available at https://github.com/michael-camilleri/TIDe (see Sect. 6 and Appendix D for details).

2 Related work

Below we discuss related work w.r.t. MOT, re-identification (re-ID) and animal tracking. We defer discussion of our ILP formulation relative to other work after discussing the method in Sect. 3.2.

Multi-object tracking

There have been many recent advances in MOT [7], fuelled in part by the rise of specialised neural architectures [8,9,10]. Here the main challenge is keeping track of an undefined number of visually distinct objects (often people, as in, e.g. [11, 12]), over a short period of time (as they enter and exit a particular scene) [13]. However, our problem is not MOT because we need to identify (rather than just track) a fixed set of individuals using persistent, externally assigned identities, rather than relative ones (as in, e.g. [14]): i.e. our problem is not indifferent to label-switching. We care about the absolute identity assigned to each tracklet—even if we had input from a perfect tracker, there would still need to be a way to assign the anonymous tracklets to identities, as we do below.

There is another important difference between the current state-of-the-art in MOT (see e.g. [9, 14,15,16,17,18,19]) and our use-case. As the mice are essentially indistinguishable, we cannot leverage appearance information to distinguish them but must rely on external cues—the RFID-based quantised position. On top of this, there are also the challenges of operating in a constrained cage environment (which exacerbates the level of occlusion), and the need to consistently and efficiently track mice over an extended period of time (on the order of hours).

(Re-)Identification

Re-ID [20] addresses the transitory notion of identity by using appearance cues to join together instances of the same object across ‘viewpoints’. Obviously, this does not work when objects are visually indistinguishable, but there is also another key difference. The standard re-ID setup deals with associating together instances of the same object, often (but not necessarily) across multiple non-overlapping cameras/viewpoints [21]. In our specific setup, however, we wish to relate objects to an external identity (e.g. RFID-based position); our work thus sits on top of any algorithm that builds tracklets (such as that of Fleuret et al. [20]).

Animal Tracking

Although generally applicable, this work was conceived through a collaboration with the Mary Lyon Centre at MRC Harwell [22], and hence seeks to solve a very practical problem: analysis of group-housed mice in their enriched home-cage environment over extended periods of 3-day recordings. This setup is thus significantly more challenging than traditional animal observation models, which often side-step identification by involving single animals in a purpose-built arena [23,24,25,26] rather than our enriched home-cage. Recent work on multiple-animal tracking [16, 27,28,29] often uses visual features for identification which we cannot leverage (our subjects are visually identical): e.g. the CalMS21 [28] dataset uses recordings of pairs of visually distinct mice, while Marshall et al. [29] attach identifying visual markers to their subjects in the PAIR-R24M dataset. Most work also requires access to richer sources of colour/depth information [30, 31] rather than our single-channel Infra-Red (IR) feed, or employs top-mounted cameras [16, 31,32,33,34] which give a much less cluttered view of the animals. Indeed, systems such as idtracker.ai [35] or DeepLabCut [36, 37] do not work for our setup, since the mice often hide each other and keypoints on the animals (which are an integral part of the methods) are not consistently visible (besides requiring more onerous annotations). Finally we reiterate that our goal is to identify tracklets through fusion with external (RFID) information, which none of the existing frameworks support—for example the multi-animal version of DeepLabCut [37] only supports supervised identity learning for visually distinct individuals.

3 Methodology

This section describes our proposed approach to identification of a fixed set of animals in video data. We explain the methodology through the running example of tracking group-housed mice. Specifically, consider a scenario in which we have continuous video recordings of mice housed in groups of three as shown in Fig. 1. The data, captured using a setup similar to that shown in Fig. 2a, consists of single-channel IR side-view video and coarse RFID position pickup from an antenna grid below the cage as in Fig. 2b. Otherwise, the mice have no visual markings for identification. Implementation details are deferred to Sect. 5.1.

Fig. 2
figure 2

The Actual Analytics Home-Cage Analysis system. a The rig used to capture mouse recordings (Illustration reproduced with permission from [22]). b The (numbered) points on the checkerboard pattern used for calibration (measurements are in mm): the numbers correspond also to the RFID antenna receivers on the baseplate

Overview

Our system takes in candidate detections about the fixed set of animals in the form of Bounding Boxes (BBs) and assigns an identity to each of them using a two-stage pipeline, as shown in Fig. 3. First, we use a Tracker [3] to join the detections across frames into tracklets based on optimising the Intersection-over-Union (IoU) between Bounding Boxes (BBs) in adjacent frames. This stage injects temporal continuity into the object identification, filters out spurious detections, and, as will be seen later, reduces the complexity of the identification problem. Then, an Identifier, using a novel ILP formulation, combines the tracklets with the coarse position information to identify which tracklet(s) belong to each of the known animals, based on a probabilistic weight model of object locations.

Fig. 3
figure 3

Our proposed architecture. Given per-frame detections of animals (a), a tracker (b) joins these into tracklets, shown (c) in terms of their temporal span. After discarding spurious detections based on a lifetime threshold, the identifier (e) combines these with coarse localisation traces over time (d), to produce identified Bounding Boxes (BBs) for each object throughout the video, shown [f] as coloured tracklets/(BBs)

3.1 Tracking

We use the tracker to inject temporal continuity into our problem, encoding the prior that animals can only move a limited distance between successive frames. Subsequently, we formulate our object identification problem as assigning tracklets to animals over a recording interval indexed by \(t\in \left\{ 1,..., T\right\} \).

The tracker also serves the purpose of simplifying the optimisation problem. While (BBs) and positions are registered at the video frame rate, it is not necessary to run our optimisation problem (below) at this granularity. Rather, we group together the sequence of frames for which the same subset of tracklets are active: a new interval will begin when a new tracklet is spawned, or when an existing one disappears. The observation is thus broken up into intervals, indexed sequentially by \(t\), shown as \(\tau _{\{ 0, \dots , 9 \}}\) at the top of Fig. 3c.

3.2 Object identification from tracklets

The tracker produces I tracklets: to this we add T special ‘hidden’ tracklets for which no (BBs) exist, and which serve to model occlusion at any time-step. The goal of the Identifier is then to assign object identities to the set of tracklets, \(\textbf{S}\), of size \(I+T\). The subset of tracklets active at time \(t\) is represented by \(\textbf{S}^{\{t\}}\). We assume there are J animals \(o_1, \ldots , o_J\) we wish to track/identify (and for which we have access to coarse location through time); an extra dummy object \(o_{J+1}\) is a special outlier/background model which captures spurious tracklets in \(\textbf{S}\).

We also define a utility \(w_{i j}\), which is a measure of the quality of the match between tracklet \(s_i\) and animal \(o_j\): we assume that tracklets cannot switch identities and are assigned in their entirety to one animal (this is reasonable if we are liberal in breaking tracklets). The weight is typically a function of the BB features and animal position (see Sect. 3.3). Given the above, our ILP optimises the assignment matrix A with elements \(a_{i j} \in \left\{ 0, 1\right\} \) (where \(a_{i, j} = 1 \Leftrightarrow \)\(s_i\) is assigned to \(o_j\)’), that maximises:

$$\begin{aligned}&\max _{A} \sum _{i=1}^{I+T} \sum _{j=1}^{J+1} w_{i j} a_{i j} , \end{aligned}$$
(1)
$$\begin{aligned}&\text {subject to the constraints:} \nonumber \\&\sum _{j=1}^{J+1} a_{i j} = 1 \quad \forall \ i \in I , \end{aligned}$$
(2)
$$\begin{aligned}&\sum _{s \in \textbf{S}^{\{ t \}}} a_{s j} = 1 \quad \forall \ t \in T,\ j \in J , \end{aligned}$$
(3)
$$\begin{aligned}&a_{i (J+1)} = 0 \quad \forall \ I+1 \le i \le I+T . \end{aligned}$$
(4)

Constraint (2) ensures that a generated tracklet is assigned to exactly one animal, which could be the outlier model. Constraint (3) enforces that each animal (excluding the outlier model) is assigned exactly one tracklet (which could be the ‘hidden’ tracklet) at any point in time, and finally, Eq. ((4)) prevents assigning the hidden tracklets to the outlier model. Unfortunately, these constraints mean that the resulting linear program is not Totally Unimodular [38], and hence does not automatically yield integral optima. Consequently, solving the ILP is in general NP-Hard, but we have found that due to our temporal abstraction, using a branch and cut approach efficiently finds the optimal solution on the size of our data without the need for any approximations.

Our ILP formulation of the object identification process is related to the general family of covering problems [39] as we show in our Supplementary Material (see Sect. B.1). In this respect, our problem is a generalisation of Exact-Set-Cover [40] in that we require every animal to be ‘covered’ by one tracklet at each point in time: i.e. we use the stronger equality constraint rather than the inequality in the general ILP (see Eq. B.2). Unlike the Exact-Set-Cover formulation, however, (a) we have multiple objects (animals) to be covered, (b) some objects need not be covered at all (outlier model), and (c) all detected tracklets must be used (although a tracklet can cover the ‘extra’ outlier model). Note that with respect to (a), this is not the same as Set Multi-Cover [41] in which the same object must be covered more than once: i.e. \(b_f\) in Eq. (B.2) is always 1 for us.

3.3 Assignment weighting model

The weight matrix with elements \(w_{ij}\) is the essential component for incorporating the quality of the match between tracklets and locations. It is easier to define the utility of assignments on a per-frame basis, aggregating these over the lifetime of the tracklet. The per-frame weight boils down to the level of agreement between the features of the BB and the position of the object at that frame, together with the presence of other occluders.

We can use any generative model for this purpose, with the caveat that we can only condition the observations ((BBs)) on fixed information which does not itself depend on the result of the ILP assignment. Specifically, we can use all the RFID information and even the presence of the tunnel, but not the locations of other (BBs). The reason for this is that if we condition on the other detections, the weight will depend on the other assignments (e.g. whether a BB is assigned to a physical object or not), which invalidates the ILP formulation.

We define our weight model as the probability that an animal picked up by a particular RFID antenna j, or the outlier distribution \(J+1\), could have generated the BB i. For the sake of our ILP formulation, we also allow animal j to generate a hidden BB when it is not visible.

Fig. 4
figure 4

a Model for assigning \(BB_i\) to \(o_j\) per-frame in F (frame index omitted to reduce clutter) when \(o_j\) is an animal. Variable \(p_j\) refers to the position of animal j, and \(c_j\) captures contextual information. The visibility is represented by \(v_j\), and the BB (actual or occluded) by \(BB_i\). b Model for assigning \(BB_i\) under the outlier model. c Neighbourhood definition for representing \(c_j\)

For objects of interest (\(j \le J\)), we employ the graphical model in Fig. 4a. Variable \(p_{j}\) represents the position of object j: \(c_j\) captures contextual information, which impacts the visibility \(v_j\) of the BB and is a function of the RFID pickups of the other animals in the frame. The visibility, \(v_j\), can take on one of three values: (a) clear (fully visible), (b) truncated (partial occlusion), and (c) hidden (object is fully occluded and hence no BB appears for it). The BB parameters are then conditioned on both the object position (for capturing the relative arrangement within the observation area) and the visibility as:

$$ \begin{aligned}{} & {} P\left( {BB_i|p_j, v_j}\right) \nonumber \\{} & {} \quad \sim {\left\{ \begin{array}{ll} 1 &{} \text {if } v_j = \text {Hidden}\ \& \ i \ge I+1 \\ 0 &{} \text {if } v_j = \text {Hidden}\ \& \ i < I+1 \\ {\mathcal {N}}\left( \mu _{pv}, \Sigma _{pv}\right) &{} \text {otherwise} \end{array}\right. } , \end{aligned}$$
(5)

where the parameters of the Gaussian are in general a function of object position and visibility. Note that in the top line of Eq. ((5)) we allow only a hidden object to generate a ‘hidden’ tracklet. Conversely, under the outlier model, Fig. 4b, the BB is modelled as a single broad distribution, capturing the probability of having spurious detections.

Bringing it all together, the per-frame weight is

$$\begin{aligned} w_{i, j}^{[f]} = \log P\left( {BB_i^{[f]} | p_j^{[f]},c_j^{[f]}}\right) , \end{aligned}$$
(6)

with:

$$\begin{aligned}{} & {} P\left( {BB_i | p_j, c_j}\right) \nonumber \\{} & {} \quad = {\left\{ \begin{array}{ll} \sum _{v_j} P\left( {BB_i | p_j, v_j}\right) P\left( {v_j | p_j, c_j}\right) &{} \text { if } j < J+1 \\ P\left( {BB_i | o_J}\right) &{} \text { otherwise } \end{array}\right. },\nonumber \\ \end{aligned}$$
(7)

where we have dropped the frame index for clarity. The complete tracklet-object weight \(w_{i, j}\) is then the sum of the log-probabilities over the tracklet’s visible lifetime.

3.4 Relation to other methods

A number of works (see e.g. [20, 42,43,44]) have used ILP to address the tracking problem; however, we reiterate that our task goes beyond tracking to data association with the external information. Indeed, in all of the above cited approaches, the emphasis is on joining the successive detections into contiguous tracklets, while our focus is on the subsequent step of identifying the tracklets by way of the RFID position.

Solving identification problems similar to our is typically achieved through hypothesis filtering, e.g. Joint Probabilistic Data Association (JPDA) [42]. While we could have posed our problem into this framework (which would be a contribution in its own right, with the inclusion of the position-affinity model in Sect. 3.3), exact inference for such a technique increases exponentially with time. While there are approximation algorithms (e.g. [45]), our ILP formulation: (a) does not explicitly model temporal continuity (by delegating this to the tracker), (b) imposes the restriction that the (per-frame) weight of an animal-tracklet pair is a function only of the pair (i.e. independent of the other assignments), (c) estimates a Maximum-A-Posteriori (MAP) tracklet weight rather than maintaining a global distribution over all possible tracklets, but then (d) optimises for a globally consistent solution rather than running online frame-by-frame filtering. The combination of (a), (b) and (c) greatly reduce the complexity of the optimisation problem compared to JPDA, allowing us to optimise an offline global assignment (d) without the need for simplifications/approximations.

4 Implementation details

4.1 Detector

While the detector is not the core focus of our contribution, it is an important first component in our identification pipeline.

We employ the widely used FCOS object detector [46] with a ResNet-50 [47] backbone, although our method is agnostic to the type of detector as long as it outputs (BBs). We replace the classifier layer of the pre-trained network with three categories (i.e.  mouse, tunnel and background) and fine-tune the model on our data (see Supplementary Material Sect. C.1). In the final configuration, we accept all detections with confidence above 0.4, up to a maximum of 5 per image. We further discard (BBs) which fall inside the hopper area by thresholding on a maximum ratio of overlap (with hopper) to BB area of 0.4. This setup achieves a recall (at IoU = 0.5) of 0.90 on the held-out test-set (average precision is 0.87, CoCo Mean Average Precision (mAP) = 0.60). Given our focus on the rest of the identification pipeline, we leave further optimisation over architectures to future work.

4.2 Tracker

We use the SORT tracker of Bewley et al. [3] which maintains a ‘latent’ state of the tracklet (the centroid (xy) of the BB, its area and the aspect ratio, together with the velocities for all but the aspect ratio) and assigns new detections in successive frames based on IoU. At each frame, the state of each active tracklet is updated according to the Kalman predictive model, and new detections matched to existing tracklets by the Hungarian algorithm [3] on the IoU (subject to a cutoff).

Our plug’n’play approach would allow us to use other advanced trackers (e.g. [10, 43, 48, 49]), but we chose this tracker for its simplicity. Since most of the above methods leverage visual appearance information, and all mice are visually similar, it is likely that any marginal improvements would be outweighed by a bigger impact on scalability. In addition, previous research (e.g. [13]) has shown that simple methods often outperform more complex ones, but we leave experimentation with trackers as possible future research.

We do, however, make some changes to the architecture, in the light of the fact that we do not need online tracking, and hence can optimise for retrospective filtering. The first is that tracklets emit (BBs) from the start of their lifetime, and we do not wait for them to be active for a number of frames: this is because, we can always filter our spurious detections at the end. Secondly, when filtering out short tracklets, we do this based on a minimum number of contiguous (rather than total) frames—again, we can do this because we have access to the entire lifetime of each tracklet. We also opt to emit the original detection as the BB rather than a smoothed trajectory, since we believe that the current model assuming a fixed aspect ratio might be too restrictive for our use-case—indeed, we plan to experiment with different models for the Kalman state in the future. Finally, we take the architectural decision to kill tracklets as soon as they fail to capture a detection in a frame: our motivation is that we prefer to have broken tracklets which we can join later using the identification rather than increasing the probability of a tracklet switching identity.

To optimise the two main parameters of the tracker—the IoU threshold for assigning new detections to existing tracklets, and the minimum lifetime of a tracklet to be considered viable—we used the combined training/validation set, and ran our ILP object identification end-to-end over a grid of parameters. The optimal results were obtained using a length cutoff of 2 frames and IoU threshold of 0.8 (see Supplementary Material Sect. C.2 for details).

4.3 Weight model

We represent (BBs) by the 4-vector containing the centroid [xy] and size [widthheight] of the axis-aligned BB. A hidden BB is denoted by the zero-vector and is handled explicitly in logic. The visibility, \(v_j\), can be one of clear, truncated or hidden as already discussed, while the position, \(p_j\), is an integer representing the one-of-18 antennas in which the mouse is picked up. For \(c_j\), we consider the RFID pickups of the other two mice, but we must represent them in an (identity) permutation-invariant way: i.e. the representation must be independent of the identity of the animal we are predicting for (all else being equal). To this end, we define a nine-point neighbourhood around the antenna \(p_j\) as in Fig. 4c, where \(p_j\) would fall in cell 4. Note that where the neighbourhood extends beyond the range of the antenna baseplate, these cells are simply ignored. \(c_j\) is then simply the count of animals picked up in each cell (0, 1 or 2).

Under the outlier model (Fig. 4b), the BB centroids, [xy], are drawn from a very wide Gaussian over the size of the image: the size, [wh], is Gaussian with its mean and covariance fit on the training/validation annotations. For the non-outlier case (Fig. 4a), we model the distribution \(p(v_j|p_j,c_j)\) using a Random Forest (RF): this proved to be the best model in terms of validation log-likelihood (this metric is preferable for a calibrated distribution). The mean \(\mu _{pv}\) of the multivariate Gaussian for the BB parameters at position p with visibility v in Eq. ((5)) is defined as follows. The centroid components [xy] are the same irrespective of the visibility, and governed by a homography mapping between the antenna positions on the ground-plane and the annotated (BBs) centroids in the image. The width and height, [wh], are estimated independently for each row in the antenna arrangement (see Fig. 1b) and visibility (clear/truncated). This models occlusion and perspective projection but allows us to pool samples across pickups for statistical strength. The covariance matrix \(\Sigma _{pv}\) is estimated solely on a per-row basis and is independent of the visibility. Further details on fine-tuning the parameters appear in our Supplementary Material (Sect. C.3).

4.4 Solver

Unfortunately, our linear program is not Totally Unimodular [38] due to the constraints in Eqs. (24), and hence, we have to explicitly enforce integrality. Consequently, solving the ILP is in general NP-Hard, but we have found that due to our temporal abstraction, using a branch and cut approach efficiently finds the optimal solution on the size of our data. We model our problem through the Python MIP package [50], using the COIN-OR Branch and Cut [51] paired with the COIN-OR Linear Programming [52] solvers. Running on a conventional desktop (Intel Xeon E3-1245 @3.5GHz with 32Gb RAM), the solver required on the order of a minute on average for each 30-minute segment.

5 Experiments

5.1 Dataset

Source

We apply and evaluate our method on a mouse dataset, [22], provided by Mary Lyon Centre at MRC Harwell, Oxfordshire (MLC at MRC Harwell). All the mice in the study are one-year old males of the C57BL/6Ntac strain. The mice are housed as groups of three in a cage and are continuously recorded for a period of 3 to 4 days at a time, with a 12-hour lights-on, 12-hour lights-off cycle. We have access to 15 distinct cages. The recordings are split into 30-minute periods, which form our unit of processing (segments). Since we are interested in their crepuscular rhythms, we selected the five segments straddling either side of the lights-on/off transitions at 07:00 and 19:00 every day. This gives us about 30 segments per cage.

The data—IR side-view video and RFID position pickup—is captured using one of four Home-Cage Analyser (HCA) rigs from Actual Analytics\(^{\textrm{TM}}\): a prototypical setup is shown in Fig. 2a. Video is recorded at 25 frames per second (FPS), while the antenna baseplate (\(3\times 6\) cells, see Fig. 2b) is scanned at a rate of about 2Hz and upsampled to the frame rate. The latter, however, is noisy due to mouse huddling and climbing. The video also suffers from non-uniform lighting, occasional glare/blurring and clutter, particularly due to a movable tunnel and bedding. Most crucially, however, being single strain (and consequently of the same colour), and with no external markings, the mice are visually indistinguishable.

Pre-processing

A calibration routine was used to map all detections into the same frame of reference using a similarity transform. In order to maintain axis-aligned (BBs), we transform the four corners of the BB but then define the transformed BB to be the one which intersects the midpoints of each transformed edge (we can do this because the rotation component is very minimal, \(<2^\circ \)). A minimal number of segments which exhibited anomalous readings (e.g. spurious/missing RFID traces) that could not be resolved were discarded altogether (see Supplementary Material Sect. A.1).

Ground-truthing

Using the VIA tool [53], we annotate video frames at four-second intervals for three-minute ‘snippets’ at the beginning, middle and end of each segment, as well as on the minute throughout the segment. This gives a good spread of samples with enough temporal continuity to evaluate our models. The annotations consist of an axis-aligned BB for each visible mouse, labelling the identity as Red/Green/Blue and the level of occlusion as clear vs. truncated: hidden mice are by definition not annotated. A Difficult flag is set when it is hard to make out the mouse even for a human observer. Where the identity of any mouse cannot be ascertained with certainty, this is noted in the schema and the frame discarded in the final evaluation. Refer to Supplementary Material Sect. A.2 for further details.

Data splits

Our 15 cages were stratified into training (7), validation (3) and testing (5). This guarantees unbiased generalisation estimates, and shows that our method can work on entirely novel data (new cages/identities) without the need for additional annotations. We annotated a random subset of 7 segments (3.5 Hrs) for training, 6 (3 Hrs) for validation and 10 (5 Hrs) for testing (2 per cage). This yielded a total of 740 (training), 558 (validation) and 766 (testing) frames in which the mice could be unambiguously identified. We used all frames within the training/validation sets for optimising the weight model, but the rest of the architecture is trained and evaluated on three-minute snippets only, using 498, 398 and 699 frames, respectively. For testing, we remove snippets in which 50% or more of the frames show immobile mice or tentative/huddles due to severe occlusion.

5.2 Comparison methods

We compare the performance of our architecture to two simpler baseline models and off-the-shelf trackers. For overall statistics, we also report the maximum possible achievable performance given the detections.

The first of the baselines, Static (C), is a per-frame assignment (Hungarian algorithm) based on Euclidean distance between the centroid of the (BBs) and the projected RFID tag. The second model, Static (P), uses the probabilistic weights of Sect. 3.3, but is assigned on a per-frame basis.

Our problem is quite unique for the reasons enumerated in Sect. 2, and hence, off-the-shelf solutions do not typically apply. Most systems are either designed for singly housed mice [54, 55] or leverage different fur-colours/external markings [27, 56] to identify the mice. We ran some experiments with idtracker.ai [35], but it yielded no usable tracks—the poor lighting of the cage, coupled with the side-view camera interfered with the background-subtraction method employed. Bourached et al. [57] showed that the widely used DeepLabCut [36] framework does not work on data from the Harwell lab, possibly because of the side-view causing extreme occlusion of key body parts. The newer version of DeepLabCut [37] supports multi-animal tracking, but for identification it requires supervision with annotated samples (full keypoints) for each new cage. Such a level of supervision would be highly onerous and impractical. In contrast our method can be applied to new cages without retraining, as long as the visual configuration is similar (and indeed, we test it on cages for which we have not trained on). As a proof of concept, we explored annotating body parts for one of the cages, but found it impossible to identify reproducible body parts due to the frequent occlusions. For this reason, we cannot report any results for any of the above methods.

It should be emphasised that there is a misalignment of goals when it comes to comparing our method with state-of-the-art MOT trackers—no matter the quality of the tracker, there is still the need for an identification stage, and hence, the comparison degenerates to optimising over tracker architectures (which is outside the scope of our research). For example, while DeepLabCut does support multi-animal scenarios, it requires a consistent identification across videos (possibly using visual markings) to support identifying individuals; otherwise, it generates simple tracklets.

However, by way of exploration, we can envision having a human-in-the-loop that seeds the track (if the tracker supports this) and provides the identifying information. To simulate this, we took each three-minute snippet, extracted the ground-truth annotation at the middle frame, and initialised a SiamMask [8] tracker per-mouse, which we ran forwards and backwards to either end of the snippet. We used this architecture as it is designed to be a general purpose tracker with no need for pre-training on new data, and hence was readily usable with the annotations we already had.

5.3 Evaluation

The standard MOT metrics ((Multi-Object Tracking Precision (MOTP) and Multi-Object Tracking Accuracy (MOTA)) [58] do not sufficiently capture the needs of our problem, in that (a) we have a fixed set of objects we need to track (making the problem better defined), but (b) we also care about the absolute identity (i.e. permutations are not equivalent). While there is a similarity with object detection [59, 60] (using identity for object class), we know a priori that we cannot have more than one detection per ‘identity’ in each frame, and thus we modify existing metrics to suit our constraints.

Overall performance

For each (annotated) frame f and animal j, we define a ground-truth BB, \(GT_j^{[f]}\)—this is null when the animal is hidden. The identifier itself outputs a single hypothesis \(\smash {BB_j^{[f]}}\) for animal j, which can also be null. Subsequently, the ‘Overall Accuracy’, \(A_{O}\), is the ratio of ground-truths that are correctly identified: i.e. where (a) the object is visible and the IoU between it and the identified BB is above a threshold, or (b) when the object is hidden and the identifier outputs null. Mathematically, this is:

$$\begin{aligned} A_{O} = \frac{1}{FJ}\sum _{f=1}^{F}\sum _{j=1}^{J} \Delta \left( BB_{j}^{[f]}, GT_{j}^{[f]}\right) , \end{aligned}$$
(8)

where

$$\begin{aligned} \Delta \left( B, G\right) \equiv {\left\{ \begin{array}{ll} 1 &{} \text {if } B = \emptyset , G = \emptyset , \\ 1 &{} \text {if } IoU\left( B, G\right) > \phi , \\ 0 &{} \text {otherwise} . \end{array}\right. } \end{aligned}$$
(9)

In the above, we assume that IoU is 0 if any of the (BBs) is null: we use an IoU threshold, \(\phi \) of 0.5 for normal objects, and 0.3 for annotations marked as Difficult.

A separate score, the ‘Overall IoU’, captures the average overlap between correct assignments, and is defined only for objects which are visible:

$$\begin{aligned} \text {IoU}_O = \frac{1}{\sum _{f=1}^{F}J^{[f]}}\sum _{f=1}^{F} \sum _{j=1}^{J} IoU\left( BB_j^{[f]}, GT_j^{[f]}\right) , \end{aligned}$$
(10)

where \(J^{[f]}\) represents the number of ground-truth visible mice in frame f: again, we assume that the IoU is 0 if any BB is null.

We also report some ‘binary’ metrics. The false negative rate (FNR) is given by the average number of times a visible object is not assigned a BB by the identifier:

$$\begin{aligned} \text {FNR}_{O} = \frac{1}{\sum _F J^{[f]}} \sum _{f=1}^{F} \sum _{j=1}^{J}\left( BB_j^{[f]} = \emptyset \wedge GT_j^{[f]} \ne \emptyset \right) , \end{aligned}$$
(11)

where \(\wedge \) is the logical and. Note that this metric does not consider whether the predicted BB is correct: this is handled by the ‘Uncovered Rate’ which augments the FNR\(_)\) by measuring the average number of times a visible object is assigned a BB that does not ‘cover’ it:

$$\begin{aligned} \text {U}_{O}&= \frac{1}{\sum _F J^{[f]}} \sum \limits _{f=1}^{F} \sum \limits _{j=1}^{J}\nonumber \\&\quad \left( BB_j^{[f]} \ne \emptyset \wedge GT_j^{[f]} \ne \emptyset \wedge IoU \left( BB_j^{[f]}, GT_j^{[f]}\right) < \phi \right) . \end{aligned}$$
(12)

Finally, the false positive rate (FPR) counts the (average) number of times a hidden object is wrongly assigned a BB:

$$\begin{aligned} \text {FPR}_{O} = \frac{1}{\sum _{f=1}^{F} \overline{J^{[f]}}} \sum _{f=1}^{F}\sum _{j=1}^{J} \left( BB_j^{[f]} \ne \emptyset \wedge GT_j^{[f]} = \emptyset \right) , \end{aligned}$$
(13)

where \(\overline{J^f}\) represents the number of animals hidden at frame f.

Performance conditioned on detections

The above metrics conflate the localisation of the BB (i.e. performance of the detector) with the identification error, which is what we actually seek to optimise. We thus define additional metrics conditioned on an oracle assignment of detections to ground-truth. Under the oracle, each candidate detection i is assigned to the ‘closest’ (by IoU) ground-truth annotation j using the Hungarian algorithm with a threshold of 0.5: this is relaxed to 0.3 for detections marked as Difficult. The identity of the detection \(O_{i}^{[f]}\) is then that of the assigned ground-truth or, null if not assigned. Given the identity \(ID_{i}^{[f]}\) assigned by our identifier to BB i, we define the Accuracy given Detections as the fraction of detections which are given the correct (same as oracle) assignment by the identifier (including null when required):

$$\begin{aligned} A_{GD} = \frac{1}{\sum _{f=1}^F I^{[f]}}\sum _{f=1}^{F} \sum _{i=1}^{I^{[f]}} \delta \left( ID_{i}^{[f]}, O_{i}^{[f]}\right) , \end{aligned}$$
(14)

where \(I^{[f]}\) is the number of detections in Frame f, and \(\delta \) is the Kronecker delta (i.e. \(\delta (a,b) = 1\) iff \(a = b\) and 0 otherwise).

We also report the rate of Mis-Identifications (i.e. a wrong identity is assigned): these are (BBs) which are given an identity by the oracle and a different identity by the Identifier.

$$\begin{aligned}&\text {MisID}_{GD} \nonumber \\&\quad = \frac{\sum _{f=1}^{F}\sum _{i=1}^{I^{[f]}} \left( ID_{i}^{[f]} \ne \emptyset \wedge O_{i}^{[f]} \ne \emptyset \wedge ID_{i}^{[f]} \notin O_{i}^{[f]}\right) }{\sum _{f=1}^{F}\sum _{i=1}^{I^{[f]}} \left( O_{i}^{[f]} \ne \emptyset \right) } \end{aligned}$$
(15)

For the purpose of this metric, null assignments are not counted as erroneous, and we normalise the count by the number of non-null oracle assignments.

Always relative to the oracle, we also quote the FPR and FNR. The FPR is now defined as the fraction of (BBs) with a null oracle assignment that are assigned an identity:

$$\begin{aligned} \text {FPR}_{GD} = \frac{\sum _{f=1}^{F}\sum _{i=1}^{I^{[f]}} \left( ID_{i}^{[f]} \ne \emptyset \wedge O_{i}^{[f]} = \emptyset \right) }{\sum _{f=1}^{F}\sum _{i=1}^{I^{[f]}} \left( O_{i}^{[f]} = \emptyset \right) } , \end{aligned}$$
(16)

and conversely, the ratio of (BBs) with a non-null oracle assignment that are not identified define the FNR:

$$\begin{aligned} \text {FNR}_{GD} = \frac{\sum _{f=1}^{F}\sum _{i=1}^{I^{[f]}} \left( ID_{i}^{[f]} = \emptyset \wedge O_{i}^{[f]} \ne \emptyset \right) }{\sum _{f=1}^{F}\sum _{i=1}^{I^{[f]}} \left( O_{i}^{[f]} \ne \emptyset \right) } . \end{aligned}$$
(17)

In fact, if we consider raw counts (rather than ratios), the sum of Mis-Identifications, FPR and FNR constitute all the errors the system makes.

Evaluation data-size

Table 1 summarises the size of each of the normalisers (within each dataset) that are used to compute the above metrics: the symbols used are the same as those used in the normalisers of Eqs. (817).

Table 1 Number of samples for evaluating the models according to the Overall and Given Detections metrics in the Tuning (Train + Validation) and Test-sets, respectively

5.4 Results

We comment on the quantitative performance of our approach. We also make available a sample of video clips showing our framework in action at https://github.com/michael-camilleri/TIDe: we comment on these in the Supplementary Material (see Sect. D).

Performance on the test-set

Table 2 shows the results of our model and comparative architectures evaluated on the held-out test-set (the number of samples in each case appear in Table 1), where our scheme clearly outperforms all other methods on all metrics apart from FNR\(_{O}\). Table 3 on the other hand shows the raw counts of the same metrics as Table 2. Looking first at the left-side of the table, the overall accuracy is quite high at 77%. To get a feel of what this means consider that we are limited by the recall of the detector. In fact, the oracle, which represents an upper bound (given the detector), scores 94%. The FPR seems high, but in reality, the number of hidden samples (see Table 3) is only 85 (4%).

Table 2 Comparative results on the test-set for the Overall Accuracy, and Accuracy given Detections

The FNR\(_O\) is the only overall metric in which our method suffers, mostly because it prefers not to assign a BB rather than assign the wrong one: this is evidenced instead by the lower rate of Uncovered. Note that in moving from the baseline through to our final method, the performance strictly improves each time for all other metrics. Indeed, a big jump is achieved simply by the use of the weight model. On the other hand, the performance of the SiamMask [8] architecture is comparable to the Static (C) baseline. It should be noted that such a method would require a level of ground-truth annotation per-video to jump-start the process (i.e. a human-in-the-loop), but then could have been a viable baseline given that it requires no fine-tuning, while our detector needed to be trained using annotations. However, its performance is inadequate mostly because (a) it is not tailored to tracking mice and (b) because it cannot reason about occlusion, leading to high-levels of false positives.

The contrast between methods is starker when it comes to the analysis given detections (the SiamMask cannot be compared because it does not utilise detections). This can be explained because in the former, the performance of the detector acts as an equaliser between alternatives, while \(A_{GD}\) teases out the capability of the identification system from the quality of the detector. Here the FPR\(_{GD}\) is lower, although from the point of view of detections, there are more of them that should be null (1563 or 45%), and hence, the ability of the identification system to reject spurious detections is illustrated. Note that the number of background (BBs) (impacting the FPR\(_{GD}\)) is higher. We also investigated outlier models which were fit explicitly to the outliers in our data but validation-set evaluation yielded a drop in performance.

Table 3 Counts of performance metrics on the test-set (counterpart to Table 1)

Example of successful identifications

Figure 5 shows a visualisation of the identification using our method (top) and the baseline/ablation models (below) for three sample frames. Panel (a) shows an occlusion scenario: the Blue mouse is entirely hidden by the Red one. This throws off the Static (C) method, to the extent that it classifies all mice incorrectly and picks up the tunnel as the Green mouse. The observation model in the Static (P) ablation is able to reject this spurious tunnel detection, but is confused by the other two detections, with the Red mouse marked Blue and the Green mouse identified as Red. Our method is able to use the temporal context to reason about this occlusion and correctly identifies Red/Green and rejects Blue as Hidden. A similar phenomenon is manifested in panel (c), where the temporal context present in our method (OURS) helps to correctly make out Red/Blue but the baseline/ablation get them mixed up: Static (P) particularly, uses the wrong BB altogether (which covers half a mouse).

Fig. 5
figure 5

Examples of successful identifications for OUR method (cropped to appropriate regions and brightened with CLAHE for clarity). Mouse pickups are marked by crosses of the appropriate colour. In each panel, ac, the ground-truth (GT) and identifications according to our model (OURS) appear in the top row, while the bottom part shows the identifications due to the centroid-distance (S (C)) and static probabilistic (S (P)) models

Moving on to panel (b), this illustrates a particularly challenging scenario due to the three mice being in very close proximity under the hopper. In this case, annotation was only possible after repeated observations and following through of the RFID traces: indeed, while Green and Red are somewhat visible, Blue is almost fully occluded and is annotated as Difficult. Most significantly, however, the RFID pickups are shifted although in the correct relative arrangement. The baseline/ablation methods completely miss out the Green mouse due to this, and while Blue is assigned a correct BB, Red is confused as Green, with the Red BB covering a spurious detection. Despite this, our method correctly identifies all three mice, even if the (BBs) for Red and Blue do not perfectly cover the extent of the respective mice.

Sample failure cases

Despite outperforming the competition, our model is not perfect. We show some such failure cases in Fig. 6. Focusing first on panel (a), the switched identification between the Red and Blue mice happens due to a lag in the RFID pickup. In this case, Blue has climbed on the tunnel, but this is too high to be picked up by the baseplate and hence its position remains at its original location, occupying the same antenna space as Red. When this happens, symmetry is broken solely by the temporal tracking, but it appears this failed in this case.

Fig. 6
figure 6

Visualisation of failure cases illustrating a switched identification, b false negative and c false positive cases. The markups follows the arrangement in Fig. 5

Panel (b) shows a challenging huddling scenario, with the mice in close proximity: indeed, the Green mouse is barely visible, as it is hid by the Red mouse and the hopper. This produces a false negative: our method is unable to pick up the Green mouse, classifying it as Hidden, although it correctly identifies Red and Blue. A similar scenario appears in panel (c), although this time, the method also causes a false positive. In (c), the Red mouse is completely hidden behind the Blue/Green mice. Our method picks up green correctly, but assigns Red’s BB to Blue, effectively incurring two errors: a false positive (Blue should be occluded) and a false negative (Red is not picked up).

The last example also illustrates the interdependent nature of the errors, a side-effect of the constraint that there are at most one of each ‘colour’ of mice. This justifies our use of specific metrics to mitigate this overlap: note how our definition of Uncovered (Eq. 12) and Misidentification (Eq. 15) explicitly discounts such instances as they would have already been captured by false negatives/positives.

6 Conclusions

We have proposed an ILP formulation for identifying visually similar animals in a crowded cage using weak localisation information. Using our novel probabilistic model of object detections, coupled with the tracker-based combinatorial identification, we are able to correctly identify the mice 77% of the time, even under very challenging conditions. Our approach is also applicable to other scenarios which require fusing external sources of identification to video data. We are currently looking towards extending the weight model to take into account the orientation of the animals as this allows us to reason about the relative position of the BB to the RFID pickup, as well as modelling the tunnel which can occlude the mice.

Supplementary Material. We make available the curated version of the dataset and code at https://github.com/michael-camilleri/TIDe, with instructions on how to use it. We provide two versions of the data: a detections-only subset to train and evaluate the detection component, as well as an identification subset. In both cases, we provide ground-truth annotations. For detections, we provide the pre-extracted frames. For identification, in addition to the raw video and position pickups, we also released the (BBs) as generated by our trained FCOS detector, allowing researchers to evaluate the methods without the need to re-train a detector. Further details about the repository are provided in the Appendix D.