1 Introduction

Pose estimation is a mandatory prerequisite in multiple disciplines, where each case demands the selection of a suitable tracking method. GNSS (e.g., GPS), for example, is suitable for determining the position of vehicles that have a direct line of sight to multiple satellites, but this method cannot be used underwater due to the attenuation of the signals. Instead, acoustic methods or optical markers can be used here (Kinsey et al. 2006). Alternatively, the absolute pose above the water surface can be determined by GNSS and the subsequent relative movements underwater can be measured with an inertial measurement unit (IMU) or the speed relative to the ground with Doppler velocity logs (DVLs). However, these methods can lead to a continuous drift of the pose. At the same time, the placement of optical markers or other instrumentation at underwater field sites is laborious and can also suffer from challenging environmental conditions, such as storms, turbid water, currents and tides, growth of algae, or biofouling. For developing and improving robust visual underwater localization techniques (Zhang et al. 2022) in challenging scenarios (Köser and Frese 2020) such as the deep sea (no sunlight) or turbid waters (coast, harbors), it is difficult to obtain ground truth poses that would allow for the evaluation of underwater computer vision approaches.

To robustify and improve such approaches, they should be validated under various test conditions (different water types, different environmental scenarios, different visual structures, and different illuminations). As already argued in Nakath et al. (2022), underwater validation scenarios severely suffer from the lack of exactly known conditions, in geometric as well as radiometric terms. As an important step to overcome geometric shortcomings, we suggest using a water tank (in our case \(2.2 \text { m} \times 1 \text { m}\) with a depth of 0.8 m), in which a miniaturized underwater scene is set up that can be seen as a 1:10 model of the real world. This scene is intended to be used for developing and accessing visual mapping approaches. To this end, dye and scattering agents are added to the water in the tank to mimic visual impairments known to occur in underwater scenarios (see, e.g., Song et al. (2022) for a comprehensive discussion). However, the resultant extremely limited visibility makes it difficult to determine ground truth camera poses underwater using optical markers (Herrmann et al. 2022). On top of that, adding optical markers will also create unrealistic structures, potentially biasing an evaluation. Our approach is therefore to implement an external reference system, which can determine ground truth camera poses independently of the water conditions. An optical tracking system is used for this, which determines absolute poses above the water surface and applies a rigid transformation and sensor fusion to obtain the camera pose underwater in real time.

An HTC Vive Pro is chosen as the tracking system. This consumer virtual reality (VR) system Yu (2011) is significantly cheaper than comparable optical motion capture systems, such as VICON or OptiTrack. Another advantage of the HTC Vive is the easy integration of the tracking algorithms into the robot operating system (ROS) (Quigley et al. 2009) using the OpenVR interface. The ground truth controller poses are thus available in ROS, so that the rigid transformation to the camera pose can be determined for each controller with a Hand–Eye calibration approach. To reduce the tracking and transformation errors, two controllers are used on the upper end of the camera stick. This results in two pose estimates for the camera, which are merged using an Unscented Kalman Filter (UKF) (Julier and Uhlmann 1997; Wan and Van Der Merwe 2000). Finally, a custom-made underwater camera-light system, which can additionally be equipped with IMUs, is mounted at the end of the stick to record underwater datasets with ground truth poses; see Fig. 1.

Fig. 1
figure 1

Application in a deep sea AUV scenario: from left to right: the whole system with the external reference system, underwater camera, and artificial lights; an overview of the scene, without external light sources; and finally a view through the underwater camera

In this work, the tracking quality of the developed system is quantitatively evaluated. For this purpose, the tracking and the Hand–Eye calibration are analyzed first. Then, the precision of the real-time camera pose estimation is validated. This is achieved by first measuring an optical target as a fixed point in space. Then, the camera pose of the external tracking system is compared with a camera pose determined optically under good visibility conditions.

2 Related Work

2.1 HTC Vive Analysis

Using the HTC Vive as a cost-effective tracking system in the field of robotics and computer vision is widespread, see, e.g., Wang et al. (2020); Ayyalasomayajula et al. (2020); Borges et al. (2018). In the latter paper, the accuracy of tracking of first generation HTC Vive devices is analyzed using the open-source library Libdeepdive (Symington 2018) in combination with a subsequent optimization step. Trackers in different orientations are used together with two lighthouses. Being restricted to a 2D plane, a mean standard deviation of 1.18 mm and 0.45\(^{\circ }\) is achieved for the tracker poses across the data sets. An analysis for the accuracy of the HTC Vive headset tracking is performed in Niehorster et al. (2017). However, since the headset is too big and bulky to use it for our experiments, there are no comparable values for our target application. Finally, (Bauer et al. 2021) provide and in-depth pose estimation accuracy evaluation of the second generation of the Vive. However, all the results are not comparable to our approach, as we are interested in fused results of a pose with significant offset.

2.2 Hand–Eye Calibration

Hand–Eye calibration describes a technique for calculating the 3D Position and orientation of a camera relative to a robot manipulator. In this work, it is used to determine the rigid transformation between the controllers and the underwater camera on the camera stick. Various approaches can be found in the literature (see, e.g., Enebuse et al. (2021)). The classic approach is from Tsai and Lenz (1989), in which the pose of the hand is known from the robot or an external tracking system. A calibration target is used here, which is placed firmly in the room. Then, a data set of poses for the hand and the transformation from the camera to the calibration target is recorded. From this, a linear system of equations can be set up and the rigid transformation between hand and camera can be determined.

An alternative approach does not rely on calibration targets and instead determines the optical data with a structure-from-motion approach. This technique is particularly interesting for areas of application in which no calibration target can be set up. In Schmidt et al. (2005), an example application in the medical field is demonstrated, in which a non-sterile calibration pattern cannot be used. However, the accuracy of this approach is, on average, lower than that of the classical approach, which we hence stick to.

2.3 Underwater Ground Truth Camera Poses

Simulated or heavily controlled environments can often help to investigate parts of a complex problem under known conditions. The idea of hardware in the loop systems is to include real hardware in a simulation system, to narrow the simulation to reality gap (Bacic 2005).

Robot arms are often used as an external reference system for determining camera poses as they offer a high repeatability and thus offer a good ground truth reference (Ali et al. 2020). Hence, they are often used in hardware in the loop test facilities for, e.g., spacecraft (Park et al. 2021; Krüger and Theil 2010).

However, a robot arm for the planned application requires a large radius of movement, which increases the acquisition costs significantly. At the same time, the integration into an underwater application results in a high maintenance effort, so that this approach is not pursued by us.

A GPS-INS fusion-based system is used in Bleier et al. (2019) to estimate the 6DOF of a ship which scans the ground with a laser system. However, as we work in an indoor lab-environment, we do not use satellites. In Bernal et al. (2017), a VICON tracking system is used in conjunction with immersed trackers and off-the-shelf cameras to capture the motion of a space suit underwater. This approach cannot be employed by us, as we want to operate in extreme visibility conditions, which prohibit easy detection underwater. Two interesting solutions for cross-surface pose determination are presented in Nocerino and Menna (2020): (i) a rod is equipped with a stereo-camera system with a submerged as well as an in-air image; (ii) a rod can as well be equipped with markers above and below the surface and then be observed by one camera. Still, we cannot use such approaches, as we also want to model different light conditions, including deep sea environments with artificial lighting, which demand absolutely no light over the surface.

Comparable work using an external reference system for underwater camera applications is presented in Song et al. (2021). Four optical tracking systems (VICON) are used here to detect poses from infrared sensors mounted at the upper end of a rod. This is a much more expensive system compared to a consumer VR system. The authors provide no analysis of the accuracy of the motion capture system, but it is claimed that it achieves an accuracy down to 0.5 mm in a \(4 m \times 4 m\) volume. At the bottom of the rod, an underwater camera with integrated IMU is located. For this purpose, the absolute deviation of the trajectory for VINS-Fusion (Qin et al. 2019) and ORB-SLAM2 (Mur-Artal and Tardós 2017) are compared to the ground truth camera poses over several data sets. In the case of VINS-Fusion, the deviation is on average 67.63 mm with the IMU and 111.33 mm without its inclusion. ORB-SLAM2 achieves an average deviation of 88.1 mm.

3 System Design

3.1 Hardware Overview

In Fig. 2, the connection between the hardware components is shown as a block diagram. This includes the devices from the HTC Vive, the camera, the calibration target consisting of ArUco markers, and the water tank. The respective poses and the transformations used in 3D space are shown. The poses of the controllers are determined by the tracking system. With the Hand–Eye calibration result, a rigid transformation for each controller to the camera pose is concatenated. The optimal camera pose is calculated by the UKF, by fusing the two noisy measurements of the controller poses. Finally, the transformation from the camera to the calibration target can be determined using ArUco marker detection. By placing the calibration target in a corner of the water tank, its pose can be defined in world coordinates by concatenating the inverse transformations. For more information about different coordinate systems, we refer to Janssen (2009).

Fig. 2
figure 2

Overview of system transformations

3.2 Vive Setup

The so-called lighthouses of the Vive system emit light sweeps, which are detected by the photo diodes of the controllers. The horizontal and vertical angle to the lighthouse can be calculated from the measured time of arrival, and thus, the absolute pose can be determined in real time. In addition, IMU measurements are merged into the pose calculation to improve the precision of relative movements. The arrangement of the lighthouses is selected in such a way that the center of the measurement environment with the water tank is optimally located in the tracking area. This also considers that the user of the system does not stand between the lighthouses field of view and the controllers during the measurement. The lighthouses are mounted at a height of 2.3 ms above the floor, with about 4.2 ms between them. In this setup, we did not observe any problems stemming from reflections from the water surface. The surroundings of the measurement setup and the tracking area of the lighthouses with the cone-shaped coloring in magenta for the first lighthouse and cyan for the second lighthouse are shown in Fig. 3.

Fig. 3
figure 3

a System overview around the tank. b The system in a simulation with the tracking area of the lighthouses

3.3 Camera Stick

The camera stick establishes the rigid mechanical connection between the camera (underwater) and the tracking system. The camera’s pose therefore relates to the controller’s pose by a rigid transformation. In Fig. 4, the camera stick is shown with the camera located on the right side, so that it is about 0.9 m below the controllers. Thus, this part can be kept submerged, while the upper elements of the stick with the controllers remain above the water surface. The use of two controllers should make it possible to mutually play off their individual tracking errors and to achieve an optimized result. Empirical experiments were used to determine in which relative positioning the controllers exhibit the smallest tracking error. This is achieved by rotating them 180\(^{\circ }\) to each other, orienting them upright and having a distance between the origins of the rigid bodies of about 0.52 m. As shown in Fig. 4, the real environment is visualized in real time in the ROS visualization (RViz).

Fig. 4
figure 4

a Camera stick in the measurement environment. b Camera stick in the real-time visualization using RViz

3.4 Underwater Camera

The characteristics of the Basler Machine Vision camera used are listed in Table 1. While Fig. 5 shows the custom-build camera underwater housing in detail. It is comprised of the camera in an adjustable mount, to enable setting its offset with respect to the dome port, an Arduino to process the data of IMUs and an USB hub to combine all data into one cable, which leaves the pressure housing.

We then employ the following calibration scheme, to determine the extrinsics and intrinsics of the camera: Initially, we estimate the cameras distortion parameters in air based on a fisheye model with a reprojection error of 0.22 px. Subsequently, we use the approaches presented in She et al. (2019, 2022), to obtain and correct the camera’s displacement with respect to the dome center. This step ensures that refraction effects due to the traversal of the light rays through interfaces between media with different optical densities are omitted. The Hand–Eye calibration (see Sect. 3.7) can now be carried out with the centered camera in air. For the actual underwater application, we again used a fisheye model and reached a reprojection error of 0.55 px on a calibration set taken in a clear underwater setting. The last step and the dome centering are carried out to eliminate all refraction-based effects, such that the underwater sets can be used for pure radiometric problems.

Table 1 Properties of the Basler camera (Basler 2022a) and lens (Basler 2022b) used
Fig. 5
figure 5

a The camera shown from the front. b The camera shown from the side

3.5 Calibration Target

A calibration target is required for the system calibration and verification. This serves as a fixed point in 3D space and is composed of three ArUco markers, which can be detected by the camera. The pose detection during image processing is improved by calculating the transformations to a joint board pose from several markers. Initially, the relative poses of the three markers have been optimized in a least square scheme using the Ceres solver (Agarwal et al. 2022). The three markers are assembled in the form of the inside view of a 3D angle, which is shown in Fig. 6.

Fig. 6
figure 6

a The optical target used for the Hand–Eye calibration. b Detection of the calibration target

3.6 Motion Capture

The HTC Vive Pro system is a room scale system with six degrees of freedom. These are defined by the rotation around three axes and the position of objects in 3D space. The measurement principle is outside-in tracking, in which the optical part, i.e., the lighthouses, is placed in fixed spatial location. Each lighthouse covers a pyramidal volume of the tracking area with an opening angle of 150\(^{\circ }\) in horizontal and 110\(^{\circ }\) in the vertical direction.

For tracking, the lighthouses emit an infrared light flash over the entire tracking area, which is alternately followed by a horizontal and a vertical infrared light plane to scan the room. This sequence is repeated periodically. The horizontal and vertical planes of light are generated by a rotor, which rotates at 50–55 rotations per second, depending on the channel used. Since the objects in the tracking area require additional information from the lighthouses, an optical transmission method is used to send the Omnidirectional Optical Transmitter (OOTX) data continuously over an infrared channel from each lighthouse. These data are used to uniquely identify the lighthouse and transmit its calibration data to the receivers with the global light flash.

The headset and the controllers of the HTC Vive have infrared sensors that measure the time between the arrival of the global light flash and the next sweeping plane. The angle between the normal vector of the lighthouse and an infrared sensor can be calculated from the time differences for horizontal and vertical measurements, since the rotational speed of the rotor in the lighthouse is known (Borges et al. 2018). Here, the lighthouses can be viewed as cameras, which determine 2D positions in the image with the infrared scanners. This results in the horizontal and vertical direction in which a tracking object is located, starting from the origin of the lighthouse coordinate system.

Since the 3D positions of the infrared sensors on the headset and the controllers are known from the construction, 2D–3D correspondences result from their positions and the results of the lighthouse measurement (Niehorster et al. 2017). By solving them, the absolute pose determination for headset and controllers can be accomplished. To improve the detection of relative movements and to increase the frequency of pose calculation, the relative movement of the tracking devices is determined by a built-in IMU. The inertial data include linear acceleration and angular velocity, which are integrated into the absolute pose. The deviations caused by the drift in the relative measurements are compensated by the lighthouse measurements.

3.7 Hand–Eye Calibration

The term Hand–Eye calibration is derived from the field of robotics where the camera is called the eye and the joint with its gripper is called the hand. Using a controller as a hand, its pose in the world coordinates is given by the VR tracking system. The calibration deals with the calculation of the relative transformation \(\textbf{TF}_{WM2Cam}\) from a controller to the camera pose (Feuerstein 2009). For the calculation, the transformation \(\textbf{TF}_{World2WM_i}\) from the world origin to the controller in 3D space and the relative transformation \(\textbf{TF}_{Cam_i2Target}\) between the camera and the calibration target are needed at the same time (Tsai and Lenz 1989). The transformation from the camera to the calibration target results from the image processing of the captured images with the known intrinsic camera parameters. The connections between the transformations for the Hand–Eye calibration are shown in Fig. 7.

Fig. 7
figure 7

Overview of the requirements for Hand–Eye calibration

With a known movement, a system of equations can be set up, and with enough samples, an accurate calculation of the calibration can be achieved. The movement must have sufficient degrees of freedom, so that the three-dimensional relationship between the camera and the controllers can be fully determined. To do this, it is necessary to rotate the camera around at least two axes while keeping the focus on a calibration target which is placed in the world system and will not be moved during calibration (Tsai and Lenz 1989). The set also contains the transformations \(\textbf{TF}_{WM_i2WM_{i+1}}\) between two consecutive controller poses and \(\textbf{TF}_{Cam_i2Cam_{i+1}}\) between two consecutive camera poses. The result of the Hand–Eye calibration describes the relative transformation \(\textbf{TF}_{WM2Cam}\) between the origin of the controllers and the lens of the camera. This is calculated across the entire data set and is identical for each data pair due to the rigid transformation between the controllers and camera. Figure 8 explains the principle of calibration.

Fig. 8
figure 8

Overview of the transformations for two consecutive data sets of the Hand–Eye calibration

Since two controllers are used on the camera stick, an individual transformation to the camera pose is required for each controller. These two transformations are determined using the same set of images. Each sample is therefore composed of the two controller poses in the world system and the transformation from the camera to the calibration target. To minimize the influence of tracking and image processing deviations on the calibration result, the set size is defined as 50 samples. In addition, the calibration is performed 20 times, so that a set of 20 transformations is available for each transformation required. The set size and number of sets were determined empirically. Subsequently, the result is optimized by linearizing the non-linear pose-optimization problem and solving it with a least-squares approach with the Ceres solver (Agarwal et al. 2022). Since these transformations each correspond to the result of an entire Hand–Eye calibration, the assumption is made that the calibration process has already compensated the outliers. Consequently, all transformations for the Ceres optimization are defined with the same uncertainty.

3.8 Unscented Kalman Filter

The Kalman Filter (KF) describes a mathematical model for integrating and fusing successive, noisy measurements of a system in a linear environment with one or multiple sensors. The KF is real-time capable and therefore suitable for tracking. The goal is to merge the measurements from the two controllers, so that the overall tracking error is minimized. The first-order Markov assumption states that all prior information is aggregated in the current state. Hence, only the previously calculated state, a motion model, and current input measurement, as well as their uncertainties are considered for the estimation of the next state. For a detailed description of the KF, refer to Pei et al. (2017).

In this work, the Unscented Kalman Filter (UKF) is used (Julier and Uhlmann 1997; Wan and Van Der Merwe 2000). The UKF has the advantage that it can be used in non-linear systems. In addition, the calculation is a second-order approximation, so that the use of Jacobian matrices is not necessary to set up the target covariance. The UKF is based on statistical techniques and takes deterministic samples of a system. A minimal number of samples around the mean, called sigma points, are used. Using Newton’s laws of motion, the unscented transform is applied to the set of sigma points by the motion model based on the observed velocity, angular velocity, and linear acceleration. Then, the new mean and its covariance are calculated by comparing the prediction with the next measurement. A description of the UKF calculations can be found in Kraft (2003).

A 15-dimensional state vector \(\mathbf {x_{k}}\) is used for the UKF, which according to Eq. 1 contains the position p, rotation q, velocity v, angular velocity av, and linear acceleration a

$$\begin{aligned}{} {} \mathbf {x_{k}} &= [p_x, p_y, p_z, q_x, q_y, q_z, v_x, v_y, v_z, av_x, av_y, av_z, a_x, a_y, a_z]^T. \end{aligned}$$
(1)

The rotation is represented by a quaternion and only the imaginary components are considered. To learn more about representing rotation in 3D space, we refer to Diebel (2006). The real part can be calculated from the imaginary components after the UKF iteration. The state vector \(\mathbf {x_{k}}\) describes the camera pose and its motion in the world coordinate system.

To use the UKF camera pose as a later ground truth reference for the captured images, the poses must be synchronized with the image time stamps. The ROS time is used as a common time basis. This is time-synchronous for all components in the system. However, the camera used does not allow the time stamp to be set directly for the image recording. Consequently, this is set within the pylon software on the PC. The image time stamps are defined in the middle of the exposure time. Then, the offset of the UKF camera pose to the recorded image is determined. To do this, rotating movements are carried out over the calibration target, as well as forwards and backwards movements in the direction of the calibration target. The evaluation of both experiments shows that the error due to the time difference is smaller than the error due to tracking and image processing. Thus, the time-synchronous behavior between poses and image time stamps is given.

4 Evaluation

In the evaluation, the various system components are analyzed independently, and the precision of the overall system is determined. Wherever possible, the respective experiment is reproduced with synthetic poses. This procedure offers the advantage that the ground truth is always known for the synthetic poses. A test scenario is used for the simulation, in which the synthetically generated camera stick is moved along the edges of a square with an edge length of about 1 m. In doing so, rotations are performed on all coordinate axes. Zero mean normal distributed noise is added to synthetically generated poses. Its standard deviation is based on the uncertainties observed in the real system. Of course, least-squares optimization procedures will be able to find optimal solutions on such systematic noise patterns, which is a trivial insight. However, synthetic ablation studies are the best we can do to justify each layer of complexity in the absence of ground truth data. In top of that, the noise is applied on the input side of the evaluated modules, while the optimization happens on the output side.

The real system is analyzed using the camera stick under dry conditions. It is the goal to validate the system independently of the sources of interference that arise in an underwater application. These can include, for example, blurred images and poor lighting conditions. To represent the later application realistically, movements of the camera stick are carried out in the empty tank.

The evaluation procedure is based on the transformation representation in Fig. 2. These transformations are each characterized by uncertainties, which are to be analyzed one after the other. For the sake of simplicity and only for the statistical analysis, independence between rotation and translation parts of the investigated transformations is assumed. First, the tracking of the controllers is examined, and then, the Hand–Eye calibration is analyzed. Finally, the precision of the UKF and the precision of the overall system are determined.

4.1 Analysis of VR Tracking System

This section defines the basis for an approximate determination of the precision of the tracking of the controllers. The fundamental problem is that the precision cannot be directly determined from the tracking data, since the ground truth poses of the controllers are unknown. The transformation \(\textbf{TF}_{WM}\) between the two controllers offers an approach for the analysis. This transformation can be regarded as constant due to the rigid construction of the stick. During tracking, the deviation of this transformation from the expected mean transformation can be determined. This describes the combined tracking error of both controllers.

The mean values of the transformations between the controllers \(\overline{TF}_{WM}\) over the datasets, their maximum deviation \(\epsilon _{max}\), and the standard deviation \(\sigma\) are analyzed. Table 2 shows five datasets, each lasting 110–140 s. This shows that the rigid transformation between the two controllers \(\textbf{TF}_{WM}\) is stable up to a few millimeters and fractions of a degree. The average standard deviation \(\overline{\sigma }\) of the transformation over the data sets is 1.12 mm and 0.12\(^{\circ }\). Based on these values, the simulation of the controller poses is parameterized. For dataset 1 in Table 2, Fig. 9 shows the course of the absolute distance and rotation between the controllers over time. For the simulation, the experiment is repeated and deviations of the transformation \(\overline{TF}_{WM}\) are observed in the same order of magnitude. An average maximum deviation \(\overline{\epsilon }_{\max }\) of 4.99 mm and 0.5\(^{\circ }\) as well as a mean standard deviation \(\overline{\sigma }\) of 1.29 mm and 0.13\(^{\circ }\) result over several simulated data sets.

Table 2 Tracking analysis for the real system based on the transformation between the two controllers
Fig. 9
figure 9

Deviation between the controller poses for distance and rotation over time for dataset 1 in Table 2. The gray area represents the 1-\(\sigma\)-standard deviation

4.2 Analysis of Hand–Eye Calibration

To maintain the high precision of the tracking, an exact transformation from the controllers to the camera must be determined during the Hand–Eye calibration. Therefore, in Table 3, the effect of the optimization is first shown using the simulation. Zero mean normal distributed noise with standard deviations based on the preceding analysis (1 mm and 0.1\(^{\circ }\)) is added to each coordinate axis during simulation to the poses of the controllers. To account for the imperfection of the pose estimation process from images of visual markers, random noise with the same parameters (zero mean, std. deviations 1 mm and 0.1\(^{\circ }\)) is added to camera poses computed from images of markers. For each data set, the average rigid transformation between the camera poses \({\overline{\epsilon }}_{TF}\) of the 20 calibrations is compared with the result of the optimization \(\epsilon _{TF_{op}}\). In addition, the average deviation of the individual camera poses from the ground truth pose \({\overline{\epsilon }}_{C}\) is compared with the optimized camera poses \(\epsilon _{C1_{op}}\). It is shown that the accuracy of the Hand–Eye calibration result is improved by a factor of approx. 4 on average by the optimization.

Table 3 Hand–Eye calibration analysis based on simulated data

Since the ground-truth poses are unknown in the real system, this form of analysis cannot be repeated there. Instead, the camera poses are analyzed after the Hand–Eye calibration. For this purpose, the transformation between the two propagated camera poses \(\textbf{TF}_{\text {Cams}}\) is determined during the tracking. This can then be used to determine the mean deviation \(\overline{TF}_{\text {Cams}}\), as well as the maximum error \(\epsilon _{\max }\) and the standard deviation \(\sigma\) of the transformation. The errors shown in Table 4 therefore contain both the calibration error and the error originating from the tracking system. There is an average deviation between the camera poses \(\overline{TF}_{\text {Cams}}\) of 6.57 mm and 0.29\(^{\circ }\). For dataset 1 in Table 4, Fig. 10 shows the course of the absolute distance and rotation between the single camera poses over time. In the simulation, the experiment was repeated with several data sets. The average deviation between the camera poses \(\overline{TF}_{\text {Cams}}\) is 3.53 mm and 0.21\(^{\circ }\). In the real system, the errors are bigger due to real effects such as the systematic drift and outliers in the tracking. The position of the camera is particularly affected by small errors in rotation of the controllers. This can easily be explained by the leverage effect stemming from the physical design of the camera stick (see Fig. 4).

Table 4 Camera poses’ analysis for the real system based on the transformation between the Cameras
Fig. 10
figure 10

Deviation between the single camera poses for distance and rotation over time for dataset 1 in Table 4. The gray area represents the 1-\(\sigma\)-standard deviation

4.3 Analysis of UKF-Filtering

To determine the precision of the overall system, the calibration target is first measured in the world system according to Fig. 2 via a chain of transformations. For this purpose, a large set of poses is recorded, and the optimal calibration target pose is then determined using a pose optimization based on the Ceres library Agarwal et al. (2022). This makes it possible to compare an optically determined camera pose based on the visual markers with the camera pose computed by the proposed tracking system. In addition to the errors from tracking and Hand–Eye calibration, errors from image processing and the time offset between the tracking poses and the image time stamps are also considered. Due to the slow-motion speed during the later experiments, a small deviation in synchronization between image time stamp and camera pose time stamp will not lead to a large spatial deviation. As an example, the average movement speed during the experiment is 56.8 mm/s and the average frame rate of the camera is 7.7 Hz. A large synchronization error of 10% of a time interval between two images is assumed for the calculation. This would lead to a systematic position error of 0.74 mm between the true camera pose during image acquisition and the tracked camera pose from the external reference system. However, such a large time offset would have been noticed during the calibration due to the systematic errors. The remaining time error can therefore only be a few milliseconds, and therefore, the resulting position error is estimated at a maximum of 0.2 mm. Consequently, time synchronicity is not a major problem for the planned application.

Table 5 compares the mean deviations from the optically determined camera pose to the individual camera poses \(\overline{\epsilon }_{C}\) and the UKF camera pose \(\overline{\epsilon }_{C_{UKF}}\). It is shown that the camera pose determined by the suggested system with the UKF is more precise than the individual camera poses using only one controller. This results in a mean deviation of 2.51 mm and 0.26\(^{\circ }\) from the UKF filtered camera pose to the optimized camera pose based on visual markers optimized across all datasets. For dataset 2 in Table 5, Fig. 11 shows the course of the absolute deviation for distance and rotation between the tracked camera pose and optical determined camera pose over time. The experiment was repeated in the simulation using a perfect Hand–Eye calibration. The comparison of the UKF camera pose to the ground truth camera pose results in an average deviation of 2.44 mm and 0.09\(^{\circ }\) over several data sets.

Table 5 Camera poses’ analysis for the real system compared to the optical camera pose
Fig. 11
figure 11

Deviation between the optical determined camera pose and the tracked camera pose after the UKF for distance and rotation over time for dataset 2 in Table 5. The gray area represents the 1-\(\sigma\)-standard deviation

5 Discussion

Fig. 12
figure 12

a Trajectory of the underwater experiment and three example positions of captured images. b The corresponding captured images during the experiment

One difficulty in evaluating the system was that the ground-truth poses of controllers and camera are not known in the real world. The simulation uses synthetic poses to show that the developed algorithm performs well. For this purpose, the system is realistically simulated in terms of tracking and error propagation. Using this parameterization, it can be shown that the camera pose has an average deviation of 2.44 mm and 0.09\(^{\circ }\) from ground truth in the simulation.

In the real system, a previously measured optical ground truth reference is used. This system behaves very similar to the simulation in terms of tracking and error propagation, and the result for the camera pose is also in the same order of magnitude with an average deviation of 2.51 mm and 0.26\(^{\circ }\). This proves that the approach developed for the external tracking system and the algorithms used are performing well. Although this error will be scaled in the envisioned application, the tracking should still vastly outperform any pose estimation data captured in the deep sea.

Furthermore, it is shown that the transformation to the lower end of the camera stick only increases the error by a factor of 2–3 due to the leverage effect. This is achieved by optimally fusing the resulting poses based on the two individual controllers using the UKF. When replacing the commercially available drivers of the VR tracking system with open-source drivers [Libsurvive (Libsurvive 2022)], the errors are significantly increased. These increased errors could possibly be caused by imperfect parameters of the open-source algorithm. However, the closed-source driver has limitations in terms of accessing raw data, the uncertainties, the algorithm, and the precise timing. Therefore, future work will deal with developing an optimizer for the open-source drivers to operate the external reference system with an precision of the same order of magnitude.

Fig. 13
figure 13

Captured underwater dataset. The RAW images are still geometrically distorted but presented in sRGB space for better visibility

6 Conclusion

In this paper, we presented an external tracking system for underwater camera poses. With the HTC Vive, we used an inexpensive consumer VR tracking system. Given the tracking above the water surface and the result of the Hand–Eye calibration, a fused underwater camera pose can be estimated in real time. Each system component is individually analyzed and validated. Using the overall system, an average deviation of the camera pose from tracking to an optically determined camera pose of approximately 3 mm and 0.3\(^{\circ }\) is achieved. With the external reference system, ground truth poses are available for the underwater camera which enables comprehensive underwater experiments.

Future work will explore the possibility of an open-source tracking driver. In addition, two IMUs will be mounted next to the camera. One will be fused in the camera pose estimation, while the other will provide independent measurements which enables the evaluation on visual-inertial optical underwater methods.

6.1 Future Application: Record Underwater Validation Datasets

We recorded underwater datasets to proof the applicability of the system to the designated task (see Fig. 12). To this end, we installed three 50 W Wasler daylight (5400 k) lamps with Walimex diffusors above the tank to mimic sunlight with heavy atmospheric scattering. In addition, we added two Ulanzi L2 lite (5500 k) as co-moving light sources (see Fig. 1). We added dye to the water to induce a seawater-like attenuation effect and in addition added Maaloxan as a scattering agent.

We now have a sensor-in-the-loop-system, which operates in a real scattering medium. Of course, a gap to the real world always remains, e.g, induced by missing swell, algae bloom, marine snow, and the like. However, we trade this in for a tracking error that low that it cannot be achieved in the wild.

In Fig. 13, we show (a) in-air imagery and then the same scene under water: with (b) homogeneous (sun) illumination, (c) heterogeneous illumination from co-moving lights, and (d) a mix of sun and artificial illumination. In each scenario, we roughly captured 2000 images using traditional lawn mower patterns and a free 3D scanning scheme to capture the impact of depth variations on monocular underwater computer vision algorithms. Each of the images then comes with a synchronized ground truth 3D pose, which, e.g., was used in Grimaldi et al. (2023) to empirically investigate the difficulties of underwater visual monocular SLAM. Finally, we fixed all parameters of the camera, except for exposure time and recorded RAW imagery. The linear nature of these data enables the development and testing of physically based color correction algorithms like, e.g., (Bryson et al. 2016; Akkaynak and Treibitz 2019; Nakath et al. 2021).

Finally, our marker only features one ArUco marker per dimension. In the future, this should be extended to more markers per dimension to yield an overdetermined system of equations for pose estimation.