Next Article in Journal
DARAL: A Dynamic and Adaptive Routing Algorithm for Wireless Sensor Networks
Next Article in Special Issue
A Distributed Learning Method for ℓ 1 -Regularized Kernel Machine over Wireless Sensor Networks
Previous Article in Journal
A Novel Microfluidic Flow Rate Detection Method Based on Surface Plasmon Resonance Temperature Imaging
Previous Article in Special Issue
Gyro Drift Correction for An Indirect Kalman Filter Based Sensor Fusion Driver
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Normalized Metadata Generation for Human Retrieval Using Multiple Video Surveillance Cameras

1
Department of Image, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Korea
2
ADAS Camera Team, LG Electronics, 322 Gyeongmyeong-daero, Seo-gu, Incheon 22744, Korea
3
Software Development Team, Convergence R&D Center, LG Innotek, Gyeonggi-do 15588, Korea
*
Author to whom correspondence should be addressed.
Sensors 2016, 16(7), 963; https://doi.org/10.3390/s16070963
Submission received: 20 March 2016 / Revised: 16 June 2016 / Accepted: 21 June 2016 / Published: 24 June 2016
(This article belongs to the Special Issue Advances in Multi-Sensor Information Fusion: Theory and Applications)

Abstract

:
Since it is impossible for surveillance personnel to keep monitoring videos from a multiple camera-based surveillance system, an efficient technique is needed to help recognize important situations by retrieving the metadata of an object-of-interest. In a multiple camera-based surveillance system, an object detected in a camera has a different shape in another camera, which is a critical issue of wide-range, real-time surveillance systems. In order to address the problem, this paper presents an object retrieval method by extracting the normalized metadata of an object-of-interest from multiple, heterogeneous cameras. The proposed metadata generation algorithm consists of three steps: (i) generation of a three-dimensional (3D) human model; (ii) human object-based automatic scene calibration; and (iii) metadata generation. More specifically, an appropriately-generated 3D human model provides the foot-to-head direction information that is used as the input of the automatic calibration of each camera. The normalized object information is used to retrieve an object-of-interest in a wide-range, multiple-camera surveillance system in the form of metadata. Experimental results show that the 3D human model matches the ground truth, and automatic calibration-based normalization of metadata enables a successful retrieval and tracking of a human object in the multiple-camera video surveillance system.

Graphical Abstract

1. Introduction

Multiple camera-based video surveillance systems are producing a huge amount of data every day. In order to retrieve meaningful information from the large data set, normalized metadata should be extracted to identify and track an object-of-interest acquired by multiple, heterogeneous cameras.
Hampapur et al. proposed a real-time video search system using video parsing, metadata descriptors and the corresponding query mechanism [1]. Yuk et al. proposed an object-based video indexing and retrieval system based on object features’ similarity using motion segmentation [2]. Hu et al. proposed a video retrieval method for semantic-based surveillance by tracking clusters under a hierarchical framework [3]. Hu’s retrieval method works with various queries, such as keywords-based, multiple object and sketch-based queries. Le et al. combined recognized video contents with visual words for surveillance video indexing and retrieval [4]. Ma et al. presented a multiple-trajectory indexing and retrieval system using multilinear algebraic structures in a reduced-dimensional space [5]. Choe et al. proposed a robust retrieval and fast searching method based on a spatio-temporal graph, sub-graph indexing and Hadoop implementation [6]. Thornton et al. extended an existing indexing algorithm in crowded scenes using face-level information [7]. Ge et al. detected and tracked multiple pedestrians using sociological models to generate the trajectory data for video feature indexing [8]. Yun et al. presented a visual surveillance briefing system based on event features, such as object’s appearances and motion patterns [9]. Geronimo et al. proposed an unsupervised video retrieval system by detecting pedestrian features in various scenes based on human action and appearance [10]. Lai et al. retrieved a desired object using the trajectory and appearance in the input video [11]. The common challenge of existing video indexing and retrieval methods is to summarize infrequent events from a large dataset generated using multiple, heterogeneous cameras. Furthermore, the lack of normalized object information during the search prevents from accurately identifying the same objects acquired from different views.
In order to solve the common problems of existing video retrieval methods, this paper presents a normalized metadata generation method from a very wide-range surveillance system to retrieve an object-of-interest. For automatic scene calibration, a three-dimensional (3D) human model is first generated using multiple ellipsoids. Foot-to-head information from the 3D model is used to estimate the internal and external parameters of the camera. Normalized metadata of the object are generated using the camera parameters of multiple cameras. As a result, the proposed method needs neither a special calibration pattern nor a priori depth measurement. The stored metadata can be retrieved using a query, such as size, color, aspect ratio, moving speed and direction.
This paper is organized as follows. Section 2 describes the 3D human model using multiple ellipsoids. A human model-based automatic calibration algorithm and the corresponding metadata retrieval method are respectively presented in Section 3 and Section 4. Section 5 summarizes the experimental results, and Section 6 concludes the paper.

2. Modeling Human Body Using Three Ellipsoids

A multiple camera-based surveillance system must be able to retrieve the same object in different scenes using an appropriate query. However, non-normalized object information results in retrieval errors. In order to normalize the object information, we estimate camera parameters using automatic scene calibration and then estimate a projective matrix using camera parameters obtained by scene calibration. After obtaining normalized information, the object in the two-dimensional (2D) image is projected to a 3D world coordinate using the projection matrix. Existing camera calibration methods commonly use a special calibration pattern [12], which extracts feature points from a planar pattern board and then estimates the camera parameters using a closed-form solution. However, the special calibration pattern-based algorithm has a limitation because the manual calibration of multiple cameras at the same time is impractical and inaccurate. In order to solve this problem, we present a multiple ellipsoid-based 3D human model using the perspective property of 2D images, and the block diagram of the proposed method is shown in Figure 1.
Let X f = [ X f Y f 1 ] T be the foot position on the ground plane and x f = [ x f y f 1 ] T the corresponding foot position in the image plane, all in the homogeneous coordinate. Given x f , X f can be computed using the homography as:
X f = H 1 x f
where H = [ p 1 p 2 p 3 ] T is the 3 × 3 homography matrix, and p i for i = 1 , 2 , 3 are the first three columns of the 3 × 4 projection matrix P that is computed by estimating camera parameters. We then generate the human model with height h on the foot position using three ellipsoids, including head Q h , torso Q t and leg Q l , in the 3D world coordinate. The 4 × 4 matrix of the ellipsoid is defined as [13]:
Q k = 1 R X 2 0 0 X c R X 2 0 1 R Y 2 0 Y c R Y 2 0 0 1 R Z 2 Z c R Z 2 X c R X 2 Y c R Y 2 Z c R Z 2 X c 2 R X 2 + Y c 2 R Y 2 + Z c 2 R Z 2
where Q k , k { h , t , l } , respectively, represent the ellipsoid matrices of head, torso and leg. R X , R Y and R Z respectively represent the radii of ellipsoids in X, Y and Z coordinates and [ X c Y c Z c ] T the center of the ellipsoids. To fit the model to real humans, we set the average heights of children, juveniles and adults as 100 cm, 140 cm and 180 cm, respectively. The ratio of the head, torso and leg is set to 2:4:4.
Each ellipsoid is back-projected to match a real object in the 2D space. The back-projected 3 × 3 ellipse, denoted as C k , by projection matrix P is define as:
C k 1 = P Q k 1 P T
where C represents the ellipsoid matrix, such as u T C u = 0 . Figure 2 shows the result of the back-projected multiple ellipsoids at different positions. In each dotted box, three different ellipsoids have the same height.
The multiple ellipsoid-based human model is generated according to the position and height of an object from multiple cameras. The first step of generating the human model is to perform shape matching in the image. To match the shape, the proposed algorithm detects a moving object region by modeling the background using the Gaussian mixture model (GMM) [14] and then normalizes the detected shape. Since the apparent shape differs by the location and size of the object, the normalized shape is represented by a set of boundary points. More specifically, each boundary point is generated where a radial line from the center of gravity meets the outmost boundary of the object. If the angle between adjacent radial lines is θ, the number of boundary points is N = 360 / θ . The shapes of an object and the corresponding human model are respectively defined as:
B = j 1 j 2 j N , and M i = o 1 i o 2 i o N i
where B represents the shape of the object, i { children , juvenile , adult } , M i the shape of the human model and N the number of normalized shapes. In this work, we experimentally used θ = 5 , which results in N = 72 . The matching error between B and M i is defined as:
e i = l = 1 N ( j l o l i ) 2
As a result, we select an ellipsoid-based human model with the minimum matching error e i to three human models, including child, juvenile and adult. If the matching error is greater than a threshold T e , the object is classified as nonhuman. If the threshold T e is too big, nonhuman objects are classified as human. On the other hand, very small T e makes human detection fail. For that reason, we chose T e = 8 for the experimentally best human detection performance. The shape matching results of the ellipsoid-based human model appropriately fit real objects, as shown in Figure 3, where moving pedestrians are detected and fitted by the ellipsoid-based human model. The ellipsoid-based fitting fails when a moving object is erroneously detected. However, the rest of the correct fitting results can compensate for the occasional failure.

3. Human Model-Based Automatic Scene Calibration

Cameras with different internal and external parameters produce different sizes and velocities in the 2D image plane for the same object in the 3D space. In order to identify the same object in a multiple camera-based surveillance system, detection and tracking should be performed in the 3D world coordinate that is not affected by camera parameters. Normalized physical information of an object can be extracted in two steps: (i) automatic scene calibration to estimate the projective matrix of a camera [15,16,17]; and (ii) projection of the object into the world coordinate using the projective matrix. The proposed automatic calibration algorithm assumes that the foot-to-head line of a human object is orthogonal to the x y plane and parallel to the z-axis in the world coordinate.
The proposed human model-based automatic scene calibration consists of three steps: (i) extraction of foot and head candidate data to compute foot-to-head homology; (ii) homology estimation using foot-to-head inlier data; and (iii) camera calibration by estimating vanishing points and lines using the foot-to-head homology.

3.1. Foot-To-Head Homology

In the Euclidean geometry, two parallel lines do not meet anywhere. On the other hand, in the projective geometry, two parallel lines meet at a point called the vanishing point. A line connecting two vanishing points is called the vanishing line, as shown in Figure 4.
Existing single image-based methods to estimate vanishing points and lines often fail if there are no line components in the background image [18,19]. In order to overcome the limit of background generation-based methods, a foreground object-based vanishing point detection method was recently proposed [15,16,17]. Since a general surveillance system has a camera installed at a higher position than the ground to view down objects, foot-to-head lines of a standing person at various positions on the ground, which is equivalent to the X Y plane in the world coordinate, converge to a single point below the ground plane, as shown in Figure 5, where each position of the person is represented by a line segment with the bottom foot and the top head points. Extended foot-to-head lines meet at the vertical vanishing point V 0 below the ground level. The line connecting head points of Positions 1 and 2 meets another line connecting foot points of the same positions at p1. Likewise, p2 is determined by Positions 1 and 3. Based on the observation, three non-collinear positions of the person determine the horizontal vanishing line V L and the vertical vanishing point V 0 .
The vanishing line and point are used to estimate the camera projection matrix. More specifically, let X ¯ = [ X Y Z 1 ] T be a point in the homogeneous world coordinate; its projective transformation becomes x ¯ = P X ¯ , where P is the projection matrix. Given x ¯ = [ x ¯ y ¯ z 1 ] T , the corresponding point in the image plane is determined as x = x ¯ / z , and y = y ¯ / z . Since we assume that the X Y plane is the ground plane, the foot position in the world coordinate is X f = [ X Y 0 ] T and the projected foot position is x ¯ f = H f X ¯ f , where X ¯ f = [ X Y Z 1 ] T . In the same manner with the X Y plane moving to the head plane, we have x ¯ h = H h X ¯ h , where both H f and H f are 3 × 3 matrices. Since a head position is projected onto the corresponding foot position, such as X ¯ f = X ¯ h ,
x ¯ h = H h f x ¯ f , and x ¯ f = H f h x ¯ h
where both H h f = H h H f 1 and H f h = H f H h 1 are 3 × 3 matrices and H h f = H f h 1 . Given the coordinate of a foot position in the ground plane, the corresponding head position in the image plane can be determined using H h f . H = H f h is defined as the foot-to-head homology, and can be determined by computing the projection matrix P using the vanishing point, vanishing line and the object height Z.

3.2. Automatic Scene Calibration

The automatic scene calibration process consists of three steps: (i) extraction of foot and head inlier data; (ii) estimation of foot-to-head homology using the extracted inlier data; and (iii) detection of vanishing line and points. For the first step of the scene calibration, a human object is detected using the Gaussian mixture model. The detected object region goes through a morphological operation for noise-free labeling [20]. The inlier candidate of the foot and head of the labeled object is selected on two conditions: (i) a foot-to-head line should be inside a finite region with respect to the y-axis; and (ii) the foot-to-head line should be a major axis of an ellipsoid that will approximate the human object.
In order to obtain the angle, major axis and minor axis of the labeled human object, ellipse fitting is performed. More specifically, the object shape is defined by the external boundary as:
S = s 1 s 2 s N T
where s i = [ x i y i ] T , for i = 1 , , N , represents the i-th boundary point and N the number of total boundary points. Using the second moments [21], the angle of shape S is computed as:
θ = 1 2 arctan 2 μ 1 , 1 μ 2 , 0 μ 0 , 2
where:
μ p , q = i = 1 N ( x i x c ) p ( y i y c ) q
and:
x c = 1 N i = 1 N x i , and y c = 1 N i = 1 N y i
In order to compute the major and minor axes of the ellipsoid, we first define the minimum and maximum inertial moments respectively as:
I min = i = 1 N ( x i x c ) cos θ ( y i y c ) sin θ I max = i = 1 N ( x i x c ) sin θ ( y i y c ) cos θ
The major and minor axes are determined using I min and I max as:
A l = 4 π 1 / 4 I max 3 I min 1 / 8 , and A s = 4 π 1 / 4 I min 3 I max 1 / 8
The aspect ratio of the object is defined as r = A l / A s , and a candidate foot and head vector is defined as c = [ x f y f x h y h ] T . c is computed using θ as:
x f = ( y max y c ) cos θ sin θ + x c , and y f = y max x h = ( y min y c ) cos θ sin θ + x c , and y h = y min
where y max and y min respectively represent the maximum and minimum of y i , for i = 1 , , N .
The set of inlier candidates C = [ c 1 c 2 c L ] T is generated from c i ’s that satisfy four conditions: (i) r 1 < r < r 2 ; (ii) θ 1 < θ < θ 2 ; (iii) there exist s i whose distance from ( x f , y f ) is smaller than d 1 , and s j whose distance from ( x h , y h ) is smaller than d 1 ; and (iv) there are no pairs of c i ’s whose distance is smaller than d 2 . In the first condition, r 1 = 2 and r 2 = 5 are used, and in the second condition, θ 1 = 80 and θ 2 = 100 are used for the experimentally best result. In the third and fourth conditions, d 1 = 3 and d 2 = 10 are respectively used.
Since the inlier candidate C still contains outliers, a direct computation of foot-to-head homology H results in a significant error. To solve this problem, we remove outliers in c using a robust random sample consensus (RANSAC) algorithm [22]. H can be determined using four inlier data since its degree of freedom is eight. Let a = [ h 11 h 12 h 13 h 21 h 22 h 23 h 31 h 32 ] T be a vector whose eight elements are the first, row-ordered eight components of H; then, a can be determined by solving:
x f y f 1 0 0 0 x f x h y f y h 0 0 0 x f y f 1 x f y h y f y h a = x h y h
Since Equation (14) generates two linear equations given a candidate vector, four candidate vectors can determine H. In order to check how many inlier data support the estimated H, the head position of each candidate vector is estimated using H, which is determined by the corresponding foot position. The estimated head position is compared to the real head position, and the candidate vector is considered to support H if the error is sufficiently small. This process repeats a given number of times, and candidate vectors that support the optimal H become inliers. The inliers generate Equation (14). Since many inliers generally produce more than eight equations, vector a, which is equivalent to matrix H, is finally determined using the pseudo inverse. Although outliers can be generated by occlusion, grouping and non-human objects, the correct inlier data can be estimated while the process repeats and candidate data are accumulated.
Given the estimated foot-to-head homology H, arbitrarily chosen two foot positions generate corresponding two head positions. Two lines connecting the two pairs of feet and head positions meet at the vanishing point. More specifically, a line in the 3D coordinate can be represented using a vector l = [ a b c ] T , which satisfies the linear equation:
a x + b y + c = 0
where the line coefficients { a , b , c } are determined using two points p = [ p x p y ] T and q = [ q x q y ] T as:
a = p y q y b = p x q x c = ( p y q y ) q x + ( p x q x ) q y
If two lines l 1 and l 2 meet at the vanishing point V 0 , the following relationship is satisfied:
V 0 = l 1 × l 2
In order to determine the vanishing line, three candidate vectors { c 1 , c 2 , c 3 } are needed. Two lines connecting both feet and head pairs connecting c 1 and c 2 meet at a point, say r = [ r x r y ] T . Likewise, another point s = [ s x s y ] T is determined using c 2 and c 3 . The line connecting two points r and s is the vanishing line V L . Given V 0 and V L , camera parameters can be estimated as shown in Figure 6.

3.3. Camera Parameter Estimation

Internal parameters include focal length f, principal point [ c x c y ] T and aspect ratio a. Assuming that the principal point is equal to the image center, a = 1 , and there is no skew, the simplified internal camera parameters are given as:
K = f 0 c x 0 f c y 0 0 1
External parameters include panning angle α, tilting angle θ, rolling angle ρ, camera height with respect to the z-axis and translations in the x and y directions. Assuming that α = 0 , x = y = 0 , the camera projection matrix is obtained by the multiplication of the internal and external parameter matrices as:
P = K cos ρ sin ρ 0 sin ρ cos ρ 0 0 0 1 1 0 0 0 cos ρ sin ρ 0 sin ρ cos ρ 1 0 0 0 0 1 0 0 0 0 1 h c
The vertical vanishing point with respect to the z-axis v 0 = [ v x v y 1 ] T provides the following constraint together with a point [ x y 1 ] T on the horizontal vanishing line:
v 0 T ω x y 1 = 0
where w = K T K 1 represents the image of the absolute conic (IAC). Substitution of Equation (18) into Equation (20) yields [23]:
v x x + v y a 2 + f 2 = 0
which demonstrates that the horizontal vanishing line can be determined by the vertical vanishing point and the focal length and that rotation parameters can be computed from v x , v y , f as [8]:
ρ = arctan a v x v y , and θ = arctan 2 a 2 v x 2 + v y 2 a f
where a = 1 .
The proposed algorithm can compute f, ρ and θ by estimating the vanishing line and point using Equations (21) and (22). The camera height h c can be computed using the real height of an object in the world coordinate h w , vanishing line v L and vanishing point v 0 :
h w h c = 1 d ( p h , V L ) d ( p f , V 0 ) d ( p f , V L ) d ( p h , V 0 )
where p f and p h respectively represent the foot and head positions of the i-th object and d ( a , b ) the distance between points a and b. In the experiment, h w = 180 cm is used for the reference height.

4. Indexing of Object Characteristics

After object-based multiple camera calibration, the metadata of an object should be extracted given a query for the normalized object indexing. In this work, queries of an object consist of a representative color in the HSV color space, horizontal and vertical sizes in meters, moving speed in meters per second, the aspect ratio and moving trajectory.

4.1. Extraction of Representative Color

The color temperature of an object may change when a different camera is used. In order to minimize the color variation problem, the proposed work performs color constancy as a pre-processing step to compensate for the white balance of the extracted representative color.

4.1.1. Color Constancy

If we assume that an object is illuminated by a single light source, the estimated color of the light source is given as:
e = R e G e B e = ω e ( λ ) s ( λ ) c ( λ ) d λ
where e ( λ ) represents the light source, s ( λ ) the reflection ratio of the surface, c = [ R ( λ ) G ( λ ) B ( λ ) ] T the camera sensitivity function and w the wavelength spectrum, including the red, green and blue colors.
The proposed color compensation method is based on the shades of gray method [24,25]. The input image is down-sampled to reduce the computational complexity, and simple low pass filtering is performed to reduce the noise effect. The modified Minkowsky norm-based color with the consideration of local correlation is given as:
( f σ ( x ) ) p d x d x 1 / p = k e
where f ( x ) represents the image defined on x = [ x y ] T , f σ = f * G σ the filtered image by the Gaussian filter G σ and p the parameter of the Minkowski norm. A small p makes the uniform distribution of weights between measurement values, and vice versa. An appropriate choice of p prevents the light source from being biased to a specific color channel. In the experiment, p = 6 was used for the experimentally best results for multiple camera color compensation. As a result, scaling parameters { w R , w G , w B } can be determined using the estimated color of the light source. The corrected color is given as:
f corr c = f c / ω c 3 , for c { R , G , B }
Figure 7 shows the results of color correction using three different cameras. Color correction can also minimize the inter-frame color distortion, since it estimates the normalized light source.

4.1.2. Representative Color Extraction

The proposed color extraction method uses the K-means clustering algorithm. An input RGB image is transformed to the HSV color space to minimize the inter-channel correlation as:
H = arctan 3 ( G B ) ( R G ) + ( R B ) , S = 1 min ( R , G , B ) V , V = R + G + B 3
Let j n = [ H n S n V n ] T be the HSV color vector of the n-th pixel, for n = 1 , , N , where N is the total number of pixels in the image. Initial K pixels are arbitrarily chosen to make a set of mean vectors { g 1 g K } , where g i , for i = 1 , , K , represents the selected HSV color vector. For every color vector, if j n is the closest to g i , j n has the label J i as:
J i = j n | d ( j n , g i ) d ( j n , g b ) , for b = 1 , , K
Each mean vector g i is updated by the mean of j n ’s in the cluster J i , and the entire process repeats until there are no more changes in g i . Figure 8 shows the results of K-means clustering in the RGB and HSV color spaces with K = 3 .
The fundamental problem of the K-means clustering algorithm is the dependency on the initial set of clusters, as shown in Figure 9. Since a single try of K-means clustering cannot guarantee extracting the representative colors, each frame generates candidate colors while tracking an object, and only the top 25% colors in the sorted candidates are finally selected. As a result, the representative colors of the object are correctly extracted even with a few errors. Figure 10 shows objects with extracted representative colors.

4.2. Non-Color Metadata: Size, Speed, Aspect Ratio and Trajectory

When multiple cameras are used in a video surveillance system, object size and speed are differently measured by different cameras. In order to extract the normalized metadata of an object, physical object information should be extracted in the world coordinate using accurately-estimated camera parameters.

4.2.1. Normalized Object Size and Speed

We can compute the physical object height in meters if the projection matrix P and foot and head coordinates are in the image plane. In order to extract the physical information of an object in the world coordinate, the foot position on the ground plane X ˜ f = H 1 x ˜ f should be computed using Equation (1). On the other hand, the y coordinate in the image plane is computed as:
y = P 2 , 1 · X + P 2 , 2 · Y + P 2 , 3 · H o + P 2 , 4 P 3 , 1 · X + P 3 , 2 · Y + P 3 , 3 · H o + P 3 , 4
where P represents the projection matrix and H 0 the object height. Using Equation (29), H 0 can be computed from y as:
H o = ( P 2 , 1 P 3 , 1 · y ) X + ( P 2 , 2 P 3 , 2 · y ) Y + P 2 , 2 P 3 , 2 · y P 3 , 3 · y P 2 , 3
The width of an object W 0 is computed as:
W o = X o X o · W i
where X 0 represents the foot position in the world coordinate, X 0 the foot position that corresponds to the one-pixel shifted foot position in the image plane and W i the object width in the image plane. Figure 11 shows the results of normalized object size estimation. As shown in the figure, the estimated object height does not change while the object is moving around.
The object speed S 0 can be computed as:
S o = ( X o t X o t ) 2 + ( Y o t Y o t ) 2
where ( X 0 t , Y 0 t ) represents the object position in the world coordinate at the t-th frame and ( X 0 t , Y 0 t ) the previous object position by one second. However, the direct estimation of S 0 from the object foot position is not robust because of the object detection error. To solve the problem, the Kalman filter can compensate for the speed estimation error. Figure 12 shows the result of the object speed estimation with and without using the Kalman filter.

4.2.2. Aspect Ratio and Trajectory

The aspect ratio of an object is simply computed as:
R o = H i / W i
where H i and W i respectively represent the object height and width in the image plane.
Instead of saving the entire trajectory of an object, the proposed system extracts object information using four positions in the trajectory. The object trajectory is defined as:
T o = x o 1 , y o 1 , x o 2 , y o 2 , x o 3 , y o 3 , x o 4 , y o 4 T
where [ x 0 y 0 ] T is the starting position, [ x 1 y 1 ] T the 1/3 position, [ x 2 y 2 ] T the 2/3 position and [ x 4 y 4 ] T the ending position.

4.3. Unified Model of Metadata

Five types of metadata described in Section 4.1 and Section 4.2 should be unified into a single data model to be saved in the database. Since object data are extracted at each frame, median values of size, aspect ratio and speed data are saved at the frame right before the object disappears. Three representative colors are also extracted using the K-means clustering algorithm with the previously-selected set of colors.
The object metadata model, including object features, serial number and frame information, is shown in Table 1. As shown in the table, duration, moving distance and area size are used to sort various objects. For the future extension, minimum and maximum values of object features are also saved in the metadata.

5. Experimental Results

This section summarizes the experimental results of the proposed object-based automatic scene calibration and metadata generation algorithms. To evaluate the performance of the scene calibration algorithm, Table 2 summarizes the variation of object mean values captured in seven different scenes. The experiment extracts normalized physical information of a human object with a height of 175 cm in various scenes. As shown in Table 2, camera parameters were estimated and corrected at each scene. Object A appears 67 times, and object height is estimated every time.
Figure 13 shows that the average object height is 182.7 cm with a standard deviation 9.5 cm. Since the real height is 175 cm, the estimation error is 7.5 cm, because the reference height h w was set to 180 cm. This result reveals that the proposed calibration algorithm is suitable to estimate the relative height rather than the absolute value.
Figure 14 shows the experimental results to search an object using the color query, including red, green, blue, yellow, orange, purple, pink, brown, white, gray and black. Table 3 summarizes the classification performance using the object color. The rightmost column has the number of total objects and the correctly classified ones in the parenthesis. The experiment can correctly classify 96.7% of the objects on average.
Figure 15 shows eight test videos with estimated camera parameters. Figure 16 shows the camera calibration results of eight test videos on the virtual ground plane and ellipsoids of a height of 180 cm.
Figure 17 shows the experimental results of the object search using the size query, including children (small), juveniles (medium) and adults (large). Figure 17a shows that the proposed algorithm successfully searched children smaller than 110 cm, and Figure 17b,c shows the similar results with a juvenile and adult, respectively. Table 4 summarizes the classification performance using the object size. The right most column has the number of total objects and the correctly-classified ones in the parenthesis. The experiment can correctly classify 95.4% of the objects on average.
Figure 18 shows the experimental results of the object search using the aspect ratio. The horizontal query is used to find vehicles; the normal query is used to find motorcycles and groups of people; and the vertical query is used to find a single human object. Table 5 summarizes the classification performance using the aspect ratio. The rightmost column has the number of total objects and the correctly-classified ones in the parenthesis. The experiment can correctly classify 96.9% of the objects on average.
Figure 19 shows the experimental results of the object search using the speed queries, including slow, normal and fast. Table 6 summarizes the search results using the object speed with the classification performances. As shown in Table 6, more than 95% of the objects are correctly classified.
Table 3, Table 4, Table 5 and Table 6 show the accuracy and reliability of the proposed algorithm. More specifically, the color-based searching result shows relatively high accuracy with various searching options. For that reason, the object color can be the most important feature for object identification.
Figure 20 shows the experimental results of the object search using user-defined boundaries to detect a moving direction.
Figure 21 shows the results of the proposed algorithm for person re-identification in the wild (PRW) dataset [26]. As shown in the figure, the objects’ colors and trajectories are correctly classified.
Figure 22 shows the processing time of the proposed algorithm. To measure the processing time, a personal computer is used with a 3.6-GHz quad-core CPU and 8 GBytes of memory. As shown in Figure 22, it takes 20–45 ms to process a frame, and the average processing speed is 39 frames per second (FPS).

6. Conclusions

This paper presented a multiple camera-based wide-range surveillance system that can efficiently retrieve objects-of-interest by extracting normalized metadata of an object acquired by multiple, heterogeneous cameras. In order to retrieve a desired video clip from a huge amount of recorded video data, the proposed system allows a user to query various features, including the size, color, length ratio, moving speed and direction. The first step of the algorithm is the auto-calibration to extract normalized physical data. The proposed auto-calibration algorithm can estimate both the internal and external parameters of a camera without using a special pattern or depth information. Image data acquired by the appropriately-calibrated camera provides normalized object information. In the metadata generation step, a color constancy algorithm is first applied to the input image as preprocessing. After a set of representative colors are extracted using K-means clustering, the physical size and speed of an object-of-interest is estimated in the world coordinate using the camera parameters. The metadata of the object are then generated using the size ratio and motion trajectories. As a result, an object-of-interest can efficiently be retrieved using a query that combines physical information from big video data recorded by multiple, heterogeneous cameras. Experimental results demonstrated that the proposed system successfully extracts the metadata of the object-of-interest using three-dimensional (3D) human modeling and auto-calibration steps. The proposed method can be applied to a posteriori video analysis and retrieval systems, such as a vision-based central control system and a surveillance system.

Supplementary Materials

The following are available online at https://www.mdpi.com/1424-8220/16/7/963/s1.

Acknowledgments

This work was supported by the NIPA (NIPA-2014-CAU) under the ITRC support program supervised, by the Ministry of Science, ICT and Future Planning under the Software Grand Challenge Project (14-824-09-003) and by the Technology Innovation Program (Development of Smart Video/Audio Surveillance SoC & Core Component for On-site Decision Security System) under Grant 10047788.

Author Contributions

Jaehoon Jung performed the experiments. Inhye Yoon and Seungwon Lee initiated the research and designed the experiments. Joonki Paik wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hampapur, A.; Brown, L.; Feris, R.; Senior, A.; Shu, C.F.; Tian, Y.; Zhai, Y.; Lu, M. Searching surveillance video. In Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, AVSS 2007, London, UK, 5–7 September 2007; pp. 75–80.
  2. Yuk, J.S.; Wong, K.Y.K.; Chung, R.H.; Chow, K.; Chin, F.Y.; Tsang, K.S. Object-based surveillance video retrieval system with real-time indexing methodology. In Image Analysis and Recognition; Springer: Berlin, Germany, 2007; pp. 626–637. [Google Scholar]
  3. Hu, W.; Xie, D.; Fu, Z.; Zeng, W.; Maybank, S. Semantic-based surveillance video retrieval. IEEE Trans. Image Process. 2007, 16, 1168–1181. [Google Scholar] [CrossRef] [PubMed]
  4. Le, T.L.; Boucher, A.; Thonnat, M.; Bremond, F. A framework for surveillance video indexing and retrieval. In Proceedings of the International Workshop on Content-Based Multimedia Indexing, CBMI 2008, London, UK, 18–20 June 2008; pp. 338–345.
  5. Ma, X.; Bashir, F.; Khokhar, A.A.; Schonfeld, D. Event analysis based on multiple interactive motion trajectories. IEEE Trans. Circuits Syst. Video Technol. 2009, 19, 397–406. [Google Scholar]
  6. Choe, T.E.; Lee, M.W.; Guo, F.; Taylor, G.; Yu, L.; Haering, N. Semantic video event search for surveillance video. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 1963–1970.
  7. Thornton, J.; Baran-Gale, J.; Butler, D.; Chan, M.; Zwahlen, H. Person attribute search for large-area video surveillance. In Proceedings of the 2011 IEEE International Conference on Technologies for Homeland Security (HST), Waltham, MA, USA, 15–17 November 2011; pp. 55–61.
  8. Ge, W.; Collins, R.T.; Ruback, R.B. Vision-based analysis of small groups in pedestrian crowds. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1003–1016. [Google Scholar] [PubMed]
  9. Yun, S.; Yun, K.; Kim, S.W.; Yoo, Y.; Jeong, J. Visual surveillance briefing system: Event-based video retrieval and summarization. In Proceedings of the 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Seoul, Korea, 26–29 August 2014; pp. 204–209.
  10. Gerónimo, D.; Kjellstrom, H. Unsupervised Surveillance Video Retrieval based on Human Action and Appearance. In Proceedings of the 2014 22nd International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 24–28 August 2014; pp. 4630–4635.
  11. Lai, Y.H.; Yang, C.K. Video object retrieval by trajectory and appearance. IEEE Trans. Circuits Syst. Video Technol. 2015, 25, 1026–1037. [Google Scholar]
  12. Zhengyou, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
  13. Zhao, T.; Nevatia, R.; Wu, B. Segmentation and tracking of multiple humans in crowded environments. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1198–1211. [Google Scholar] [CrossRef] [PubMed]
  14. Stauffer, C.; Grimson, W.E.L. Adaptive background mixture models for real-time tracking. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Fort Collins, CO, USA, 23–25 June 1999; Volume 2.
  15. Lv, F.; Zhao, T.; Nevatia, R. Camera calibration from video of a walking human. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1513–1518. [Google Scholar] [PubMed]
  16. Krahnstoever, N.; Mendonca, P.R. Bayesian autocalibration for surveillance. In Proceedings of the Tenth IEEE International Conference on Computer Vision, Beijing, China, 17–21 October 2005; Volume 2, pp. 1858–1865.
  17. Liu, J.; Collins, R.T.; Liu, Y. Surveillance camera autocalibration based on pedestrian height distributions. In Proceedings of the British Machine Vision Conference, Scotland, UK, 29 August–2 September 2011; p. 144.
  18. Cipolla, R.; Drummond, T.; Robertson, D.P. Camera Calibration from Vanishing Points in Image of Architectural Scenes. In Proceedings of the British Machine Vision Conference (BMVC), Citeseer, Nottingham, UK, 13–16 September 1999; Volume 99, pp. 382–391.
  19. Liebowitz, D.; Zisserman, A. Combining scene and auto-calibration constraints. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greek, 20–27 September 1999; Volume 1, pp. 293–300.
  20. Zivkovic, Z.; van der Heijden, F. Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognit. Lett. 2006, 27, 773–780. [Google Scholar] [CrossRef]
  21. Bradski, G.R. Computer vision face tracking for use in a perceptual user interface. In Proceedings of the Workshop Applications of Computer Vision, Kerkyra, Greek, 19–21 October 1998; pp. 214–219.
  22. Fischler, M.A.; Bolles, R.C. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
  23. Liebowitz, D.; Criminisi, A.; Zisserman, A. Creating architectural models from images. Comput. Graph. Forum 1999, 18, 39–50. [Google Scholar] [CrossRef]
  24. Finlayson, G.D.; Trezzi, E. Shades of gray and colour constancy. In Proceedings of the 12th Color and Imaging Conference, Scottsdale, AZ, USA, 9–12 November 2004; pp. 37–41.
  25. Van de Weijer, J.; Gevers, T.; Gijsenij, A. Edge-based color constancy. IEEE Trans. Image Process. 2007, 16, 2207–2214. [Google Scholar] [CrossRef] [PubMed]
  26. Person Re-Identification in the Wild Dataset. Available online: http://robustsystems.coe.neu.edu/sites/robustsystems.coe.neu.edu/files/systems/projectpages/reiddataset.html (accessed on 17 June 2016).
Figure 1. Block diagram of the proposed human retrieval method.
Figure 1. Block diagram of the proposed human retrieval method.
Sensors 16 00963 g001
Figure 2. Human models on the projected multiple ellipses with different sizes and locations.
Figure 2. Human models on the projected multiple ellipses with different sizes and locations.
Sensors 16 00963 g002
Figure 3. Matching results of the human models: (a) an example of the fitting failure in the second human from the right; (bd) the corrected fitting results.
Figure 3. Matching results of the human models: (a) an example of the fitting failure in the second human from the right; (bd) the corrected fitting results.
Sensors 16 00963 g003
Figure 4. Vanishing lines and vanishing points.
Figure 4. Vanishing lines and vanishing points.
Sensors 16 00963 g004
Figure 5. Estimation of vanishing lines and vanishing points.
Figure 5. Estimation of vanishing lines and vanishing points.
Sensors 16 00963 g005
Figure 6. Foot-to-head homology estimation: (a) inlier data; (b) ground truth of the homology; and (c) the estimated homology.
Figure 6. Foot-to-head homology estimation: (a) inlier data; (b) ground truth of the homology; and (c) the estimated homology.
Sensors 16 00963 g006
Figure 7. Results of color correction: (a) input images captured by three different cameras; and (b) color-corrected images using the shades of gray method.
Figure 7. Results of color correction: (a) input images captured by three different cameras; and (b) color-corrected images using the shades of gray method.
Sensors 16 00963 g007
Figure 8. K-means clustering results in the (a) RGB and (b) HSV color spaces.
Figure 8. K-means clustering results in the (a) RGB and (b) HSV color spaces.
Sensors 16 00963 g008
Figure 9. Results of K-means clustering to extract representative colors of the same object using different sets of initial clusters: (a) input image; (b) different results of K-means clustering; and (c) the sorted colors of (b).
Figure 9. Results of K-means clustering to extract representative colors of the same object using different sets of initial clusters: (a) input image; (b) different results of K-means clustering; and (c) the sorted colors of (b).
Sensors 16 00963 g009
Figure 10. Selection of the representative colors from the candidate colors computed by the K-means clustering algorithm: (a) input image with two people; (b) the result of color selection; (c) an input image with a vehicle; (d) the result of color selection.
Figure 10. Selection of the representative colors from the candidate colors computed by the K-means clustering algorithm: (a) input image with two people; (b) the result of color selection; (c) an input image with a vehicle; (d) the result of color selection.
Sensors 16 00963 g010
Figure 11. Size estimation results of the same object that is (a) far from the camera; (b) close to the camera.
Figure 11. Size estimation results of the same object that is (a) far from the camera; (b) close to the camera.
Sensors 16 00963 g011
Figure 12. Results of object speed estimation: (a) without using the Kalman filter and (b) using the Kalman filter.
Figure 12. Results of object speed estimation: (a) without using the Kalman filter and (b) using the Kalman filter.
Sensors 16 00963 g012
Figure 13. Variation of the object height in each frame.
Figure 13. Variation of the object height in each frame.
Sensors 16 00963 g013
Figure 14. Results of the object search using representative colors. (a) Red; (b) green; (c) blue; (d) yellow; (e) orange; (f) purple; (g) white; (h) black.
Figure 14. Results of the object search using representative colors. (a) Red; (b) green; (c) blue; (d) yellow; (e) orange; (f) purple; (g) white; (h) black.
Sensors 16 00963 g014
Figure 15. Test video files with estimated camera parameters: (a,b) two images of the first scene captured by two different camera parameters; (c,d) two images of the second scene captured by two different camera parameters; (e,f) two images of the third scene captured by two different camera parameters; (g,h) two images of the fourth scene captured by two different camera parameters.
Figure 15. Test video files with estimated camera parameters: (a,b) two images of the first scene captured by two different camera parameters; (c,d) two images of the second scene captured by two different camera parameters; (e,f) two images of the third scene captured by two different camera parameters; (g,h) two images of the fourth scene captured by two different camera parameters.
Sensors 16 00963 g015
Figure 16. Result of camera calibration on the virtual three-dimensional grid on: (a,b) two images of the first scene captured by two different camera parameters; (c,d) two images of the second scene captured by two different camera parameters; (e,f) two images of the third scene captured by two different camera parameters; (g,h) two images of the fourth scene captured by two different camera parameters.
Figure 16. Result of camera calibration on the virtual three-dimensional grid on: (a,b) two images of the first scene captured by two different camera parameters; (c,d) two images of the second scene captured by two different camera parameters; (e,f) two images of the third scene captured by two different camera parameters; (g,h) two images of the fourth scene captured by two different camera parameters.
Sensors 16 00963 g016
Figure 17. Results of the object using the object size. (a) Small size; (b) medium size; (c) large size.
Figure 17. Results of the object using the object size. (a) Small size; (b) medium size; (c) large size.
Sensors 16 00963 g017
Figure 18. Results of the object search using the object ratio. (a) Horizontal; (b) normal; (c) vertical.
Figure 18. Results of the object search using the object ratio. (a) Horizontal; (b) normal; (c) vertical.
Sensors 16 00963 g018
Figure 19. Results of the object search using the object speed. (a) Slow; (b) normal; (c) fast.
Figure 19. Results of the object search using the object speed. (a) Slow; (b) normal; (c) fast.
Sensors 16 00963 g019
Figure 20. Results of the object search using the moving direction. (a) Line setting; (b) the results of the search.
Figure 20. Results of the object search using the moving direction. (a) Line setting; (b) the results of the search.
Sensors 16 00963 g020
Figure 21. Results of the proposed algorithm using a public dataset [26]: (ad) four frames in the test video with re-identified people.
Figure 21. Results of the proposed algorithm using a public dataset [26]: (ad) four frames in the test video with re-identified people.
Sensors 16 00963 g021
Figure 22. Processing time of the proposed algorithm.
Figure 22. Processing time of the proposed algorithm.
Sensors 16 00963 g022
Table 1. Object metadata model.
Table 1. Object metadata model.
NameDescription
IDObject number
File nameOccurrence video file name
FrameStart frameStart frame, end frame and duration of the frame
End frame
Duration
TrajectoryFirst positionFirst position, 1/3 position, 2/3 position, last position and moving distance
Second position
Third position
Last position
Moving distance
Height (mm)Min heightMinimum, median and maximum height of the object
Median height
Max height
Width (mm)Min widthMinimum, median and maximum width of the object
Median width
Max width
Speed (m/s)Min speedMinimum, median and maximum speed of the object
Median speed
Max speed
Aspect ratioMin aspect ratioMinimum, median and maximum aspect ration of the object
Median aspect ratio
Max aspect ratio
ColorFirst colorFirst, second and third HSV color value
Second color
Third color
Area sizeMin areaMinimum, median and maximum size of the area
Median area
Max area
Table 2. Performance evaluation of scene auto-calibration.
Table 2. Performance evaluation of scene auto-calibration.
Input ScenesEstimated and Corrected
Camera Parameters
Scenes with ANumber of
Appearances
Sensors 16 00963 i001
<Scene_1>
f = 613 θ = 111 ρ = 182 h c = 2660 mm Sensors 16 00963 i00225
Sensors 16 00963 i003
<Scene_2>
f = 632 θ = 118 ρ = 180 h c = 6450 mm Sensors 16 00963 i0049
Sensors 16 00963 i005
<Scene_3>
f = 643 θ = 104 ρ = 180 h c = 3096 mm Sensors 16 00963 i0062
Sensors 16 00963 i007
<Scene_4>
f = 667 θ = 117 ρ = 173 h c = 10 , 331 mm Sensors 16 00963 i0083
Sensors 16 00963 i009
<Scene_5>
f = 644 θ = 107 ρ = 183 h c = 2399 mm Sensors 16 00963 i01015
Sensors 16 00963 i011
<Scene_6>
f = 688 θ = 108 ρ = 179 h c = 2672 mm Sensors 16 00963 i01210
Sensors 16 00963 i013
<Scene_7>
f = 532 θ = 109 ρ = 180 h c = 3035 mm Sensors 16 00963 i0143
Table 3. Result of the classification based on the color.
Table 3. Result of the classification based on the color.
RedGreenBlueYellowOrangePurplePinkWhiteGrayBlackTotal Object
Red112000204000129 (95%)
Green06100000007 (86%)
Blue01960000043104 (92%)
Yellow00070001008 (88%)
Orange2003880010094 (94%)
Purple00000200002 (100%)
Pink1000101200014 (86%)
White0000000795084 (94%)
Gray0000000193296 (97%)
Black00400000231237129 (98%)
Table 4. Result of the classification based on the object size.
Table 4. Result of the classification based on the object size.
SmallMediumLargeTotal Object
Small3511349 (71%)
Medium618521212 (87%)
Large0179931010 (98%)
Table 5. Result of the classification based on the aspect ratio.
Table 5. Result of the classification based on the aspect ratio.
HorizontalNormalVerticalTotal Object
Horizontal383546 (83%)
Normal154762 (87%)
Vertical22111401163 (98%)
Table 6. Result of the classification of the speed-based search.
Table 6. Result of the classification of the speed-based search.
SlowNormalFastTotal Object
Slow96370133 (72%)
Normal29765983 (99%)
Fast09146155 (94%)

Share and Cite

MDPI and ACS Style

Jung, J.; Yoon, I.; Lee, S.; Paik, J. Normalized Metadata Generation for Human Retrieval Using Multiple Video Surveillance Cameras. Sensors 2016, 16, 963. https://doi.org/10.3390/s16070963

AMA Style

Jung J, Yoon I, Lee S, Paik J. Normalized Metadata Generation for Human Retrieval Using Multiple Video Surveillance Cameras. Sensors. 2016; 16(7):963. https://doi.org/10.3390/s16070963

Chicago/Turabian Style

Jung, Jaehoon, Inhye Yoon, Seungwon Lee, and Joonki Paik. 2016. "Normalized Metadata Generation for Human Retrieval Using Multiple Video Surveillance Cameras" Sensors 16, no. 7: 963. https://doi.org/10.3390/s16070963

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop