1 Introduction

A trademark is the product of commodity economy and is one of the most significant industrial property rights. The Patent Office in U.S. have received about 2 million trademark patent applications per year since 2013. In China, the number of trademark registration applications has the first rank in the world for 15 years and about 2.87 million trademark patent applications have been received in 2015. At the end of 2015, China’s effective trademark registration had amounted to 10.74 million, and the total number of applications for registration was 19.13 million. Trademark applicants usually take about 18 months to get approval from the Patent Office, and most of the time are spent on retrieving the similar trademarks, which is largely related with the economic interests of companies. In the view of these situations, it is vital and essential to improve the performance of retrieving similar trademarks.

At present, the trademark retrieval technology is developed by the image retrieval technology, which is mainly from the development of the content based image retrieval (CBIR), including the color, texture, and shape of the three features of the search. In the use of color features, [5, 7,8,9,10] computed color histograms in different color space to describe the color features of the trademark. By using texture features, Tursun and Kalkan [5] used LBP features. Goyal and Walia [4] used improved LBP features such as LBP-HF, ULBP, CS-LBP, HLBP, and LDP features as textures features of trademark images to retrieve. Qi [11] proposed the retrieval of Tamura texture features for trademark images. In terms of shape features, which can be divided into boundary features and regional characteristics. The boundary feature is mainly obtained by extracting the edge of the image by Fourier descriptor, wavelet transform and discrete cosine transform as the shape feature of the image. In general, the low-frequency component reacts with the overall shape of the object, and the high frequency component reacts with the object. The moment features include Hu moments, Zernike moments, pseudo Zernike moments, Legendre moments, and etc. Moment features can satisfy the transformation of image, like scale and rotation. Tursun and Kalkan [5] also use the moment feature to retrieve trademark images.

Currently, the mainstream method is to use the fusion of local features and global features as fusion features for trademark image retrieval. The global feature is mainly mentioned moments as above, and the local features can be divided into: entropy histogram, boundary distance histogram, boundary point curvature, SIFT feature, and BoW feature. Tursun and Kalkan [5] recommend BoW features, Triangular SIFT feature to retrieve similar trademark images. In the aspect of fusion features, Wang and Hong [12] proposed method by the fusing Zernike moments and SIFT features. Wei et al. [3] using the fusion of Zernike moments and edge point curvature and boundary histogram to extract similar features. Goyal and Walia [4] combines Zernike moments with improved LBP, LDP, and achieved good retrieval accuracy.

However, trademark images are apparently different from ordinary images. Obviously, the color of trademark images can be changeable and artificial, so that the color features cannot be viewed as the intrinsic features of trademark. In addition, many trademark images do not always have skin or faces, which may have no rich texture features. Only extracting texture features is not suitable for this problem. Shape features are the most effective features to search similar trademark images. If trademark image contains complex geometric patterns, not only boundary features but also regional features (global features) will not get good search results. In this study, the most difficult problem for retrieving similar trademarks is how to define “similarity”. In other words, it is challenging to validate the similarity between two trademark images since there is a big gap between computer vision and human vision.

For the purpose of solving this big problem, we want to extract another new features which is more similar to human vision than traditional features. Recently, deep learning is well known in worldwide. It mimics simplified human nervous system and gets the best result in variety of fields. Convolutional neural network (CNN) is one of the deep learning methods which mainly deal with images and rank high levels in image processing, pattern recognition and so on. In view of these reasons, we choose extracting trademark images features by CNN. In this research, we extract Uniform LBP from feature maps of convolutions as retrieval features, which is another “fusion”, and we use Euclidean distance to measure the similarity of different trademarks.

2 Proposed Method

2.1 Convolutional Neural Network (CNN)

Convolutional neural network (CNN) is a kind of multi-layer neural network, specially designed to deal with two-dimensional data. CNN reduces the number of training parameters in the network by mining the spatial correlation in the data, and it achieves the efficiency of the back propagation algorithm (BP) of the network.

In a CNN, the local receptive field in the two-dimensional space can make the neural network extract from the input image to the primary visual features, such as edge, endpoint, corner, etc. The subsequent layers combine these primary features to more advanced features, and some salient features of the observed data can be obtained at each layer. The pooling operations can reduce the dimension of the convolution features, in the meanwhile filter out the noise, then the image recognition features have been enhanced. Those pooled cells have transactional invariance. Even if the image has a certain degree of displacement, stretching and rotation of the relative deformation, the extracted features will remain unchanged.

Each layer of the convolutional neural network consists of a convolution layer and a pooling layer. This structure makes the network have a high degree of robustness with distortion of objects. The convolution process is: convoluting the input image with a trained filter \( f_{x} \), and then applying an offset of \( b_{x} \) to obtain the convolution layer. The formula is defined below [1]:

$$ X_{n}^{l} = \sum\nolimits_{{i \in M_{n} }} {X_{i}^{l - 1} *K_{in}^{l} + b^{l} } $$
(1)

Among them, \( M_{n} \) represents a set of feature maps selected from the input feature maps, \( X_{n}^{l} \) represents the nth feature map of the lth layer, \( K_{in}^{l} \) represents the lth element of the nth convolution kernel of the ith layer, \( b_{n}^{l} \) represents the lth offset of the nth layer, “*” represents the convolution process.

The pooling layer is also called the down-sampling layer, which produces the sampling result of feature maps. The pooling layer does not change the number of feature maps but changes the size of the feature. In the pooling process, each \( n \times n \) neighborhood pixels calculate its maximum or mean value, resulting in a roughly reduced n times feature map \( Sx + I \).

$$ X_{n}^{l} = \beta_{n}^{l} down(X_{n}^{l - 1} ) + b_{n}^{l} $$
(2)

In this formula, down(.)Indicates the down-sampling (pool) function, \( \beta_{n}^{l} \) represents the nth multiply offset of the lst layer. We adapt max pooling as the down-sampling function which is the largest element in the pooled region:

$$ P_{n} = \begin{array}{*{20}c} {max} \\ {i \in R_{n} } \\ \end{array} c_{i} $$
(3)

In order to make the network weaken faster, we use restricted linear unit (ReLU) as the activation function to obtain the responses. The ReLU function is defined as follow:

$$ f\left( x \right) = { \hbox{max} }(0,x) $$
(4)

2.2 Local Binary Pattern (LBP)

Local Binary Pattern (LBP) is an effective texture description operator putting forward by Ojala et al. [2]. It is used to extract the local texture information of the image and is invariant for illumination. LBP features can obtain good results in majority of applications. The gray value of the adjacent eight pixels is compared to the base value for obtaining a set of binary numbers that represent this LBP operator.

For any pixel in a local area of an image f(x c , y c ), a center point pixel and 8 neighborhood point pixels, \( g_{0} \), \( g_{1} \), ….\( g_{n} \), are chosen. The local area texture is denoted as T = t(\( g_{0} \), \( g_{1} \), ….\( g_{n} ) \), regarding the gray value of the center point as the threshold of the window and binarization for other pixels. That is, the gray value of the central pixel is the base value of the comparison. The formula is

$$ {\text{T }} \approx t\left( {s\left( {g_{0} - g_{c} } \right), \ldots ,\left( {g_{7} - g_{c} } \right)} \right),{\text{s}}\left( {\text{x}} \right) = \left\{ {\begin{array}{*{20}c} {1,x > 0} \\ {0,x \le 0} \\ \end{array} } \right. $$
(5)

And then the clockwise direction of an 8-bit binary number is calculated, corresponding to a binary mode for every pixel. The following formula for each symbol function is converted to a decimal number, which describes the local image texture. The LBP code value of the spatial structure characteristic is expressed as:

$$ LBP\left( {x_{c} ,y_{c} } \right) = \sum\nolimits_{i = 0}^{P} {s(g_{i} - g_{c} )2^{i} } $$
(6)

With the increase of the neighborhood size, the pattern of LBP feature would grow exponentially. It is more unfavorable to extract texture features when increasing more binary patterns. For the purpose of solving this problem, Ojala [2] adopted “uniform patterns” to improve LBP features. It can be expressed as \( LBP_{P,R}^{u2} \) when the binary numbers corresponding of local binary mode change from 0 to 1 or from 1 to 0 up to twice [6]. So it is called a uniform pattern class. Except for uniform patterns, other modal classes are classified as another. When using uniform mode and using \( LBP_{P,R}^{u2} \) operator for LBP coding, its expression is:

$$ LBP\left( {x_{c} ,y_{c} } \right) = \left\{ {\begin{array}{*{20}l} {\sum\nolimits_{i = 0}^{P} {s\left( {g_{i} - g_{c} } \right)2^{i} } } \hfill & {if\,U(LBP) \le 2} \hfill \\ {p + 1} \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
(7)

The uniform pattern accounts for more than 90% of all modes when using (8,1) neighborhood and accounts for more than 70% in the (16,2) neighborhood. The experimental results show that the uniform patterns can effectively describe most of the images of the texture features, and can greatly reduce the number of features.

So we choose uniform LBP which can reduce the dimension of features, the number of pattern drops from the original \( 2^{p} \) down to \( {\text{p}}({\text{p}} + 1) + 2 \) categories. The Uniform LBP improves the classification ability and enhances robustness, but meanwhile reduces the impact of high-frequency noise and vector dimension.

2.3 Trademark Retrieval

After that, we extract LBP features based on the intermediate results of CNN, the more applied texture features can be extracted which combine the advantages with CNN and uniform LBP.

Principal Component Analysis (PCA) is one of the most commonly used dimensionality reduction methods. PCA transforms the multivariable problem in high-dimensional space into low-dimensional space, forming a new minority of variables. This approach can reduce the dimensions of multivariable data system, but also simplify statistics of system variables.

In our paper, we put forward a new method to extract Uniform LBP features based on CNN. Firstly, we use the pre-trained CNN model, i.e., imagenet-vgg-f [1], to extract features by its full connection shown in Fig. 1. We send a trademark image into the net. After convolutions and max-pooling, features can be extracted through full connection. Then we extract the features of the penultimate layer (The layer marked by the red circle in Fig. 1). Although the retrieval results of this method is similar to human vision than other traditional methods, this method does not apply to the situation that similar trademark images which have different background color. Furthermore, this approach will be affected by the overall graphics structure and ignore the local texture similarity. So we put forward an improved method to optimize it.

Fig. 1.
figure 1

Extracting features from full connection

The proposed method is extracting Uniform LBP features from feature maps of every convolutions shown in Fig. 2. Following the same procedure of the first method, we send a trademark image into the net, but now we extract feature maps from every convolution layers respectively. The LBP algorithm is based on images in which the value of each pixel should be 0 to 255 while the value of feature maps is not in this range. So we need to normalize the value of feature maps into 0 to 255. Then extract Uniform LBP features from feature maps of five convolutions, which we choose the number of neighbor pixel is 8 and the value of radius is 1. Then we cascade LBP features of the same convolution so that we can get 5 LBP features by five convolution layers. After that we reduce the dimensions of cascade LBP features by Principal Component Analysis (PCA). Through experiments, we select which is the best LBP features of different convolutions. The concrete flow chart of this process is shown in Fig. 3.

Fig. 2.
figure 2

Extracting Uniform LBP features from feature maps

Fig. 3.
figure 3

The concrete steps of extracting Uniform LBP features by our proposed method

3 Experimental Results and Analyses

3.1 Experimental Dataset

In our experiments, we collect 7,139 trademark images including some transformed images by artificial process, cut, rotation, etc. In this way, we collect and create 317 similar trademark groups and every group has two similar trademarks at least. We use these 317 groups as out test set. Figure 4 is shown us a part of similar trademark dataset. In order to verify the effectiveness of our methods, we also have experiments in METU dataset [5] which contains about 930,328 trademark images. The test set of METU dataset contains 32 sets including 10 similar trademark images per set, which showed in Fig. 5 partially.

Fig. 4.
figure 4

Some samples of similar trademark

Fig. 5.
figure 5

Partial test sets of METU dataset

Owing to the size of input image must be fixed to 224 \( \times \) 224, while collected trademark images have different sizes. If we normalize the size of the image directly, some of geometries will be distorted. So we should firstly fill the image into a square, then normalize the size of the image into 224 \( \times \) 224.

3.2 Feature Extraction

In our experiment, we use pre-trained CNN model, i.e., imagenet-vgg-f [1], to extract the images features. Firstly, we feed all the trademark images to the imagenetvgg-f model to extract their features. Then we feed the query image to the imagenet-vgg-f to extract its features. Finally, we measure the similarity distance between features of query image and features of every trademark image. Table 1 shown the structure and concrete parameters of imagenet-vgg-f model.

Table 1. The structure of imagenet-vgg-f [1]

CNN contains convolutions, full connections and classification. In this part, we have tried two methods to extract features by CNN: 1. extracting features through the full connection of CNN as shown in Fig. 1; 2. extracting Uniform LBP features from every feature maps of per convolution as shown in Fig. 2.

We use the first method as shown in Fig. 1 to get a 4096 dimensions vector from the full connection as the feature of the input trademark image, and use our proposed method as shown in Fig. 2 to extract Uniform LBP features from 5 convolutions, then we reduce the dimension of features by PCA. In CNN, we can extract primary visual features from the first fewer layers (inflection point, end point, corner) and extract more advanced features (texture, global shape) from the subsequent layers. Therefore, we consider that we can extract LBP features through the subsequent layers. Through a large number of experimental verification, we find the features from the forth convolution can enhance the robustness of LBP and can be much closer to human vision, so we choose forth convolution to extract LBP features. Because the forth convolution has 256 feature maps, we extract LBP features of every feature map so that we can get 256 \( \times \) 59 dimensions vector per image. Afterwards, through PCA the dimensions of feature can be reduced to 6224.

3.3 Similarity Measurement

The essence of the image similarity calculation is the calculation of the similarity between the corresponding features of the image. The feature of the image is expressed by the vector, and the similarity between the two images is calculated by calculating the distance between the two vectors. In this experiment, we use Euclidean distance to measure it.

Let a and b be two n-dimensional vectors, ai and bi are the values on the i-th dimension of the vector, and the formula of the Euclidean distance can be expressed as:

$$ d_{E} \left( {A,B} \right) = (\sum\nolimits_{i = 1}^{n} {\left| {a_{i} - b_{i} } \right|^{2} } )^{1/2} $$
(8)

3.4 Experimental Results

In this section, firstly we test our proposed method in our own dataset. We extract features from the full connection and obtain a 4096-dimensional vector for every trademark image. The Fig. 5 (a), (b) and (c) are shown to us the retrieval consequent of this method. The query image is the first image of every results. Compared to the ground-truth in Fig. 4, the Fig. 6(a) indicates that features extracted by CNN could get better consequences which is close to human vision. However, the Fig. 6(b) and (c) show that this method cannot adjust the effect with background color, and focus on the overall shape but ignore the local texture similarity.

Fig. 6.
figure 6

Retrieval results by features from CNN full connection

In order to solve these problems above, we seriously consider the examples which cannot be searched in a good way and find that these examples which have poor results by CNN can be retrieve excellently by LBP features, because the pairs of these trademark images have the most of the same texture. The dimension of LBP feature is much smaller than the dimension of CNN feature extracted from full connection, so integrating these two features directly is not feasible. In view of this situation, we try our new method: extracting LBP feature from feature maps of per convolution. The Fig. 7(a), (b) and (c) show us the retrieval consequent of the proposed method. The query image is also the first image of every result which is the same test image to Fig. 6(a), (b) and (c).

Fig. 7.
figure 7

Retrieval results by uniform LBP features from CNN convolutions

Comparing the two sets of experiment between Figs. 6 and 7, we can find that the result of Fig. 7(a) keeps the same good retrieval accuracy to the results of Fig. 6(a). And Fig. 7(b) and (c) get the favorable results that reflect this proposed method can solve the drawback of using CNN features from the full connection, which will be affected by background color and pay little attention to local texture.

It is worth noting that the situations of Fig. 7(b) and (c), which changes the background color or cut a part of original trademark, are the most common forms of infringement of trademark copyright. So the method we proposed not only can retrieve similar trademark images as same as human vision, but also can credibly search the trademark which is suspected to plagiarism.

3.5 Evaluation

In order to validate the effectiveness and accuracy of the proposed method quantitatively, we use Recall (R), Precision (P) and F-Measure to describe it. They are defined as:

$$ {\text{Recall}} = \frac{\text{The number of associated images in the output}}{\text{The number of associated images in the database}} \times 100\% $$
(9)
$$ {\text{Precision}} = \frac{\text{The number of associated images in the output}}{\text{The number of images retrieved for all outputs}} \times 100\% $$
(10)
$$ {\text{F}} - {\text{Measure}} = 2 \times \frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}} $$
(11)

We test our proposed method in our trademark database, because the number of every similar trademark group is no more than 10. We set the total number of retrieved image is 10. In Table 2, we show that our proposed method compared with other common methods. Every data in Table 2 is an average value.

Table 2. Recall, Precision and F-Measure of compared methods

From Table 2 we can concern that our proposed method has much better performance than other traditional methods in the Table 2. And the recall, precision, and F-measure of our method are all higher than any other methods.

Through the experimental results of METU dataset tested by Tursun [5], we can obtain the PVR curve of our proposed method on METU dataset which showed in Fig. 8. As can be seen from Fig. 8, our proposed method gets better retrieval results than any other methods when the recall is less than 0.5; when the recall is greater than 0.5, our method. Our method is probably the same as the best method (SIFT) for retrieval results. In summary, our approach is shown to be effective in large-scale trademark image retrieval.

Fig. 8.
figure 8

Retrieval results in METU dataset

4 Conclusion

In this paper, a new method is proposed to retrieve similar trademark images, which extracts Uniform LBP features based on CNN. Comparing the recall, precision, F-measure and PVR curve on METU dataset with other traditional methods, our method would get better retrieval results. In social area, the proposed method on the one hand could help trademark applicants to avoid infringement behavior, on the other hand could help trademark reviewers to quickly and effectively review the applications for registration of the trademark whether it is a malicious cybersquatting, deliberately imitating the behavior of the registered trademark or not. Our method not only has a very good retrieval results, but also solve the case in the absence of adequate samples and how to use mature CNN to solve our own problems.