Skip to content
BY 4.0 license Open Access Published by De Gruyter Open Access March 28, 2019

Information retrieval algorithm of industrial cluster based on vector space

  • Rongsheng Li EMAIL logo and Nasruddin Hassan
From the journal Open Physics

Abstract

The current information retrieval research on industrial clusters has low precision, low recall ratio, obvious delay and high energy consumption. Thus, in this paper, a information retrieval algorithm based on vector space for industrial clusters is proposed. By optimizing the unlawful labels in the database network, dividing the web pages of the industrial cluster information database and calculating the keyword scores of the relevant information of the industrial cluster corresponding to a web page, a set of well-divided database pages is obtained, and the purification of the industrial cluster information database is realized. According to the purification of industrial cluster information database, RFD algorithm is used to extract the page data features of purified industrial cluster information database. The extracted results are substituted into the information retrieval, and the vectors composed of retrieval units are used to describe the information of various types of industrial clusters and each retrieval. The matching results of information retrieval are obtained by calculating the correlation between the information of industrial clusters and the query, and the information retrieval of industrial clusters is completed. Experimental results show that the algorithm has high precision and recall ratio, short retrieval time and low energy consumption.

1 Introduction

Industrial clusters refer to the collection of enterprises and related corporate bodies with geographical proximity, interrelated, and linked by virtue of mutual commonality and complementarity in a specific field [1]. The main components of industrial clusters include enterprises, governments, university research institutes, financial institutions, industry associations and intermediary institutions.

At present, the relevant platforms provide a variety of information retrieval services for the spatial database of industrial clusters, including fuzzy queries: approximate queries for enterprise names; classified queries: specific queries for production types in the enterprise database; peripheral queries: joint query for geographic coordinates and various types of industries in the enterprise tables; Site search: joint query for multiple fields in enterprise, product and industry information tables [2, 3]. Through the analysis and research of the above-mentioned retrieval methods for the platform, it is found that the retrieval module of this platform only supports the search of some data resources and some fields in the platform. The retrieval efficiency of the system is low, and it cannot meet the user’s retrieval needs correctly. It lacks an efficient intelligent information retrieval algorithm or method.

The humanization of retrieval is reflected in the standardization of Web site design and the clarity of navigation. Information construction and retrieval is a hot issue in recent years. Information construction is proposed to better organize and present information on the web. An important goal is to make information understood. The organization and expression of information are essential for achieving good performance of information retrieval and acquirement [4]. Information retrieval, as an indispensable and important means of inquiry in people’s daily life and work, plays an increasingly important role in today’s society. The following are some widely used information retrieval methods and algorithms.

Zhang Xiaomin et al. put forward a keyword retrieval method based on temporal semantics. Temporal informationwas introduced to construct temporal data graph, and temporal correlation scoring mechanism was designed.

Temporal semantic constraints were introduced in the process of temporal graph search, and keyword-based temporal retrieval algorithm was designed. Experimental results showed that the retrieval time was short, but the precision was low [5]. Jiang Yu et al. proposed an information query algorithm based on Top-k. This algorithm extracted the static Top-k information of inverted index, and then calculated the initial threshold for specific query terms dynamically. On this basis, combining MaxScore and WAND algorithm, a fast-start Top-k query processing algorithm was proposed. Experimental results showed that the proposed algorithm had low computational complexity, but low recall ratio [6]. Zhao Yanni et al. proposed tree matching algorithm based on effective path weight. On the basis of maintaining the effective node and tree structure of XML document tree, the information of tree root node is the most important. With the increase of tree depth, the importance of node information is gradually weakened. The path weight was calculated automatically according to the path hierarchy, and the corresponding path was given. The matching degree of the tree was calculated according to the effective information of tree node and the effective path of tree structure. Experiments on large-scale XML document queries showed that the algorithm had a high query rate, but the delay of query process was obvious [7]. Ma Youzhong et al. proposed a similarity join query algorithm for high-dimensional data based on Chi square distribution. In order to solve the problem of dimensionality disaster and high computational cost in similarity join query of high-dimensional data, high-dimensional data was mapped to low-dimensional space based on p-stable distribution. The property of chi-square distribution proved that if the distance of low-dimensional space was greater than kε, the probability of that the distance of original space was greater than ε had a lower bound, so it could be filtered effectively in low-dimensional space at a lower computational cost. Experiments on real data sets showed that the proposed algorithm had a good recall rate, but it had the problem of high query energy consumption [8].

Aiming at the problems existing in the current research results, an information retrieval algorithm for industrial clusters based on vector space is proposed. The detailed process is as follows:

The improved VIPS algorithm is used to purify the information database of industrial clusters, so as to improve the precision and recall ratio of information retrieval, and reduce the retrieval delay and the energy consumption.

RFD algorithm is used to extract the page data features of purified industrial cluster information database, which lays a foundation for information retrieval of industrial clusters.

The search space is defined, to calculate the correlation between documents and queries, and realize the information retrieval of industrial clusters by the idea that the higher the correlation between information and query words is, the more relevant the information is.

The proposed algorithm is verified.

The full text is summarized and the next research plan is proposed.

2 Material and methods

2.1 Purification of industrial cluster information database

In order to improve the precision and recall ratio of information retrieval in industrial clusters, reduce the retrieval delay and energy consumption, the information database of industrial clusters needs to be processed. Noise removal module is indispensable[9]. In this paper, the improved VIPS algorithm is used to debase information blocks. Through a large number of statistics and analysis, noise semantic blocks are identified by using the number of text and links, the relative position of the page blocks and the content attributes of the page blocks. The origin coordinates of the database web page window are defined as the top left corner of the web page, the abscissa coordinate of the web page block center X is the abscissa coordinate of the center point of the web page block in the window, the ordinate coordinate of the web page block center is Y, the width of the web page is W, and the height of the web page is H. The spatial position of the web page is defined by the relative space position K of the web block, and the expression of K is:

(1) K = u Y / H R 1 d Y / H R 2 l X / W R 3 r X / W R 4 m i d e l s e

Where u, d, l, r, and mid represent the top, bottom, left, right and middle positions of the database pages.

According to the definition of equation (1), VIPS algorithm is used to partition the web pages, and the rules are used to purify the web pages. The optimized sorting algorithm is as follows:

Input: database web page set P and the keyword set Q of industry cluster related information.

Output: a good set SPof database pages.

The detailed process is as follows:

Optimize the illegal labels in the database network.

Use VIPS algorithm to segment web pages.

Calculate the scores of key information related to industrial clusters in a web page:

(2) S j = c j b × f i × t i j × r b 2 × f i 2 l j Q K

Where Sj represents the score of web page j corresponding to relevant information keywords of industrial clusters, cj represents the number of entries containing relevant information keywords of industrial clusters, tij represents the occurrence frequency of relevant information keywords of industrial clusters in web page j, fi represents the frequency of inverted words in web pages of relevant information keyword i of industrial clusters. b represents the field parameter, and lj represents the length of page j.

According to the equation (2), it can get a good set of database page SP:

(3) S P = S P b

Where S represents the set of original database page. The result of equation (3) is the result of purifying industrial cluster information database.

2.2 Feature extraction of industry cluster information

According to the purification of industrial cluster information database in Section 2.1, RFD algorithm is used to extract the page data features of the purified industrial cluster information database.

Usually, if a feature item becomes a representative feature of a category, most samples of that category have this feature; if a feature item becomes a discriminant feature of a category, then most samples of other categories do not have this feature. In feature extraction, the representative and discriminant features should be selected as vector representations of a class [10, 11].

Supposing that p x c can approximate the ratio of the number of information containing feature item in training set category ć to the total number of information containing ć in training set, then p x c ¯ can approximate the ratio of the number of information not belonging to category ć and containing feature item to the total number of information not belonging to category ć in training set. The similarity measure of characteristic RFD (x̕, ć) can be expressed as:

(4) R F D x , c = S P A M B N M 2 = A × D B × C 2 M 2 N M

Where A represents the number of training information belonging to category ć and containing characteristic item . B represents the number of training information that does not fall into category ć and contains characteristic item . C represents the number of training information data that belong to category ć and does not contain characteristic item . D represents the number of training information that does not fall into category ć and does not contain characteristic item . M represents the number of information belonging to category ć. N represents the total number of training data.

Since both N and NM in the equation (4) are constants, the equation (4) can be simplified to:

(5) R F D x , c = A × D B × C 2

In order to reduce the error of feature extraction, the equation (5) is improved.

(6) R F D = A × D B × C 2 A × D B × C > 0 0 A × D B × C 0

Based on the above considerations, the calculated feature items of equation (6) have more classification discrimination ability, which is mainly to remove the information data features of industrial clusters which do not have the classification ability.

The main idea of improved feature extraction based on RFD is that for a feature to become a representative feature of a certain category, it must have the following two characteristics: representative and discriminant [12]. The absolute value of the sum of the representativeness measure of feature item and the discriminability measure of feature item are used to measure the correlation between features and categories, which is called ARFD. Conditional probability p x c is a representative measure of characteristic item , while p x c ¯ is a discriminant measure of characteristic item x 0. The improved feature extraction is to calculate by using equation (7):

(7) A R F D ( x , c ) = p ( x | c ) p ( x c ) S P

equation (7) can be approximated to:

(8) A R F D x , c = A M B N M S P

Where the greater the value of ARFD(, ć) is, the more the relevant information of feature item and class ć is.

2.3 Information retrieval algorithm based on vector space for industrial clusters

In order to improve the precision and recall ratio of information retrieval in industrial clusters, information retrieval is realized on the basis of information feature extraction of industrial clusters. The retrieval pattern of vector space is a relatively easy to understand retrieval pattern, and is a widely used information retrieval algorithm model in the field of information retrieval[13, 14]. The basic idea is that information and query are made up of words, and each query can be described by a vector composed of retrieval units. When searching, the correlation between information and query is calculated, and the higher the correlation with a specific query is considered the more relevant information.

The common way to describe information and retrieval vectors is that the retrieval space is composed of all retrieval units contained in information and retrieval, and the information and retrieval are represented as vectors in this space.

It is assumed that the information retrieval space of industrial clusters is = 〈ť1 , ť2 , . · · , ťń. Among them, ťí = (í = 1, 2, · · · , ń) is the different retrieval units conitained in information and query, ń is the size of the whole retrieval space Ω, that is, the total number of different retrieval units contained in information query.

In retrieval space Ω, all information can be represented by vectors: d ω d 1 , ω d 2 , , ω d n . Among them, ω d n ( i = 1 , 2 , , n ) is a series of descriptions of the information meaning, when the retrieval unit ťí appears in the information type, ω d n is 1, conversely, when the retrieval unit t i does not appear in the information, ωďń is 0. Usually, most of the items in ωďń are zero because the size of search space is much larger than the length of each industry information file.

Combining the above information, we can approximately understand that in search space , all queries can also be represented by vectors: q = ω q 1 , ω q 2 , , ω q n . Among them, ω q n = ( i = 1 , 2 , , n ) is a series of descriptions of the query meaning, when retrieval unit ťí appears in the query, ω q n is 1, conversely, when retrieval unit t i does not appear in the query, ω is 0. In general, because the length of queries is shorter than that of industrial clusters, more entries will be zero in ω.

According to the above analysis, not every retrieval unit is equally important in information retrieval of industrial clusters (for example, keywords should be more important than non-keywords). So, how to embody such information in vectors needs to be solved urgently [15, 16, 17, 18, 19, 20, 21]. One of the feasible schemes is to adjust the weight of vectors manually, which enlarges the weight of retrieval units that users care about. However, manual intervention is difficult to achieve because of the huge workload. Therefore, another method is more commonly used in information retrieval: the weights based on the statistical frequency of the information file set, also known as TF-IDF weights.

TF-IDF weights consist of two parts, one is the frequency of the retrieval unit appearing in the information file, that is, TF, the other is called inverted file frequency, that is, IDF. TF-IDF weight is usually the product of TF and IDF for a given retrieval unit.

For the convenience of illustrating the problem, the following definition is made: T F i j represents the frequency of the retrieval unit ťí appearing in the industrial cluster information database, DFj represents the amount of information containing the retrieval unit ťí in the entire industrial cluster information database.

By defining the above definition, the frequency of inversion information can be defined as:

(9) I D F j = log d D F j

Where, IDFj represents the frequency of reversal information.

For a given information file, the vector describing the information file is composed of ń elements, which correspond to ń retrieval units in the information file set. The weights of each element are determined by the frequency of the corresponding retrieval unit appearing in the industrial cluster information database and the frequency of the retrieval unit appearing in the entire industrial cluster information database, as shown in Eq. (10):

(10) ω i j = T F i j × I D J j

Using as the weight of each element in the vector, the vector of information and retrieval is further adjusted. When the value range is [0.25, 0.30], the vector of information and retrieval can be adjusted best. This vector can describe the information and query more accurately.

For vector space retrieval model, it not only needs to define vectors to represent information and retrieval, but also needs to choose an appropriate method to calculate the relevance of information and query to determine whether information and query are related. The cosine of vector angle is used as the basis for judging the relevance of industrial cluster information.

According to the above, the similarity between information ď and retrieval q is defined in retrieval space . The retrieval matching process can be expressed as follows:

(11) S C ( d , q ) = i = 1 n ω d i × ω q i i = 1 n ω d i 2 i = 1 n ω q i 2 1 / 2 A R F D

Where SC(ď, q) calculated by equation (11) is the result of information retrieval based on vector space.

3 Results

In order to verify the validity of the vector space based information retrieval algorithm for industrial clusters, a correlation experiment is conducted. In the experiment, two industrial gathering information of education and entertainment in a province are selected as the source of experimental data. Experimental environment: Intel Pentium Dual E2140@1.60GHz; operating system: Microsoft Windows XP; hard disk: 160 GB; memory: 1 GB; development tools: Eclipse 3.2. The experimental indicators are: Retrieval precision; Retrieval recall ratio; Retrieval delay; Network energy consumption of retrieval. The results are as follows:

Figures 1 and 2 show that the information retrieval algorithm based on vector space for industrial clusters has higher precision and recall ratio, and is more robust than the current research results. RFD algorithm is used to extract the page data features of purified industrial cluster information database, and preliminarily determine the characteristics of industrial cluster information data, which provides support for information retrieval. Based on feature extraction, the retrieval space is set, and the cosine of vector angle is used to judge the correlation between information and retrieval in industrial clusters. The results of information retrieval in industrial clusters are obtained, which effectively improves the precision and recall ratio of information retrieval.

Figure 1 Comparison of precision of different information retrieval methods
Figure 1

Comparison of precision of different information retrieval methods

Figure 2 Comparison of recall ratio with different information retrieval methods
Figure 2

Comparison of recall ratio with different information retrieval methods

As can be seen from Figures 3 and 4, compared with the current research, the information retrieval algorithm based on vector space has a great advantage in terms of retrieval delay and energy consumption. Before searching for industrial clusters, this algorithm uses improved VIPS algorithm to purify the information database of industrial clusters. The noisy semantic blocks are identified and removed by the number of text and links, the relative position of web blocks and the content attributes of web blocks, thus greatly reducing the information retrieval delay and reduce retrieval energy consumption.

Figure 3 Comparison of retrieval time for different information retrieval methods
Figure 3

Comparison of retrieval time for different information retrieval methods

Figure 4 Comparison of energy consumption in different information retrieval methods
Figure 4

Comparison of energy consumption in different information retrieval methods

4 Discussion

This paper discusses the effect of adjusting the weight ω i j of each element in the information data vectors of industrial clusters to the information and retrieval vectors. The value of ω i j is defined in [0.19, 0.24], [0.25, 0.30] and [0.31, 0.36] respectively, and the effect of weight ω i j on information and retrieval vectors is observed. The larger the adjustment coefficient is, the more accurate the vector can describe the information and the content of the query. The simulation results are as follows:

In Figure 5, when the value of ω i j is [0.19, 0.24] and the vector adjustment coefficients of information and retrieval fluctuate greatly, which indicates that the accuracy of vector description information and query content will also be affected to varying degrees, and then the effect of information retrieval in industrial clusters will be affected. When the value of ω i j is [0.25, 0.30], the vector adjustment

Figure 5 Influence of different values of    ω   i ′   j ′     ${\omega _{i'j'}}$on vector adjustment coefficients of information and retrieval
Figure 5

Influence of different values of ω i j on vector adjustment coefficients of information and retrieval

coefficient of information and retrieval is the largest, which means that the vector can describe information and query content to the greatest extent.

5 Conclusions

As a hot social content, industrial clusters play a positive role in social development. Information retrieval of industrial clusters is conducive to understanding the development

status of industrial clusters, which is of great significance to the regulation and progress in this field. Therefore, an information retrieval algorithm of industrial clusters based on vector space is proposed. The information retrieval of industrial clusters is completed by purifying the information database of industrial clusters, extracting the information features of industrial clusters and matching the information similarity of industrial clusters. Experimental results show that the proposed algorithm has a high retrieval rate and retrieval efficiency, and has absolute advantages over the current research results.

Acknowledgement

Major project of applied research of philosophy and social science in Henan higher schools “research on development strategy of higher career education in the context of higher education globalization” (2016-yyzd-21);

Research project of decision-making of Henan provincial government “study on occupational education promoting industrial upgrading mechanism of Henan province” (2016B090).

References

[1] Wu Z., Gao K., Wang Z., Wei C., Wali F., Zan G. et al., Direct information retrieval after 3D reconstruction in grating-based X-ray phase-contrast computed tomography, J. Synchr. Radiat., 2018, 25(Pt 4), 1222-1228.10.1107/S1600577518008019Search in Google Scholar

[2] Li G.L., Chen J.L., Liu B., Yin Y., Zhang H.B., Cross-media retrieval of online product based on tag-rank and CCA, Sci. Technol. Eng., 2016, 16(14), 222-227.Search in Google Scholar

[3] Lu N., Gao Q.M., An algorithm for retrieving internet tourism resources based on mixed feature threshold, B Sci. Technolo., 2017, 33(8), 162-165.Search in Google Scholar

[4] Hu J.H., Qin Z.C., Shi L., Lu Z., Zhou B., Research on spatial information cloud service platform and application, J. China Acad. Electron. Inf. Tech., 2016, 11(1), 51-58.Search in Google Scholar

[5] Zhang X.M., Qi W., Zhang J., Gui X.Q., T-STAR: Keywords-based temporal information retrieval method over relational databases, Appl. Res. Comput., 2017, 34(10), 3051-3056.Search in Google Scholar

[6] Jiang Y., Song X.S., Yang Y.X., Jiang K., Rapid start top-k query based on threshold, J. Chinese Inf. Process, 2017, 31(5), 163-170.Search in Google Scholar

[7] Zhao Y.N., Guo H.L., XML tree matching algorithm based on effective path weight, Comput. Eng. Des., 2016, 37(4), 949-953.Search in Google Scholar

[8] Ma Y.Z., Jia S.J., Zhang Y.X., Chi-square distribution based similarity join query algorithm on high-dimensional data, J. Comp. Appl., 2016, 36(7), 1993-1997.Search in Google Scholar

[9] Li Y.X., The simulation research on the optimization management of mass library information retrieval, Comput. Simulat., 2017, 34(5), 389-392.Search in Google Scholar

[10] Ren Y., Design and implementation of a fault searching system combined with semantic web, Comput. Meas. Control, 2017, 25(5), 35-37.Search in Google Scholar

[11] Wei Y.P., Banawan K., Ulukus S., Cache-aided private information retrieval with partially known uncoded prefetching: fundamental limits, IEEE J. Sel. Area Comm., 2017, (99), 1-1.10.1109/JSAC.2018.2844940Search in Google Scholar

[12] Huang X.X., He Y., Application of cloud computing technology in library group resource retrieval, Automat. Instrum., 2017, (2), 139-142.Search in Google Scholar

[13] Rocha V., Kon F., Cobe R., Wassermann R., A hybrid cloud-P2P architecture for multimedia information retrieval on VoD services, Comput., 2016, 98(1-2), 73-92.10.1007/s00607-014-0428-3Search in Google Scholar

[14] Jiang Y., Zhang J., Zhu L.X., Ontology based knowledge graph model of genealogical record and retrieval system, Electron. Des. Eng., 2017, 25(12), 161-165.Search in Google Scholar

[15] Metwally O.N., Sinha S.R., Sa2026 a novel voice-activated web application for rapid knowledge generation and information retrieval through semantic parsing of verbal communication, Gastroenterol., 2016,150(4), S433-S433.10.1016/S0016-5085(16)31503-7Search in Google Scholar

[16] Zandebasiri M., Soosani J., Pourhashemi M., Evaluating existing strategies in environmental crisis of Zagros Forests of Iran, Appl. Ecol. Env. Res., 2017, 15(3), 621-632.10.15666/aeer/1503_621632Search in Google Scholar

[17] Jezewska-Frackowiak J., Seroczynska K., Banaszczyk J., Wozniak D., Skowron M., Ozog A. et al., Detection of endospore producing bacillus species from commercial probiotics and their preliminary microbiological characterization, J. Environ. Biol., 2017, 38(6), 1435-1440.10.22438/jeb/38/6/MRN-478Search in Google Scholar

[18] Delgado J., Peña J.M., Monotonicity preserving representations of curves and surfaces, Appl. Math. Nonlin. Sci., 2016, 1(2), 517-528.10.21042/AMNS.2016.2.00041Search in Google Scholar

[19] Li D., Wang L., Peng W., Ge S., Li N., Furuta Y., Chemical structure of hemicellulosic polymers isolated from bamboo biocomposite during mold pressing, Polym. Compos., 2017, 38(9), 2009-2015.10.1002/pc.23772Search in Google Scholar

[20] Brown T., Du S., Eruslu H., Sayas F.J., Analysis of models for viscoelastic wave propagation, Appl. Math. Nonlin. Sci., 2018, 3, 55-96.10.21042/AMNS.2018.1.00006Search in Google Scholar

[21] Gao W., Zhu L., Guo Y., Wang K., Ontology learning algorithm for similarity measuring and ontology mapping using linear programming, J Intell. Fuzzy Syst., 2017, 33(5), 3153-3163.10.3233/JIFS-169367Search in Google Scholar

Received: 2018-10-28
Accepted: 2019-01-28
Published Online: 2019-03-28

© 2019 Rongsheng Li and Nasruddin Hassan, published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 1.6.2024 from https://www.degruyter.com/document/doi/10.1515/phys-2019-0007/html
Scroll to top button