Abstract
This paper endeavors to bring together two largely disparate areas of research. On one hand, text mining methods treat each document as an independent instance despite the fact that in many text domains, documents are linked and their topics are correlated. For example, web pages of related topics are often connected by hyperlinks and scientific papers from related fields are typically linked by citations. On the other hand, Social Network Analysis (SNA) typically treats edges between nodes according to ”flat” attributes in binary form alone. This paper proposes a simple approach that addresses both these issues in data mining scenarios involving corpora of linked documents. According to this approach, after assigning weights to the edges between documents, based on the content of the documents associated with each edge, we apply standard SNA and network theory tools to the network. The method is tested on the Enron email corpus and successfully discovers the central people in the organization and the relevant communications between them. Furthermore, Our findings suggest that due to the non-conservative nature of information, conservative centrality measures (such as PageRank) are less adequate here than non-conservative centrality measures (such as eigenvector centrality).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Wasserman, S., Faust, K.: Social network analysis: Methods and applications. Cambridge University Press, Cambridge (1994)
Newman, M.E.J.: Who is the best connected scientist? A study of scientific coauthorship networks in Complex Networks. In: Ben-Naim, E., Frauenfelder, H., Toroczkai, Z. (eds.) pp. 337–370. Springer, Berlin (2004)
Onnela, J.-P., Saramäki, J., Hyvonen, J., Szabó, G., Argollo de Menezes, M., Kaski, K., Barabási, A.-L., Kertész, J.: Analysis of a large-scale weighted network of one-to-one human communication. New J. Phys. 9, 179 (2007)
Wu, F., Huberman, B.A., Adamic, L.A., Tyler, J.R.: Information flow in social groups. Physica A 337, 327–335 (2004)
Kleinbaum, A.M., Stuart, T.E., Tushman, M.L.: Communication (and Coordination?) in a Modern, Complex Organization. Harvard Business School Working Paper, no. 09-004 (July 2008)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing, 1st edn. The MIT Press, Cambridge (1999)
Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.U.: Complex networks: structure and dynamics. Physics Reports 424, 175–308 (2006)
Athreya, K.B., Ney, P.E.: Branching Processes. Courier Dover Publications (2004)
Shetty, J., Adibi, J.: The Enron email dataset database schema and brief statistical report (Technical Report). Information Sciences Institute (2004)
McCallum, A., Corrada-Emmanuel, A., Wang, X.: Topic and Role Discovery in Social Networks. In: IJCAI (2005)
Kleinberg Jon, M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)
Kurland, O., Lee, L.: Respect my authority! HITS without hyperlinks, utilizing cluster-based language models. In: Proceedings of SIGIR 2006, pp. 83–90 (2006)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Tech. rep. Stanford Digital Library Technologies Project (1998)
Burgess, M., Canright, G., Engø-Monsen, K.: Mining location importance from the eigenvectors of directed graphs (2006), http://research.iu.hio.no/papers/directed.pdf
Langville Amy, N., Meyer Carl, D.: Deeper inside PageRank. Internet Mathematics Journal (2004)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, vol. 20 (2004)
Hirsch, J.E.: An index to quantify an individual’s scientific research output. PNAS 102(46), 16569–16572 (2005)
Mimno, D., McCallum, A.: Mining a digital library for influential authors. In: Joint Conference on Digial Libraries, JCDL (2007)
Frikh, B., Djanfar, A.S., Ouhbi, B.: An intelligent surfer model combining web contents and links based on simultaneous multiple-term query. In: Computer Systems and Applications, AICCSA 2009 (2009)
Richardson, M., Domingos, P.: Combining Link and Content Information in Web Search. Web Dynamics, 179–194 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Berchenko, Y., Daliot, O., Brueller, N.N. (2011). Intra-Firm Information Flow: A Content-Structure Perspective. In: Gama, J., Bradley, E., Hollmén, J. (eds) Advances in Intelligent Data Analysis X. IDA 2011. Lecture Notes in Computer Science, vol 7014. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24800-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-24800-9_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24799-6
Online ISBN: 978-3-642-24800-9
eBook Packages: Computer ScienceComputer Science (R0)