Abstract
Using JavaScript and dynamic DOM manipulation on the client side of Web applications is becoming a widespread approach for achieving rich interactivity and responsiveness in modern Web applications. At the same time, such techniques---collectively known as Ajax---shatter the concept of webpages with unique URLs, on which traditional Web crawlers are based. This article describes a novel technique for crawling Ajax-based applications through automatic dynamic analysis of user-interface-state changes in Web browsers. Our algorithm scans the DOM tree, spots candidate elements that are capable of changing the state, fires events on those candidate elements, and incrementally infers a state machine that models the various navigational paths and states within an Ajax application. This inferred model can be used in program comprehension and in analysis and testing of dynamic Web states, for instance, or for generating a static version of the application. In this article, we discuss our sequential and concurrent Ajax crawling algorithms. We present our open source tool called Crawljax, which implements the concepts and algorithms discussed in this article. Additionally, we report a number of empirical studies in which we apply our approach to a number of open-source and industrial Web applications and elaborate on the obtained results.
- Alvarez, M., Pan, A., Raposo, J., and Hidalgo, J. 2006. Crawling webpages with support for client-side dynamism. In Proceedings of the International Conference on Advances in Web-Age Information Management. Lecture Notes in Computer Science Series, vol. 4016, 252--262. Google ScholarDigital Library
- Alvarez, M., Pan, A., Raposo, J., and Vina, A. 2004. Client-side deep Web data extraction. In Proceedings of the IEEE International Conference on E-Commerce Technology for Dynamic E-Business (CEC-EAST’04). IEEE Computer Society, Los Alamitos, 158--161. Google ScholarDigital Library
- Atterer, R. and Schmidt, A. 2005. Adding usability to Web engineering models and tools. In Proceedings of the 5th International Conferencee on Web Engineering (ICWE’05). 36--41. Google ScholarDigital Library
- Barbosa, L. and Freire, J. 2007. An adaptive crawler for locating hidden-Web entry points. In Proceedings of the 16th International World Wide Web Conference (WWW’07). ACM Press, New York, NY, 441--450. Google ScholarDigital Library
- Bezemer, C.-P., Mesbah, A., and van Deursen, A. 2009. Automated security testing of Web widget interactions. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE’09). ACM Press, New York, NY, 81--91. Google ScholarDigital Library
- Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004. Ubicrawler: A scalable fully distributed Web crawler. Softw. Practice Exper. 34, 8, 711--726. Google ScholarDigital Library
- Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1--7, 107--117. Google ScholarDigital Library
- Burner, M. 1997. Crawling towards eternity: Building an archive of the world wide web. Web Techniques Mag. 2, 5, 37--40.Google Scholar
- Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. 1996. Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’96). ACM Press, New York, NY, 493--504. Google ScholarDigital Library
- Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the 11th International World Wide Web Conference. ACM Press, New York, NY, 124--135. Google ScholarDigital Library
- Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. Comput. Netw. ISDN Syst. 30, 1--7, 161--172. Google ScholarDigital Library
- Dasgupta, A., Ghosh, A., Kumar, R., Olston, C., Pandey, S., and Tomkins, A. 2007. The discoverability of the Web. In Proceedings of the 16th International World Wide Web Conference (WWW’07). ACM Press, New York, NY, 421--430. Google ScholarDigital Library
- de Carvalho, A. F. and Silva, F. S. 2004. Smartcrawl: A new strategy for the exploration of the hidden Web. In Proceedings of the 6th Annual ACM International Workshop on Web Information and Data Management (WIDM’04). ACM Press, New York, NY, 9--15. Google ScholarDigital Library
- Dijkstra, E. W. 1959. A note on two problems in connexion with graphs. Numerische Mathematik 1, 1, 269--271.Google ScholarDigital Library
- Duda, C., Frey, G., Kossmann, D., Matter, R., and Zhou, C. 2009. Ajax crawl: Making Ajax applications searchable. In Proceedings of the 25th International Conference on Data Engineering (ICDE’09). IEEE Computer Society, Los Alamitos, CA, 78--89. Google ScholarDigital Library
- Fielding, R. and Taylor, R. N. 2002. Principled design of the modern Web architecture. ACM Trans. Inter. Tech. 2, 2, 115--150. Google ScholarDigital Library
- Garavel, H., Mateescu, R., and Smarandache, I. 2001. Parallel state space construction for model-checking. Model Check. Softw. 2057, 217--234. Google ScholarDigital Library
- Garrett, J. 2005. Ajax: A new approach to Web applications. Adaptive path. http://www.adaptivepath.com/publications/essays/archives/000385.php.Google Scholar
- Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. Proceedings of the International World Wide Web Conference, 2, 4, 219--229. Google ScholarDigital Library
- Lage, J. P., da Silva, A. S., Golgher, P. B., and Laender, A. H. F. 2004. Automatic generation of agents for collecting hidden Webpages for data extraction. Data Knowl. Eng. 49, 2, 177--196. Google ScholarDigital Library
- Levenshtein, V. L. 1996. Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics Control Theory 10, 707--710.Google Scholar
- Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., and Halevy, A. 2008. Google’s deep Web crawl. Proc. VLDB Endow. 1, 2, 1241--1252. Google ScholarDigital Library
- Maxwell, S. and Delaney, H. 2004. Designing Experiments and Analyzing Data: A Model Comparison Perspective. Lawrence Erlbaum, U.K.Google Scholar
- Memon, A., Banerjee, I., and Nagarajan, A. 2003. GUI ripping: Reverse engineering of graphical user interfaces for testing. In Proceedings of the 10th Working Conference on Reverse Engineering (WCRE’03). IEEE Computer Society, Los Alamitos, CA, 260--269. Google ScholarDigital Library
- Memon, A., Soffa, M. L., and Pollack, M. E. 2001. Coverage criteria for GUI testing. In Proceedings of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE’01). ACM Press, New York, NY, 256--267. Google ScholarDigital Library
- Mesbah, A., Bozdag, E., and van Deursen, A. 2008. Crawling Ajax by inferring user interface state changes. In Proceedings of the 8th International Conference on Web Engineering (ICWE’08). IEEE Computer Society, Los Alamitos, CA, 122--134. Google ScholarDigital Library
- Mesbah, A. and van Deursen, A. 2007. Migrating multipage Web applications to single-page Ajax interfaces. In Proceedings of the 11th European Conference on Software Maintenance and Reengineering (CSMR’07). IEEE Computer Society, Los Alamitos, CA, 181--190. Google ScholarDigital Library
- Mesbah, A. and Prasad, M. R. 2011. Automated cross-browser compatibility testing. In Proceedings of the 33rd ACM/IEEE International Conference on Software Engineering (ICSE’11). ACM Press, New York, NY, 561--570. Google ScholarDigital Library
- Mesbah, A. and van Deursen, A. 2008. A component- and push-based architectural style for Ajax applications. J. Syst. Softw. 81, 12, 2194--2209. Google ScholarDigital Library
- Mesbah, A. and van Deursen, A. 2009. Invariant-based automatic testing of Ajax user interfaces. In Proceedings of the 31st International Conference on Software Engineering (ICSE’09). IEEE Computer Society, Los Alamitos, CA, 210--220. Google ScholarDigital Library
- Ntoulas, A., Zerfos, P., and Cho, J. 2005. Downloading textual hidden Web content through keyword queries. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’05). ACM Press, New York, NY, 100--109. Google ScholarDigital Library
- Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In Proceedings of the 2nd International World Wide Web Conference. Vol. 94. 17--20.Google Scholar
- Raghavan, S. and Garcia-Molina, H. 2001. Crawling the hidden Web. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01). 129--138. Google ScholarDigital Library
- Roest, D., Mesbah, A., and van Deursen, A. 2010. Regression testing Ajax applications: Coping with dynamism. In Proceedings of the 3rd International Conference on Software Testing, Verification and Validation (ICST’10). IEEE Computer Society, Los Alamitos, CA, 128--136. Google ScholarDigital Library
- Russell, A. 2006. Comet: Low latency data for the browser. http://alex.dojotoolkit.org/?p=545.Google Scholar
- Valmari, A. 1998. The state explosion problem. In Lectures on Petri Nets I: Basic Models: Advances in Petri Nets. 429--528. Google ScholarDigital Library
- W3C. W3C Web Storage. http://dev.w3.org/html5/webstorage/.Google Scholar
Index Terms
- Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes
Recommendations
A Model-Based Approach for Crawling Rich Internet Applications
New Web technologies, like AJAX, result in more responsive and interactive Web applications, sometimes called Rich Internet Applications (RIAs). Crawling techniques developed for traditional Web applications are not sufficient for crawling RIAs. The ...
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebThe steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Topic-Sensitive hidden-web crawling
WISE'12: Proceedings of the 13th international conference on Web Information Systems EngineeringA constantly growing amount of high-quality information is stored in pages coming from the Hidden Web. Such pages are accessible only through a query interface that a Hidden-Web site provides and may span a variety of topics.
In order to provide ...
Comments