Purely vision-based segmentation of web pages for assistive technology

https://doi.org/10.1016/j.cviu.2016.02.007Get rights and content

Highlights

  • We use a novel vision-based method to analyze the layout of a web page.

  • Our method produces a hierarchical segmentation of the page reflecting its structure.

  • Vision-based methods are not sensitive to implementation language or complexity.

  • The visual presentation of a page provides rich information about semantic structure.

  • This structure can help create modified presentations for users with assistive needs.

Abstract

We propose a system for analyzing the structure of a web page based on purely visual information, rather than on implementation details. This is advantageous because regardless of the complexity of the underlying implementation, the web page is designed to be easily interpreted visually. Our method produces a hierarchical segmentation reflecting the visual structure of the rendered page. This rich information about the presentation of the web page can be used by other systems which produce alternate presentations more suitable for users with visual or cognitive disabilities.

Introduction

We begin with a general overview of our proposed research into vision-based segmentation of web pages, clarifying the anticipated benefit for users with assistive needs. We then focus more specifically on the computer vision research, highlighting the main elements of what we are proposing and their intended contribution.

Today there are an increasing number of users on the Internet with specific assistive needs. Visually complex webpages with dense, information-rich structures can be difficult for these users to navigate. In this paper, we present research in support of providing systems that facilitate access to web content for users with visual or cognitive impairments. At the core of our solution are computer vision algorithms, applied to produce an effective segmentation of webpage content, which can then be employed to deliver alternative, useful depictions of those webpages to users with assistive needs.

Our approach is one that aims to provide semantically-rich representations of web page content structure by treating web pages as images to be interpreted using computer vision techniques. In developing this framework, we reflected upon its potential value for a wide range of users with challenges requiring assistive technology. Our initial motivation was to support improved audio screen readers for users who are visually impaired [1]. But our system as designed could also support selective presentation of full content for users, of particular benefit for reducing extraneous elements and emphasizing central elements instead. This may be of particular use for users such as the elderly.

We first present the proposed algorithms for segmentation, clarifying the novelty of the computer vision techniques and presenting a validation of the methodology as sound and effective in capturing webpage content. We then examine a host of user communities who may be well served by a system depicting web content that is guided by our algorithms. We also outline some directions for future research, both in extending the technical solution that is offered and also with respect to conducting user studies to demonstrate usability. In all, we emphasize the value of providing a solution that is not tied to the implementation and underlying code of the webpages, discussing how approaching the challenge from a computer vision standpoint offers important contributions for assistive technology.

The objective of our vision-based method is to determine the hierarchical structure of a web page layout using visual cues, without reference to the implementation of the web page. Our intention is for this system to serve as a back-end system, supporting front-end systems that reformat the web page for presentation to the user. Many such front-end systems, such as screen readers, exist today. Existing back-end systems for depicting web pages may use visual cues, but extract them from visual attributes defined in the code. As code-based analysis is brittle, we want to instead leverage the image of the rendered page. We believe that this approach has three principal advantages:

  • 1.

    It does not depend on the quality or implementation language of the underlying code (provided that the browser’s rendering engine can handle it).

  • 2.

    It allows for semantically significant divisions within images, Flash objects, and other entities that are treated monolithically in HTML or CSS code.

  • 3.

    Perhaps most importantly, it analyzes the web page’s structure using as evidence the page designer’s view of the page (the appearance of the rendered web page)1.

Essentially, the advantage of an image-based analysis is that it depends not on the details of how the visual structure of the page is produced, but rather on what the visual appearance is. It uses exactly the information seen by users who do not require assistive technology to make the same type of inference about the structure of the page contents. In this paper we present a robust, extensible Bayesian framework, grounded in a formal model of web page appearance, for performing image-based segmentation of a web page, together with a comparison between the results of such an analysis and more traditional code-based techniques. As we shall see, assistive technology systems that rely on source code-based segmentation algorithms face challenges when there are, for example, images or Flash objects in the page. These algorithms would only be able to treat these objects atomically, and would be unable to detect their internal structure. As a result, users who require distracting content to be suppressed would not be able to select only parts of these objects for display.

Section snippets

Related work

Although relatively few researchers have attempted to use vision-based segmentation of web pages to support screen reader technology, there has been considerable work on using vision-based page segmentation in information retrieval and optical character recognition systems. This section examines some prominent or otherwise interesting techniques used in these and other fields (which could constitute the foundation of a back-end system designed to support effective depiction of web pages for

Our proposed vision-based method

Our system takes as input an image of a rendered web page and produces a hierarchical segmentation of the image. The original image is identical to the output of the browser’s rendering engine, as intended by the page’s designer. The image is segmented by first detecting edges in the image, then searching for the segmentation which is best supported by the edge structure. The system is best viewed at three levels, as follows. At the high-level, it takes as input the image of a rendered page

Implementation and experimental results

We present both qualitative and quantitative results of our algorithm. Qualitative results are shown first, followed by a quantitative comparison between segmentations produced by our algorithm and segmentations produced by taking the bounding boxes of the nodes of the DOM tree of the page. We were inspired to design our algorithm to address a variety of challenges faced by source code-based solutions; Appendix B discusses a several of these challenges, with practical examples.

Our test dataset

Discussion

Although our primary focus is on the application of our method to web pages, there are other, similar domains to which it could be applied, perhaps in a slightly modified forms. Our method is designed for artificial, designed images (as opposed to the more common use of computer vision for natural images) which convey information about their semantic structure through their visual organization. Other cases of this include images of a desktop windowing system, infographics, and academic papers

Applications

In this section, we first discuss sample users with assistive needs in the context of web use. We then describe examples of assistive systems that could produce alternative depictions of web pages based on a segmentation of the page, and show mock-ups of these interface front-ends. Each of these very different interfaces depends on a high-quality segmentation to provide information about the page structure; segmentation algorithms such as ours are versatile back-end components that can be

Future work

There are two primary directions for future work. The first is to extend our existing model, to make it broader or deeper. This is research focused on computer vision. The second is to explore in greater detail the user community requiring assistive technology based on our proposed model. We outline both of these directions for future research below.

Conclusion

In this paper, we have developed a computer-vision model for determining the segmentation of webpages, which can then be leveraged to offer improved depictions of these pages for users with a variety of assistive needs. Our proposed system is a back-end for use in assistive technology systems. This system supplies the front-end with rich, semantically significant information. We have also explained how our system can be readily extended to provide higher-level information such as segment

Acknowledgments

Thanks to NSERC (Natural Sciences and Engineering Research Council of Canada) for financial support. We also wish to acknowledge the contributions of Shari Trewin from IBM TJ Watson during initial brainstorming of ideas on a desiderata for reducing clutter in webpages, as HCI assistive technology; and John A. Doucette, for feedback on an earlier version of the paper. We are grateful as well to the anonymous reviewers for their very helpful comments.

Michael Cormier is a PhD student at the Cheriton School of Computer Science, University of Waterloo. He completed a Master’s thesis in computer science at the University of Waterloo in 2013. His current research interests are computer vision and its applications to assistive technology. He also has an interest in vision with unconventional image formation models. Michael is currently supported by a Canadian government NSERC PGS-D scholarship and by institutional scholarships from the University

References (62)

  • F. Hassan et al.

    Web document segmentation for better extraction of information: A review

    Int. J. Comput. Appl.

    (2015)
  • D. Cai et al.

    VIPS: A Vision-Based Page Segmentation Algorithm, Technical Report, MSR-TR-2003-79

    (2003)
  • G. Petasis, A. Theodorakos, P. Fragkou, V. Karkaletsis, Segmenting HTML pages using visual and semantic information,...
  • R. Song et al.

    Learning important models for web page blocks based on layout and content analysis

    SIGKDD Explor. Newsl.

    (2004)
  • F. Cesarini et al.

    Structured document segmentation and representation by the modified x-y tree

    Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR ’99.

    (1999)
  • J. Chen et al.

    Detecting web content function using generalized hidden Markov model

    Proceedings of the 5th International Conference on Machine Learning and Applications, ICMLA ’06.

    (2006)
  • M. Dixon et al.

    Prefab: Implementing advanced behaviors using pixel-based reverse engineering of interface structure

    Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

    (2010)
  • B. Krüpl-Sypien et al.

    A versatile model for web page representation, information extraction and content re-packaging

    Proceedings of the 11th ACM Symposium on Document Engineering, DocEng ’11

    (2011)
  • TUWIEN Database and Artificial Intelligence Group, TUWIEN Project ABBA: Web Accessibility, URL:...
  • P. Panteleris et al.

    Vision-based SLAM and moving objects tracking for the perceptual support of a smart walker platform

    Proceedings of the 2014 Workshop on Assistive Computer Vision and Robotics

    (2014)
  • S. Cloix et al.

    Descending stairs detection with low-power sensors

    Proceedings of the 2014 Workshop on Assistive Computer Vision and Robotics

    (2014)
  • P. Viswanathan et al.

    An intelligent powered wheelchair for users with dementia: Case studies with noah (navigation and obstacle avoidance help).

    Proceedings of the AAAI Fall Symposium: Artificial Intelligence for Gerontechnology

    (2012)
  • Y. Zhong et al.

    Regionspeak: Quick comprehensive spatial descriptions of complex images for blind users

    Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

    (2015)
  • M. Talbot et al.

    Trajectory capture in frontal plane geometry for visually impaired

    Proceedings of 2006 International Conference on Auditory Displays

    (2006)
  • C. Lewis

    Issues in web presentation for cognitive accessibility

  • T.A. Hart et al.

    Evaluating websites for older adults: adherence to ‘senior-friendly’ guidelines and end-user performance

    Behav. Inf. Technol.

    (2008)
  • M. Watanabe et al.

    Improving accessibility through the visual structure of web contents

    Proceedings of the 4th International Conference on Universal Access in Human-Computer Interaction (UAHCI ’07)

    (2007)
  • C. Asakawa et al.

    Annotation-based transcoding for nonvisual web access

    Proceedings of the ASSETS 2000

    (2000)
  • H. Takagi et al.

    Site-wide annotation: Reconstructing existing pages to be accessible

    Proceedings of the ASSETS 2002

    (2002)
  • Y. Yesilada et al.

    Screen readers cannot see (ontology based semantic annotation for visually impaired web travellers)

  • J. Mahmud et al.

    Csurf: a context-driven non-visual webbrowser

    Proceedings of the 16th International Conference on World Wide Web, WWW 2007

    (2007)
  • Cited by (19)

    • Box clustering segmentation: A new method for vision-based web page preprocessing

      2017, Information Processing and Management
      Citation Excerpt :

      In our Box Clustering Segmentation method, we strictly avoid using DOM and the HTML-based heuristics. We use a purely visual representation of the documents which makes our method closer to other methods based on the graphical document representation (Cormier et al., 2016; Wei et al., 2015). On the other hand, we don’t detect the visual separators explicitly and the clustering approach is closer to the Web Content Clustering by Alcic and Conrad (2011).

    • Computer vision for assistive technologies

      2017, Computer Vision and Image Understanding
      Citation Excerpt :

      In this field, another issue to be faced by computer vision is related to soft and hard biometrics: the recognition of the persons in front of the assisted person (i.e. the possibility to have an accurate face recognition system) is an increasing demand from visual impaired users (Chaudhry and Chandra, 2015). A novel vision-based method to analyze the layout of a web page to facilitate access to web content for users with visual impairments was proposed in Cormier et al. (2016). Another useful application concerns with the design, development and evaluation of wearable mobile reading devices that rely on robust document image analysis in order to identify the structure of the document (Keefer and Bourbakis, 2014; Keefer et al., 2013; Koo and Cho, 2010).

    • Defining Patterns for a Conversational Web

      2023, Conference on Human Factors in Computing Systems - Proceedings
    • Utilizing Machine Learning for the Identification of Visually Similar Web Elements

      2023, Proceedings - 2023 IEEE International Conference on e-Business Engineering, ICEBE 2023
    View all citing articles on Scopus

    Michael Cormier is a PhD student at the Cheriton School of Computer Science, University of Waterloo. He completed a Master’s thesis in computer science at the University of Waterloo in 2013. His current research interests are computer vision and its applications to assistive technology. He also has an interest in vision with unconventional image formation models. Michael is currently supported by a Canadian government NSERC PGS-D scholarship and by institutional scholarships from the University of Waterloo. He also serves as a Graduate Ambassador for the Cheriton School.

    Karyn Moffatt is an assistant professor in the School of Information Studies at McGill University. Her broad research area is Human Computer Interaction (HCI), with a specific focus on the ways in which technology can be employed to meet the needs of older adults and people with disabilities. Prior to joining McGill University, Karyn was a post-doctoral fellow at the University of Toronto supported by awards from NSERC and CIHR s Health Care, Technology, and Place strategic initiative. She received her doctorate in computer science from the University of British Columbia in 2010.

    Robin Cohen is a Professor at the David R. Cheriton School of Computer Science at the University of Waterloo, in Waterloo, Ontario, Canada. Her research interests are in the subfields of user modeling and multiagent systems, within artificial intelligence. One focus of her current work is on providing streamlined presentation of content to users in online settings such as social networks. She has been a faculty member at the University of Waterloo for over 30 years and is a former Associate Dean of Research in the Faculty of Mathematics. She is also a Senior Member of the AAAI.

    Richard Mann is an Associate Professor in the Cheriton School of Computer Science at University of Waterloo. His interests are in the areas of Artificial Intelligence, Perception and Learning, Computer Vision and Computer Audio.

    View full text