Automatic Extraction of Complex Web Data

Abstract

A new wrapper induction algorithm WTM for generating rules that describe the general web page layout template is presented. WTM is mainly designed for use in weblog crawling and indexing system. Most weblogs are maintained by content management systems and have similar layout structures in all pages. In addition, they provide RSS feeds to describe the latest entries. These entries appear in the weblog homepage in HTML format as well. WTM is built upon these two observations. It uses RSS feed data to automatically label the corresponding HTML file (weblog homepage) and induces general template rules from the labeled page. The rules can then be used to extract data from other pages of similar layout template. WTM is tested on some selected weblogs and the results are satisfactory.

Recommended Citation

Zhang, Ming; Zhou, Ying; and Patrick, Jon, "Automatic Extraction of Complex Web Data" (2006). PACIS 2006 Proceedings. 66.
https://aisel.aisnet.org/pacis2006/66

PACIS 2006 Proceedings

Automatic Extraction of Complex Web Data

Abstract

Recommended Citation

Search

Links

Browse

Author Corner

PACIS 2006 Proceedings

Automatic Extraction of Complex Web Data

Authors

Abstract

Recommended Citation

Share

Search

Links

Browse

Author Corner