Webstemmer

back [Japanese]

Download: webstemmer-dist-0.7.1.tar.gz (Python 2.4 or newer is required)


What's it?

Webstemmer is a web crawler and HTML layout analyzer that automatically extracts main text of a news site without having banners, ads and/or navigation links mixed up (Here is a sample output).

Generally, extracting text contents from web sites (especially news sites) ends up with lots of unnecessary stuff: ads and banners. You could craft some regular expression patterns to pick up only desired parts, but to construct such a pattern is often a tricky and time consuming task. Furthermore, some patterns need to be aware of the surrounding contexts. Some news sites even have several different layouts.

Webstemmer analyzes the layout of each page in a certain web site and figures out where the main text is located. Analysis can be done in a fully automatic manner with little human intervention. You only need to give a URL of the top page. For more details, see How It Works? page.

The algorithm works for most of well-known news sites. The following table shows the average number of successfully extracted pages among all the obtained pages from each site a day. We got about 90% of the pages correctly for most of the sites:

News SiteAvg. Extracted/Obtained
New York Times488.8/552.2 (88%)
Newsday373.7/454.7 (82%)
Washington Post342.6/367.3 (93%)
Boston Globe332.9/354.9 (93%)
ABC News299.7/344.4 (87%)
BBC283.3/337.4 (84%)
Los Angels Times263.2/345.5 (76%)
Reuters188.2/206.9 (91%)
CBS News171.8/190.1 (90%)
Seattle Times164.4/185.4 (89%)
NY Daily News144.3/147.4 (98%)
International Herald Tribune125.5/126.5 (99%)
Channel News Asia119.5/126.2 (94%)
CNN65.3/73.9 (89%)
Voice of America58.3/62.6 (94%)
Independent58.1/58.5 (99%)
Financial Times55.7/56.6 (98%)
USA Today44.5/46.7 (96%)
NY135.7/37.1 (95%)
1010 Wins14.3/16.1 (88%)
Total3829.1/4349.2 (88%)

How to Use

Text extraction with Webstemmer has the following steps:

  1. Obtain a number of "seed" pages from a particular site.
  2. Learn the layout patterns from the obtained pages.
  3. Later on, obtain updated pages from the same site.
  4. Extract texts from the newly obtained pages using the learned patterns.

Step 1. and 2. are only required at the first time. Once you learned the layout patterns, you can use them to extract texts from a newly obtained page from the same website by repeting step 3. and 4. until its layout is drastically changed.

Webstemmer package includes the following programs:

In the previous versions (<= 0.3), all these programs (web crawler, layout analyzer and text extractor) were combined into one command. Now they are separated in Webstemmer version 0.5 or newer.

Step 1. Obtain seed pages

To learn layout patterns, you first need to run a web crawler to obtain the seed pages. The crawler recursively follows the links in each page until it reaches a certain depth (the default is 1 -- i.e. the crawler follows each link from the top page only once) and stores the pages into a .zip file.

(Crawl CNN from its top page.)

$ ./textcrawler.py -o cnn http://www.cnn.com/
Writing: 'cnn.200511210103.zip'
Making connection: 'www.cnn.com'...
...

An obtained .zip file contains a list of HTML files. Each file name in the archive includes a timestamp at which the crawling is performed. You can use the .zip file as a seed for learning layout patterns (step 2.) or extracting texts from new pages (step 4.)

(View the list of obtained pages.)

$ zipinfo cnn.200511210103.zip
Archive:  cnn.200511210103.zip   699786 bytes   75 files
-rwx---     2.0 fat    59740 b- defN 21-Nov-05 01:03 200511210103/www.cnn.com/
-rw----     2.0 fat    32060 b- defN 21-Nov-05 01:03 200511210103/www.cnn.com/privacy.html
-rw----     2.0 fat    41039 b- defN 21-Nov-05 01:03 200511210103/www.cnn.com/interactive_legal.html
-rw----     2.0 fat    33760 b- defN 21-Nov-05 01:03 200511210103/www.cnn.com/INDEX/about.us/
...

Step 2. Learn the layout patterns

Then you can learn the layout patterns from obtained pages with analyze.py. The program take one or more zip files as input and outputs obtained layout patterns into the standard output.

(Learn the layout patterns from obtained pages and save it as cnn.pat.)

$ ./analyze.py cnn.200511210103.zip > cnn.pat
Opening: 'cnn.200511210103.zip'...
Added: 1: 200511210103/www.cnn.com/
Added: 2: 200511210103/www.cnn.com/privacy.html
Added: 3: 200511210103/www.cnn.com/interactive_legal.html
Added: 4: 200511210103/www.cnn.com/INDEX/about.us/
...
Fixating....................................................

It takes O(n^2) time to learn layout patterns, e.g. when learning 100 pages takes a couple of minutes, learning about 1,000 pages takes a couple of hours.

The obtained layout patterns are represented in plain-text format. For more details, see Anatomy of pattern files.

Step 3. Obtain new pages

Some time later, suppose you obtained a set of new pages from the same website.

(Crawl again from CNN top page.)

$ ./textcrawler.py -o cnn http://www.cnn.com/
Writing: 'cnn.200603010455.zip'
Making connection: 'www.cnn.com'...
...

(View the obtained html pages.)

$ zipinfo cnn.200603010455.zip
Archive:  cnn.200603010455.zip   850656 bytes   85 files
-rwx---     2.0 fat    66507 b- defN  1-Mar-06 04:55 200603010455/www.cnn.com/
-rw----     2.0 fat    33759 b- defN  1-Mar-06 04:55 200603010455/www.cnn.com/privacy.html
-rw----     2.0 fat    42738 b- defN  1-Mar-06 04:55 200603010455/www.cnn.com/interactive_legal.html
-rw----     2.0 fat       85 b- defN  1-Mar-06 04:55 200603010455/www.cnn.com/INDEX/about.us/
...

Step 4. Extract texts from the newly obtained pages

Now you can extract the main texts from the newly obtained pages by using the acquired pattern cnn.pat.

$ ./extract.py cnn.pat cnn.200603010455.zip > cnn.txt
Opening: 'cnn.200603010455.zip...

Extracted texts are saved as cnn.txt.

$ cat cnn.txt
!UNMATCHED: 200511210103/www.cnn.com/                                             (unmatched page)

!UNMATCHED: 200511210103/www.cnn.com/privacy.html                                 (unmatched page)

!UNMATCHED: 200511210103/www.cnn.com/interactive_legal.html                       (unmatched page)
...

!MATCHED: 200603010455/www.cnn.com/2006/HEALTH/02/09/billy.interview/index.html   (matched page)
PATTERN: 200511210103/www.cnn.com/2005/POLITICS/11/20/bush.murtha/index.html      (layout pattern name)
SUB-0: CNN.com - Too busy to cook? Not so fast - Feb 9, 2006                      (supplementary section)
TITLE: Too busy to cook? Not so fast                                              (article title)
SUB-10: Leading chef shares his secrets for speedy, healthy cooking               (supplementary section)
SUB-17: Corporate Governance                                                      (supplementary section)
SUB-17: Lifestyle (House and Home)
SUB-17: New You Resolution
SUB-17: Billy Strynkowski
MAIN-20: (CNN) -- A busy life can put the squeeze on healthy eating. But that     (main text)
         doesn't have to be the case, according to Billy Strynkowski, executive
         chef of Cooking Light magazine. He says cooking healthy, tasty meals
         at home can be done in 20 minutes or less.
MAIN-20: CNN's Jason White interviewed Chef Billy to learn his secrets for
         healthy cooking on the run.
...
SUB-25: Health care difficulties in the Big Easy                                  (supplementary section)

!MATCHED: 200603010455/www.cnn.com/2006/EDUCATION/02/28/teaching.evolution.ap/index.html  (another matched page)
PATTERN: 200511210103/www.cnn.com/2005/POLITICS/11/20/bush.murtha/index.html      (layout pattern name)
SUB-0: CNN.com - Evolution debate continues - Feb 28, 2006                        (supplementary section)
TITLE: Evolution debate continues                                                 (article title)
SUB-17: Schools                                                                   (supplementary section)
SUB-17: Education
MAIN-20: SALT LAKE CITY (AP) -- House lawmakers scuttled a bill that would have   (main text)
         required public school students to be told that evolution is not
         empirically proven -- the latest setback for critics of evolution.
...

Each article is delimited with an empty line. Each article begins with a header line that has a form of either "!MATCHED pageID" or "!UNMATCHED pageID", which indicates whether the page's layout was identified or not. pageID is the name of the page included in the zip archive.

When a page layout is identified, it is followed by "PATTERN:" line that shows the layout pattern name which matched to the page and one or more text lines. Each text line begins with either "TITLE:", "MAIN-n:", or "SUB-n:", which means the article title section, main text sections, or other supplementary sections, respectively. Each paragraph in a text section appears in a separate line.

Each text line begins with a capitalized header, and is separated exactly one newline character. (In the above example, extra newlines are inserted for readability's sake.) Therefore you can easily get the desired part with simple text processing like perl or grep. A line which begins with "SUB-n:" is a supplementary section, which is identified as neither a article title nor main text, but still considered as meaningful text. The section ID n is different depending on the layout pattern.

Installation

Download the tar.gz file. You need Python 2.4 or newer to run this software. There is no special configuration or installation process required. Just type ./analyze.py or ./extract.py in the command line.


textcrawler.py (web crawler)

textcrawler.py is a simple web crawler which recursively crawls within a given site and collects text (HTML) files. The crawler is suitable to obtain a middle-scale website (up to 10,000 pages).

textcrawler.py stores obtained pages in a single .zip file. It supports Mozilla-style cookie files, persistent HTTP connection, and gzip compression. To reduce its traffic, it allows a user to have strict control over its crawling behavior, such as specifying recursion depth and/or URL patterns it may (or may not) obtain. Furthermore, it supports a persistent URL database (URLDB) which maintains URLs that have been visited so far and avoids to crawl the same URL repeatedly. Since most news sites have a unique URL for each distinct article, this greatly helps reducing the network traffic.

textcrawler.py tries to use HTTP persistent connections or gzip compressions as much as possible. It also tries to obey a robots.txt file in the site. Each HTTP connection is made only to the IP address which was given to the program first, i.e. it doesn't support crawling across different hosts. All links that refer to other hosts are ignored.

Most news sites use unique URLs to point different articles. Therefore, normally you don't have to retrieve the same URL twice. textcrawler.py supports -U option (specifying URLDB filename). When a URLDB is specified, textcrawler.py preserves the MD5 hash value and last-visited time for each URL in the persistent file. Currently, Berkeley DBM (bsddb) is used for this purpose. This will greatly save the crawling time and disk space, since the crawler doesn't store a page if its URL is already contained in the URLDB. A URLDB file can be inflated as the number of the URLs it contains is increased. Use urldbutils.py command to reorganize an inflated URLDB file.

textcrawler.py follows a set of regular expression patterns that define which URLs may (or may not) be crawled. A regexp pattern can be specified -a (Accept) or -j (reJect) options from command line. The crawling permission of a URL is determined by checking regexp patterns sequentially in the specified order. By default, it accepts all URLs that include the start URL as its substring.

Syntax

$ textcrawler.py -o output_filename [options] start_url ...

You need to specify an output filename. A timestamp (YYYYMMDDHHMM) and the extension .zip is automatically appended to this name.

Examples:
(Start from http://www.asahi.com/ with maximum recursion being 2,
 and store the files into asahi.*.zip. Assume euc-jp as a default charset.)
$ textcrawler.py -o asahi -m2 -c euc-jp http://www.asahi.com/

(Start from http://www.boston.com/news/globe/, but the pages in
 the upper directory "http://www.boston.com/news/" is also allowed.
 Use the URLDB file boston.urldb.)
$ textcrawler.py -o boston -U boston.urldb -a'^http://www\.boston\.com/news/' http://www.boston.com/news/globe/

Options

-o output_filename
This option is mandatory. It specifies the prefix of a zip filename where all the crawled pages are stored. The actual filename will be in the form of "filename.timestamp.zip", where a timestamp (or a string specified by -b option) is appended after the specified filename.

-m maximum_depth_of_recursive_crawling
Specifies the maximum depth which is allowed to crawl. Default is 1. When you increase this number, it increases the number of crawled pages exponentially. (In most news sites, depth=1 covers about 100 pages, whereas depth=2 covers about 1000 pages.)

-k cookie_filename
Specifies a cookie file (which should be in Mozilla's cookie.txt format) to use in crawling. Some news sites require cookies to identify users. When the cookie file is specified, textcrawler.py automatically uses them when necessary. textcrawler.py does not store any cookie it obtains during crawling.

-c default_character_set
Usually textcrawler.py tries to follow the HTML charset declared (by <meta> tag) in a page header. If there is no charset declaration, the default value (such as "euc-jp" or "utf-8") is used. textcrawler.py does not detect the character set automatically.

-a accept_url_pattern
Specifies a regular expression pattern that defines a URL which is allowed to obtain. When combined with -j option, the patterns are checked in the specified order.

-j reject_url_pattern
Specifies a regular expression pattern that defines a prohibited URL. When combined with -a option, the patterns are checked in the specified order. By default, all the URLs which ends with jpg, jpeg, gif, png, tiff, swf, mov, wmv, wma, ram, rm, rpm, gz, zip, or class is rejected.

-U URLDB_filename
When specified, textcrawler.py records a URL that was once visited to a persistent URL database (URLDB) and does not crawl the URL again. A URLDB file contains the md5 hash of URLs (as keys) and the last-visited times (as values). When the crawler finds a new URL, it dynamically checks the database and filters out those which have been already visited. However, when the crawler haven't yet reached its maximum recursion depth, the intermediate pages are still crawled to obtain links. When you crawl the same site again and again, and you can assume an interesting page has always a unique URL, this reduces the crawling time.

-b timestamp_string
Overrides the timestamp string. The string specified here is appended to the filename you specified with -o option. It is also prepended to the name of each page in the zip file, like "200510291951/www.example.com/...". By default, the string is automatically determined from the current time when the program is started, in the form of YYYYMMDDHHMM.

-i index_html_filename
When a URL ends with the "/" character, the filename specified here is appended to that URL. The default value is a null string (nothing is added). Note that in some sites the URLs "http://host/dir/" and "http://host/dir/index.html" is distinguished (especially when they're using Apache's mod_dir module.)

-D delay_secs
When this option is specified, the crawler waits for the N second(s) each time it crawls a new page. The default value is 0 (no waiting).

-T timeout_secs
Specifies the duration of network timeout in seconds. The default value is 300. (5 mins)

-L linkinfo_filename
By default, textcrawler.py stores all the anchor texts (a text which is surrounded by <a> tag) into a zip file as a file called "linkinfo". Later this information is used by analyze.py to locate the page titles. This option changes the linkinfo filename. When this option is set to an empty string, the crawler doesn't store any anchor text.

-d
Raises the debug level and displays extra messages.


analyze.py (layout analyzer)

analyze.py performs layout clustering based on the HTML files textcrawler.py has obtained, and outputs the learned pattern file into the standard output. It might take more than several hours depending on the number of pages to analyze. For example, my machine (with Xeon 2GHz) took 30 mins to learn the layouts from 300+ pages. (For some reason, Psyco, a Python optimizer, doesn't accelerate this sort of program. It simply took a huge amount of memory but doesn't make the program run faster.)

Each layout pattern that analyze.py outputs has the "score" of the pattern, which shows how likely that page is an article. The score is calculated based on the number of alphanumeric characters in each section of a page. So normally you can remove non-article pages by simply using -S option to filter out low-scored layouts. (You can even tune those patterns manually. See Anatomy of pattern files.)

Syntax

$ analyze.py [options] input_file ... > layout_pattern_file

Normally it takes a zip file that textcrawler.py has generated. Multiple input files are accepted. This is useful for using pages obtained from a single site in multiple days.

Examples:
(Learn the layout from cnn.200511171335.zip and cnn.200511210103.zip
 and save the patterns to cnn.pat)
$ analyze.py cnn.200511171335.zip cnn.200511210103.zip > cnn.pat

It can also take a list of filenames instead of a .zip file that contains HTML files obtained by other HTTP clients such as wget:

$ find 200511171335/ -type f | ./analyze.py - linkinfo > cnn.pat
In this case, the hierarchy of the directory must be the same as one used by textcrawler.py, i.e. each filename should be in the form of timestamp/URL.

Options

Some options are very technical. You might need to understand the algorithm to change them to get a desired effect.

-c default_character_set
Specifies the default character set that is used when there is no charset declaration (<meta> tag) in an HTML file. A different character set is not automatically detected.

-a accept_url_pattern
Specifies a regular expression pattern that defines a URL which is allowed to analyze in the same manner as textcrawler.py. When combined with -j option, the patterns are checked in the specified order.

-j reject_url_pattern
Specifies a regular expression pattern that defines a prohibited URL. When combined with -a option, the patterns are checked in the specified order. By default, analyze.py tries to use all the pages contained in a given zip file.

-t clustering_threshold
Specifies the threshold of layout clustering in fraction, from 0.0 to 1.0. Two pages are brought into the same cluster (i.e. same layout) if the similarity of the two pages is equal or more than the threshold. Using the higher threshold causes making more strict distinction between pages. However, a high threshold may lower the number of matching pages, making each cluster smaller. The default value is 0.97. In some news sites, setting this value to 0.99 or 0.95 may improve the performance of layout detection.

-T title_detection_threshold
Specifies the threshold of title detection. A layout section whose similarity to the reference anchor text is equal or more than this value is used as a candidate of the page title.

-S page_score_threshold
Specifies the threshold of page usefulness. Each layout pattern comes with the "score" of the pages, which indicates how likely the pages that has that layout are an article, based on the number of total characters in distinct sections. A layout pattern that has a score lower than this threshold is automatically filtered. The default value is 100. Generally, a layout whose score is lower than this value is not an article page. In many news sites, most article pages have a layout whose score is more than 1000. Setting this value to -1 preserves all the layouts obtained.

-L linkinfo_filename
By default, analyze.py tries to use a linkinfo file that is created by textcrawler.py and stored in a zip file. This file contains one or multiple anchor texts (a text which is surrounded by <a> tag) referring to each page and is used by the analyzer to locate page titles. This option changes the linkinfo filename it searches in a zip file. When this option is set to an empty string, the analyzer tries to find anchor texts by itself without using linkinfo file, which may result in the slower running speed.

-m max_samples
Specifies the number of pages that are used for detecting if the page is grouped into a certain cluster. The default is zero, which means all the pages are considered. Since the number of possible comparisons grows quadratically, setting this to a low number can make the process significantly fast. In some cases where about 1,000 articles were involved, setting this parameter to 5 made the computation time dropped to about half.

-d
Raises the debug level and displays extra messages.


extract.py (text extractor)

extract.py receives a layout pattern file and tries to extract the texts from a set of HTML pages. This program takes a zip file (or directory name otherwise) and outputs the extracted text into stdout.

Syntax

$ extract.py [options] pattern_filename input_filename ... > output_text
Example:
(Extract the texts from asahi.200510220801.zip using pattern file asahi.pat,
   and store them in shift_jis encoding into file asahi.200510220801.txt)
$ extract.py -C shift_jis asahi.pat asahi.200510220801.zip > asahi.200510220801.txt

Options

-C output_text_encoding
Specifies the encoding of output texts (page titles and main texts). The default value is utf-8.

-c default_character_set
Specifies the default character set that is used when there is no charset declaration (<meta> tag) in an HTML file. A different character set is not automatically detected.

-a accept_url_pattern
Specifies a regular expression pattern that defines a URL which is allowed to use in the same manner as textcrawler.py. When combined with -j option, the patterns are checked in the specified order.

-j reject_url_pattern
Specifies a regular expression pattern that defines a prohibited URL. When combined with -a option, the patterns are checked in the specified order. By default, extract.py tries to use all the pages contained in a given zip file.

-t layout_similarity_threshold
Specifies the minimum similarity score for comparing each page with layout patterns. The default value is 0.8. extract.py tries to identify the layout of a page by finding the most similar layout pattern. But the page is rejected if the highest similarity is still less than this threshold, and "!UNMATCHED" is printed. Usually you don't need to change this value.

-S
Strict mode. It requires that each page should have all the layout blocks of any layout pattern. This allows to obtain only strictly conforming pages, but it may lower the number of pages that are successfully extracted.

-T diffscore_threshold
If the diffscore of a layout block is equal or more than this value, extract.py recognizes it as "variable blocks". The default value is 0.5.

-M mainscore_threshold
If the mainscore of a layout block is equal or more than this value, extract.py recognizes it as "main blocks". The default value is 50.

-d
Raises the debug level and displays extra messages.


urldbutils.py (URLDB utility)

urldbutils.py removes redundant URLs from a URLDB file to shrink it. When textcrawler.py uses a URLDB, it keeps adding a newly found (the md5 hash of) URL into the database file, which causes the file size increase gradually. It also records the time that each URL is last seen. If a URL is not seen for a certain time, it can be safely removed from the database.

Syntax

$ urldbutils.py {-D | -R} [options] filename [old_filename]

You need to choose either display mode (-D) or reorganize mode (-R). The display mode is mainly for a debugging purpose. When you rebuild a DBM file, two filenames (new and old) should be specified. For the safety reason, it does not run when the new file already exists.

Example:
(Remove URLs which haven't been seen for 10 or more days, and
    rebuild a new URLDB file myurldb.new.)
$ urldbutils.py -R -t 10 myurldb.new myurldb
$ mv -i myurldb.new myurldb
mv: overwrite `urldb'? y

Options

-D
Displays the content of the URLDB file (md5 hash + last seen time).

-R
Removes the URLs in the URLDB file which haven't been seen for a certain time, and rebuild a new URLDB file. You need to specify two filenames (new, old) and -t option (threshold).

-t days
Specifies the maximum duration for a URL in days.

-v
Verbose mode. Display the entries that are removed in -R mode.


html2txt.py (simpler text extractor)

html2txt.py is a much simpler text extractor (or an HTML tag ripper) without using any sort of predefined pattern. It just removes all HTML tags from the input files. It also removes javascript or stylesheet contents surrounded by <script>...</script> or <style>...</style> tag.

Syntax

$ html2txt.py [options] input_filename ... > output_text
Example:
$ html2txt.py index.html > index.txt

Options

-C output_text_encoding
Specifies the encoding of output texts (page titles and main texts). The default value is utf-8.

-c default_character_set
Specifies the default character set that is used when there is no charset declaration (<meta> tag) in an HTML file. A different character set is not automatically detected.

Bugs


Changes


Terms and Conditions

(This is so-called MIT/X License.)

Copyright (c) 2005-2009 Yusuke Shinyama <yusuke at cs dot nyu dot edu>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Last Modified: Mon Jun 15 19:41:35 JST 2009

Yusuke Shinyama