Webstemmer - How it works?

back [Japanese]


Webstemmer uses the following assumptions for analyzing web pages:

Webstemmer tries to identify a page layout by finding invariant features that are preserved across different pages. Then it tries to remove unchanged parts (which are mostly banners and navigation links) within the same page layout.

1. Cluster pages based on their layout structure.

2. Align pages which have the same layout.

3. Remove the common parts.

There are two techniques that are extensively used in this algorithm. One is called edit distance (levenshtein distance). This is used to calculate the similarity between two distinct strings. The other technique is clustering, which is to group similar objects based on their distance defined by an arbitrary metric. They have been developed since 1960s.

Analyzing Layouts - analyze.py

Layout analysis is to extract a "pattern" of HTML pages by clustering pages which have similar structures (layouts) to each other. To perform clustering, one has to compare multiple HTML pages and compute its similarity by comparing the features of each page. Group the pages which is similar to each other and extract common features as the pattern of the group (cluster). Here is the basic algorithm to do this in analyze.py program.

Step 1. Parsing the page

First analyze.py decomposes each HTML page into a sequence of "layout blocks" (Fig. 1). Though HTML elements normally forms a tree structure, we only take HTML block elements such as <p>, <div>, and <h1> and <title> element. We interpret an HTML page as a sequence of layout blocks. Layout blocks are used as features to compute the similarity of different pages. Here is an example of a HTML page and its layout blocks:

Fig. 1. Extracting a sequence of layout blocks from HTML.

Step 2. Computing similarity

The similarity between two pages is determined by the edit distance of the layout block sequences. This is similar to comparing two strings (i.e. diff). First obtain the maximum common sequence (MCS) of layout blocks and compute the ratio of common layout blocks (Fig. 2) between two sequences A and B, as in:

Sim(A, B) = [ Σe in MCS(A,B) Weight(e) ] / [ Σe in A Weight(e) + Σe in B Weight(e) ]
This way, we can tolerate a redundant element which is accidentaly inserted in the middle of the sequences. We used the number of alphabets contained each layout block e as its weight.

Fig. 2. Comparing layout block sequences.

Step 3. Clustering and generating a layout pattern

To perform clustering, one need to compute the similarity scores for all possible combinations of N pages, namely N x N pairs. The algorithm used in the analyze.py program tries to reduce the number of combinations, so the actual number of similarities might be less than N x N. A pages whose similarity exceeds a certain threshold is grouped into the same cluster. The clustering is performed in a hierarchical manner. This threshold is given by '-t' option in the analayze.py program. After obtaining clusters, extract the common layout blocks contained in all the pages in a group (cluster) (Fig. 3). These layout blocks form a "layout pattern" which is the output of this program.

Fig. 3. Generating a layout pattern.

Step 4. Removing banners and navigation links

The next step is to find banners and/or navigation links from each layout pattern. Each layout block is given a value called "DiffScore", which indicates how much the texts in a block varies to each other:

DiffScore(E) = [ Σs1,s2 in E Weight(s1) + Weight(s2) - 2 × |MCS(s1, s2)| ] / [ Σs1,s2 in E Weight(s1) + Weight(s2) ]
where si is a string contained in a particular layout block type E from all the pages that have the same layout. A layout block whose DiffScore is less than a certain threshold is considered as static, and is therefore removed in the later stage (Fig. 4), because most ads or banners do not change within the same layout. The threshold of the DiffScore is given by '-T' option in extract.py program.

Fig. 4. Finding the common texts between different pages.

Step 5. Discovering the title and main text

After filtering out static texts, we try to discover a layout block which is the page title. The title block is determined by comparing the text contained in that block and one of anchor texts (a string surrounded by a link tag <a> ... </a>) which refer to that page. In most of the news sites, it is known that the anchor text of a link to an article represents the title of that article. So compute the similarity between texts from each layout block and the anchor texts, and choose the block which has the hightest similarity (Arrow 1. in Fig. 5). This is also based on the edit distance of strings. The title block is determined in the clustering process (i.e. analyze.py), whereas the range of main texts can be adjusted dynamically in the text extraction.

There is an alternative way to find a title block. When there is no anchor text available for that page, we consider the layout block that is most similar to one of the main text and appears before that text block, as a title (Arrow 2. in Fig. 5). This was the only way of finding a title in the previous versions of Webstemmer. But this is a "fallback" method now, because the accuracy of title detection is not as good as the first method. The threshold of the similarity between title and main text is given by '-T' option in analyze.py.

Fig. 5. Discovering the title and main text from the page.

In order to find the main text of each page, we compute the "MainScore" for each layout block, which shows how much significant text is included within that block:

MainScore(E) = DiffScore(E) × [ Σs in E Weight(s) ] / |E|
A layout block that is not a title and whose MainScore exceeds a certain threshold, is considered as a main text block. Since main texts are often split into several layout blocks, we take every layout block whose MainScore exceeds the threshold. This threshold is given by '-M' option in extract.py program.

The sequence of layout blocks in each cluster is stored in a layout pattern file. Each layout block in a cluster has a DiffScore and MainScore. Then, each cluster can have the score of significence, indicating how much that cluster is important. This is based on the number of significant characters (excluding anchor texts) in the main text and the number of pages in the cluster:

LayoutPatternScore(P) = log(|P|) × ΣE in P MainScore(E)

The detail of the patter file format is described in Anatomy of Pattern Files.

Extracting Texts - extract.py

The extract.py program takes a layout pattern file generated by analyze.py and extract the main text and its title from HTML pages. First it decomposes each HTML page into layout blocks in the same way as the analyzer. Then it tries to find the most similar sequence of layout blocks from the pattern file (Fig. 6). If the similarity exceeds a certain threshold (given by '-t' option), it is considered that the page has the same layout pattern as the one in the pattern file.

Fig 6. Searching a layout pattern which matches the given page.

You can add an extra restriction by using "strict mode" (-S option) which prevents incomplete matching. In the strict mode, the program rejects a layout pattern if any of its layout block is missing in the page, no matter how many layout blocks are overlapping. This allows you to identify page layout improved accuracy. However, in some newspapers (such as several U.S. newspapers, which use slightly different layouts on each day,) you might get lower recall, which means the number of matching pages might be decreased.

After determining the page layout pattern, the program extracts texts from the layout blocks selected by the user-specified thresholds, and output them as either TITLE:, MAIN: or SUB: according to its DiffScore and MainScore value.

Anatomy of Pattern Files

analyze.py outputs patterns as a text file. It has one layout pattern per line. Each pattern contains several values formatted by Python's repr() function. An empty line and a line which begins with a '#' is regarded as a comment. Each pattern follows a couple of comment lines which shows the score of the pattern (indicating how likely the page is an article) and original page IDs that belong to that pattern when learning.

The following sample pattern was obtained from cnn.com. (For reader's convenience, a pattern is split into multiple lines, which is not allowed in an actual pattern file.)

### version=0.6.0                (Webstemmer Version)
### fname='cnn.200511210103.zip' (Source file used for learning)
### cluster_threshold=0.970000   (Clustering threshold of page similarity)
### title_threshold=0.600000     (Threshold for detecting a title)
### pages=74                     (Total number of pages used for learning)

# 3759.96885103 <200511210103/www.cnn.com/2005/EDUCATION/11/18/principal.shaming.ap/index.html> (pattern 1)
#       (Pages which belong to this cluster)
#       200511210103/www.cnn.com/2005/EDUCATION/11/18/principal.shaming.ap/index.html
#       200511210103/www.cnn.com/2005/WORLD/meast/11/20/iran.nuclear.ap/index.html
#       200511210103/www.cnn.com/2005/EDUCATION/11/18/student.progress.ap/index.html
#       200511210103/www.cnn.com/2005/TRAVEL/DESTINATIONS/11/18/homer.exhibit.ap/index.html
#       200511210103/www.cnn.com/2005/US/11/20/lost.in.morgue.ap/index.html
(3759.9688510285455,                                            (Overall score of the cluster)
 '200511210103/www.cnn.com/2005/EDUCATION/11/18/principal.shaming.ap/index.html', (Cluster ID)
 9,                                                             (Index of the title layout block in the sequence)
 [                                                              (List of layout blocks)
  # (diffscore, mainscore, block feature)
  (0.54130434782608694, 24.899999999999999, 'title'), 
  (0.0, 0.0, 'table:align=right/tr:valign=middle/td:class=cnnceilb'),
  (0.0, 0.0, 'table:id=cnnceil/tr/td:class=cnnceilw'),
  (0.052631578947368418, 0.0, 'ul:id=nav/li/div'),
  (0.0, 0.0, 'ul:id=nav/li:class=money/div'),
  (0.0,0.0, 'ul:id=nav/li:class=sports/div'),
  (0.047314578005115092, 0.0, 'ul:id=nav/li/div'),
  (0.0, 0.0, 'ul:id=nav/li:class=autos/div'),
  (0.0, 0.0, 'td:id=cnnnavbar:rowspan=2/div:class=cnnnavbot/div'),
  (0.76451612903225807, 23.699999999999999, 'tr:valign=top/td:id=cnnarticlecontent/h1'),
  (0.085365853658536592, 1.3999999999999999, 'div:class=cnninteractiveelementscontainer/div:class=cnnieheader/h3'),
  (0.73124999999999996, 23.399999999999999, 'div:class=cnniebox/div:class=cnnieboxcontent/div:class=cnnemailalertoptionrow'),
  (0.0, 0.0, 'div:class=cnniebox/div:class=cnnieboxcontent/div:class=cnnalertsbuttonrow'),
  (0.0, 0.0, 'div:class=cnninteractiveelementscontainer/div:class=cnniebox/div:class=cnnieboxcontent cnnalertsfooterrow'),
  (0.99176016830294533, 2262.8000000000002, 'tr:valign=top/td:id=cnnarticlecontent/p'),
  (0.0, 0.0, 'td:id=cnnarticlecontent/div:class=cnnstorycontrib/p'),
  (0.0, 0.0, 'tr:valign=top/td/div:class=cnnstorytools'),
  (0.0, 0.0, 'table:id=cnnstorytoolstimebox/tr/td'),
  (0.37226277372262773, 0.0, 'tr:valign=top/td/div:class=cnnbinnav'),
  (0.7265625, 0.0, 'table:class=cnnstoryrelatedstopstory/tr/td'),
  (0.66804979253112029, 0.0, 'td/div:class=cnnstorybinsublk/div'),
  (0.0, 0.0, 'tr:valign=top/td/div:class=cnnbinnav'),
  (0.0, 0.0, 'table:class=cnnstoryrelatedstopstory/tr/td'),
  (0.0, 0.0, 'td/div:class=cnnstorybinsublk/div'),
  (0.0, 0.0, 'tr/td/div:class=cnn4pxlpad'),
  (0.0, 0.0, 'table/tr/td'),
  (0.0, 0.0, 'table/tr/td:align=right:class=cnn7pxrpad'),
  (0.0, 0.0, 'table:id=cnnfoot/tr:valign=top/td'),
  (0.0, 0.0, 'table/tr/td:valign=top'),
  (0.0, 0.0, 'table/tr:class=cnnnopad/td')


Well, in most web pages texts might be extracted in a much easier way, like picking up lines with a certain number of words or punctuations... So the attempt explained within this page is basically nonsense. However, it just looks cool.

Last Modified: Thu Sep 6 12:58:13 JST 2007

Yusuke Shinyama