Logweeder

Homepage

Logweeder is a log analyzer. It scans UN*X logs (syslog), categorizes common events, and discovers uncommon events automatically.

In most UN*X systems, syslogs are recorded one event per line. Logweeder scans syslog outputs (e.g. /var/log/messages) and identify "uncommon" events. To detect uncommon entries, a program first learns "common" log patterns from existing log files. This is achieved by grouping similar log entries and generalizing them as a regular expression pattern. A user can use the generated patterns to classify logs that have a known type. A log entry that is not categorized to any known type is possibly an uncommon entry.

Download: logweeder-0.2.tar.gz (gzipped tar, 6kbytes)


How to Use

Logweeder consists of two programs:

First you need to run learn.py against log files to learn regexp patterns. The generated regexp patterns are written to the standard output.

$ ./learn.py /var/log/messages > mypattern
Clustering.++..+.......

The generated file mypattern should look like this:

('type-0', 8, '^mango\\ rpc\\.mountd\\:\\ authenticated\\ [a-zA-Z_]*\\ request\\ from\\ [a-zA-Z_]*\\.xx\\.xxx\\.xxx\\:[0-9]*\\ for\\ .*\\/[a-zA-Z_]*\\ \\(.*\\/[a-zA-Z_]*\\)', ['authenticated', 'for', 'xxx', 'request', 'xx', 'mountd', 'rpc', 'mango', 'from'])
# mango rpc.mountd: authenticated unmount request from kiwi.xx.xxx.xxx:686 for /data (/data)
# mango rpc.mountd: authenticated unmount request from kiwi.xx.xxx.xxx:689 for /home (/home)
# mango rpc.mountd: authenticated mount request from banana.xx.xxx.xxx:613 for /home (/home)
# mango rpc.mountd: authenticated mount request from kiwi.xx.xxx.xxx:697 for /home (/home)
# mango rpc.mountd: authenticated mount request from kiwi.xx.xxx.xxx:708 for /data (/data)
# mango rpc.mountd: authenticated mount request from grape.xx.xxx.xxx:1023 for /usr/local (/usr/local)
# mango rpc.mountd: authenticated unmount request from banana.xx.xxx.xxx:880 for /home (/home)
# mango rpc.mountd: authenticated mount request from grape.xx.xxx.xxx:1023 for /usr/local (/usr/local)

('type-1', 3, '^mango\\ kernel\\:\\ Packet\\ log\\:\\ input\\ REJECT\\ eth1\\ PROTO\\=[0-9]*\\ [0-9]*\\.[0-9]*\\.[0-9]*\\.[0-9]*\\:[0-9]*\\ 192\\.168\\.0\\.61\\:[0-9]*\\ L\\=[0-9]*\\ S\\=0x00\\ I\\=[0-9]*\\ F\\=0x4000\\ T\\=.*\\ \\(\\#[0-9]*\\)', ['kernel', '0x4000', 'eth1', 'log', 'PROTO', '0x00', 'input', 'F', 'I', 'L', 'S', 'T', 'Packet', 'mango', 'REJECT'])
# mango kernel: Packet log: input REJECT eth1 PROTO=6 128.122.80.107:61887 192.168.0.61:113 L=48 S=0x00 I=4036 F=0x4000 T=63 SYN (#10)
# mango kernel: Packet log: input REJECT eth1 PROTO=17 128.105.143.14:41385 192.168.0.61:9618 L=64 S=0x00 I=0 F=0x4000 T=53 (#11)
# mango kernel: Packet log: input REJECT eth1 PROTO=6 133.15.94.103:33271 192.168.0.61:113 L=60 S=0x00 I=44595 F=0x4000 T=48 SYN (#10)

('type-2', 1, '^mango\\ rpc\\.mountd\\:\\ export\\ request\\ from\\ 192\\.168\\.0\\.70', ['from', 'request', 'mountd', 'rpc', 'mango', 'export'])
# mango rpc.mountd: export request from 192.168.0.70

A line that starts with a parenthesis is a pattern line. Each pattern has its name (like 'type-0'), frequency, and regular expression. The following lines with '#' are actual log entries which were used to generate this pattern.

Now you can classify each log entry with the obtained patterns:

$ ./match.py mypattern /var/log/messages
type-0: Jan 31 06:17:27 mango rpc.mountd: authenticated unmount request from kiwi.xx.xxx.xxx:686 for /data (/data)
type-1: Jan 31 06:35:34 mango kernel: Packet log: input REJECT eth1 PROTO=6 128.122.80.107:61887 192.168.0.61:113 L=48 S=0x00 I=4036 F=0x4000 T=63 SYN (#10)
type-0: Feb  4 06:18:42 mango rpc.mountd: authenticated unmount request from kiwi.xx.xxx.xxx:689 for /home (/home)
type-0: Feb  4 06:20:02 mango rpc.mountd: authenticated mount request from banana.xx.xxx.xxx:613 for /home (/home)
unknown: Feb  5 06:20:51 mango rpc.mountd: export request from 192.168.0.70
type-0: Feb  5 06:21:01 mango rpc.mountd: authenticated mount request from kiwi.xx.xxx.xxx:697 for /home (/home)
...
The name of each entry is displayed at the beginning of the line. An entry that does not match with any known pattern is labeled 'unknown'. Pass these outputs to other postprocessing programs like grep, awk, etc.

Reference Manual

learn.py

Synopsis:

$ learn.py [options] [logfile1 logfile2 ...]

learn.py takes zero or more log files as arguments. When no filename is specified, it reads logs from the standard input.

Options:

-c charskip
The number of characters to trim each log entry before analysis. The default value is 16, since normally the first 16 characters of each syslog entry is a timestamp and should be ignored.

-n num_samples
The number of log entries to be used for deriving each pattern. The default value is 10.

NOTICE: It takes O(n^2) time to perform learning where n is the nubmer of log entries.

-p pattern
A previously constructed pattern file.

-t similarity_threshold
Specifies the similarity threshold for grouping logs. When generating regexp patterns, the program cuts off entries whose similarity is less than this value. The value range is from 0.0 to 1.0; Setting this value to 0.0 means two log entries can be completely different to be put into the same group. Setting this value to 1.0 means two log entries must be exactly same. If different-looking entries that are grouped into the same group, try increasing this value. In contrast, if similar-looking entries are split into too many groups, try decreasing this value. The default value is 0.7. This is okay for most log files.
-v
Verbose.

-q
Quiet.

match.py

Synopsis:

$ match.py [options] pattern_file [logfile1 logfile2 ...]

match.py takes exactly one pattern file and zero or more log files as arguments. When no log file is specified, it reads logs from the standard input.

Options:

-c charskip
The number of characters to trim each log entry before matching. This must be the same as the one used with learn.py. The default value is 16.

-t freq_threshold
Specifies the frequency threshold for patterns. The patterns whose frequency is less than this value is ignored. Setting this to a large number (say, 10) causes all minor log entries classified as "unknown".

How It Works

(under construction)

  1. Tokenize all the strings that have contiguous alphanumeric letters.
  2. Calculate the edit distance of all possible pairs of log entries (sequence of tokens).
  3. Link two entries whose similarity is more than a certain threshold and take all the connected entries as a group.
  4. Try to find the most common pattern that matches with every string in each group.

Terms and Conditions

(This is so-called MIT/X license.)

Copyright (c) 2007 Yusuke Shinyama <yusuke at cs dot nyu dot edu>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Yusuke Shinyama