regex-less parsing of messages
A very interesting and useful discussion took place the last week on the LogAnalysis mailing list.
Anton Chuvakin started the thread by asking other than parsing the individual messages (that could potentially have thousands of different formats), what other methods can be used in analyzing logs?
Some suggestions out of this discussion are listed here.
Clustering
Anton listed this as an option using tools such as slct. Another effort that I am aware of that’s using this approach is Securimine for Snort (SFS) from Securimine.
Securimine is founded by Ophir Rachman, who also founded Entercept Security Technologies (later on acquired by McAfee).
Brute-force Parsing
This method basically tries to guess some of the data structures inside a log message, such as IP address, hostname, username, password, action, etc etc.
Being able to correctly guess what data is a message without first knowing the message format is a tough problem. It relies on the parser knowing the exact structure of some of the data.
However, it can still be used to assist in parsing unknown messages. You can also apply some simple logics to classify the messages. Such as, if you see keywords such as from or to and IP addresses, that may be a firewall message.
Obviously this is not a fool-proof way, but given the alternative (not doing anything with the message at all!), it is a viable solution.
(One may ask the question of, is it better to not do anything so the users won’t be misled? or is it better to attempt in guessing and possibly give the wrong information? what do you think?)
Bayes/Markov/Expert Systems/Neural Nets/Genetic Algorithms
Several of the statisitical type of analysis were mentioned here.
- Expert system - a collection of empirical data and decision algorithms compiled by developers
- Hidden Markov models - since they are used in natural language and speech processing they might be applicable to log entries (they are after all some type of “natural speech”).
- Neural nets - Once built, the neural net would be trained by experienced teachers (log analysis gurus).
- Genetic algorithms - The trick would be to 1. define the right requirements (for example, determine the least number of message types without discarding significant data) and 2. define the genetic codes for the solution organisms. Maybe GAs are a bit far fetched but I wouldn’t exclude them.
- Bayes - Bayesian classifiers have been extremely popular and successful in spam filtering. The success of baysian in spam filtering is partly due to the simplicity of classifying emails into ham and spam. In the log world, it is much tougher to tell from good to bad. Also, lots of not-bad messages may also indicate something bad. So it is tough to say how one can apply this type of technology to log analysis.
Obviously I am no mathematician nor do I claim to understand the nitty-gritty details of statistical analysis, so I can’t comment much on the technical merit of these methods. But love to hear from anyone who have more knowledge.
Indexing
One of the newer methods of analyzing logs is indexing and providing Google like search capabilities for all logs. This is something LogLogic and Splunk are doing.
The basic idea is that instead of parsing the messages by understanding every single format, use the full-text indexing approaches to break the messages into tokens, then allow users to use boolean search expressions to search the logs.
This method is great when it comes to troubleshooting and forensic analysis. If complemented with the understanding of the log formats, it can be as powerful as other methods.
I wrote an article on Searching for Root Cause a while back on the benefit of using Google-like indexed search on logs.
Tokenizing
This is the way most log analyzers are using today. This method generally require writing regular expressions or similar methods to parse the individual pieces of information out of the log messages.
Rainer Gerhards has a great summary in his paper On the Nature of Syslog Data.
Various standards
IBM’s Common Base Event XML format - This is a VERY complicated XML based format that tries to cover everything. I see two huge problem with this type of format. First, it hugely expands the storage requirement given that raw log storage is required. Second, it could make parsing that much slower given the size of a single log (multiple KBs instead of hundres of bytes). It’s been morphed into the OASIS standard WSDM Management Using Web
Services v1.0 (WSDM-MUWS) .
WELF
W3C
IDMEF - Intrusion Detection Message Exchange Format
IDIOM - Intrusion Detection Interaction and Operations Messages (Cisco message format)
Eight steps for integrating security into application development
As a security professional and a developer, I have always been very frustrated in the carelessness of some developers when it comes to conforming to the simple security practices. The most common ones I see are throwing unchecked user inputs to the system call or database queries.
Ruby Qurashi’s article on Eight steps for integrating security into application development is a good summary of a process one should take to ensure security’s built into the applications from the start.
1. Initial review
2. Definition phase: Threat modeling
3. Design phase: Design review
4. Development phase: Code review
5. Deployment phase: Risk assessment
6. Risk mitigation
7. Benchmark
8. Maintenance phase: Maintain
The threat modeling step is, I believe, one of the most critical steps in this whole process. This belief is mainly due to that many of the application developers are not familiar with the various attacks that could happen to their software. This step would serve as a great training step for these developers.
If this step is performed correctly, the following steps will be much easier for everyone.
Good summary, worth reading.
Gallery 2.0.2 Security Fix Release
Gallery 2.0.1 and 2.0 has a minor security flaw. Here’s from the gallery web site:
Gallery 2.0.2 is now available for download. This release adds no new features. It fixes a minor XSS exploit, a potential information leak and a file disclosure bug in the zipcart module that could allow remote visitors to view sensitive files on your webserver. These security flaws were discovered during an internal security audit of the Gallery 2 code, and there are no known exploits of them in the wild. However we strongly recommend that you upgrade to version 2.0.2 as soon as possible. If you’re unable to upgrade right away we recommend that you disable the zipcart module until time permits you to upgrade.
I came back today and saw a TON of access from various IPs. It is especially bad since now there seems to be an automated process that checks for this exploit. Ran the following to get the offending IPs:
tail -20000 access_log|grep ‘\.\.\.\.\.\.\/1\.0′|cut -f1 -d’ ‘|sort|uniq
The offending IPs seem to be:
- 12.44.172.92
- 12.44.181.220
- 63.160.77.236
It seems to have crawled the web for URLs that link to the gallery pictures and used those URLs to get to the gallery sites. It looks for both /album and /gallery URLs.
The logs are similar to
12.44.172.92 - - [04/Dec/2005:15:24:56 -0800] “GET /album/sa/ecuador/sa1.html HTTP/1.0″ 302 276 “-” “Mozilla/4.0 (compatible; MSIE 4.0; Windows NT; ……/1.0 )” “-”
or
63.160.77.236 - - [04/Dec/2005:15:24:28 -0800] “GET /gallery/main.php?g2_view=core.ShowItem&g2_itemId=12&
g2_GALLERYSID=21831e46358ea023c3289f30b9f7ffb5 HTTP/1.0″ 200 14830 “-” “Mozilla/4.0 (compatible; MSIE 4.0; Windows NT; ……/1.0 )” “-”
If you use those URLs, you would get something like

Notice the “System Information” section? It shows a ton of stuff about your setup.
After the upgrade, that whole section will be gone, giving only the “Error Detail” section.

