QCon New York June 15-19, 2020 | | Fast Log Analysis by Automatically Parsing Heterogeneous Log

This presentation is now available to view on InfoQ.com

What You’ll Learn

Hear how parsing logs is extremely challenging. However, there are approaches that originate in machine learning that can be used to make sense of automating the parsing of heterogeneous logs.
Learn interesting approaches to log parsing and backed by a reference implementation used in a commercial product.
Understand the challenges to parsing logs automatically.

Abstract

Most log analysis tools provide platforms for indexing, monitoring, and visualizing logs. Although these tools allow users to relatively easily perform ad-hoc queries and define rules in order to generate alerts, they do not provide automated log parsing support. In particular, most of these systems use regular expressions (RegEx) to parse log messages. These tools assume that the users know how to work with RegEx. and make them manually parse or define the fields of interest. By definition, these tools support only supervised parsing as human input is essential. However, human involvement is clearly non-scalable for heterogeneous and continuously evolving log message formats in systems such as IoTs and custom applications -- it is impossible to manually review the sheer number of logs generated in an hour, let alone days and weeks. On top of that, writing RegEx-based parsing rules is a long, frustrating, and error-prone process as RegEx rules may conflict with each other. In this talk, we present a solution inspired by the unsupervised machine learning techniques for automatically generating RegEx rules from a set of logs with no (or minimal) human involvement. Human involvement is limited to providing a set of training logs. In addition, we present a demo illustrating how to integrate our solution with the popular Elasticsearch-Logstash-Kibana (ELK) stack to analyze logs collected from the real-world applications.

Question:

Who is the main audience the talk is targeting?

Answer:

The talk is mainly targeting people who design/architect log analytics solutions and are focused on making the troubleshooting operational problems faster by analyzing logs. When a computer operates, it generates logs to communicate with humans -- logs act as tweets to inform system status. If something fails, somebody has to understand the logs and take necessary steps to correct it. This talk is about how people parse those logs in a form that is one level up in analytics.

Question:

What's the motivation for the talk?

Answer:

When we initially started building the log analysis product for commercial purposes, we experienced bottleneck situations pretty quick. You have a log, but, unless you parse it, you cannot build any useful tools/analytics with it -- this is kind of limited. Since every log is really different (I mean there is no consistent form of logging), it's become very hard to automate.

To solve this problem, we say: “Ok, if this is automated, it doesn't need to be 100% perfect to start log parsing with no (or minimal) human input about the logs, but at least it will help people to get it started with the log analysis. Over the time, if more input is provided, then the automated process will act like a human expert. ” So you throw any logs and the system comes up with some regular expression based patterns. Logs are usually unstructured and there is a lot of text in a log, but, once you run our method, it will generate patterns to parse logs into structured forms, and use that to make sense of the logs.

In our talk, we will discuss our approaches to solving this problem. For example, in the talk, we will cover one particular log which is very scary (almost one page long). Using our tool and the approach we took to solve the problem, the tool will show is that given any log you have a way to parse it.

Question:

How does it do that? Does it apply machine learning techniques to be able to identify the components of the log? What does it actually do?

Answer:

Yes, it's a good question. What we found is if you apply pure machine learning the issue is run time. Machine learning is a time-consuming process, and it is still limiting. What we did is bring machine learning concepts into a blended approach with ways of understanding system logs. What I mean, although the logs are all in different formats, they are generated by a computer. A computer is dumb -- it’s just some programs writing information in the form of logs. Taking that assumption, it is not a storybook where all lines are different. Since logs are generated by some programs, it is only a few logging points which usually generate logs with fixed formats (maybe 10 to 100 formats or something like that). If you go with this kind of deep system knowledge, then it is just solving the problem systematically.

Question:

You have a tool that automates discovery of parsing logs now. Is this talk about a specific tool or about techniques you used to build that tool?

Answer:

The talk will be very generic. Because we have built this tool, we developed a methodology for addressing the problem. We’ll mostly focus on the methodology and use the tool to demonstrate the reference implementation and various design trade-offs. BTW, we completely assume that you don't have any prior knowledge to be able to attend this talk.

Speaker: Biplob Debnath

Researcher @NEC

Dr. Biplob Debnath is a researcher at NEC Labs, where his works over the last seven years have spanned building end-to-end face-recognition based video analytics system, log analytics system, non-volatile memory (i.e., flash, PCM) based caching system, and data deduplication system. His technical works have received 1300+ Google Scholar citations. Currently, his works focus on applying machine learning and AI techniques for solving real-world problems. His works on video and log analytics ships in NEC's commercial products. His Ph.D. research on flash-based key-value stores ships in Bing ObjectStore, research on data deduplication ships in Windows Server 2012, and research on caching ships in IBM's Storage Array. Biplob received a Ph.D. and an M.S. from the University of Minnesota.

Find Biplob Debnath at

Speaker page

@bkdebnath

https://www.linkedin.com/in/bkdebnath/

Speaker: Willard Dennis

Senior Systems Administrator @NEC

Will Dennis is currently employed as a Senior Systems Administrator at NEC Laboratories America, and has over 25 years of experience in managing, installing, and troubleshooting enterprise computing systems, networks, and software. A lifelong learner, Will enjoys keeping current with both tech and culture in the field of Information Technology. Will can be found online on Twitter as @willarddennis, and thru LinkedIn at https://www.linkedin.com/in/willdennis/