Machine Learning Techniques for Security Information and Event Management
Jordan Caraballo-Vega
Abstract
The deployment and maintenance of a High Performance Computing facility, such as the NASA Center for Climate Simulation, requires services able to monitor and report live results of hardware and software operational statistics. With more than 4k computing nodes and more than 90,000 processors cores, it is crucial for the NCCS to implement techniques to advance, improve, and speed up our way of analyzing failures to fix and prevent future downtimes.
One important technique used over time to supervise information and events is to automatically store timestamped documentation of relevant procedures in log files. This technique helps organizations, businesses or networks to proactively mitigate different risks. Even when this information is very useful, as fast-moving data increases, it becomes nearly impossible for humans to detect these error causes or incoming threats. Figure 1. We receive daily ~115 to 120 million log messages from ~3,000 servers.
Therefore our aim is to:
- Enhance and improve our ability to view, analyze, and monitor logs files.
- Upgrade our existing ELK+Graylog infrastructure.
- Prove that machine learning (ML) techniques are useful for log analysis.
- Implement and develop ML jobs to automate the detection of common and security events.
- Produce a recipe for future production upgrade procedures



