In mid-2017, we were working with one of the world’s largest healthcare companies to put a new data application into production. The customer had grown through acquisition and in order to maintain compliance with the FDA, they needed to aggregate data in real-time from dozens of different divisions of the company. The consumers of this application did not care how we built the data pipeline. However, they were sure to tell us that if it broke, and the end-users did not get their data, they would be out of compliance and could be subject to massive fines.
The data pipeline was built primarily on the Cloudera platform using Spark Streaming, Kudu, and Impala, however there were components that relied upon automation built in Bash and Python as well. Having supported data products in the past, we knew that data applications need robust support beyond strong up-front architecture and development. Specifically, we began thinking about how we could ensure that errors do not go unnoticed. If any part of our pipeline has issues, we need the ability to act proactively. Cloudera Manager was setup to issue alerts for the platform, Workload XM looked great for Cloudera workload optimization, but how can we be alerted if any part of the application pipeline fails – inside or outside the Cloudera platform?
There are many log aggregation and alerting tools on the market. Out of the box, Elasticsearch provides the same log searching functionality as Solr, and Kibana provides a very good web user interface. In our case though, managing an Elasticsearch cluster is work, and our customers are already using Cloudera Search and an additional cluster was not worth the extra overhead. In addition, security and alerting is provided only with the paid version of Elasticsearch.
Other solutions were prohibitively expensive. For example, one of the options we evaluated was priced based on the volume of data ingested and aggregated. Having been down this road in the past, we knew that you end up spending a lot of time adjusting log verbosity and making sure only important things are logged. This is low value work and storage is cheap.
Given that the application had PHI and HIPAA data, we also wanted a solution that included role-based access control (RBAC). The Sentry integration with Solr Cloud provides out of the box role based access control, so our customer is able to use their existing Sentry roles to protect their log data against unauthorized access using their existing processes.
Pulse (https://github.com/phdata/pulse) is an application log aggregation, search, and alert framework that runs on Cloudera CDH. Pulse builds on Hadoop/Spark technologies to add log aggregation to your existing infrastructure and integrates closely with Cloudera through a Cloudera CSD and Parcel.
When running an application, it’s important to be able to:
- Search your logs to find issues quickly. Searching logs in a centralized location cuts debugging time drastically.
- Create flexible alerts against your logs as they arrive, in real time. Alerts let you be proactive about your application health, and learn of issues before they are reported downstream.
- Secure your logs against unauthorized access. Hadoop clusters are multi-tenant environments, often with hundreds of users. Log data should be secured like the rest of your data.
- Keep logs for a configurable amount of time. Depending on application requirements logs may need to be kept for a few days or a few years.
Pulse stores logs in Apache Solr, which gives full text search over all log data. Apache Sentry, which handles role based access control works with Apache Solr, so its easy to control access to logs. Pulse itself adds functionality like log lifecycle management so logs are kept only as long as needed. It includes log appenders for multiple languages that make it easy to index and search logs in a central location. Pulse also has an alert engine, so you can create alerts that will notify your team when things go wrong.
Pulse runs on your existing infrastructure and is not a cloud based service. Pulse is packaged into a Cloudera Custom Service Descriptor (CSD). It is installed via Cloudera Manager. In addition to making the install of Pulse painless on your Cloudera Hadoop cluster, Cloudera Manager acts as a supervising process for the Pulse application, providing access to process logging and monitoring.
Pulse consists of four components:
- Log Appenders: a log4j appender is pre-packaged with Pulse, other appenders for Python and Bash are also available.
- Log Collector: an http server that listens for log messages from the appenders and puts them into the Apache Solr (Cloud) full text search engine. The Log Collector can be scaled across your cluster to handle large volumes of logs.
- Alert Engine: service runs against near-real-time log data indexed in to Solr Cloud. Alerts run on an interval and can alert via email or http hook.
- Collection Roller: handles application log lifecycle and plumbing. Users configure how often to create a new index for logs and how long to keep logs.
Logs stored in Solr can be visualized using existing tools, including Hue Search, Arcadia, or Banana. Each log record stored in Pulse contains the original log message timestamp, making it easy to create time-series visualizations of your log data.
The Log Collector is an HTTP REST service that listens for log events, batches them, and writes them to Solr Cloud. Because the Log Collector is just a REST API, it’s easy to configure applications in any language to use it. It’s on the Pulse roadmap to flip the log-collector around and read log events from a Kafka topic instead of just listening for messages.
Log appenders are the application specific code and configuration used to write log messages to the Log Collector. We’ve built clients that write to the Log Collector for Java (using log4j), Python, and Bash.
For Java, you can attach a custom log4j appender using only changes to your log4j.properties. To use the Python appender, you will need to import a Python library that extends the Python logging framework. The Bash appender is a simple script you can source that makes logging from Bash applications to the Pulse framework easy.
The Collection Roller will define applications in Pulse and handle log lifecycle. The image below describes how the Collection Roller works with collection aliases. The two collection aliases, one for read and one for write,
The write alias (suffixed with ‘_latest’ internally in Pulse) will always point at the most recently created collection. It’s used by the Log Collector so the log collector doesn’t have to know about specific collections.
The read alias (suffixed with ‘_all’ internally in Pulse) will always point at all log collections for a single application, It is used by visualization or search tools wanting to access all log data.
Every day a new collection is created, while the oldest collection is deleted. The image below describes this process.
Alerts Engine and Visualization
Users interact with Pulse through the Alert Engine and visualization tools.
Because Pulse uses Solr Cloud behind the scenes, any visualization tool that works with Solr Cloud can be used, including Arcadia Data, Hue Search, or Banana.
Here is a screenshot of a dashboard using Arcadia Data:
The Pulse Alert Engine is a daemon that continually monitors logs and will alert users when something goes wrong. The alert engine has rules that can use the full power of the Solr/Lucene query language. For example, this query will alert if an ERROR level message was seen in the last 10 minutes:
`timestamp:[NOW-10MINUTES TO NOW] AND level: ERROR`
This query will alert if a metric goes out of the range of 50-100
timestamp:[NOW-10MINUTES TO NOW] AND metric:[101 TO *] OR metric: [0 TO 49]
Pulse lets you keep control of your logs and while taking advantage of your existing infrastructure and tools. Pulse is Apache 2.0 Licensed. It’s available for use and contributions at https://github.com/phdata/pulse.