February 22, 2017

Archiving Navigator Audit Data with StreamSets and Kafka

By Frank Rischner

Introduction

Many of phData’s customers are heavy users of Cloudera Navigator. Cloudera Navigator provides metadata information to the user who can also audit all actions performed on data in the cluster. Per day, one customer generates an average of 4GB Audit Data, which is stored by default in the mysql database with a 90 day retention period. This caused two issues: it was expensive to store on enterprise SAN and the customer wanted to increase retention the retention period to 3 years. Fortunately, we happen to have a very cheap way of storing data: Hadoop!

Cloudera Navigator Audit server is part of the Cloudera Enterprise license. Every Action/Event performed on the cluster gets logged to the Navigator Audit database. In figure 1, we see the Navigator Audit Events, based on a filter by username.

Figure1: Navigator Audits

Figure2: Navigator Data Lineage

Configuring Cloudera Navigator

Cloudera Navigator can publish audit events to Apache Kafka or a Syslog. We chose Kafka as this followed the customer’s principles to publish everything to Kafka. The configuration of Cloudera Navigator is fairly simple and only requires a few lines of configuration. The following snippet was pasted in the Navigator Audit Server Logging Advanced Configuration Snippet for Cloudera Management Services in Cloudera Manager (CM):

log4j.logger.kafkaAuditStream=TRACE,KAFKA
log4j.appender.KAFKA=org.apache.kafka.log4jappender.KafkaLog4jAppender
log4j.additivity.com.cloudera.navigator.kafkaAuditStream=false
log4j.appender.KAFKA.layout=org.apache.log4j.PatternLayout
log4j.appender.KAFKA.layout.ConversionPattern=%m%n
log4j.appender.KAFKA.SyncSend=false
log4j.appender.KAFKA.BrokerList=broker_host:broker_port
log4j.appender.KAFKA.Topic=NavigatorAuditEvents

Please note that you will need to change the values for log4j.appender.KAFKA.BrokerList and log4j.appender.KAFKA.Topic with your kafka brokers and topic. After a restart of the Navigator Audit Server, messages will be published to Kafka.

Building a StreamSets Pipeline

There are a couple ways to get the data from Kafka to HDFS. One is with a spark streaming job, and the other is to use StreamSets Datacollector (SDC). We chose the SDC option because of its easy to use web interface.

In SDC, a pipeline consists of at least two components: an origin and a destination. Additionally, there is the possibility to use one or multiple processors for transformations in-between. In our use case, we need a Kafka-origin which we configured to consume messages from the kaka-topic.  

Figure3: StreamSets Pipeline

Our Kafka Origin feeds directly into an Hadoop Filesystem destination which stores the data as Avro files with snappy compression. Using the avro file format with snappy compression shrunk the file size from 4gb per day in MySQL to 700mb of snappy compressed avro data in HDFS. In order to avoid small files, the max file size has been set 1GB and the idle timeout to one hour. The origin is configured to create a folder per day where the data is being written to. Those folders will the map to a partition in the hive table. (figure 4)

Figure 4: SDC output configuration

 

Figure5: Avro configuration

For using Avro, we needed an Avro Schema which we put in the SDC destination configuration (see figure 5):

{
   "type": "record",
   "name": "NavigatorAuditData",
   "namespace":"org.apache.avro.ipc",
   "fields": [
     {"name": "type", "type": "string"},
     {"name": "allowed", "type": "string"},
     {"name": "time", "type": "string"},
     {"name": "service", "type": "string"},
     {"name": "ip", "type": "string"},
     {"name": "user", "type": "string"},
     {"name": "op", "type": ["null", "string"]},
     {"name": "opText", "type": "string", "default" : "null"},
     {"name": "impersonator", "type": "string", "default" : "null"},
     {"name": "db", "type": "string", "default" : "null"},
     {"name": "table", "type": "string", "default" : "null"},
     {"name": "family", "type": "string", "default" : "null"},
     {"name": "privilege", "type": "string", "default" : "null"},
     {"name": "objType", "type": "string", "default" : "null"},
     {"name": "src", "type": "string", "default" : "null"},
     {"name": "path", "type": "string", "default" : "null"},
     {"name": "status", "type": "string", "default" : "null"},
     {"name": "perms", "type": "string", "default" : "null"}
  ]
} 

The table is created as followed with one partition per day. Please note that newly added partitions do not get added automatically. For this case, we decided to use hive’s msck repair table command to discover new partitions.

create external table navigator_audit (
type string
,allowed string
,time string
,service string
,ip string
,user string
,op string
,opText string
,impersonator string
,db string
,table string
,family string
,privilege string
,objType string
,src string
,path string
,status string
,perms string )
partitioned by (dt string)
stored as avro
location '/data/navigator_audit/landing';

Conclusion

This customer is now able to reduce the retention period for the MySQL stored data to 30 days (or whatever makes sense in your use-case) in CM. This will take off some heavy load of your mysql database. Now in HDFS stored Audit, data can be retained for years without using up important resources and still be queryable via Hive/Impala.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit