Archiving Navigator Audit Data with StreamSets and Kafka

Navigator Audit Data with StreamSets and Kafka

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Introduction

Many of phData’s customers are heavy users of Cloudera Navigator. Cloudera Navigator provides metadata information to the user who can also audit all actions performed on data in the cluster. Per day, one customer generates an average of 4GB Audit Data, which is stored by default in the mysql database with a 90 day retention period. This caused two issues: it was expensive to store on enterprise SAN and the customer wanted to increase retention the retention period to 3 years. Fortunately, we happen to have a very cheap way of storing data: Hadoop!

Cloudera Navigator Audit server is part of the Cloudera Enterprise license. Every Action/Event performed on the cluster gets logged to the Navigator Audit database. In figure 1, we see the Navigator Audit Events, based on a filter by username.

Figure1: Navigator Audits

Figure2: Navigator Data Lineage

Configuring Cloudera Navigator

Cloudera Navigator can publish audit events to Apache Kafka or a Syslog. We chose Kafka as this followed the customer’s principles to publish everything to Kafka. The configuration of Cloudera Navigator is fairly simple and only requires a few lines of configuration. The following snippet was pasted in the Navigator Audit Server Logging Advanced Configuration Snippet for Cloudera Management Services in Cloudera Manager (CM):

log4j.logger.kafkaAuditStream=TRACE,KAFKA

log4j.appender.KAFKA=org.apache.kafka.log4jappender.KafkaLog4jAppender

log4j.additivity.com.cloudera.navigator.kafkaAuditStream=false

log4j.appender.KAFKA.layout=org.apache.log4j.PatternLayout

log4j.appender.KAFKA.layout.ConversionPattern=%m%n

log4j.appender.KAFKA.SyncSend=false

log4j.appender.KAFKA.BrokerList=broker_host:broker_port

log4j.appender.KAFKA.Topic=NavigatorAuditEvents

Please note that you will need to change the values for log4j.appender.KAFKA.BrokerList and log4j.appender.KAFKA.Topic with your kafka brokers and topic. After a restart of the Navigator Audit Server, messages will be published to Kafka.

Building a StreamSets Pipeline

There are a couple ways to get the data from Kafka to HDFS. One is with a spark streaming job, and the other is to use StreamSets Datacollector (SDC). We chose the SDC option because of its easy to use web interface.

In SDC, a pipeline consists of at least two components: an origin and a destination. Additionally, there is the possibility to use one or multiple processors for transformations in-between. In our use case, we need a Kafka-origin which we configured to consume messages from the kaka-topic.  

Figure3: StreamSets Pipeline

Our Kafka Origin feeds directly into an Hadoop Filesystem destination which stores the data as Avro files with snappy compression. Using the avro file format with snappy compression shrunk the file size from 4gb per day in MySQL to 700mb of snappy compressed avro data in HDFS. In order to avoid small files, the max file size has been set 1GB and the idle timeout to one hour. The origin is configured to create a folder per day where the data is being written to. Those folders will the map to a partition in the hive table. (figure 4)

Figure 4: SDC output configuration

 

Figure5: Avro configuration

For using Avro, we needed an Avro Schema which we put in the SDC destination configuration (see figure 5):

{

   “type”: “record”,

   “name”: “NavigatorAuditData”,

   “namespace”:”org.apache.avro.ipc”,

   “fields”: [

     {“name”: “type”, “type”: “string”},

     {“name”: “allowed”, “type”: “string”},

     {“name”: “time”, “type”: “string”},

     {“name”: “service”, “type”: “string”},

     {“name”: “ip”, “type”: “string”},

     {“name”: “user”, “type”: “string”},

     {“name”: “op”, “type”: [“null”, “string”]},

     {“name”: “opText”, “type”: “string”, “default” : “null”},

     {“name”: “impersonator”, “type”: “string”, “default” : “null”},

     {“name”: “db”, “type”: “string”, “default” : “null”},

     {“name”: “table”, “type”: “string”, “default” : “null”},

     {“name”: “family”, “type”: “string”, “default” : “null”},

     {“name”: “privilege”, “type”: “string”, “default” : “null”},

     {“name”: “objType”, “type”: “string”, “default” : “null”},

     {“name”: “src”, “type”: “string”, “default” : “null”},

     {“name”: “path”, “type”: “string”, “default” : “null”},

     {“name”: “status”, “type”: “string”, “default” : “null”},

     {“name”: “perms”, “type”: “string”, “default” : “null”}

  ]

} 

The table is created as followed with one partition per day. Please note that newly added partitions do not get added automatically. For this case, we decided to use hive’s msck repair table command to discover new partitions.

create external table navigator_audit (

type string

,allowed string

,time string

,service string

,ip string

,user string

,op string

,opText string

,impersonator string

,db string

,table string

,family string

,privilege string

,objType string

,src string

,path string

,status string

,perms string )

partitioned by (dt string)

stored as avro

location ‘/data/navigator_audit/landing’;

Conclusion

This customer is now able to reduce the retention period for the MySQL stored data to 30 days (or whatever makes sense in your use-case) in CM. This will take off some heavy load of your mysql database. Now in HDFS stored Audit, data can be retained for years without using up important resources and still be queryable via Hive/Impala.

More To Explore

Data Science Salon Seattle Conference

Event Recap: Data Science Salon Seattle

Photo by Anna Anisin. Last week, I had the pleasure of attending the Data Science Salon conference in Seattle, Washington. It was my first time

Want to learn more about phData?

Image of desk