Kafka

Archiving Navigator Audit Data with StreamSets and Kafka

Andy Stadtler helped with this post Many of phData’s customers are heavy users of Cloudera Navigator. Cloudera Navigator provides metadata information to the user who can also audit all actions performed on data in the cluster. Per day one customer generates an average of 4GB Audit Data, which is stored by default in the mysql […]

Read More

Visualizing NetFlow Data with Apache Kudu, Apache Impala (incubating), StreamSets Data Collector, and D3.js

NetFlow is a data format that reflects the IP statistics of all network interfaces interacting with a network router or switch. Netflow records can be generated and collected in near real-time for the purposes of cybersecurity, network quality of service, and capacity planning. For network and cybersecurity analysts interested in these data, being able to […]

Read More

Troubleshooting Spark and Kafka escalations for our managed services customers

My favorite technical task is handling a troubleshooting escalation from one of our Managed Services customers. This week I was lucky enough to work on a problem worth sharing. The job was producing messages in a Spark job to Kafka. Half way through the Spark job, the job froze, no more messages were produced to Kafka. After […]

Read More

StreamSets – Hadoop Ingestion Made Simple

StreamSets recently announced and open sourced their first product, DataCollector. I had been given access to a preview version of the product and was quite impressed. Given their product is now public and generally available, I thought I would go through a super-simple demo. In my consulting role at phData, I’ve worked with many customers […]

Read More

Apache Kafka Performance Numbers

A search for “Apache Kafka performance” will result in dozens of articles but few results useful for estimating real-world results. Specifically I’ve few found results which run on hardware common to modern data centers, replicate the data with the common factor of 3, and many parallel producers and consumers. These results are meant to be used […]

Read More