Spark

Visualizing NetFlow Data with Apache Kudu, Apache Impala (incubating), StreamSets Data Collector, and D3.js

NetFlow is a data format that reflects the IP statistics of all network interfaces interacting with a network router or switch. Netflow records can be generated and collected in near real-time for the purposes of cybersecurity, network quality of service, and capacity planning. For network and cybersecurity analysts interested in these data, being able to […]

Read More

Configuring Oozie for Spark SQL on a Secure Hadoop Cluster

A secure hadoop cluster requires actions in Oozie to be authenticated. However, due to the way that Oozie workflows execute actions, Kerberos credentials are not available to actions launched by Oozie. Oozie runs actions on the Hadoop cluster. Specifically, for legacy reasons, each action is started inside a single task map-only MapReduce job. Spark does […]

Read More

try-with-resources in Scala

At phData, many of customers are deploying Apache Spark. As such, those customers look to phData not only for Apache Spark, Hadoop, and Kafka expertise, but also Scala. The following is an explanation of a construct we use in one of our Scala and Spark courses. This post will show a generic function that is […]

Read More