SQL

Visualizing NetFlow Data with Apache Kudu, Apache Impala (incubating), StreamSets Data Collector, and D3.js

NetFlow is a data format that reflects the IP statistics of all network interfaces interacting with a network router or switch. Netflow records can be generated and collected in near real-time for the purposes of cybersecurity, network quality of service, and capacity planning. For network and cybersecurity analysts interested in these data, being able to […]

Read More

Parquet vs Text Compression

Parquet is a columnar data format. Columnar data formats, which store data grouped by columns, when tuned specifically for a given dataset can achieve compression ratios of up to 95%. However, with zero tuning, they still provide excellent compression. For example, below I use Faker to generate 1M rows: from faker import Factory fake = […]

Read More

Hive corruption due to newlines and carriage returns

phData has customers across the spectrum of use cases. One of our customers stores vast volumes of XML. One of our engineers was recently asked: Hive sometimes corrupts my data and other times it does not. What is going on? The answer is quite interesting so I thought I would share. Specifically the query they […]

Read More

4 Strategies for Updating Hive Tables

Apache Hive and complementary technologies such as Cloudera Impala provide scalable SQL on Apache Hadoop. Unlike legacy database systems Hive and Impala have traditionally not provided any update functionality. However, many use cases require periodically updating rows such as slowly changing dimension tables. SQL on Hadoop technologies typically utilize one of two storage engines, Apache HBase […]

Read More

The Paradox of Agile Data Management

At phData, many of us come from a software development background and have witnessed the success of Agile Methodologies. Agile started in software development where it quickly gained popularity, but has also now made inroads into other realms. The concept of the “Agile Admin”, or as it’s better known, Devops, takes many of its core […]

Read More

The Truth about SQL on Hadoop (part 3)

This is a multi-part blog post meant to be an exhaustive introduction to SQL-on-Hadoop. The first part in this series covered Storage Engines and Online Transaction Processing (OLTP). The next post covered Online Analytical Processing (OLAP) while this post will cover engine retrofits for Hadoop and choosing among the alternatives. Retrofits When breaking this topic […]

Read More

The Truth about SQL on Hadoop (part 2)

This is a multi-part blog post meant to be an exhaustive introduction to SQL-on-Hadoop. The first part in this series covered Storage Engines and Online Transaction Processing (OLTP). This post will cover Online Analytical Processing (OLAP) while the third in the series will cover engine retrofits for Hadoop and choosing among the alternatives. Data processing and […]

Read More

The Truth about SQL on Hadoop (part 1)

This is a multi-part blog post meant to be an exhaustive introduction to SQL-on-Hadoop. The first part in this series will cover Storage Engines and Online Transaction Processing (OLTP). The next post will cover Online Analytical Processing (OLAP) while the third in the series will cover engine retrofits for Hadoop and choosing among the alternatives. […]

Read More