Apache Kudu Integration Testing in Scala/SBT Applications

Apache Kudu

Introduction to Kudu Integration Testing Beginning with the 1.9.0 release, Apache Kudu published new testing utilities that include Java libraries for starting and stopping a pre-compiled Kudu cluster. This utility enables JVM developers to easily test against a locally running Kudu cluster without any knowledge of Kudu internal components or its different processes. Cloudera published […]

How to Use the Kudu Quickstart on Windows

Kudu Quickstart

This blog post was written by Donald Sawyer and Frank Rischner.  Introduction to Apache Kudu Apache Kudu is a distributed, highly available, columnar storage manager with the ability to quickly process data workloads that include inserts, updates, upserts, and deletes. Kudu integrates very well with Spark, Impala, and the Hadoop ecosystem. At phData, we use […]

CDP Data Warehouse Experience: The Hadoop Paradigm Shift


Cloudera Data Platform (CDP) represents a major step forward toward combining the value-added distributions of Hadoop from both Cloudera (CDH) and Hortonworks (HDP) into a unified, cloud-ready Data and Analytics platform. CDP maps out a new direction to manage and expand large data workloads into single-cloud, multi-cloud, or cloud-data-center hybrids — wherever you need it, […]

Introducing the Cloudera Data Platform: Unlock Adaptive Scaling


As the technology landscape changes, it’s important for businesses to take advantage of the increased efficiencies this competitive landscape offers. The most recent advancement companies are racing to implement is leveraging the benefits of the cloud and elastic compute platforms that allow for on-demand availability for critical business processes. In practice, this means that applications […]

Implementing Metadata as Part of Data Management


Data centralization without careful metadata implementation is like stocking a warehouse without sorting and labeling all the boxes. Yes, you may have everything you need in there; but your end users will be wandering around lost. For example, how would someone looking for manufacturing material know, without access to metadata, that they needed to use […]

Archway: Self-Service Data Engineering on Cloudera CDH

phData Archway

Data engineering in a production environment is complex. Engineers and data scientists need to be onboarded onto a platform where they can share data and resources; and the process is often longer and more difficult than many people initially realize. It can be an adventure just getting the right approvals: Is the data allowed to […]

How to Tame Apache Impala Users with Admission Control

Apache Impala

Introduction A common problem encountered with Apache Impala is resource management. Everyone wants to use as many resources (i.e. memory) as they can to try to increase speed and/or hide query inefficiency. However, it’s not fair to others and it can be detrimental to queries supporting important business processes. What we see at a lot […]

How to Query a Kudu Table Using Impala JDBC in Cloudera Data Science Workbench

CDSW and Impala Configuration

Kudu is an excellent storage choice for many data science use cases that involve streaming, predictive modeling, and time series analysis. However, in industries like healthcare and finance where data security compliance is a hard requirement, some people worry about storing sensitive data (e.g. PHI, PII, PCI, et al) on Kudu without fine-grained authorization. Kudu […]

Enabling Big Data Analytics with Arcadia Data

Arcadia Data Logo

As distributed data platforms like Hadoop and Cloud grow in adoption, there increasingly needs to be a more distributed approach to business intelligence (BI) and visual analytics. Traditional BI Tools No Longer Scale to the Increased Business Needs At phData, we continue to run into traditional BI tools failing to adapt to the increasing data […]

Hadoop Meets Blockchain: Trust Your (Big) Data

Hadoop Blockchain Directory

At a simple level, Blockchains solve a trust problem. Increasingly, companies are relying on third parties to help drive brand recognition and gain consumer trust. This includes trusting third party data. For these companies to succeed, it is vital that the data they receive is trustworthy and accurate. Each organization involved needs to trust that […]