February 25, 2020

CDP Data Warehouse Experience: The Hadoop Paradigm Shift

By Peter Doyle

Cloudera Data Platform (CDP) represents a major step forward toward combining the value-added distributions of Hadoop from both Cloudera (CDH) and Hortonworks (HDP) into a unified, cloud-ready Data and Analytics platform. CDP maps out a new direction to manage and expand large data workloads into single-cloud, multi-cloud, or cloud-data-center hybrids — wherever you need it, whenever you need it. 

CDP

Combining over 30 open source projects, CDP is incredibly comprehensive; however, it can be overwhelming to figure out how all the pieces fit together. The complexity can become a barrier to entry for individual teams, each with a slightly different focus on platform use. 

CDP addresses this challenge by providing persona-based entry points into the stack that target specific areas for Data Engineering (DE), Business Intelligence (BI) and Machine Learning (ML). These are just entry points; CDP is composed of modular services that can be approached either as a whole or selectively, depending on the need. 

CDP consists of the following component services:

  • Management Console — web-based portal into CDP Environments 
  • Workload Manager — telemetry and visibility into workloads down into specific databases, statement types, or users
  • Data Catalog — unified management security and governance visibility across multiple CDP environments
  • Replication Manager — data migration and propagation 
  • Data Hub — managed workload runtime services 
  • Data Warehouse — auto-scalable data marts and data warehousing 
  • Machine Learning — workbench for data science and data engineering 
  • Cloudera Runtime — core open-source distribution within CDP, along with the bundled CDH facilities, such as Cloudera Manager (CM), adjusted to run on top of managed cloud runtime(s) that ties together Data Hub, Warehouse, Replication Manager, and Data Catalog 
  • Data Center — CM and CDH versions of CDP designed to run on traditional data center clusters, and to offer the ability to dynamically scale by leveraging container-based solutions

Simplified capacity planning with Data Warehouse Experience

Quite often, capacity planning for single-platform workloads is calculated by determining the largest load possible, then adding a little additional headroom; however, these worst-case scenarios usually far exceed the capacity requirements of typical day-to-day operations. Another problem in shared multi-tenant environments is that the“quiet” loads can become adversely affected by “noisy” loads that drain shared resources until they complete, which often makes scheduling workloads to meet business requirements, like SLAs, a major challenge. 

To address these issues, CDP allows you to offload excessively burdensome workloads to an on-demand CDP cloud environment —  providing the additional capacity and isolation that these “noisy” workloads demand. This allows core platform storage and workload capacity to grow at a predictable, linear rate over time; meanwhile, exceptional workloads can be offloaded and costed out just for those specific workloads that exceed normal capacity. 

CDP Data Warehouse Experience (DWX) is a facility within CDP that combines compute and storage resources into a managed, on-demand cluster, that can be spun up and isolated from other workloads. For example, suppose you want to offload some excessively complex reporting that needs to be done on a month-end cycle. Rather than trying to squeeze other workloads — risking resource contention and unpredictable runtimes or worse — you can leverage DWX. 

With DWX, you can simply choose a Hive or Impala SQL Engine and set it to auto-scale virtual warehouse instances with an upper limit on resources, and even set cost thresholds. A Hive Virtual Warehouse can be configured to start out with a specific number of Coordinators, Executors, and Executor Groups to work on a single isolated query, and those can be sized appropriately based on the size of the warehouse itself. 

Hive Virtual Warehouse

Hive Virtual Warehouse auto-scaling manages resources based on query load, allowing compute resources to expand when either headroom (the number of available coordinators) or wait time (the query queue capacity) is exceeded.

HiveServer

When no queries are sent to the executor group (as defined by the auto-suspend timeout), the cluster scales down, and nodes are released. The query endpoint still remains available; however, the cost of keeping an entire cluster running has been contracted back to the minimal configuration. 

Digging deeper into CDP

Providing a robust, cost-effective data workload platform can be a daunting job — especially without the kind of “platform-on-demand” capabilities offered by CDP. Fortunately, helping businesses make the most of their cloud and data-center investments —enabling them to adapt to shifting capacity demands and meet their SLAs — is exactly what CDP was designed to do. 

This blog post is part of a multi-part series dedicated to CDP. For further reading, check out our earlier blog post on unlocking adaptive scaling using CDP. Stay tuned for more on CDP features and roadmap news in subsequent posts. To learn how phData can help you with your data needs, reach out to our team at sales@phdata.io.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit