CDP Data Warehouse Experience: The Hadoop Paradigm Shift

Cloudera Data Platform (CDP) represents a major step forward toward combining the value-added distributions of Hadoop from both Cloudera (CDH) and Hortonworks (HDP) into a unified, cloud-ready Data and Analytics platform. CDP maps out a new direction to manage and expand large data workloads into single-cloud, multi-cloud, or cloud-data-center hybrids — wherever you need it, whenever you need it.

Combining over 30 open source projects, CDP is incredibly comprehensive; however, it can be overwhelming to figure out how all the pieces fit together. The complexity can become a barrier to entry for individual teams, each with a slightly different focus on platform use.

CDP addresses this challenge by providing persona-based entry points into the stack that target specific areas for Data Engineering (DE), Business Intelligence (BI) and Machine Learning (ML). These are just entry points; CDP is composed of modular services that can be approached either as a whole or selectively, depending on the need.

CDP consists of the following component services:

Management Console — web-based portal into CDP Environments
Workload Manager — telemetry and visibility into workloads down into specific databases, statement types, or users
Data Catalog — unified management security and governance visibility across multiple CDP environments
Replication Manager — data migration and propagation
Data Hub — managed workload runtime services
Data Warehouse — auto-scalable data marts and data warehousing
Machine Learning — workbench for data science and data engineering
Cloudera Runtime — core open-source distribution within CDP, along with the bundled CDH facilities, such as Cloudera Manager (CM), adjusted to run on top of managed cloud runtime(s) that ties together Data Hub, Warehouse, Replication Manager, and Data Catalog
Data Center — CM and CDH versions of CDP designed to run on traditional data center clusters, and to offer the ability to dynamically scale by leveraging container-based solutions

Simplified capacity planning with Data Warehouse Experience

Quite often, capacity planning for single-platform workloads is calculated by determining the largest load possible, then adding a little additional headroom; however, these worst-case scenarios usually far exceed the capacity requirements of typical day-to-day operations. Another problem in shared multi-tenant environments is that the“quiet” loads can become adversely affected by “noisy” loads that drain shared resources until they complete, which often makes scheduling workloads to meet business requirements, like SLAs, a major challenge.

To address these issues, CDP allows you to offload excessively burdensome workloads to an on-demand CDP cloud environment — providing the additional capacity and isolation that these “noisy” workloads demand. This allows core platform storage and workload capacity to grow at a predictable, linear rate over time; meanwhile, exceptional workloads can be offloaded and costed out just for those specific workloads that exceed normal capacity.

CDP Data Warehouse Experience (DWX) is a facility within CDP that combines compute and storage resources into a managed, on-demand cluster, that can be spun up and isolated from other workloads. For example, suppose you want to offload some excessively complex reporting that needs to be done on a month-end cycle. Rather than trying to squeeze other workloads — risking resource contention and unpredictable runtimes or worse — you can leverage DWX.

With DWX, you can simply choose a Hive or Impala SQL Engine and set it to auto-scale virtual warehouse instances with an upper limit on resources, and even set cost thresholds. A Hive Virtual Warehouse can be configured to start out with a specific number of Coordinators, Executors, and Executor Groups to work on a single isolated query, and those can be sized appropriately based on the size of the warehouse itself.

Hive Virtual Warehouse auto-scaling manages resources based on query load, allowing compute resources to expand when either headroom (the number of available coordinators) or wait time (the query queue capacity) is exceeded.

When no queries are sent to the executor group (as defined by the auto-suspend timeout), the cluster scales down, and nodes are released. The query endpoint still remains available; however, the cost of keeping an entire cluster running has been contracted back to the minimal configuration.

Digging deeper into CDP

Providing a robust, cost-effective data workload platform can be a daunting job — especially without the kind of “platform-on-demand” capabilities offered by CDP. Fortunately, helping businesses make the most of their cloud and data-center investments —enabling them to adapt to shifting capacity demands and meet their SLAs — is exactly what CDP was designed to do.

This blog post is part of a multi-part series dedicated to CDP. For further reading, check out our earlier blog post on unlocking adaptive scaling using CDP. Stay tuned for more on CDP features and roadmap news in subsequent posts. To learn how phData can help you with your data needs, reach out to our team at sales@phdata.io.

CDP Data Warehouse Experience: The Hadoop Paradigm Shift

CDP consists of the following component services:

Simplified capacity planning with Data Warehouse Experience

Digging deeper into CDP

More to explore

How to Use Maps in Sigma Computing

Alteryx Server 2024.2: New Features and How to Update

From Pipelines to Loops: How Fivetran + Census Reflects a Shift in Data Architecture

Join our team

Partners

Resources

Software

Accelerate and automate your data projects with the phData Toolkit

Industries

Solutions

Company

Technology Partners

Other Technology Partners

Check out our latest insights

How to Use Maps in Sigma Computing

Alteryx Server 2024.2: New Features and How to Update

Data Engineering

Consulting, Migrations, Data Pipelines, DataOps

Change Management, Enablement & Learning

COE, Coaching, PMO

Data Science and Machine Learning Services

MLOps Enablement, Prototyping, Model Development and Deployment

Strategy Services

Data, Analytics, and AI Strategy, Architecture and Assessments

Reporting, Analytics, and Visualization Services

Self-Service, Integrated Analytics, Dashboards, Automation

Elastic Operations

Data Platforms, Data Pipelines, and Machine Learning