November 13, 2019

Archway: Self-Service Data Engineering on Cloudera CDH

By Tony Foerster

Data engineering in a production environment is complex. Engineers and data scientists need to be onboarded onto a platform where they can share data and resources; and the process is often longer and more difficult than many people initially realize.

It can be an adventure just getting the right approvals: Is the data allowed to live on this platform? Are there additional security considerations? Are there enough resources available on the platform for the application? And that’s only the beginning. The necessary resources need to be provisioned. The operations team needs to ensure that quotas and resource queues are configured correctly; that databases are created; that role-based access control is set up properly — and on and on.

Not only is this time-consuming; it’s risky. With such a complex, manual process, it’s easy to make a mistake along the way. Mistakes mean project slowdowns, misconfigurations, and extra costs, which can lead in turn to late starts, security vulnerabilities, or failure to get applications off the ground at all.

That’s where Archway comes in: it was built to automate these processes — with ease-of-use, accuracy, flexibility, and security in mind.

An easier way to get value from your data

Archway is an open source project, under the Apache License Version 2, that automates the creation, approval, and governance of user workspaces in a Cloudera CDH environment. It makes self-service data engineering a reality, with an intuitive user interface (UI) that empowers users to create workspaces and set up role-based access control in a matter of minutes — saving you time, reducing onboarding risks, and granting faster access to your data.

With Archway, users simply choose from a list of workspace types that fit their needs, entering a name, description, summary, and data compliance information. After reviewing the workspace request data, compliance and operations teams can approve the workspace with the click of a button. The workspace resources are then automatically created in the background, and access control is configured immediately.

Archway will automatically provision:

  • Hive databases
  • HDFS directories, including user home directories
  • Kafka topics
  • Yarn resource queues
  • Sentry roles with their associated Active Directory groups for each the above resources

Archway User Interface

Archway User Interface

Because every analytics or machine learning application is different, Archway provides templates for workspaces to fit almost any need, including:

  • User workspaces — Workspaces for single users, consisting of a database and an optional Kafka topic for their own exploration and play.
  • Simple workspaces — Workspaces intended for small applications or team collaboration.
  • Structured workspaces — Workspaces consisting of multiple databases for different stages of the application lifecycle. For example, different databases can be used for landing data, transforming it, and exposing data to end users and analysts.

Archway

It’s easy to grant additional users access to the workspace, by doing a quick user ID search and setting the desired access permissions. Removing user access is just as simple.

Archway Permissions

Likewise, members of data compliance or operations teams can easily view and manage workspaces in the cluster.

Archway Risk/Compliance

Governance, including important information about data security, is handled as part of the Archway workspace approval process. When creating a workspace, users indicate the type of data that will exist in that workspace. Important information, including whether a workspace will include PII, PHI, or PCI, is collected and stored alongside the workspace to be used in future decision making and auditing.

Archway Governance

An Archway to self-service data engineering

Archway enables self-service data analysis and engineering by automating the creation and governance of workspaces in Cloudera CDH environments. It’s flexible and secure. It’s open source. And it helps ensure that processes like resource creation and access control never get in the way of a successful data project.

For more details, watch the demo, or check out the Github repo.

This blog post was written by Tony Foerster and Brock Noland.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit