The options for building an actionable data platform are near endless, especially when you factor in all the solutions for storing data, approaches to transforming and modeling data, technologies ready to consume data, and people needing to work with data.
For most companies, building a run-of-the-mill data platform will not be enough. They will want a best-in-class data platform that consistently delivers value. But, what does that even look like?
The easy, and probably obvious answer here is that there’s no one-size-fits-all data platform.
What is best for one company will not necessarily be best for all. The good news is that there are a couple of objectives that, if achieved, create immense business value for any organization.
Modern, best-in-class data platforms typically:
- Provide use case scalability
- Enable data governance
- Support self-serviceability
- Instill confidence in data and the platform to users
In this post, we’ll take a closer look at how your organization can build a successful data platform around these characteristics.
Use Case Scalability From Data Warehousing in the Cloud
It shouldn’t be surprising to know that modern data warehousing solutions don’t look the same as they did twenty years ago. What might surprise you though, is that for most enterprises, a modern data warehouse doesn’t look the same as it did even five years ago.
Did you know: Only five years ago, most enterprises would consider a modern solution to be wheeling a rack or two of specialized hardware into their data center, in exchange for a few million dollars on an annual contract.
Cloud-based data warehousing solutions like Snowflake, AWS Redshift, Azure Synapse, and Google BigQuery have changed the game quite a bit with a pay-per-use data warehouse, but how exactly does that convert to use case scalability for a data platform?
On-Premise vs. Cloud-Based Data Warehousing Solutions
If we consider on-premise data warehousing solutions, investment is all up-front. You pay for the data warehousing solution, but don’t get to see any return on investment while the hardware is set up, configured, and operationalized. You might be months into the build-out with millions of dollars invested and just then be starting to implement a solution for the first use case. Initially, this leaves a business with a severely underutilized piece of hardware, making such a move a high-risk leap of faith.
At some point in the warehouse lifetime, there will be enough use cases existing to eat up the available compute or storage of the hardware. When this occurs, it is required to either order more hardware (another large hit to the budget) or identify what existing use cases can be scaled back and to what extent. Purchasing more hardware in this stage is less of a leap of faith, but will once again leave the organization with an under-utilized data platform as new use cases are prioritized and solutions built for them.
Contrast that on-premise cost model with a cloud-based data warehouse that uses a pay-per-use cost model, where there is an opportunity to prove the value of a use case using an iterative approach. An initial iteration can be to implement a use case solution with very light requirements to help gauge cost estimates and to understand how valuable that solution might be. Future iterations can expand on the solution, modifying the complexity of data transformation or how data flows through it, and even remove it to focus on another use case. At no point is there a need to consider purchasing and installing additional hardware, as new warehouses or clusters can be created on-demand.
Using a cloud-based data warehouse allows costs to scale according to the number of use cases and their complexity. This permits the business to prioritize and shows the true flexibility and scalability of a best-in-class data platform.
Data Governance from Data Ownership and Stewardship
More companies are getting additional data from new sources, with growth that is unparalleled since the likes of Google and Facebook. With this growth comes new problems like securing the data, ensuring regulatory compliance, and general management of the data. These are problems that data governance exists to solve.
Unfortunately, data governance is not achieved by using a specific tool or set of tools. Yes, tooling exists that supports many of the aspects of data governance, but they only enhance existing data governance practices. Data governance is very much a ‘people and process’ oriented discipline, intending to make data secure, usable, available, and of high quality.
The key to initiating and enabling data governance lies in garnering support from the C-Suite and IT leadership. But to truly grow and mature a data governance practice, it requires actual ownership and stewardship of data assets via specific roles like data owners and data stewards.
What is a Data Owner?
A data owner is an appointed role that manages data at an entity and attribute level, typically focused on the collection of the entity data, its quality, and authorization to the entities. Multiple data owners will cover the whole of entities that exist within the organization’s data profile.
What is a Data Steward?
A data steward is an appointed role that ensures data standards and policies are enforced and applied in day-to-day business. Where the data owner is accountable for the data entities, the data steward is the one working with the data entities daily, ensuring data definition details are clear and correct.
Appointing and enabling data owners and data stewards to produce and maintain high-quality datasets is the first step to establishing solid data governance practices. Giving the business trust and confidence in the data on the platform shows the value of data governance on a best-in-class data platform.
Self-Serviceability from ITSM, Information Architecture and Automation
An impactful data platform exists to accelerate innovation, and it does this by liberating data to make it accessible to users that need it for strategic business purposes. But it is not safe to assume that all users should be able to access all data on the platform, because different datasets will have different sensitivity derived from the nature of the data. For example, sales and HR typically have sensitive datasets that a business wants to restrict access to. Further, customer and health data is often protected by compliance requirements, restricting how the data might be used and who is allowed to access it.
For an end-user hoping to access a dataset on the data platform, a means must exist to request access to that dataset and a workflow exists by which the request can be scrutinized by the necessary parties before being fulfilled. This is a self-serviceability use case and one that is central to liberating data on the platform.
Another primary self-serviceability situation to consider is the ability to request the creation of objects or structures within the data warehouse. As new use cases for the platform come up, new ‘workspaces’ can enable a data engineering team a place to build new data transformations and datasets to support those use cases. Beyond these examples, there are many situations where self-serviceability can help the autonomy of the data platform. But what does it take to build self-serviceability into a data platform?
What Does ITSM Stand For and What Does it Mean?
An IT Service Management Solution (ITSM) would be ideal for taking, managing, and fulfilling self-service requests. ITSMs provide a ticketing system and a workflow management system to support change and incident management for an IT organization. The workflow support can be used to request approval from data owners or managers and initiate automation to build out what is being requested.
Speaking of automating the creation of objects and structures, this is something that works best when there is a common understanding of how data is organized on the platform and how data moves about the platform. Information architecture defines both, among other things, and it is an important piece of documentation that provides direction for so much of the data platform. A best-in-class information architecture would include at least the following for each data repository on the platform:
Additionally, the information architecture would have details around how data moves about the different data repositories within the data platform. This would provide answers to the questions of ‘What are the allowed sources and destinations of data?’ and ‘What tooling is used to perform a movement?’
With ITSM for taking and managing requests, information architecture for defining the data standards that make up the data platform, and automation for performing repeated work according to the information architecture, the data platform can operate almost autonomously. This autonomy gives the business the ability to accelerate innovation and advance strategic outcomes.
Platform Confidence from Software Engineering Best Practices
Like most IT systems, a data platform is a collection of data repositories, file systems, web applications, and many other components that are in a near-constant state of change. These changes are necessary though, especially when you factor in:
- How databases and datasets are created
- Data is ingested and transformed
- Tools are configured
- Data pipelines are managed
- Infrastructure is deployed
- Automation is created
- This list goes on
With all of this change, there is a high risk that something will go wrong, and negatively impact the platform. To build confidence in a platform is to mitigate as much risk as possible.
Applying change is something that web application developers have become very good at due to the typically stateless nature of web applications. This has allowed for some large collaborative code bases that can be deployed hundreds or thousands of times a day without end users being impacted. Confidence in the development practices leads to confidence in a system that is ever-changing.
With so much data flowing into data platforms now, it is necessary to embrace the development best practices in ways that will work for stateful data systems to build confidence in an ever-changing data platform.
- Source Code Management (SCM) tools
- Continuous Integration (CI)
- Continuous Delivery (CD)
- Multiple deployment environments
- Testing and data quality
- Infrastructure as Code (IaC)
- Database change management
- Rollback strategy
- Monitoring and alerting
The underlying themes here are automation, testing, and monitoring. Automate the building and testing of artifacts. Create deployment pipelines that deploy artifacts, test the deployment and promote the artifacts to the next environments. Use Database Change tools to assist with creating DDL/DML scripts that are environment agnostic, and automate the deployment and promotion of those changes. Build data quality checks into data pipelines and alert on anomalies. If something fails a test, automate rollback procedures.
Using software engineering best practices leads to confidence in an IT organization’s ability to change a system while mitigating its risk. This in turn converts into confidence in the platform for the business.
How to Get There
The idea of modernizing an existing data platform often is a significant part of an organization’s digital transformation journey, and it is something that cannot be achieved all at once. Outside of procuring a cloud-based data warehouse, the previously mentioned characteristics of a best-in-class data platform all require some form of maturity which should be tackled with an iterative approach, focusing on adding business value in each iteration.
Looking for More Information on Building a Best-in-Class Data Platform?
If you need more guidance on constructing a value-driving data platform, be sure to download our guide: How to Build an Actionable Data Strategy Framework. This 8-step guide is filled with practical information, helpful examples, and expert advice that will bring you a step closer to building a best-in-class data platform.