Building AI with Data-Centric Test Development

Getting AI into production is neither straightforward nor easy. Essential to success is honoring the hard-won lessons of the DevOps and MLOps eras in the AI application development done today. To do this, we must recognize that Generative AI provides an accelerated entry point to development. It bootstraps, but never removes, the data and test-centric needs of ML and DevOps.

Much of the focus on AI development challenges is on the difficulty of evaluations. Algorithmic, AI-driven, or human-centered evaluations all have hurdles that must be discussed. However, we would elevate the idea that the evaluation data, not the evaluations themselves, should take the front seat in development, working hand-in-hand with evaluations performed on these datasets.

This practice, which you might call Test-Data-Driven AI Development (catchy, we know), functions as the spiritual successor to Test-Driven Development, which was refactored for the data-centric era that we are in now.

To unpack TDDAID (seriously, we’re open to a better name), let’s recap test-driven development at a high level, discuss the essentialness of data-centric approaches in AI development, and then lay out a few more practical steps where it might all apply.

What is Test-Driven Development?

First, a primer on test-driven development (TDD) in software. TDD is often strived for but difficult to achieve. It goes beyond asserting that program tests are essential to hardening any software product; it establishes that tests come first.

In this practice, the lifecycle of a software product begins in the eyes of the end users. It then becomes detailed user stories to nail down each facet of functionality. Then, tests for the software are written so that any program that passes these tests is automatically successful in the eyes of the users. Finally, development begins. Test passing becomes the benchmark, not test coverage.

But Wait, TDD Rarely Worked for Software Teams, Why is it Right for AI?

Following TDD to the letter is undoubtedly challenging; only some teams dogmatically follow it. TDD requires very detailed and clear requirements; any changes later to business needs must also be addressed first in testing before development.

For exploratory and greenfield projects, it may be impossible to capture requirements and needed tests before work commences. What we see as the spirit of TDD is that you start with the goals, write them as tests, and work toward them, which is what needs to be carried forward into AI.

Why Data-Centric AI?

The present AI boom, powered by Machine Learning, is as much a story about the incredible datasets curated and made available for training as it is about the cutting-edge deep learning architectures that digested them. Before AlexNet came the ImageNet database for Computer Vision. Before GPTs, Common Crawl and similar datasets were developed for Natural Language Processing. Each corresponding leap in AI technology is a story of the data that powers it.

The Generative AI era pushes this concept further by developing foundation models that are provided pre-trained in broad contexts with vast and generic datasets. What foundation models never have is the high-value, high-veracity data needed to achieve groundbreaking results on specific use cases. Here, data-centric AI is the process by which mission-critical datasets are built to be injected into foundation AI systems.

Data-Centric AI for Test-Driven Development

At the center of test-driven AI pipelines is the bootstrapping of evaluation datasets, which accurately and thoroughly sample the application’s desirable and undesirable behaviors. Initial proof-of-concept deployments can use foundation models to begin broadly inpainting patterns, which are later detailed with real-world user feedback. I’ll go through the high-level AI-powered stages in this development cycle.

1. Prototype Evaluation Data

Most business applications begin with the establishment of user stories from which detailed acceptance criteria are created. At this point, example application data will be sparse or non-existent. Previously, this completely blocked the development of an AI application. With foundation models, many IO patterns can be created directly from the user stories and acceptance criteria and reviewed by the business.

This is the first critical step in the journey. Note that we haven’t developed a single line of code for the actual application. We’re focusing on roughly filling in the data space with a broad model. While the focus will be on describing correct behavior, successful evaluations must also be able to identify incorrect behavior. Foundation models can generate these, too.

2. Evaluation Logic

Once example data is established, developers have to come up with the actual evaluation logic. These will not be perfect or even very good initially. They only need to be directionally correct (in identifying good and bad behavior). Here, many AI and non-AI options are worth considering. Sometimes, regex will suffice, or traditional techniques like ROUGE or BLEU for NLP.

Rigid rules may miss subtleties or completely fail to discriminate, and here again, general-purpose foundation models can serve for the time being. This could involve using LLMs as a judge or directly computing similarities between given and correct responses with BERT-style models.

None of these approaches will be perfect, but we’re aiming for directionally correct. Circling back to the point of data-centric AI, we also have to acknowledge that the ideal dataset from which to evaluate our program does not exist yet, so we don’t want to overfit our evaluations on a prototype evaluation dataset.

3. Model Development

This is when actual development starts, and AI developers can iterate and bring to bear appropriate technologies that reasonably pass evaluations. Each application input will generate output that matches the appropriate evaluated data and is scored against it.

Many data-centric techniques boost application performance, from RAG with unstructured and structured context to fine-tuning systems and models. The assessment of all of these techniques should leverage evaluations as much as possible.

In the heat of development, it is easy to forget to retain logs of good and bad behavior, but these, too, should be added to the evaluation suite as work proceeds as much as possible. If you’ve followed steps 1-2, you’ve already got a repository of test cases, and it’s easy to add more.

4. Deployment and Feedback

Every AI application should have recorded and easily recallable feedback. Positive and negative behaviors can be marked through direct feedback or surfaced with topic modeling (another AI task) to understand performance.

All of this information is critical as it becomes the first real dataset that can be used for evaluation of future development. Where work is initially started synthetically, now every user interaction becomes a future benchmark.

Models, orchestrations, and platforms are rapidly evolving in AI, and it is essential to use the best techniques available for each application. Negative performance drift of an application (even with new and high-performing models) is also quite common, as popular, published metrics never guarantee business application success. Models may broadly improve while performance within your user’s interactions falls.

By retaining past interactions as an evaluation dataset, all updates can be checked against newly requested features and preserve long-standing behavior.

Closing Thoughts

Most organizations already have many applications in flight and are rapidly testing the value of AI across their business. Where this framework can feel cumbersome for rapid development, it is geared towards enabling the smoothest, fastest path to real production AI.

Each part of this process can be solved by a variety of old and new tools, but the overall system has generally not been solved and requires its own development lifecycle. Recalling the proposed initial evaluation datasets, the platform just needs to be directionally correct. Happy developing.

If you need further guidance or have specific questions, don’t hesitate to contact our team of AI experts. We’ve got fantastic workshops on generative and data-centric AI. We’re here to help you achieve the smoothest transition from development to production and ensure your AI initiatives deliver maximum value.

Building AI with Data-Centric Test Development

What is Test-Driven Development?

But Wait, TDD Rarely Worked for Software Teams, Why is it Right for AI?

Why Data-Centric AI?

Data-Centric AI for Test-Driven Development

1. Prototype Evaluation Data

2. Evaluation Logic

3. Model Development

4. Deployment and Feedback

Closing Thoughts

More to explore

Using Snowflake CoCo as an Agentic Orchestration Service

From Spec to Pipeline: Inside phData Toolkit’s Agentic Automation

Ship Snowflake Cortex Agents Faster: A Skills‑First Workflow with Cortex Code + TruLens

Join our team

Partners

Resources

Software

Accelerate and automate your data projects with the phData Toolkit

Industries

Solutions

Company

Technology Partners

Check out our latest insights

Using Snowflake CoCo as an Agentic Orchestration Service

From Spec to Pipeline: Inside phData Toolkit’s Agentic Automation

Other Technology Partners

Data Engineering

Consulting, Migrations, Data Pipelines, DataOps

Change Management, Enablement & Learning

COE, Coaching, PMO

Data Science and Machine Learning Services

MLOps Enablement, Prototyping, Model Development and Deployment

Strategy Services

Data, Analytics, and AI Strategy, Architecture and Assessments

Reporting, Analytics, and Visualization Services

Self-Service, Integrated Analytics, Dashboards, Automation

Elastic Operations

Data Platforms, Data Pipelines, and Machine Learning