April 5, 2024

How to Automate Document Processing with Snowflake’s Document AI

By Garrett Springer

With an endless stream of documents that live on the internet and internally within organizations, the hardest challenge hasn’t been finding the information, it is taking the time to read, analyze, and extract it. 

With Document AI from the Snowflake Data Cloud, organizations can utilize the power of LLMs to automate the process of converting unstructured documents into organized tables with ease!

In this blog, we’ll cover what the Document AI tool is, what use cases it solves, and how to integrate it with document processing pipelines.

What is Document AI from Snowflake?

Document AI is a new Snowflake tool that ingests documents (e.g., PDFs or scanned handwritten docs) and can answer questions about them using natural language. By harnessing the advancements of LLMs, users can now extract key information buried within large documents without any code or ML knowledge required. 

Simply upload your documents, ask a question, and get the answer!

Why is Document AI Important?

Businesses of all shapes and sizes carry massive amounts of documents ranging from company handbooks to meeting notes, presentations, and financial records—all usually unstructured electronic documents (PDFs, word documents, etc). 

Thanks to reduced costs of storing documents electronically, we now have access to a seemingly endless amount of information. Unfortunately, we’re still left with the problem of making this data organized and useful for BI tools and other applications due to the unstructured nature of the documents. That’s where Document AI comes in!

Snowflake Solution

In the past, companies would hire employees whose focus was scanning, entering, and correcting data from documents into an organized table or database. Even with advancements offered by document scanning technology or expensive custom software to process an organization’s unique form, these solutions still aren’t ideal.

Snowflake’s Document AI offers a robust document processor with the ability to extract information from plain English questions. Simply upload your documents to Document AI, have a set of questions you want answered from the document, and the tool does the rest! 

Since it’s a Snowflake tool, Snowflake users can even automate the process of loading the extracted info to tables within their Snowflake account.

Snowflake Document AI answering questions from Uber’s 125-page quarterly earnings report.

Once the documents are uploaded and the queries are specified, Snowflake can copy the results into a structured table for further analysis. Data pipelines can be set up in Snowflake using stages, streams, and tasks to automate the continuous process of uploading documents, extracting information, and loading them into destination tables.

What Use Cases Does Document AI Solve?

Before diving into the technology that powers Document AI, let’s go over some of the utility that extracting your information offers.

Enhances BI Tools

Business Intelligence tools are one of the most popular ways to get more actionable insights out of your data. However, these tools, such as Tableau and Power BI require the data to be organized into a table and can’t render the information from images and unstructured data. 

Integrating Document AI transforms unstructured forms into organized and formatted tables, enabling visualizations, charts, and document summaries.

Ask the Doc

Another example is an Ask-the-doc application. By utilizing a QA system, users can upload an arbitrary document and ask their questions in natural language. Imagine a company just released its yearly earnings report, and it contains over 200 pages of dense, dry language and charts. With QA systems, you can simply upload the document and ask the questions you’re looking for without having to dig through the entire document in search of an answer.

Image taken from Snowflake’s blog.

Key Information Extraction

The last example we’ll cover is the one that we’ll be tackling over the remainder of this blog: automatically extracting key information from dense documents. For this example, we’re going to step into the shoes of a financial services company that needs to stay up-to-date with companies that it’s invested in.

Financial Analysts have to spend a lot of time reading earnings reports, legal documents, and other dense financial reports that give information about how the company is performing. So much time could be freed up by automating the process of ingesting these documents and extracting key bits of information using natural language querying. Before we dive into the demo, the next section covers a couple of the key technologies that enable Document AI.

What’s Going on Under the Hood?

Fortunately, Snowflake Document AI was built with ease of use in mind, which means that it doesn’t require machine learning expertise or a steep learning curve to get started! If you want to skip the technical background and dive straight into the demo, skip to the next section; otherwise, we’ll cover some of the technical details going on under the hood.

Technical Overview

Document AI is utilizing a few key techniques: Natural Language Processing (NLP) using a first-party Large Language Model (LLM) and Optical Character Recognition (OCR). OCR is used to convert the document containing printed or handwritten text into a file suitable for consumption by the LLM. The LLM is then able to receive queries from the user and answer them using the text in the document.

Optical Character Recognition

Optical Character Recognition is a technology used to convert scanned documents, PDFs, images, and handwritten texts into editable and searchable text. By analyzing the shapes and patterns of characters within the scanned images, OCR software identifies and extracts the text, preserving its layout and formatting. 

This enables organizations to digitize paper-based information, streamline document processing workflows, and make textual content easily accessible and searchable for Natural Language Processing.

Large Language Models

Snowflake Document AI is powered by a first-party Large Language Model (LLM). LLMs have been the main subject of attention in the tech world since the release of tools such as OpenAI’s ChatGPT, GitHub Copilot, and other LLM-powered tools. 

LLMs are advanced AI systems trained on vast amounts of text data to understand and generate human-like language. When applied to document question-answering tasks, LLMs can comprehend complex queries and provide accurate and contextually relevant answers by analyzing the content of documents. Leveraging their deep understanding of language, LLMs facilitate efficient information retrieval and knowledge extraction, enabling users to obtain precise insights from documents with minimal manual effort.

Document AI provides a trained general-purpose model out-of-the-box designed to perform well on a broad amount of documents and question-answering tasks, however, some domain-specific documents may require additional fine-tuning to achieve optimal performance. 

Just like people, LLMs can learn by example. This is where training jobs come into play!

By showing the model what the answer is supposed to be, you can train it to perform better on future documents. If you notice that some of the answers are incorrect, you can manually enter the correct answers in the Snowflake UI. Snowflake documentation recommends that you annotate at least 20 documents before starting a training job.

Once again, the great benefit of Document AI is that it is a no-code tool! This means that you don’t have to understand how to code, let alone how these complex models work, to take advantage of their utility!

Document AI Processing Demo

In this demo, we’re setting up a pipeline that loads quarterly earnings reports from a stage, processes them using Document AI, and then loads the data into a table. With this pipeline, financial analysts will have access to the key information without sifting through entire documents.

At the time of writing this blog, Document AI is still in private preview

If you want to follow along, companies can reach out to Snowflake and request access.

In the demo, we will load PDFs into an Internal Snowflake Stage, use Document AI to extract key information, and finally load it into a table. This table data can then be used in visualization tools like Tableau or PowerBI, which directly connect to Snowflake. 

You can read more about connecting BI tools to Snowflake in our other blogs.

In the video below, we create a data processing pipeline that loads documents from a stage, processes them with Document AI, and then loads the data into a table. Document AI is used to analyze and extract important information from a set of quarterly earnings reports which are used to report details about a company’s financial status. 

By uploading these reports to Document AI, the process of extracting the information of interest is automated, enabling the data for consumption by a downstream application.

Closing Thoughts

Setting up Document AI to extract more insight from your documents is powerful but only the first step in getting the most out of your data.

Still curious?

Reach out to phData and see how our consultants can help your business grow with our machine learning and Snowflake consulting services.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit