Building an AI-Ready Data Warehouse: The Cleanup Journey
BLUF (Bottom Line Up Front): To create an AI-ready data warehouse that supports self-service analytics, we've implemented a systematic cleanup process that removed over 3,000 unused data assets. This approach helps us deliver faster insights to ministry leaders and reduces the risk of AI hallucinations by eliminating outdated information.
The Data Loop Problem
Last week I mentioned something that has been stuck in my mind for days: the data loop. The long back and forth process from leaders wanting to be data-driven, to analysts, to data engineers, back to analysts and finally back to leaders after days, sometimes weeks after a simple question was asked.
This is simply not acceptable anymore. AI has arrived and everyone around us expects this process to be faster and better, and they are right.
We want our church to reach more people and we believe data is a tool to help us get there, so data insights must be ready when leaders ask questions.
For that reason we are committed to creating a data warehouse that is AI-ready and suitable for self-service analytics. But getting there is not easy. Here are a few elements of our strategy to help us get there.
Step 1: Clean-up
We have been doing data for over 10 years at our church, which means many, many data assets have been created over time. Many of these assets became obsolete over time, but they are still available in our data warehouse.
We are on a mission of eliminating unused data assets and combining assets that can be combined.
The Materialized Views Challenge
The clean-up task was particularly challenging. We use BigQuery to build our data warehouse and decided to implement our golden assets (the final assets consumed by our clients such as app, Tableau, etc.) as materialized views instead of tables. This approach allows us to change the underlying structure of the view without any of the destinations noticing any restructuring of the backend.
This was a great decision that gave the engineering team enough flexibility to make bold moves, but on BigQuery we have no way to use the logs to find out what view is being consumed. If view vw_my_view
is referencing table_a
and table_b
in a simple query such as:
select * from table_a, table_b
Google Cloud Platform (GCP) will provide logs that table_a
and table_b
are being used, but not vw_my_view
. And this is a problem, because both tables might be still important for other useful views, but vw_my_view
itself became obsolete because it has not been used for months.
Our Technical Solution
Still, GCP does provide the query text, and with the help of AI that was enough:
INSERT INTO
`project_id.dataset_id.views_usage`
WITH
model_results AS (
SELECT
ml_generate_text_llm_result,
queried_at,
user_email
FROM
ML.GENERATE_TEXT( MODEL `dataset_id.gemini`,
(
SELECT
CONCAT( 'For the given query, generate the following metadata for each referenced table or view and return ONLY a valid JSON array with objects containing these exact fields: project_id, dataset_id, table_id. Query: ', query) AS prompt,
queried_at,
user_email
FROM
`dataset_id.jobs` ),
STRUCT( 0.2 AS temperature,
TRUE AS FLATTEN_JSON_OUTPUT ) ) )
SELECT
JSON_EXTRACT_SCALAR(table_json, "$.project_id") AS project_id,
JSON_EXTRACT_SCALAR(table_json, "$.dataset_id") AS dataset_id,
JSON_EXTRACT_SCALAR(table_json, "$.table_id") AS table_id,
queried_at,
user_email
FROM
model_results,
UNNEST(JSON_EXTRACT_ARRAY(REGEXP_REPLACE(ml_generate_text_llm_result, r'```json\n|\n```', ''))) AS table_json
The query above is capable of taking the entire query text found in the logs, and then combines the query text with pre-made instructions I give to Gemini specifying what I want from the query text and what format I want the results in. Gemini then takes the entire prompt with the query text (an unstructured data) and returns all the table/view IDs being called in the query. Finally, I have the last step of the query to make sure that each table/view identified by Gemini is represented with its own row in the query results. Voilà. That's how we found a way to catch materialized views that have not been used by any client.
Results and Risk Management
Thanks to this view catcher, we now have a scheduled job searching daily for unused views which then triggers an alert to the data engineers to start the process of decommissioning them.
Over the past few weeks, we were able to delete over 3,000 assets.
Think about this number and how it can lead to outdated answers to leaders and contribute to AI hallucination. These are over 3,000 tables or views that haven't been used in a while. All these data assets might have old and obsolete metadata such as descriptions or even metrics that no one uses anymore. If I provide all these assets to AI, AI would not be able to discern what's good or bad, so if a pastor uses an AI solution to ask about a data point we might have in store, AI might answer with one of the 3,000 bad data points that no one uses anymore. That's why this is a reason to celebrate and that's why if you are also pursuing an AI-ready database, you also need to find ways to identify the lifecycle of the assets of your data warehouse as well.
We definitely broke some things in the process, don’t be surprised if you break some things on your end as well. Every time we're about to delete an object from our data warehouse, we move these objects to what we call a "dumpster dataset" within BigQuery. This dumpster has an expiration period of 30 days. Yup, all tables or views moved to this dataset auto-delete in 30 days, but this is enough time for the "scream test" to notify us (if we move an asset and it breaks something important, someone or something will scream, hence "scream test").
Next Steps
This is definitely a lot of work, but this is taking us much closer to an AI-ready data warehouse where only relevant data is stored so that our users, either pastors or data analysts, can ask questions about our data without all the noise from unused or outdated data.
This is just step one from our strategy to building an AI-ready and self-service-ready data warehouse. Our next step is expanding test coverage and following the "document-first" approach when building data assets. Let's unpack these other steps in the coming weeks.