What methods can I incorporate sophisticated logic into my dataform process?

Question

What methods can I incorporate sophisticated logic into my dataform process?

Summary

I am looking to enhance the functionality of my Dataform pipeline by introducing a layer of modularity (via JavaScript functions) that can identify when there is a disruptive change in the schema of my raw data source. This system would then automatically adjust all the .SQLX scripts within my project, eliminating the need for manual intervention.

--

Context

To provide some context, I have a data collection script that streams JSON-encoded raw data into a Pub/Sub topic at a high rate. A PubSub-to-BigQuery subscriber is then used to load this raw data into an hourly partitioned table in BigQuery. Subsequently, this table acts as the foundation for over 20 user-facing data tables. Essentially, Dataform reads from this raw table periodically and carries out the necessary transformations to update the various tables, each defined by its respective .SQLX script containing business logic.

--

Issue

Occasionally, changes are made to the schema of the raw table (such as modifying field names or types), which requires extensive manual adjustments on the Dataform side to ensure that the query logic aligns with the new schema and prevents any crashes. My goal is to streamline this process and reduce the amount of manual work needed whenever such modifications occur.

--

Potential Resolutions

Currently, Terraform rebuilds the raw table automatically when the schema is altered, but assigns it a different name with a version number (e.g., table_1-0-0 --> table_2-0-0). I propose incorporating JavaScript functions that instruct Dataform on how to handle the data based on the version/name of the raw table. This should be implemented in such a way that if new fields are added or existing ones are modified in the raw table, I do not have to manually update all 20 .SQLX files and reconfigure how these fields are referenced in the queries, etc.

It may sound repetitive, but is this type of automation achievable? Thank you in advance

javascript sql google-bigquery etl dataform

Answer 1

Answer №1

Although it's a bit belated, I wanted to share an update on the solution I implemented to achieve my desired outcome.

I divided my code into two separate modules:

constants.js: This file stored global variables containing the full paths to various raw tables in BigQuery, each corresponding to a different schema version.
modules.js: Here, I created a function that performs a UNION ALL operation on all past and present schema versions. I then called this function within the pre_operations block of my main user-facing table during each incremental run.

The contents of modules.js:

// Function for selecting data from current and previous schemas
function set_latest_schema() {
    return `
      WITH UNIONED AS (
        SELECT * FROM ${constants.raw_4} WHERE DATE(ingestion_time) >= "2023-01-03"
        UNION ALL
        SELECT * FROM ${constants.raw_3} WHERE DATE(server_timestamp) >= "2023-01-01"
        UNION ALL
        SELECT * FROM ${constants.raw_2} WHERE DATE(server_timestamp) >= "2023-01-01"
        UNION ALL
        SELECT * FROM ${constants.raw_1_1} WHERE DATE(server_timestamp) >= "2023-01-01"
        UNION ALL
        SELECT * FROM ${constants.raw_1} WHERE DATE(server_timestamp) >= "2023-01-01"
      ),
    `
}

The contents of constants.js:

// List of source raw tables (newest to oldest)
const raw_4 = '`project_id.dataset_id.table_id_5`';
const raw_3 = '`project_id.dataset_id.table_id_4`';
const raw_2 = '`project_id.dataset_id.table_id_3`';
const raw_1_1 = '`project_id.dataset_id.table_id_2`';
const raw_1 = '`project_id.dataset_id.table_id_1`';

In case of a schema change, a new raw table is created with a corresponding constant pointing to it, a line is added to the UNION ALL statement, and a Pub/Sub message triggers a cloud function to delete existing user-facing tables. My orchestration layer in Cloud Workflows then initiates frequent executions to rebuild the updated user-facing tables. While not perfect, manual changes in query logic are still necessary to account for the new schema.

Additionally, I've realized that simplifying schema changes due to new fields is possible by using a repeated record structure where custom key-value pairs representing new fields can be added to the raw table without needing to rebuild it. This streamlines the process when facing breaking changes.

Answer 2

Although it's a bit belated, I wanted to share an update on the solution I implemented to achieve my desired outcome.

I divided my code into two separate modules:

constants.js: This file stored global variables containing the full paths to various raw tables in BigQuery, each corresponding to a different schema version.
modules.js: Here, I created a function that performs a UNION ALL operation on all past and present schema versions. I then called this function within the pre_operations block of my main user-facing table during each incremental run.

The contents of modules.js:

// Function for selecting data from current and previous schemas
function set_latest_schema() {
    return `
      WITH UNIONED AS (
        SELECT * FROM ${constants.raw_4} WHERE DATE(ingestion_time) >= "2023-01-03"
        UNION ALL
        SELECT * FROM ${constants.raw_3} WHERE DATE(server_timestamp) >= "2023-01-01"
        UNION ALL
        SELECT * FROM ${constants.raw_2} WHERE DATE(server_timestamp) >= "2023-01-01"
        UNION ALL
        SELECT * FROM ${constants.raw_1_1} WHERE DATE(server_timestamp) >= "2023-01-01"
        UNION ALL
        SELECT * FROM ${constants.raw_1} WHERE DATE(server_timestamp) >= "2023-01-01"
      ),
    `
}

The contents of constants.js:

// List of source raw tables (newest to oldest)
const raw_4 = '`project_id.dataset_id.table_id_5`';
const raw_3 = '`project_id.dataset_id.table_id_4`';
const raw_2 = '`project_id.dataset_id.table_id_3`';
const raw_1_1 = '`project_id.dataset_id.table_id_2`';
const raw_1 = '`project_id.dataset_id.table_id_1`';

In case of a schema change, a new raw table is created with a corresponding constant pointing to it, a line is added to the UNION ALL statement, and a Pub/Sub message triggers a cloud function to delete existing user-facing tables. My orchestration layer in Cloud Workflows then initiates frequent executions to rebuild the updated user-facing tables. While not perfect, manual changes in query logic are still necessary to account for the new schema.

Additionally, I've realized that simplifying schema changes due to new fields is possible by using a repeated record structure where custom key-value pairs representing new fields can be added to the raw table without needing to rebuild it. This streamlines the process when facing breaking changes.

What methods can I incorporate sophisticated logic into my dataform process?

Answer №1

Similar questions

Concealing a Vuejs dropdown when clicking outside of the component

The module 'react/lib/React' could not be located within the file 'ReactTestUtils.js'

What is the best way to change the size of a QR code

The selected value from a dropdown list may occasionally come back as text

A significant number of middleware packages, such as compress, are no longer provided as part of the

Ways to manage absent embedded expressions in template literals

What could be causing the issue with my Angular integration with Jira Issue Collector to not function properly?

The Mantine date picker is causing an error stating that objects are not valid as a React child

grab the destination URL from the embedded frame

What is the best way to create a dynamic information page using both V-for and V-if in Vue.js?

Modify the onerror function of the image tag within the onerror function

Please ensure that the table is empty before reloading data into it

Conflicting Joomla Modules

The function is not recognized in C# programming language

Encountering a JavaScript problem in Google Chrome?

Retrieving data from a dynamic array using jQuery

Issue encountered while serializing the `.product` object retrieved from the `getStaticProps` function in NextJS, in conjunction with

"We are unable to set a value for a global array unless we utilize the .push() method

Learn how to properly convert a string into a valid URL using the Next JS router when pushing pages

Activate the Bootstrap Jquery/Ajax inline editing feature by simply clicking on the Edit button