Summary
I am looking to enhance the functionality of my Dataform pipeline by introducing a layer of modularity (via JavaScript functions) that can identify when there is a disruptive change in the schema of my raw data source. This system would then automatically adjust all the .SQLX scripts within my project, eliminating the need for manual intervention.
--
Context
To provide some context, I have a data collection script that streams JSON-encoded raw data into a Pub/Sub topic at a high rate. A PubSub-to-BigQuery subscriber is then used to load this raw data into an hourly partitioned table in BigQuery. Subsequently, this table acts as the foundation for over 20 user-facing data tables. Essentially, Dataform reads from this raw table periodically and carries out the necessary transformations to update the various tables, each defined by its respective .SQLX script containing business logic.
--
Issue
Occasionally, changes are made to the schema of the raw table (such as modifying field names or types), which requires extensive manual adjustments on the Dataform side to ensure that the query logic aligns with the new schema and prevents any crashes. My goal is to streamline this process and reduce the amount of manual work needed whenever such modifications occur.
--
Potential Resolutions
Currently, Terraform rebuilds the raw table automatically when the schema is altered, but assigns it a different name with a version number (e.g., table_1-0-0 --> table_2-0-0). I propose incorporating JavaScript functions that instruct Dataform on how to handle the data based on the version/name of the raw table. This should be implemented in such a way that if new fields are added or existing ones are modified in the raw table, I do not have to manually update all 20 .SQLX files and reconfigure how these fields are referenced in the queries, etc.
It may sound repetitive, but is this type of automation achievable? Thank you in advance