Optimizing JSON data by recursively removing empty values, converting strings to booleans, and trimming whitespace using jq - best practices?

During data preprocessing, I have the task of eliminating all empty values from an input JSON. This includes removing empty arrays [], empty objects {}, and various forms of empty strings like "", " ", and "\t". Additionally, I need to recursively trim all whitespaces from strings, even if they serve as object keys. To achieve this, I devised a solution using jq 1.6 along with a customized walk() function. While this setup works fine, I am seeking ways to enhance the performance of my query, especially in terms of CPU utilization. Currently, I execute it through executeScript on a cluster of 10 nodes, each equipped with 4 CPUs and 16GB of RAM, where CPU usage seems to be the bottleneck rather than memory.

jq 'walk(
  if type == "string" then
    (sub("^[[:space:]]+"; "") | sub("[[:space:]]+$"; "") | if . == "true" then . |= true else . end | if . == "false" then . |= false else . end)
  elif type == "object" then
    with_entries(select(.value | IN("",null, [], {}) | not) | .key |= sub("^[[:space:]]+"; "") | .key |= sub("[[:space:]]+$"; "") |select(.key | IN("") | not ))
  elif type == "array" then
      map(select(. | IN("",null, [], {}) | not))
  else . end)'

This is my current approach. I also convert "true" to boolean true and "false" to boolean false. Are there any apparent optimizations that can be made to the query?

While I did consider implementing the entire process in JavaScript or Groovy, I found that jq handles recursive processing of nested JSON objects elegantly, saving me from reinventing the wheel. However, I am open to exploring JavaScript or Groovy implementations if significant enhancements cannot be achieved within the jq query.

Answer №1

Looking to enhance performance? Without seeing the customized version of walk that you're using, it's difficult to provide specific advice. However, here's a more efficient alternative compared to the one in builtins.jq:

def walk(f):
  def w:
    if type == "object"
    then . as $in
    | reduce keys_unsorted[] as $key
        ( {}; . + { ($key):  ($in[$key] | w) } ) | f
    elif type == "array" then map( w ) | f
    else f
    end;
  w;

Answer №2

After much consideration, I decided to create a compact rust binary that performs the specified task:

use std::io::Read;
use serde_json::{Value, Map};

fn clean_value(val: &Value) -> Option<Value> {
    match val {
        Value::Null => None,
        Value::String(s) => {
            let trimmed = s.trim().to_owned();
            match trimmed.to_lowercase().as_str() {
                "true" => Some(Value::Bool(true)),
                "false" => Some(Value::Bool(false)),
                _ => if trimmed.is_empty() { None } else { Some(Value::String(trimmed)) },
            }
        },
        Value::Array(arr) => {
            let cleaned: Vec<Value> = arr.iter()
                .filter_map(clean_value)
                .collect();
            if cleaned.is_empty() { None } else { Some(Value::Array(cleaned)) }
        },
        Value::Object(map) => {
            let cleaned: Map<String, Value> = map.iter()
                .filter_map(|(k, v)| clean_value(v).map(|v| (k.trim().to_owned(), v)))
                .collect();
            if cleaned.is_empty() { None } else { Some(Value::Object(cleaned)) }
        },
        _ => Some(val.clone()),
    }
}

fn clean_json(json: &str) -> Result<String, serde_json::Error> {
    let value: Value = serde_json::from_str(json)?;
    let cleaned = clean_value(&value);
    match cleaned {
        Some(v) => Ok(serde_json::to_string(&v)?),
        None => Ok(String::new()),
    }
}

fn main() {
    let mut buffer = String::new();
    std::io::stdin().read_to_string(&mut buffer).unwrap();
    match clean_json(&buffer) {
        Ok(json) => println!("{}", json),
        Err(e) => eprintln!("Error cleaning json: {}", e),
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn it_works() {
        let input = r#"
        {
            "  key  ": "  true  ",
            "  empty array  ": [],
            "  empty object  ": {},
            "  empty string  ": "",
            "  null  ": null,
            "  nested  ": {
                "  key  ": "  false  ",
                "  empty array  ": [],
                "  empty object  ": {},
                "  empty string  ": "",
                "  null  ": null
            }
        }
        "#;
        let expected = r#"{"key":true,"nested":{"key":false}}"#;
        let cleaned = clean_json(input).unwrap();
        assert_eq!(cleaned, expected);
    }
}

The execution speed is notably improved:

time ./clean-json < ~/test_small.json
real    0m0.022s
user    0m0.021s
sys     0m0.001s

when compared to an enhanced version of jq:

real    0m0.365s
user    0m0.336s
sys     0m0.031s

Answer №3

Check out these two versions of the walk function that may be useful if you're focusing on individual values and keys for transformation.

It's important to note that using scalar_walk should result in quicker processing, while employing with_entries could potentially slow down the execution of atomic_walk when dealing with JSON objects.

# Perform f on keys and values only
# Optimize by skipping objects or arrays themselves
def atomic_walk(f):
  def w:
    if type == "object"
    then with_entries( .key |= f | .value |= w)
    elif type == "array" then map( w )
    else f
    end;
  w;
# Apply f to values (excluding keys) exclusively
def scalar_walk(f):
  def w:
    if type == "object"
    then map_values(w)
    elif type == "array" then map( w )
    else f
    end;
  w;

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

HTML script tag failing to load

As a beginner in using scripts in HTML, I'm currently following a tutorial that suggests loading the script after the body for certain reasons. In my own code, I found that I need to place the script both in the head section and after the body. Remov ...

The error message "Uncaught TypeError: Unable to retrieve the 'length' property of an undefined object in Titanium" has occurred

Here is the data I am working with: {"todo":[{"todo":"Khaleeq Raza"},{"todo":"Ateeq Raza"}]} This is my current code snippet: var dataArray = []; var client = new XMLHttpRequest(); client.open("GET", "http://192.168.10.109/read_todo_list.php", true); c ...

Tips for updating the date format in Material UI components and input fields

Is there a way to customize the date format in Material UI for input text fields? I am struggling to format a textField with type 'date' to display as 'DD/MM/YYYY'. The available options for formatting seem limited. It would be helpful ...

issue encountered when filling out a dropdown menu using a JSON data structure

Seeking assistance with populating a button dropdown in angularjs. Encountering the error message: "Unexpected end of expression: data.WotcSummary "|. Any ideas on what might be causing this issue? Here is the JavaScript file code snippet: WotcDashBoard ...

If using conditional SCSS, consider overriding the variables

When working with React state, I am passing a list and calling the component where needed. comments: { elementLabel: "Comments", elementType: 'textarea', className: "form-control", ...

Convert the number into the following date format: dd/mm/yyyy

The Oracle database has returned the following number: 0.002976190476190476 I want to convert it to the format: dd/mm/yyyy, utilizing either Javascript or jQuery. Any suggestions on how I can achieve this? ...

Issue with October CMS: Radio button selection triggers an Ajax call, but clicking twice in quick succession causes the content

I am currently utilizing October CMS and materializecss to develop a form with options on one of my pages. The form functions correctly as it dynamically loads content when different options are clicked. However, I have identified a problem where clicking ...

Generating Tree Structure Object Automatically from Collection using Keys

I am looking to automatically generate a complex Tree structure from a set of objects, with the levels of the tree determined by a list of keys. For example, my collection could consist of items like [{a_id: '1', a_name: '1-name', b_id ...

Transforming JSON into array format with key-value pairs using JavaScript

Currently, I am working on a web application that is receiving data in a specific format from a node server: "{""elements":[{"10sr2b2":{"total":0,"bad":22,"clients":["fc8e7f","fc8e7e"],"zone":"101900"}}]}" The issue lies in the fact that this data is str ...

Obtain the URL response using Node.js (Express/HTTP)

I am currently working on fetching responses from two URLs in nodejs using http.request, but I have encountered an issue. Here is my code snippet: var url = "https://www.google.com/pretend/this/exists.xml"; var opt = { host: url.split(".com/")[0] + " ...

Live formatting of phone numbers using regular expressions

For my AngularJS project, I am looking to format phone numbers as they are being typed, without relying on any external library. The desired format is: 99 99 99 99 99 var phone = tel.replace(/\D*(\d{2})\D*(\d{2})\D*(\d{2})&b ...

Encountered an Error: The JSON response body is malformed in a NextJS application using the fetch

Running my NextJS app locally along with a Flask api, I confirmed that the Flask is returning proper json data through Postman. You can see the results from the get request below: { "results": [ [ { "d ...

Exploring the contrast: State data versus destructuring in React

I have been experimenting with some basic React code and I am puzzled by the behavior. Can anyone explain the difference between these two parts of the code: increment = () => { const { count } = this.state; // doing destructure ... this.setState({ cou ...

Converting a JSON string to an Object in Kotlin Multiplatform Mobile

In the past, I posed this query: Implementing generic method in interface that uses implementors class where an object could be transformed into a JSON string. However, my current aim is to reverse this process. Ideally, it would resemble the following: i ...

Material UI grid items overlapping issueIn Material UI, the grid

Here is a snippet of code for your review: <div className="mx-md-4 my-5 mx-sm-0 mx-xs-0 flex-column align-items-center d-flex justify-content-center "> <Grid className='mt-3' container spacing={2}> ...

Tips for managing event listeners with JavaScript validation and PHP

Hi everyone, I'm a new student learning Web Programming and focusing on Web Development. I encountered an issue while trying to use JavaScript validation with a form submitted to a PHP file. I attempted to utilize event listeners to check the input fi ...

Using Vue.$set, you can create nested objects that were previously missing in your

Is there a way to generate missing nested objects (such as bar, a, b, and c) when employing Vue.$set in the following manner? export const mutations = { UPDATE(state, payload) { this._vm.$set(state.foo.bar.a.b.c, payload.key, payload.value) } } In ...

Preloading resources using the <link> tag with "preload" attribute as "script" is different from simply including a <script> tag with a "src" attribute

Can you explain the distinction between these two elements? <link rel="preload" as="script" href=".."> and <script src="..." /> Do you think rel="preload" is necessary in the first scenario? ...

Utilize pg-promise for inserting data with customized formatting using the placeholders :name and :

After reviewing the pg-promise documentation, I came across this code snippet: const obj = { one: 1, two: 2 }; db.query('INSERT INTO table(${this:name}) VALUES(${this:csv})', obj); //=> INSERT INTO table("one"," ...

Refresh FullCalendar in Angular 2 and above

Can someone please assist me with re-rendering or redrawing a full-calendar in Angular using @fullcalendar/resource-timeline? I have updated the data after calling the API and now I need to make sure the calendar reflects these changes. Any guidance on h ...