Optimizing JSON data by recursively removing empty values, converting strings to booleans, and trimming whitespace using jq - best practices?

Question

Optimizing JSON data by recursively removing empty values, converting strings to booleans, and trimming whitespace using jq - best practices?

During data preprocessing, I have the task of eliminating all empty values from an input JSON. This includes removing empty arrays [], empty objects {}, and various forms of empty strings like "", " ", and "\t". Additionally, I need to recursively trim all whitespaces from strings, even if they serve as object keys. To achieve this, I devised a solution using jq 1.6 along with a customized walk() function. While this setup works fine, I am seeking ways to enhance the performance of my query, especially in terms of CPU utilization. Currently, I execute it through executeScript on a cluster of 10 nodes, each equipped with 4 CPUs and 16GB of RAM, where CPU usage seems to be the bottleneck rather than memory.

jq 'walk(
  if type == "string" then
    (sub("^[[:space:]]+"; "") | sub("[[:space:]]+$"; "") | if . == "true" then . |= true else . end | if . == "false" then . |= false else . end)
  elif type == "object" then
    with_entries(select(.value | IN("",null, [], {}) | not) | .key |= sub("^[[:space:]]+"; "") | .key |= sub("[[:space:]]+$"; "") |select(.key | IN("") | not ))
  elif type == "array" then
      map(select(. | IN("",null, [], {}) | not))
  else . end)'

This is my current approach. I also convert "true" to boolean true and "false" to boolean false. Are there any apparent optimizations that can be made to the query?

While I did consider implementing the entire process in JavaScript or Groovy, I found that jq handles recursive processing of nested JSON objects elegantly, saving me from reinventing the wheel. However, I am open to exploring JavaScript or Groovy implementations if significant enhancements cannot be achieved within the jq query.

javascript json groovy jq apache-nifi

Answer 1

Answer №1

Looking to enhance performance? Without seeing the customized version of walk that you're using, it's difficult to provide specific advice. However, here's a more efficient alternative compared to the one in builtins.jq:

def walk(f):
  def w:
    if type == "object"
    then . as $in
    | reduce keys_unsorted[] as $key
        ( {}; . + { ($key):  ($in[$key] | w) } ) | f
    elif type == "array" then map( w ) | f
    else f
    end;
  w;

Answer 2

Looking to enhance performance? Without seeing the customized version of walk that you're using, it's difficult to provide specific advice. However, here's a more efficient alternative compared to the one in builtins.jq:

def walk(f):
  def w:
    if type == "object"
    then . as $in
    | reduce keys_unsorted[] as $key
        ( {}; . + { ($key):  ($in[$key] | w) } ) | f
    elif type == "array" then map( w ) | f
    else f
    end;
  w;

Answer 3

Answer №2

After much consideration, I decided to create a compact rust binary that performs the specified task:

use std::io::Read;
use serde_json::{Value, Map};

fn clean_value(val: &Value) -> Option<Value> {
    match val {
        Value::Null => None,
        Value::String(s) => {
            let trimmed = s.trim().to_owned();
            match trimmed.to_lowercase().as_str() {
                "true" => Some(Value::Bool(true)),
                "false" => Some(Value::Bool(false)),
                _ => if trimmed.is_empty() { None } else { Some(Value::String(trimmed)) },
            }
        },
        Value::Array(arr) => {
            let cleaned: Vec<Value> = arr.iter()
                .filter_map(clean_value)
                .collect();
            if cleaned.is_empty() { None } else { Some(Value::Array(cleaned)) }
        },
        Value::Object(map) => {
            let cleaned: Map<String, Value> = map.iter()
                .filter_map(|(k, v)| clean_value(v).map(|v| (k.trim().to_owned(), v)))
                .collect();
            if cleaned.is_empty() { None } else { Some(Value::Object(cleaned)) }
        },
        _ => Some(val.clone()),
    }
}

fn clean_json(json: &str) -> Result<String, serde_json::Error> {
    let value: Value = serde_json::from_str(json)?;
    let cleaned = clean_value(&value);
    match cleaned {
        Some(v) => Ok(serde_json::to_string(&v)?),
        None => Ok(String::new()),
    }
}

fn main() {
    let mut buffer = String::new();
    std::io::stdin().read_to_string(&mut buffer).unwrap();
    match clean_json(&buffer) {
        Ok(json) => println!("{}", json),
        Err(e) => eprintln!("Error cleaning json: {}", e),
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn it_works() {
        let input = r#"
        {
            "  key  ": "  true  ",
            "  empty array  ": [],
            "  empty object  ": {},
            "  empty string  ": "",
            "  null  ": null,
            "  nested  ": {
                "  key  ": "  false  ",
                "  empty array  ": [],
                "  empty object  ": {},
                "  empty string  ": "",
                "  null  ": null
            }
        }
        "#;
        let expected = r#"{"key":true,"nested":{"key":false}}"#;
        let cleaned = clean_json(input).unwrap();
        assert_eq!(cleaned, expected);
    }
}

The execution speed is notably improved:

time ./clean-json < ~/test_small.json
real    0m0.022s
user    0m0.021s
sys     0m0.001s

when compared to an enhanced version of jq:

real    0m0.365s
user    0m0.336s
sys     0m0.031s

Answer 4

After much consideration, I decided to create a compact rust binary that performs the specified task:

use std::io::Read;
use serde_json::{Value, Map};

fn clean_value(val: &Value) -> Option<Value> {
    match val {
        Value::Null => None,
        Value::String(s) => {
            let trimmed = s.trim().to_owned();
            match trimmed.to_lowercase().as_str() {
                "true" => Some(Value::Bool(true)),
                "false" => Some(Value::Bool(false)),
                _ => if trimmed.is_empty() { None } else { Some(Value::String(trimmed)) },
            }
        },
        Value::Array(arr) => {
            let cleaned: Vec<Value> = arr.iter()
                .filter_map(clean_value)
                .collect();
            if cleaned.is_empty() { None } else { Some(Value::Array(cleaned)) }
        },
        Value::Object(map) => {
            let cleaned: Map<String, Value> = map.iter()
                .filter_map(|(k, v)| clean_value(v).map(|v| (k.trim().to_owned(), v)))
                .collect();
            if cleaned.is_empty() { None } else { Some(Value::Object(cleaned)) }
        },
        _ => Some(val.clone()),
    }
}

fn clean_json(json: &str) -> Result<String, serde_json::Error> {
    let value: Value = serde_json::from_str(json)?;
    let cleaned = clean_value(&value);
    match cleaned {
        Some(v) => Ok(serde_json::to_string(&v)?),
        None => Ok(String::new()),
    }
}

fn main() {
    let mut buffer = String::new();
    std::io::stdin().read_to_string(&mut buffer).unwrap();
    match clean_json(&buffer) {
        Ok(json) => println!("{}", json),
        Err(e) => eprintln!("Error cleaning json: {}", e),
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn it_works() {
        let input = r#"
        {
            "  key  ": "  true  ",
            "  empty array  ": [],
            "  empty object  ": {},
            "  empty string  ": "",
            "  null  ": null,
            "  nested  ": {
                "  key  ": "  false  ",
                "  empty array  ": [],
                "  empty object  ": {},
                "  empty string  ": "",
                "  null  ": null
            }
        }
        "#;
        let expected = r#"{"key":true,"nested":{"key":false}}"#;
        let cleaned = clean_json(input).unwrap();
        assert_eq!(cleaned, expected);
    }
}

The execution speed is notably improved:

time ./clean-json < ~/test_small.json
real    0m0.022s
user    0m0.021s
sys     0m0.001s

when compared to an enhanced version of jq:

real    0m0.365s
user    0m0.336s
sys     0m0.031s

Answer 5

Answer №3

Check out these two versions of the walk function that may be useful if you're focusing on individual values and keys for transformation.

It's important to note that using scalar_walk should result in quicker processing, while employing with_entries could potentially slow down the execution of atomic_walk when dealing with JSON objects.

# Perform f on keys and values only
# Optimize by skipping objects or arrays themselves
def atomic_walk(f):
  def w:
    if type == "object"
    then with_entries( .key |= f | .value |= w)
    elif type == "array" then map( w )
    else f
    end;
  w;

# Apply f to values (excluding keys) exclusively
def scalar_walk(f):
  def w:
    if type == "object"
    then map_values(w)
    elif type == "array" then map( w )
    else f
    end;
  w;

Answer 6

Check out these two versions of the walk function that may be useful if you're focusing on individual values and keys for transformation.

It's important to note that using scalar_walk should result in quicker processing, while employing with_entries could potentially slow down the execution of atomic_walk when dealing with JSON objects.

# Perform f on keys and values only
# Optimize by skipping objects or arrays themselves
def atomic_walk(f):
  def w:
    if type == "object"
    then with_entries( .key |= f | .value |= w)
    elif type == "array" then map( w )
    else f
    end;
  w;

# Apply f to values (excluding keys) exclusively
def scalar_walk(f):
  def w:
    if type == "object"
    then map_values(w)
    elif type == "array" then map( w )
    else f
    end;
  w;

Optimizing JSON data by recursively removing empty values, converting strings to booleans, and trimming whitespace using jq - best practices?

Answer №1

Answer №2

Answer №3

Similar questions

HTML script tag failing to load

The error message "Uncaught TypeError: Unable to retrieve the 'length' property of an undefined object in Titanium" has occurred

Tips for updating the date format in Material UI components and input fields

issue encountered when filling out a dropdown menu using a JSON data structure

If using conditional SCSS, consider overriding the variables

Convert the number into the following date format: dd/mm/yyyy

Issue with October CMS: Radio button selection triggers an Ajax call, but clicking twice in quick succession causes the content

Generating Tree Structure Object Automatically from Collection using Keys

Transforming JSON into array format with key-value pairs using JavaScript

Obtain the URL response using Node.js (Express/HTTP)

Live formatting of phone numbers using regular expressions

Encountered an Error: The JSON response body is malformed in a NextJS application using the fetch

Exploring the contrast: State data versus destructuring in React

Converting a JSON string to an Object in Kotlin Multiplatform Mobile

Material UI grid items overlapping issueIn Material UI, the grid

Tips for managing event listeners with JavaScript validation and PHP

Using Vue.$set, you can create nested objects that were previously missing in your

Preloading resources using the <link> tag with "preload" attribute as "script" is different from simply including a <script> tag with a "src" attribute

Utilize pg-promise for inserting data with customized formatting using the placeholders :name and :

Refresh FullCalendar in Angular 2 and above