During data preprocessing, I have the task of eliminating all empty values from an input JSON. This includes removing empty arrays []
, empty objects {}
, and various forms of empty strings like ""
, " "
, and "\t"
. Additionally, I need to recursively trim all whitespaces from strings, even if they serve as object keys. To achieve this, I devised a solution using jq 1.6 along with a customized walk()
function. While this setup works fine, I am seeking ways to enhance the performance of my query, especially in terms of CPU utilization. Currently, I execute it through executeScript on a cluster of 10 nodes, each equipped with 4 CPUs and 16GB of RAM, where CPU usage seems to be the bottleneck rather than memory.
jq 'walk(
if type == "string" then
(sub("^[[:space:]]+"; "") | sub("[[:space:]]+$"; "") | if . == "true" then . |= true else . end | if . == "false" then . |= false else . end)
elif type == "object" then
with_entries(select(.value | IN("",null, [], {}) | not) | .key |= sub("^[[:space:]]+"; "") | .key |= sub("[[:space:]]+$"; "") |select(.key | IN("") | not ))
elif type == "array" then
map(select(. | IN("",null, [], {}) | not))
else . end)'
This is my current approach. I also convert "true"
to boolean true
and "false"
to boolean false
. Are there any apparent optimizations that can be made to the query?
While I did consider implementing the entire process in JavaScript or Groovy, I found that jq handles recursive processing of nested JSON objects elegantly, saving me from reinventing the wheel. However, I am open to exploring JavaScript or Groovy implementations if significant enhancements cannot be achieved within the jq query.