Tips for storing a JSON file with GridFS

Question

Tips for storing a JSON file with GridFS

In my possession is an extensive dataset. Utilizing mongoose schemas, each data element has a structure resembling the following:

    {
      field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”: 
      field2: “GAA…..GAATG”
    }

Reference: Reading an FASTA file

The individual elements are relatively simple and small, yet abundant in number, with a collective size of over 200MB.

The dilemma I face is that I am unable to store it in mongo due to its large size (> 200MB).

While I have come across GridFS as a potential solution,

All available resources primarily focus on uploading images and videos;
No guidance has been provided on how to retain the functionality of mongoose schema;
The existing examples do not allow user-defined paths for saving the data, which is common in mongoose settings.

In a basic setting: how would I go about saving a JSON file using GridFS or a similar approach, akin to working with small JSON files? What are the advantages and disadvantages of this method compared to other alternatives? Is my proposed approach considered valid? Specifically, utilizing a hierarchy of JSON files and later populate function has proven effective!

As a demonstration of saving a JSON file with mongoose:

Model.create([        
          {
          field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”: 
          field2: “GAA…..GAATG”
        }, 
        {
          field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”: 
          field2: “GAA…..GAATG”
        }]);

In the above example, only two-element JSON file was saved. For larger files, I must divide them into smaller chunks (such as 1%) and structure them accordingly, as mentioned earlier, at least that is my current solution.

My concern is that I might be reinventing the wheel. While I can save the files independently, there is a need for correlation among them since they belong to the same file, similar to segments of an image belonging together.

This is my current solution, devised using my own methodology! Although it does not incorporate GridFS, suggestions involving GridFS are still welcomed. It relies solely on JSON files, breaking down the document into smaller pieces arranged in a hierarchical tree fashion.

https://i.stack.imgur.com/QYJXt.png

The issue has been resolved utilizing this diagram. Yet, out of curiosity, I am interested to explore whether achieving something similar using GridFS is possible for educational purposes.

Discussion

Initially, I attempted to maintain them as subdocs, which failed. Subsequently, I tried preserving just their ids, which amounted to 35% of the entire chunk and exceeded 16MB: again unsuccessful. Finally, I settled on creating a placeholder document to store the ids exclusively, resulting in success!

javascript json mongodb mongoose

Answer 1

Answer №1

It is highly unlikely to be beneficial to store data in Mongo using GridFS.

Storing binary data in a database is generally not recommended. However, for small data, the advantages of being able to query it might outweigh the drawbacks such as server load and slow processing.

If you intend to store JSON document data in GridFS, treat it like any other binary data. Keep in mind that the stored data will remain opaque, meaning you can only access file metadata but not the JSON content itself.

Handling Large Data Queries

If querying data is crucial for your needs, assess the data format first. If the data structure resembles the example provided where simple string matching suffices for queries, consider these options:

Scenario 1: Big Data with Minimal Points

If you have few sets of data but each set contains large amounts of information, consider storing the bulk data elsewhere and referencing it instead. For instance, save the actual data in an external file on Amazon S3 and store the link in your MongoDB entry.

{
  field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”,
  field2link: "https://my-bucket.s3.us-west-2.amazonaws.com/puppy.png"
}

Scenario 2: Numerous Small Data Points

If individual datasets are relatively small (under 16 MB) but there are many of them, opt to save the data directly in MongoDB without utilizing GridFS.

Data Storage Approaches

Given your circumstances involving sizable data, leveraging GridFS could prove inefficient.

A provided benchmarking analysis suggests retrieval time scales substantially based on file size. In a comparable setup, fetching a document from the database could take up to 80 seconds.

Possible Enhancements

The default chunk size in GridFS typically stands at 255 KiB. Consider boosting this value to the maximum permissible limit (16 MB) to optimize larger file access times. Alter the chunk size setting while initializing the GridFS bucket.

new GridFSBucket(db, {chunkSizeBytes: 16000000})

Alternatively, for improved efficiency, merely store filenames within Mongo entries and retrieve corresponding files directly from the filesystem.

Additional Considerations

Another potential downside of storing binary data in Mongo has been highlighted by this source: "If the binary data is extensive, loading it into memory may impede access to frequently used text documents or exceed available RAM capacity, affecting overall database performance."

Illustrative Instance

An adapted example of saving a file in GridFS can be found in the Mongo GridFS tutorial.

const uri = 'mongodb://localhost:27017/test';

mongodb.MongoClient.connect(uri, (error, db) => {
  const bucket = new mongodb.GridFSBucket(db);

  fs.createReadStream('./fasta-data.json')
    .pipe(bucket.openUploadStream('fasta-data.json'))
    .on('finish', () => console.log('done!'))
  ;
});

Answer 2

It is highly unlikely to be beneficial to store data in Mongo using GridFS.

Storing binary data in a database is generally not recommended. However, for small data, the advantages of being able to query it might outweigh the drawbacks such as server load and slow processing.

If you intend to store JSON document data in GridFS, treat it like any other binary data. Keep in mind that the stored data will remain opaque, meaning you can only access file metadata but not the JSON content itself.

Handling Large Data Queries

If querying data is crucial for your needs, assess the data format first. If the data structure resembles the example provided where simple string matching suffices for queries, consider these options:

Scenario 1: Big Data with Minimal Points

If you have few sets of data but each set contains large amounts of information, consider storing the bulk data elsewhere and referencing it instead. For instance, save the actual data in an external file on Amazon S3 and store the link in your MongoDB entry.

{
  field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”,
  field2link: "https://my-bucket.s3.us-west-2.amazonaws.com/puppy.png"
}

Scenario 2: Numerous Small Data Points

If individual datasets are relatively small (under 16 MB) but there are many of them, opt to save the data directly in MongoDB without utilizing GridFS.

Data Storage Approaches

Given your circumstances involving sizable data, leveraging GridFS could prove inefficient.

A provided benchmarking analysis suggests retrieval time scales substantially based on file size. In a comparable setup, fetching a document from the database could take up to 80 seconds.

Possible Enhancements

The default chunk size in GridFS typically stands at 255 KiB. Consider boosting this value to the maximum permissible limit (16 MB) to optimize larger file access times. Alter the chunk size setting while initializing the GridFS bucket.

new GridFSBucket(db, {chunkSizeBytes: 16000000})

Alternatively, for improved efficiency, merely store filenames within Mongo entries and retrieve corresponding files directly from the filesystem.

Additional Considerations

Another potential downside of storing binary data in Mongo has been highlighted by this source: "If the binary data is extensive, loading it into memory may impede access to frequently used text documents or exceed available RAM capacity, affecting overall database performance."

Illustrative Instance

An adapted example of saving a file in GridFS can be found in the Mongo GridFS tutorial.

const uri = 'mongodb://localhost:27017/test';

mongodb.MongoClient.connect(uri, (error, db) => {
  const bucket = new mongodb.GridFSBucket(db);

  fs.createReadStream('./fasta-data.json')
    .pipe(bucket.openUploadStream('fasta-data.json'))
    .on('finish', () => console.log('done!'))
  ;
});

Answer 3

Answer №2

After exploring different options, I have discovered a more efficient way to address this issue compared to the method described in the original question. Utilizing Virtuals has proven to be incredibly effective!

Initially, I had concerns about using ForEach to append an additional element to the Fasta file, fearing potential slowdowns. However, my worries were unfounded as the process turned out to be quite speedy!

My solution involves modifying each Fasta file structure as shown below:

{
  Parentid: { type: mongoose.Schema.Types.ObjectId, ref: "Fasta" }//include this new line with its parent id
  field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”: 
  field2: “GAA…..GAATG”

}

Subsequently, I implement the following code snippet:

FastaSchema.virtual("healthy", {
  ref: "FastaElement",
  localField: "_id",
  foreignField: "parent",
  justOne: false,
});

Finally, I use the populate function:

  Fasta.find({ _id: ObjectId("5e93b9b504e75e5310a43f46") })
    .populate("healthy")
    .exec(function (error, result) {          
      res.json(result);
    });

This approach effectively avoids complications related to subdocument overload. Populating the Virtual proves to be swift and does not lead to any overload issues. While I haven't formally tested it yet, I am curious to compare its performance with conventional populate methods. One clear advantage is the elimination of the need for storing ids in hidden documents.

I am astonished by the elegance of this straightforward solution, which emerged while responding to another inquiry on this platform!

Kudos to mongoose for enabling such seamless functionality!

Answer 4

After exploring different options, I have discovered a more efficient way to address this issue compared to the method described in the original question. Utilizing Virtuals has proven to be incredibly effective!

Initially, I had concerns about using ForEach to append an additional element to the Fasta file, fearing potential slowdowns. However, my worries were unfounded as the process turned out to be quite speedy!

My solution involves modifying each Fasta file structure as shown below:

{
  Parentid: { type: mongoose.Schema.Types.ObjectId, ref: "Fasta" }//include this new line with its parent id
  field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”: 
  field2: “GAA…..GAATG”

}

Subsequently, I implement the following code snippet:

FastaSchema.virtual("healthy", {
  ref: "FastaElement",
  localField: "_id",
  foreignField: "parent",
  justOne: false,
});

Finally, I use the populate function:

  Fasta.find({ _id: ObjectId("5e93b9b504e75e5310a43f46") })
    .populate("healthy")
    .exec(function (error, result) {          
      res.json(result);
    });

This approach effectively avoids complications related to subdocument overload. Populating the Virtual proves to be swift and does not lead to any overload issues. While I haven't formally tested it yet, I am curious to compare its performance with conventional populate methods. One clear advantage is the elimination of the need for storing ids in hidden documents.

I am astonished by the elegance of this straightforward solution, which emerged while responding to another inquiry on this platform!

Kudos to mongoose for enabling such seamless functionality!

Tips for storing a JSON file with GridFS

Answer №1

Handling Large Data Queries

Scenario 1: Big Data with Minimal Points

Scenario 2: Numerous Small Data Points

Data Storage Approaches

Possible Enhancements

Additional Considerations

Illustrative Instance

Answer №2

Similar questions

Retrieve a nested JSON item using Java code

Having difficulty assigning a value to a specific element within an array

Users are reporting that verification emails are not being sent when the Accounts.createUser function is used within

Displaying only one modal at a time with Bootstrap 3

Initiate the Material TextField onChange event from within a nested component

HTML code featuring multiple dropdown menus, each equipped with its own toggleable textarea

Deconstructing JavaScript scripts to incorporate HTML5/CSS3 functionality for outdated browsers such as Internet Explorer

Switching Next.js JavaScript code to Typescript

Personalized modify and remove elements on a row of the DataGrid material-ui version 5 component when hovered over

Developing a custom function within an iterative loop

Implementing dynamic content updating in WordPress by passing variables and utilizing AJAX

Eliminate null objects from a JSON array with the help of GSON

Turning a JSON dot string into an object reference in JavaScript: A simple guide

"How to retrieve the height of an element within a flexslider component

What is the best method for concealing a specific element on the screen using ReactJS?

Exploring AngularJS and Jasmine: Testing a controller function that interacts with a service via $http

Definition in Typescript: The term "value is" refers to a function that takes in any number of arguments of

What could be causing AngularJS to fail to send a POST request to my Express server?

Implementing a Timer on an HTML Page with JavaScript

What is the best way to implement TrackballControls with a dynamic target?