In my possession is an extensive dataset. Utilizing mongoose schemas, each data element has a structure resembling the following:
{
field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”:
field2: “GAA…..GAATG”
}
Reference: Reading an FASTA file
The individual elements are relatively simple and small, yet abundant in number, with a collective size of over 200MB.
The dilemma I face is that I am unable to store it in mongo due to its large size (> 200MB).
While I have come across GridFS as a potential solution,
All available resources primarily focus on uploading images and videos;
No guidance has been provided on how to retain the functionality of mongoose schema;
The existing examples do not allow user-defined paths for saving the data, which is common in mongoose settings.
In a basic setting: how would I go about saving a JSON file using GridFS or a similar approach, akin to working with small JSON files? What are the advantages and disadvantages of this method compared to other alternatives? Is my proposed approach considered valid? Specifically, utilizing a hierarchy of JSON files and later populate
function has proven effective!
As a demonstration of saving a JSON file with mongoose:
Model.create([
{
field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”:
field2: “GAA…..GAATG”
},
{
field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”:
field2: “GAA…..GAATG”
}]);
In the above example, only two-element JSON file was saved. For larger files, I must divide them into smaller chunks (such as 1%) and structure them accordingly, as mentioned earlier, at least that is my current solution.
My concern is that I might be reinventing the wheel. While I can save the files independently, there is a need for correlation among them since they belong to the same file, similar to segments of an image belonging together.
This is my current solution, devised using my own methodology! Although it does not incorporate GridFS, suggestions involving GridFS are still welcomed. It relies solely on JSON files, breaking down the document into smaller pieces arranged in a hierarchical tree fashion.
https://i.stack.imgur.com/QYJXt.png
The issue has been resolved utilizing this diagram. Yet, out of curiosity, I am interested to explore whether achieving something similar using GridFS is possible for educational purposes.
Discussion
Initially, I attempted to maintain them as subdocs, which failed. Subsequently, I tried preserving just their ids, which amounted to 35% of the entire chunk and exceeded 16MB: again unsuccessful. Finally, I settled on creating a placeholder document to store the ids exclusively, resulting in success!