When I receive these files, each will contain at least a million rows, up to a maximum of 1.5 billion. The data is initially normalized when received, and I am looking for a way to store it all in one document. The format of the data may vary, it could be in csv, Fixed Width Text File, tsv, or another format.
Currently, I have imported some collections from sample csv files.
Below is a small representation of my data with missing fields:
In my beneficiaries.csv file, the data is repeated:
beneficiaries.csv contains over 6 million records
record # 1
{"userid":"a9dk4kJkj",
"gender":"male",
"dob":20080514,
"start_date":20000101,
"end_date":20080227}
record # 2
{"userid":"a9dk4kJkj",
"gender":"male",
"dob":20080514,
"start_date":20080201,
"end_date":00000000}
same user different start and end dates
claims.csv contains over 200 million records
{"userid":"a9dk4kJkj",
"date":20080514,
"code":"d4rd3",
"blah":"data"}
lab.csv contains over 10 million records
{"userid":"a9dk4kJkj",
"date":20080514,
"lab":"mri",
"blah":"data"}
Based on my current knowledge, I have three options:
Sort the files, read a certain amount into C++ Member objects from the data files, stop at a certain point, insert the members into MongoDB, and then continue from where the previous batch ended. This method has been Tested and is Working, but sorting such massive files can overload the system for hours.
Load the data into SQL, read into C++ Member objects one by one, and then bulk load the data into MongoDB. This method has been tested and works, but I would prefer to avoid it if possible.
Load the documents into separate collections in MongoDB and perform a map-reduce function without parameters to write to a single collection. I have the documents loaded in their own collections for each file (as shown above). However, I am new to MongoDB and have a tight deadline. The concept of map-reduce is challenging for me to grasp and execute. I have read the documentation and attempted to use the solution provided in this stack overflow answer: MongoDB: Combine data from multiple collections into one..how?
The output member collection should resemble the following:
{"userid":"aaa4444",
"gender":"female",
"dob":19901225,
"beneficiaries":[{"start_date":20000101,
"end_date":20080227},
{"start_date":20008101,
"end_date":00000000}],
"claims":[{"date":20080514,
"code":"d4rd3",
"blah":"data"},
{"date":20080514,
"code":"d4rd3",
"blah":"data"}],
"labs":[{"date":20080514,
"lab":"mri",
"blah":"data"}]}
Would loading the data into SQL, reading into C++, and inserting into MongoDB outperform map-reduce? If so, I will opt for that method.