Using map-reduce to insert embedded documents from other collections into MongoDB's vast collections

Question

Using map-reduce to insert embedded documents from other collections into MongoDB's vast collections

When I receive these files, each will contain at least a million rows, up to a maximum of 1.5 billion. The data is initially normalized when received, and I am looking for a way to store it all in one document. The format of the data may vary, it could be in csv, Fixed Width Text File, tsv, or another format.

Currently, I have imported some collections from sample csv files.

Below is a small representation of my data with missing fields:

In my beneficiaries.csv file, the data is repeated:

beneficiaries.csv contains over 6 million records

record # 1
{"userid":"a9dk4kJkj",
 "gender":"male",
 "dob":20080514,
 "start_date":20000101,
 "end_date":20080227}

record # 2
{"userid":"a9dk4kJkj",
 "gender":"male",
 "dob":20080514,
 "start_date":20080201,
 "end_date":00000000}

 same user different start and end dates

claims.csv contains over 200 million records

{"userid":"a9dk4kJkj",
     "date":20080514,
     "code":"d4rd3",
     "blah":"data"}

lab.csv contains over 10 million records

{"userid":"a9dk4kJkj",
     "date":20080514,
     "lab":"mri",
     "blah":"data"}

Based on my current knowledge, I have three options:

Sort the files, read a certain amount into C++ Member objects from the data files, stop at a certain point, insert the members into MongoDB, and then continue from where the previous batch ended. This method has been Tested and is Working, but sorting such massive files can overload the system for hours.
1. Load the data into SQL, read into C++ Member objects one by one, and then bulk load the data into MongoDB. This method has been tested and works, but I would prefer to avoid it if possible.
2. Load the documents into separate collections in MongoDB and perform a map-reduce function without parameters to write to a single collection. I have the documents loaded in their own collections for each file (as shown above). However, I am new to MongoDB and have a tight deadline. The concept of map-reduce is challenging for me to grasp and execute. I have read the documentation and attempted to use the solution provided in this stack overflow answer: MongoDB: Combine data from multiple collections into one..how?

The output member collection should resemble the following:

{"userid":"aaa4444",
 "gender":"female",
 "dob":19901225,
 "beneficiaries":[{"start_date":20000101,
                  "end_date":20080227},
                  {"start_date":20008101,
                  "end_date":00000000}],
"claims":[{"date":20080514,
         "code":"d4rd3",
         "blah":"data"},
        {"date":20080514,
         "code":"d4rd3",
         "blah":"data"}],
"labs":[{"date":20080514,
         "lab":"mri",
         "blah":"data"}]}

Would loading the data into SQL, reading into C++, and inserting into MongoDB outperform map-reduce? If so, I will opt for that method.

javascript mongodb mapreduce bigdata database

Answer 1

Answer №1

Personally, I believe that your data is well-suited for map-reduce, so it would be advantageous to choose option 3: load the documents into mongo in 3 separate collections: beneficiaries, claims, labs and execute map-reduce on the userid key for each collection. Afterwards, merge the data from the 3 collections into a single collection using find and insert based on the userid key.

For instance, if you load beneficiaries.csv into the beneficiaries collection, here is a sample code for map-reduce on beneficiaries:

mapBeneficiaries = function() {
    var values = {
        start_date: this.start_date,
        end_date: this.end_date,
        userid: this.userid,
        gender: this.gender,
        dob: this.dob
    };
    emit(this.userid, values);
};

reduce = function(k, values) {
  list = { beneficiaries: [], gender : '', dob: ''};
  for(var i in values) {
    list.beneficiaries.push({start_date: values[i].start_date, end_date: values[i].end_date});
    list.gender = values[i].gender;
    list.dob = values[i].dob;
  }
  return list;
};

db.beneficiaries.mapReduce(mapBeneficiaries, reduce, {"out": {"reduce": "mr_beneficiaries"}});

The resulting data in mr_beneficiaries will look similar to this:

{
    "_id" : "a9dk4kJkj",
    "value" : {
        "beneficiaries" : [ 
            {
                "start_date" : 20080201,
                "end_date" : 0
            }, 
            {
                "start_date" : 20080201,
                "end_date" : 0
            }
        ],
        "gender" : "male",
        "dob" : 20080514
    }
}

Repeat the same process to obtain mp_claims and mp_labs. Then merge them into singledocuments:

db.mr_beneficiaries.find().forEach(function(doc) {
    var id = doc._id;
    var claims = db.mr_claims.findOne({"_id":id});
    var labs = db.mr_lab.findOne({"_id":id});
    db.singledocuments.insert({"userid":id,
                         "gender":doc.value.gender,
                         "dob":doc.value.dob,
                         "beneficiaries":doc.value.beneficiaries,
                         "claims":claims.value.claims,
                         "labs":labs.value.labs});
});

Answer 2

Personally, I believe that your data is well-suited for map-reduce, so it would be advantageous to choose option 3: load the documents into mongo in 3 separate collections: beneficiaries, claims, labs and execute map-reduce on the userid key for each collection. Afterwards, merge the data from the 3 collections into a single collection using find and insert based on the userid key.

For instance, if you load beneficiaries.csv into the beneficiaries collection, here is a sample code for map-reduce on beneficiaries:

mapBeneficiaries = function() {
    var values = {
        start_date: this.start_date,
        end_date: this.end_date,
        userid: this.userid,
        gender: this.gender,
        dob: this.dob
    };
    emit(this.userid, values);
};

reduce = function(k, values) {
  list = { beneficiaries: [], gender : '', dob: ''};
  for(var i in values) {
    list.beneficiaries.push({start_date: values[i].start_date, end_date: values[i].end_date});
    list.gender = values[i].gender;
    list.dob = values[i].dob;
  }
  return list;
};

db.beneficiaries.mapReduce(mapBeneficiaries, reduce, {"out": {"reduce": "mr_beneficiaries"}});

The resulting data in mr_beneficiaries will look similar to this:

{
    "_id" : "a9dk4kJkj",
    "value" : {
        "beneficiaries" : [ 
            {
                "start_date" : 20080201,
                "end_date" : 0
            }, 
            {
                "start_date" : 20080201,
                "end_date" : 0
            }
        ],
        "gender" : "male",
        "dob" : 20080514
    }
}

Repeat the same process to obtain mp_claims and mp_labs. Then merge them into singledocuments:

db.mr_beneficiaries.find().forEach(function(doc) {
    var id = doc._id;
    var claims = db.mr_claims.findOne({"_id":id});
    var labs = db.mr_lab.findOne({"_id":id});
    db.singledocuments.insert({"userid":id,
                         "gender":doc.value.gender,
                         "dob":doc.value.dob,
                         "beneficiaries":doc.value.beneficiaries,
                         "claims":claims.value.claims,
                         "labs":labs.value.labs});
});

Using map-reduce to insert embedded documents from other collections into MongoDB's vast collections

Answer №1

Similar questions

When I attempt to utilize the API, the JavaScript implementation of a <script src=...> element seems to interfere

javascriptEmbed youtube video thumbnail dynamically as users input a URL

Troubleshooting problems with SQLite3 when multiple users are accessing the database concurrently

I attempted to craft a toggle button by applying and removing an active class that I had previously designed, but unfortunately, it did not function as intended

submit a unidirectional post using jquery.post

OBJ and MTL Loaders from Three.js are not being continuously invoked

What is the best way to implement my Vanilla JS Array manipulation technique in a ReactJS environment?

React - How to properly pass a reference to a React portal

Error: The attempt to access the 'useContext' property of null has failed due to a TypeError

Hover state remains persistent even after modal window is activated in outouchend

Exploring ES6: Harnessing the Power of Classes

Facing difficulties in Angular 8 while trying to import firestore and firebase for an authentication system

`Can you retrieve the day name along with the date using Node.js?`

In the frontend, I seem to have trouble accessing elements of an array using bracket notation, yet strangely it works flawlessly in the backend

Utilizing Selenium Webdriver to efficiently scroll through a webpage with AJAX-loaded content

Activate Pop-up for a single instance on BigCommerce

Ng-repeat seems to be having trouble showing the JSON data

transforming a two-dimensional array into an object array using JavaScript

Store the decoded user information in the database for safekeeping

What causes the variation in output when utilizing React's setState() function?