Using map-reduce to insert embedded documents from other collections into MongoDB's vast collections

When I receive these files, each will contain at least a million rows, up to a maximum of 1.5 billion. The data is initially normalized when received, and I am looking for a way to store it all in one document. The format of the data may vary, it could be in csv, Fixed Width Text File, tsv, or another format.

Currently, I have imported some collections from sample csv files.

Below is a small representation of my data with missing fields:

In my beneficiaries.csv file, the data is repeated:

beneficiaries.csv contains over 6 million records

record # 1
{"userid":"a9dk4kJkj",
 "gender":"male",
 "dob":20080514,
 "start_date":20000101,
 "end_date":20080227}

record # 2
{"userid":"a9dk4kJkj",
 "gender":"male",
 "dob":20080514,
 "start_date":20080201,
 "end_date":00000000}

 same user different start and end dates

claims.csv contains over 200 million records

{"userid":"a9dk4kJkj",
     "date":20080514,
     "code":"d4rd3",
     "blah":"data"}

lab.csv contains over 10 million records

{"userid":"a9dk4kJkj",
     "date":20080514,
     "lab":"mri",
     "blah":"data"}

Based on my current knowledge, I have three options:

  1. Sort the files, read a certain amount into C++ Member objects from the data files, stop at a certain point, insert the members into MongoDB, and then continue from where the previous batch ended. This method has been Tested and is Working, but sorting such massive files can overload the system for hours.

    1. Load the data into SQL, read into C++ Member objects one by one, and then bulk load the data into MongoDB. This method has been tested and works, but I would prefer to avoid it if possible.

    2. Load the documents into separate collections in MongoDB and perform a map-reduce function without parameters to write to a single collection. I have the documents loaded in their own collections for each file (as shown above). However, I am new to MongoDB and have a tight deadline. The concept of map-reduce is challenging for me to grasp and execute. I have read the documentation and attempted to use the solution provided in this stack overflow answer: MongoDB: Combine data from multiple collections into one..how?

The output member collection should resemble the following:

{"userid":"aaa4444",
 "gender":"female",
 "dob":19901225,
 "beneficiaries":[{"start_date":20000101,
                  "end_date":20080227},
                  {"start_date":20008101,
                  "end_date":00000000}],
"claims":[{"date":20080514,
         "code":"d4rd3",
         "blah":"data"},
        {"date":20080514,
         "code":"d4rd3",
         "blah":"data"}],
"labs":[{"date":20080514,
         "lab":"mri",
         "blah":"data"}]}

Would loading the data into SQL, reading into C++, and inserting into MongoDB outperform map-reduce? If so, I will opt for that method.

Answer №1

Personally, I believe that your data is well-suited for map-reduce, so it would be advantageous to choose option 3: load the documents into mongo in 3 separate collections: beneficiaries, claims, labs and execute map-reduce on the userid key for each collection. Afterwards, merge the data from the 3 collections into a single collection using find and insert based on the userid key.

For instance, if you load beneficiaries.csv into the beneficiaries collection, here is a sample code for map-reduce on beneficiaries:

mapBeneficiaries = function() {
    var values = {
        start_date: this.start_date,
        end_date: this.end_date,
        userid: this.userid,
        gender: this.gender,
        dob: this.dob
    };
    emit(this.userid, values);
};

reduce = function(k, values) {
  list = { beneficiaries: [], gender : '', dob: ''};
  for(var i in values) {
    list.beneficiaries.push({start_date: values[i].start_date, end_date: values[i].end_date});
    list.gender = values[i].gender;
    list.dob = values[i].dob;
  }
  return list;
};

db.beneficiaries.mapReduce(mapBeneficiaries, reduce, {"out": {"reduce": "mr_beneficiaries"}});

The resulting data in mr_beneficiaries will look similar to this:

{
    "_id" : "a9dk4kJkj",
    "value" : {
        "beneficiaries" : [ 
            {
                "start_date" : 20080201,
                "end_date" : 0
            }, 
            {
                "start_date" : 20080201,
                "end_date" : 0
            }
        ],
        "gender" : "male",
        "dob" : 20080514
    }
}

Repeat the same process to obtain mp_claims and mp_labs. Then merge them into singledocuments:

db.mr_beneficiaries.find().forEach(function(doc) {
    var id = doc._id;
    var claims = db.mr_claims.findOne({"_id":id});
    var labs = db.mr_lab.findOne({"_id":id});
    db.singledocuments.insert({"userid":id,
                         "gender":doc.value.gender,
                         "dob":doc.value.dob,
                         "beneficiaries":doc.value.beneficiaries,
                         "claims":claims.value.claims,
                         "labs":labs.value.labs});
});

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

When I attempt to utilize the API, the JavaScript implementation of a <script src=...> element seems to interfere

Within one of my HTML files, I encountered the following line near the top: <script src="//maps.google.com/maps/api/js?key=apikey"></script> The API key is currently hardcoded in this file, but I would like to use a configuration option store ...

javascriptEmbed youtube video thumbnail dynamically as users input a URL

I am currently working on a React frontend for my web app. One of the features I want to implement is a URL input box, with an image display panel below it. The goal is that when a user enters a YouTube URL into the input box, the thumbnail of the correspo ...

Troubleshooting problems with SQLite3 when multiple users are accessing the database concurrently

I am encountering difficulties with data reads in my SQLite3 database. Scenario: I have created a python program that runs two scripts simultaneously: Script1: Constantly retrieves messages from an MQTT broker at intervals of 30 seconds and writes data t ...

I attempted to craft a toggle button by applying and removing an active class that I had previously designed, but unfortunately, it did not function as intended

Every time I click on a button, I keep encountering this error message. I am certain that my selector is correct, but I can't seem to figure out why I'm getting the Uncaught TypeError: Cannot read property 'classList' of undefined at HT ...

submit a unidirectional post using jquery.post

I am working with a variable named test_string which is set to the value "hello." var test_string = "hello"; I want to send this variable to a PHP page in a one-way communication. I have attempted the following: $.post('php_page.php', test_str ...

OBJ and MTL Loaders from Three.js are not being continuously invoked

I'm currently using the THREE.js OBJ and MTL Loader in a loop to display various elements of a 3D animated cake. I specifically need these elements so that users can change the color of certain parts (e.g. decorations) of the cake. However, I've ...

What is the best way to implement my Vanilla JS Array manipulation technique in a ReactJS environment?

https://i.sstatic.net/ZRGsS.jpgMy REST API development is in its final stages, and I'm currently facing a challenge with converting an array received from the backend into either a Nested Object of Objects or an array of objects. This conversion would ...

React - How to properly pass a reference to a React portal

I have a Card component that needs to trigger a Modal component. Additionally, there is a versatile Overlay component used to display content above the application. Displayed here is the App component: class App extends Component { /* Some Code */ ...

Error: The attempt to access the 'useContext' property of null has failed due to a TypeError

Nowhere in my React code am I using the useContext property. There is a compiled webpack file in an npm package with a component inside. When trying to use this component in my React app, it throws an error: Uncaught TypeError: Cannot read properties of nu ...

Hover state remains persistent even after modal window is activated in outouchend

One of the buttons on my website has a hover effect that changes its opacity. This button is used to share information on Facebook. It's a simple feature to implement. Here is the CSS code: .social_vk, .social_fb { height: 38px; obj ...

Exploring ES6: Harnessing the Power of Classes

I am currently learning the ES6 syntax for classes. My background is in C#, so I apologize if my terminology is not accurate or if something seems off. For practice, I am working on building a web app using Node and Express. I have defined some routes as ...

Facing difficulties in Angular 8 while trying to import firestore and firebase for an authentication system

While attempting to implement Firestore/Firebase functionalities for Google OAuth signin, I encountered an error indicating that Firebase is not imported: https://i.sstatic.net/oL4rY.png CODE: ERROR in node_modules/@angular/fire/auth/auth.d.ts:4:28 - er ...

`Can you retrieve the day name along with the date using Node.js?`

Can someone assist me with obtaining the day name from a date in string format using Node.js? let gameDate = bidDatarray[0].gameDate;// date : 27/12/2019 console.log( gameDate.getDay()); When I try to run gameDate.getDay(), I encounter an error stating ...

In the frontend, I seem to have trouble accessing elements of an array using bracket notation, yet strangely it works flawlessly in the backend

I am encountering a peculiar issue as a newcomer to coding. I have an array retrieved from the backend database, and my goal is to access individual elements of this array in the frontend using bracket notation. While I can successfully access the elements ...

Utilizing Selenium Webdriver to efficiently scroll through a webpage with AJAX-loaded content

I am currently utilizing Selenium Webdriver to extract content from a webpage. The challenge I'm facing is that the page dynamically loads more content using AJAX as the user scrolls down. While I can programmatically scroll down using JavaScript, I a ...

Activate Pop-up for a single instance on BigCommerce

After researching and adding my own code, I am still struggling to get this question answered correctly. Here are the key points I am trying to achieve: 1. Automatically open a popup when the homepage loads. 2. Ensure that the popup is centered on all brow ...

Ng-repeat seems to be having trouble showing the JSON data

Thank you in advance for any assistance. I have a factory in my application that utilizes a post method to retrieve data from a C# function. Despite successfully receiving the data and logging it to the console, I am facing difficulties in properly display ...

transforming a two-dimensional array into an object array using JavaScript

Below is a comparison between a two-dimensional array code: var questions = [ ['How many states are in the United States?', 50], ['How many continents are there?', 7], ['How many legs does an insect have?', 6] ]; and i ...

Store the decoded user information in the database for safekeeping

I have developed an application where users can capture images and send them to the database with ease. Upon logging in, each user receives a token. As long as the token is valid, they do not need to log in again. To implement JWT authentication, I refer ...

What causes the variation in output when utilizing React's setState() function?

I'm puzzled by this Whenever I try this.setState({count: count+1}), it only updates the count once no matter how many times I click But when I attempt this.setState({count: this.setState.count}), every click successfully updates the cou ...