Unexpected output from the MongoDB mapReduce function

Having 100 documents stored in my mongoDB, I am facing the challenge of identifying and grouping possible duplicate records based on different conditions such as first name & last name, email, and mobile phone.

To achieve this, I am utilizing mapReduce to create key-value pairs for these 100 documents, essentially creating groupings within the data.

However, issues arise when a 101st duplicate record is introduced to the database. The output of the mapReduce operation for the other documents that are duplicates with the 101st record becomes corrupted.

For instance:

My current focus is on detecting duplicates based on first name & last name.

With the initial set of 100 documents, the result looks like this:

{
    _id: {
        firstName: "foo",
        lastName: "bar,
    },
    value: {
        count: 20
        duplicate: [{
            id: ObjectId("/*an object id*/"),
            fullName: "foo bar",
            DOB: ISODate("2000-01-01T00:00:00.000Z")
        },{
            id: ObjectId("/*another object id*/"),
            fullName: "foo bar",
            DOB: ISODate("2000-01-02T00:00:00.000Z")
        },...]
    },

}

While this is the desired outcome, the scenario changes once more than 100 possible duplicates are present in the DB.

Consider the introduction of the 101st document:

{
    firstName: "foo",
    lastName: "bar",
    email: "<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="71171e1e311310035f121e1c">[email protected]</a>",
    mobile: "019894793"
}

The subsequent results for 101 and 102 documents look like this:

{
    _id: {
        firstName: "foo",
        lastName: "bar,
    },
    value: {
        count: 21
        duplicate: [{
            id: undefined,
            fullName: undefined,
            DOB: undefined
        },{
            id: ObjectId("/*another object id*/"),
            fullName: "foo bar",
            DOB: ISODate("2000-01-02T00:00:00.000Z")
        }]
    },

}
{
    _id: {
        firstName: "foo",
        lastName: "bar,
    },
    value: {
        count: 22
        duplicate: [{
            id: undefined,
            fullName: undefined,
            DOB: undefined
        },{
            id: undefined,
            fullName: undefined,
            DOB: undefined
        }]
    },

}

It appears that exceeding 100 possible duplicates causes this issue, similar to what was discussed in a related topic on stackoverflow without a working solution: MapReduce results seem limited to 100?

I am seeking suggestions or ideas on how to address this challenge.

Edit:

Initially, my source code looked like this:

var map = function () {
    var value = {
        count: 1,
        userId: this._id
    };
    emit({lastName: this.lastName, firstName: this.firstName}, value);
};

var reduce = function (key, values) {
    var reducedObj = {
        count: 0,
        userIds: []
    };
    values.forEach(function (value) {
        reducedObj.count += value.count;
        reducedObj.userIds.push(value.userId);
    });
    return reducedObj;
};

Now, I have updated it as follows:

var map = function () {
    var value = {
        count: 1,
        users: [this]
    };
    emit({lastName: this.lastName, firstName: this.firstName}, value);
};

var reduce = function (key, values) {
    var reducedObj = {
        count: 0,
        users: []
    };
    values.forEach(function (value) {
        reducedObj.count += value.count;
        reducedObj.users = reducedObj.users.concat(values.users); // or using the forEach method

        // value.users.forEach(function (user) {
        //     reducedObj.users.push(user);
        // });

    });
    return reducedObj;
};

I am unsure why the updated code is failing given that I am still pushing a value (userId) to reducedObj.userIds.

Could there be an issue with the value emitted in the map function?

Answer №1

Addressing the Challenge


In many cases, dealing with a common mapReduce pitfall can be challenging due to the lack of clear and proper explanations available. Therefore, providing an answer becomes necessary in such situations.

The key point that is often overlooked or misunderstood in the documentation lies here:

  • MongoDB might execute the reduce function multiple times for the same key. The previous output from the reduce function for that key then becomes one of the input values for the next invocation of the reduce function for that key.

Further down the documentation, it adds:

  • The return object's type must be identical to the value type emitted by the map function.

This means that as the number of duplicate key values increases beyond a certain threshold, the reduce stage may have trouble processing all of them in a single pass. Instead, the reduce method iterates multiple times, reusing outputs from previous reductions as inputs for subsequent passes.

MapReduce is designed to handle extensive datasets by incrementally 'reducing' data until arriving at a consolidated result per key. This underscores the significance of ensuring consistency between the structures of output produced by both emit and reduce.

Resolving the Issue


To rectify this issue, adjustments need to be made in both how data is emitted in the map function and processed in the reduce function:

...JavaScript code snippet provided...

This involves maintaining uniformity in the format used to emit data and initializing the reduce function with a compatible structure at the outset. By aligning these two components, you ensure smooth data flow throughout the process.

Alternative Approach


Considering the anticipated output, utilizing the aggregation framework could offer a more efficient solution compared to mapReduce. The aggregation framework simplifies the task and delivers faster results, making it a preferable choice in this scenario:

...Another JavaScript code snippet showcased here...

Whether employing mapReduce or aggregate, it is essential to bear in mind the limitation imposed by the 16MB document size. Storing 'duplicate' items within an array can potentially hit this restriction.

Moreover, unlike mapReduce, aggregation allows for excluding non-duplicate entries directly from the outcomes. MapReduce lacks this capability and would require post-processing steps to achieve similar filtering results.

The core documentation itself emphasizes:

NOTE
While the Aggregation Pipeline offers enhanced performance and coherence for most operations, map-reduce operations provide unique flexibility not currently accessible through the aggregation pipeline.

Hence, the decision on whether to utilize mapReduce or aggregation hinges on the specific requirements and nuances of the problem at hand.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Is there a way to modify the default color when the header is assigned the "sticky" class?

Currently, I am in the process of building my own website and have implemented an exciting hover effect that randomly selects a color from an array and applies it when hovering over certain elements. However, once the cursor moves away, the color reverts b ...

retrieving data from a different controller in AngularJS

Having an issue with passing data from rootScope.reslogin2 to scope.user. It's not displaying as expected, here is my JavaScript file: app.controller("logincont", ['$scope','$http','md5','$window','$rootS ...

Executing React's useEffect hook twice

As I work on developing an API using express.js, I have implemented an authentication system utilizing JWT tokens for generating refresh and access tokens. During testing with Jest, Supertest, and Postman, everything appears to be functioning correctly. O ...

Tips on how to update a table following each button click during an ajax request

I am encountering an issue while trying to display data in a table by clicking on a search button. The problem arises when there is no data between the specified "Fromdate - Todate" range; the error message appears correctly. However, even after entering t ...

What is the method for obtaining the selected option value while hovering over a dropdown list option?

Hello, I am using the Chosen jQuery plugin to select image options. When I change the trigger, I can easily get a value. Now, I want to be able to get the value when I hover over an option in the select dropdown menu, like shown in the image below, and dis ...

Display upon hovering, conceal with a button located within a popup container

There seems to be an issue with the code below. Even though it works perfectly in jsfiddle, it breaks in my Chrome and other browsers right after displaying the ".popup" div. Can anyone point out what I might be doing wrong? I found similar code on this si ...

Using `await` inside an if block does not change the type of this expression

Within my code, I have an array containing different user names. My goal is to loop through each name, verify if the user exists in the database, and then create the user if necessary. However, my linter keeps flagging a message stating 'await' h ...

Learn to Generate a Mathematical Quiz with Javascript

For a school project, I am tasked with developing a Math Quiz which showcases questions one at a time. The questions vary in type, including Multiple Choice, Narrative Response, Image Selection, Fill in the blank, and more. I require assistance in creatin ...

Ways to retrieve the content from a textfield

Is there a way to retrieve text from a textfield in material UI without using the onChange method? It just seems odd that I would need to constantly track the value with onChange in order to use it for any other purpose. I decided to search for solutions ...

Encountering a 500 internal server error while trying to submit a form via AJAX and

I'm a beginner in PHP and I'm facing issues with sending test emails from my local host. My form consists of 3 fields, and I want the user to be able to submit the form and see a success message without the page refreshing. Although I have set u ...

The negation functionality in the visible binding of Knockout.js is not functioning properly

I'm having trouble using the visible data binding with a negation and it's not functioning as expected. I've come across various posts on stackoverflow suggesting that the NOT binding should be used as an expression. However, in my scenario, ...

PHP implementation for a static header layout

I am interested in learning how to update content without refreshing the header. I have created a simple example below. Header.php <html> <head> </head> <body> <ul> <li><a href="index.php" ...

Combining Json attributes in Jquery Grouping

My task is to group the displays by sectors, but I couldn't find a method in JSON to achieve this. Below is the code snippet: $(function displays(){ var url = '{% url get_displays %}'; $.getJSON(url, function(data) { var sidebar = ...

What are the steps to transform a blob into an xlsx or csv file?

An interesting feature of the application is the ability to download files in various formats such as xlsx, csv, and dat. To implement this, I have utilized a library called fileSaver.js. While everything works smoothly for the dat/csv format, there seems ...

The autoincrement feature does not support the schema.path function

Currently, I am facing an issue while attempting to implement an auto-increment ID field for a person collection in a MongoDB cloud database. The error arises at the line containing const schemaKey = this._schema.path(this._options.inc_field); when I inclu ...

Identification of input change on any input or select field within the current modal using JavaScript

My modal contains approximately 20 input and select fields that need to be filled out by the user. I want to implement a JavaScript function to quickly check if each field is empty when the user navigates away or makes changes. However, I don't want t ...

The alignment is off

<script> var myVar = setInterval(myTimer, 1000); function myTimer() { var d = new Date(); document.getElementById("demo").innerHTML = d.toLocaleTimeString(); } </script> <p text-align="right" id="demo" style="font-family:Comic Sans ...

Develop an enhancement for the Date object in Angular 2 using Typescript

Using the built-in Date type, I can easily call date.getDate(), date.getMonth()...etc. However, I am looking for a way to create a custom function like date.myCustomFunctionToGetMonthInString(date) that would return the month in a string format such as &a ...

Mongoose virtual population allows you to fetch related fields

When attempting to utilize virtual populate between two models that I've created, the goal is to retrieve all reviews with the tour id and display them alongside the corresponding tour. This is achieved by using query findById() to specifically show o ...

Issue with the submission button not triggering onclick event correctly

I've been trying to add an onclick event to a submit button. I've searched various tutorial sites and followed all the suggestions, but none of them have solved the issue. Interestingly, when I include an alert in the function being called, it wo ...