Having 100 documents stored in my mongoDB, I am facing the challenge of identifying and grouping possible duplicate records based on different conditions such as first name & last name, email, and mobile phone.
To achieve this, I am utilizing mapReduce to create key-value pairs for these 100 documents, essentially creating groupings within the data.
However, issues arise when a 101st duplicate record is introduced to the database. The output of the mapReduce operation for the other documents that are duplicates with the 101st record becomes corrupted.
For instance:
My current focus is on detecting duplicates based on first name & last name.
With the initial set of 100 documents, the result looks like this:
{
_id: {
firstName: "foo",
lastName: "bar,
},
value: {
count: 20
duplicate: [{
id: ObjectId("/*an object id*/"),
fullName: "foo bar",
DOB: ISODate("2000-01-01T00:00:00.000Z")
},{
id: ObjectId("/*another object id*/"),
fullName: "foo bar",
DOB: ISODate("2000-01-02T00:00:00.000Z")
},...]
},
}
While this is the desired outcome, the scenario changes once more than 100 possible duplicates are present in the DB.
Consider the introduction of the 101st document:
{
firstName: "foo",
lastName: "bar",
email: "<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="71171e1e311310035f121e1c">[email protected]</a>",
mobile: "019894793"
}
The subsequent results for 101 and 102 documents look like this:
{
_id: {
firstName: "foo",
lastName: "bar,
},
value: {
count: 21
duplicate: [{
id: undefined,
fullName: undefined,
DOB: undefined
},{
id: ObjectId("/*another object id*/"),
fullName: "foo bar",
DOB: ISODate("2000-01-02T00:00:00.000Z")
}]
},
}
{
_id: {
firstName: "foo",
lastName: "bar,
},
value: {
count: 22
duplicate: [{
id: undefined,
fullName: undefined,
DOB: undefined
},{
id: undefined,
fullName: undefined,
DOB: undefined
}]
},
}
It appears that exceeding 100 possible duplicates causes this issue, similar to what was discussed in a related topic on stackoverflow without a working solution: MapReduce results seem limited to 100?
I am seeking suggestions or ideas on how to address this challenge.
Edit:
Initially, my source code looked like this:
var map = function () {
var value = {
count: 1,
userId: this._id
};
emit({lastName: this.lastName, firstName: this.firstName}, value);
};
var reduce = function (key, values) {
var reducedObj = {
count: 0,
userIds: []
};
values.forEach(function (value) {
reducedObj.count += value.count;
reducedObj.userIds.push(value.userId);
});
return reducedObj;
};
Now, I have updated it as follows:
var map = function () {
var value = {
count: 1,
users: [this]
};
emit({lastName: this.lastName, firstName: this.firstName}, value);
};
var reduce = function (key, values) {
var reducedObj = {
count: 0,
users: []
};
values.forEach(function (value) {
reducedObj.count += value.count;
reducedObj.users = reducedObj.users.concat(values.users); // or using the forEach method
// value.users.forEach(function (user) {
// reducedObj.users.push(user);
// });
});
return reducedObj;
};
I am unsure why the updated code is failing given that I am still pushing a value (userId
) to reducedObj.userIds
.
Could there be an issue with the value
emitted in the map
function?