Page Content

Tutorials

MapReduce vs. Aggregation Framework of MongoDB?

Aggregation Framework of MongoDB

The Aggregation Framework in MongoDB is a potent tool for processing, analysing, and transforming data. Documents from a collection go through a number of phases, with the output of one stage acting as the input for the next, according to the pipeline principle. Because it uses built C++ code and does not incur the BSON-to-JSON conversion overhead that MapReduce does, this framework is typically more efficient than MapReduce for the majority of aggregate jobs.

$project and $unwind are two essential and commonly utilised steps in this pipeline.

The Stage: Reshaping and Selecting Documents

Every document in the data stream is reshaped using the $project stage by adding, removing, or changing fields. By restricting the quantity of data that is sent through the pipeline, it enables you to choose which fields to return. This can greatly enhance performance by lowering processing demands and network overhead.

Below is a summary of its features:

  • Field Selection:
    • By setting their values to 1 or true, you can indicate which fields to include.
    • By setting a field’s value to 0 or false, you can exclude it.
    • Unless specifically disabled by specifying _id: 0, the _id field is always present by default.
  • Field Renaming: Rename a field and set its value to the old field’s path (prefixed with $).
  • Creating New Fields with Expressions: Complex transformations utilising a variety of expressions are possible with $project, including:
    • String functions: $concat, $substr, $toLower, $toUpper.
    • Arithmetic functions: $add, $subtract, $multiply.
    • Date functions: $year, $month, $dayOfMonth, and other date functions.
    • Logical functions: For if-then-else logic, $and, $or, $eq, $gt, $cond.”
    • Set operators: $setEquals, $setIntersection, $setDifference, $setUnion, $setIsSubset, $anyElementTrue, $allElementsTrue.
    • Miscellaneous functions: For temporary variables and arrays, use $let and $map.
  • Promoting Nested Fields: It is possible to elevate nested fields to the document’s top level.
  • Accumulators in $project (MongoDB 3.2+): $project stages can use a subset of accumulator operators like $sum, $avg, $max, $min, $push, $addToSet. These accumulators aggregate values over several documents in $group, but not in $project.

The Stage: Deconstructing Arrays

An array field from the input documents is deconstructed using the $unwind stage. The $unwind function generates a new document for every entry in the array. A single value one of the original array’s elements replaces the array field in this new document, which is a replica of the original input document.

  • Purpose and Impact:
    • Data flattening: It efficiently “flattens” array data, allowing individual array items to be accessed for additional processing, including grouping or matching. This is particularly helpful when conducting pseudo-joins or in-depth analysis on the contents of the array.
    • Document Explosion: If the arrays are big or have a lot of elements, using $unwind might greatly increase the number of documents in the pipeline. AllowDiskUse: true is frequently advised for such tasks because this can be memory-intensive and may result in “out of memory” problems.
    • Missing Arrays: By default, a document will be removed from the pipeline if it lacks the designated array field. The preserveNullAndEmptyArrays option in $unwind can be used to keep documents even if the array is empty or missing starting with MongoDB 3.2.
  • For illustration, let’s look at a collection inventory that has records such as these:
  • If using MongoDB 3.2+, to additionally include the original document with _id:3:

Combined Usage and Performance Considerations

In an aggregate pipeline, these steps are frequently combined to carry out intricate data analysis:

  1. Filtering Early ($match): Setting up $match stages as early in the pipeline as feasible is usually regarded as best practice. This aids in:
    • Leverage indexes: By utilising indexes, $match (and $sort) can expedite the initial filtering step.
    • Reduce data volume: You can minimise the quantity and size of documents that need to be processed by later stages by eliminating unnecessary documents early on, particularly before computationally demanding steps like $unwind or $group.
  2. Deconstructing Arrays ($unwind): $unwind can be used to separate arrays into individual documents for further examination following an initial $match.
  3. Reshaping Output ($project): Lastly, to keep the result set short, $project can be used to pick, rename, and change the fields from the unwinding documents into the format that is intended for the final output. Superfluous fields can be removed.

Example of a full pipeline combining these stages: Consider that each firm document in your collection contains a funding_rounds array, where each element is a subdocument that describes a funding round, including investments (which are an array of investment details in and of itself). Only “Greylock” funding rounds and company names should be found, then project details.

db.companies.aggregate([
  // Stage 1: Initial match to filter companies where Greylock participated in at least one round
  { $match: {"funding_rounds.investments.financial_org.permalink": "greylock"} },
  // Stage 2: Unwind the funding_rounds array to create a separate document for each funding round 
  { $unwind: "$funding_rounds" },
  // Stage 3: Second match to filter for only the funding rounds that Greylock actually participated in 
  { $match: {"funding_rounds.investments.financial_org.permalink": "greylock"} },
  // Stage 4: Project and reshape the output to show relevant details 
  { $project: {
    _id: 0, // Exclude the default _id field 
    companyName: "$name", // Rename 'name' to 'companyName'
    funderPermalink: "$funding_rounds.investments.financial_org.permalink", // Project funder permalink
    amountRaised: "$funding_rounds.raised_amount", // Project amount raised
    fundedYear: "$funding_rounds.funded_year" // Project funded year
  }}
]).pretty()

After reducing the input with $match, this pipeline expands the relevant array documents with $unwind for granular filtering and reshapes the output with $project to display the appropriate information. Within MongoDB, this methodical approach enables robust and adaptable data manipulation and analysis.

This demonstrates the straightforward, declarative nature of the Aggregation Framework.

MapReduce

A multi-phase data aggregation methodology called MapReduce is used to condense massive amounts of data. It consists of two main stages: a reduce function that aggregates values for keys with many entries into a single output, and a map function that analyses each input document and produces one or more key-value pairs. The findings can be finalised using an optional finalise function.

MapReduce’s custom JavaScript functions (map, reduce, finalise) are crucial. This gives you the flexibility and capacity to construct complicated, arbitrary logic that the Aggregation Framework’s native operators cannot convey.

Flexibility costs performance. MapReduce is slower and unsuitable for real-time data analysis since it employs JavaScript, which was single-threaded before MongoDB 2.4/V8 engine upgrades, and requires BSON-to-JSON conversion. Inline MapReduce output is restricted to 16MB.

An illustration of a MapReduce operation to determine the number of active posts per user is provided here:

db.posts.mapReduce(
  function() { emit(this.user_id, 1); }, // map function
  function(key, values) { return Array.sum(values); }, // reduce function
  {
    query: {status:"active"}, // optional filter
    out:"post_total" // output collection
  }
);

The post_total collection can be queried to view the outcome.

When to Use MapReduce

MapReduce is still necessary in situations where the aggregating Framework’s capabilities are insufficient, even though the Aggregation Framework is typically used for the majority of aggregating operations because of its performance and easier syntax. MapReduce should be taken into consideration when:

  • Complex Aggregations Not Yet Supported by the Aggregation Pipeline: MapReduce’s JavaScript functions offer the required extensibility if your aggregation logic is too complex or includes transformations that cannot be described using the operators and expressions of the current aggregation pipeline. This includes situations that call for highly customised data processing or arbitrary programming logic.
  • Custom Functions and Arbitrary JavaScript Processing: MapReduce lets you specify your own JavaScript functions for the map, reduce, and finalise stages. When the aggregate calls for reasoning that is more complex than what the built-in C++ operators can handle, this is quite helpful.
  • Incremental Aggregation on Continuously Growing Datasets: New results can be merged or re-reduced into an existing output collection using MapReduce’s output options, such as merge and reduce. Building reports that are updated on a regular basis with fresh data without having to reprocess the full dataset each time is made easier with this.

In conclusion, the Aggregation Framework is the best option for the majority of activities due to its simplicity and quickness. MapReduce is only meant for complex, highly tailored aggregating tasks that require JavaScript’s full functionality for their logic.

Index