Page Content

Tutorials

What are the Principles of Schema Design in MongoDB?

Schema Design in MongoDB

Despite being a schema-less document database, MongoDB requires schema design for high-performance and scalability applications. Iterative data model development and experimentation are easier without predefining a document’s structure.

MongoDB schema design follows user requirements and combines objects into one document if they will be used together, otherwise separates them. Understanding usage trends for queries, updates, and data processing drives this method. The goal is to retrieve data with few queries and save it in one document.

Best Practices for Schema Design in MongoDB

Schema Design in MongoDB
Schema Design in MongoDB
  • Understand Your Data Access Patterns: Recognise how your application will query and change the data prior to creating your schema. Create your schema with the most frequent and important operations in mind.
  • Optimize for Read Performance: Because joins and aggregations can be performance-intensive, organise your documents to reduce their usage.
  • Embedding vs. Referencing: Determining when to embed documents and when to reference them is important. Although embedding can result in larger page sizes, it can also speed up reading. More normalised data may result from referencing, but more queries could be needed to retrieve related data.
  • Avoid Unbounded Arrays: Performance problems may arise if you use arrays in your documents that have infinite growth potential.
  • Use Appropriate Data Types: To conserve space and enhance efficiency, select the data types that are best suited for your fields.
  • Indexing: Make sure your schema allows for effective indexing, especially for fields that receive a lot of queries.
  • Consider Sharding: If your dataset is big, think about sharding while designing your schema. Select a shard key that will produce an even distribution of data.

Embedding vs Referencing in MongoDB

The main focus of MongoDB schema design is choosing between two main methods for expressing data relationships: referencing (normalisation) and embedding (denormalisation).

Embedding (Denormalisation)

Storing related material inside a single document is known as embedding. As a result, a denormalised data model is produced in which a single record contains all of the relevant information for an object.

When to Embed

  1. “Contains” Relationships: Use “Contains” Relationships where there is an inherent “belongs to” relationship, such a customer’s address.
  2. One-to-One Relationships: When two entities are often considered together and are conceptually contained within one another. For example, all patron information can be retrieved with a single query when an address is embedded directly into a user page.
  3. One-to-Few Relationships: When an author has a “few” postings, for example, or when the “many” side of a one-to-many relationship is small and unlikely to increase substantially.
  4. Frequently Accessed Together: Data that is almost always retrieved alongside its parent document is said to be often accessed together.
  5. Atomic Updates: Because a single write operation can insert or update all of the data for an entity within a single document, embedding makes it easier to perform atomic write operations on related fields.
  6. Performance for Reads: Because all relevant data is retrieved in a single database operation, embedding typically results in greater read performance by lowering I/O activity and network overhead.
  7. Simpler Object Mapping: Complex object mappers are not always necessary because the document model frequently maps to objects in programming languages more naturally.

Trade-offs/Drawbacks of Embedding

  • Document Size Limit: 16 MB is the maximum document size. Document size has an effect on performance if it exceeds this threshold. GridFS should be used for very large binary data sets.
  • Document Growth and Fragmentation: If a document grows in size over time (for example, by pushing elements to an array), MongoDB may need to move it to disc, which is less effective and may cause fragmentation for the MMAPv1 storage engine.
  • Data Duplication: When data is embedded, it may be repeated, necessitating the updating of several documents if the information changes. Updates may become more complicated as a result.

Referencing (Normalisation)

Referencing is the process of storing relationships between documents by including links (usually the _id field). These references are then resolved by applications using further queries.

When to Reference

  1. One-to-Many Relationships (Large/Unbounded): These occur when the “many” side can expand considerably or can be accessed separately from the “one” parent. For instance, keeping books in storage for a publisher. Inefficient embedding would result in mutable, increasing arrays if a publisher had an infinite number of books.
  2. Many-to-Many Relationships: Array keys that have _id references to other documents are used to represent many-to-many relationships. Products and categories, for instance, where a product may have more than one tag and a tag may be applied to more than one product.
  3. Data Not Frequently Accessed Together: A field may belong in another collection if it is nearly always left out of your results when you search for a document.
  4. Volatile Data: To prevent updating numerous copies, it is preferable to reference data that changes often.
  5. Social Graph Data: Referencing is frequently more adaptable for social applications that connect individuals, followers, material, etc. For example, a separate collection that maps publishers to subscribers maintains big, often-changing “followers” arrays and keeping user documents lightweight.
  6. DBRefs: When a document references documents from many collections and the target collection may differ, MongoDB’s DBRefs make it easier to reference documents across collections. Nevertheless, unless there is a strong case for DBRefs, manual references just keeping the _id are easier and typically adequate.

Trade-offs/Drawbacks of Referencing

  • No Native Joins: Unlike relational databases (RDBMS), MongoDB has historically not supported joins. This implies that several queries will be needed to obtain documents from various collections. It’s still an architectural choice to prioritise scalability, even if MongoDB v3.2 added the $lookup aggregation operator for left outer joins and later versions permit more intricate joins.
  • Performance for Reads: Read performance may be slower than single document reads since it necessitates several trips to the server.

In conclusion, balancing considerations such as read/write performance, data access patterns, document size restrictions, and data volatility determine whether to use embedding or referencing. While reference is better suited for managing big or unbounded relationships and facilitating rapid writes, embedding is frequently used to accomplish quick reads. As application requirements change, developers should be aware of these trade-offs and ready to try new things and refine their schema designs.

Index