Data partitioning

Data partitioning

Overview

In many large-scale solutions, data is divided into partitions that can be managed and accessed separately. Partitioning can improve scalability, reduce contention, and optimize performance. It can also provide a mechanism for dividng data by usage pattern. For example, you can archive older data in cheapar data storage.

The partitioning strategy must be chosen carefully to maximize the benefits while minizing adverse effects.

Why partition data?

Improve scalability: You might overcome the physical hardware limits of using a single database system.
Improve performance: Data access operaitons take place over a smaller volume of data. Operations that affect more than one partition can run in parallel.
Improve security: You can separate sensitive data into a different partition and apply different security controls.
Improve availability: Avoids a single point of failure. If one instance fails, only the data in that partition is unavailable.
Provide operational flexibility: You can fine-tune operations & define different strategies for management, monitoring, backup and restore, based on the importance of the data in each partition.

Designing partitions

There are three typical strategies for partitioning data:

Horizontal partitioning (sharding): Each partition (shard) is a separate data store, but all partitions have the same schema.
Vertical partitioning: Each partition holds a subset of the fields for items in the data store. The fields are divided according to their pattern of use.
Functional partitioning: Data is aggregated according to how it is used by each bounded context in the system. For example, invoice data is stored in one partition and product inventory in another.

These strategies can be combined. For example, you might divide data into shards and then use vertical partitioning to further subdivde the data in each shard.

Designing for scalability

Consider size and workload for each partition and balance them so that data is distributed to achieve maximum scalability. However, you must also partition the data so that it does not exceed the scaling limits of a single partition store.

Follow these steps when designing partitions for scalability:

Analyze the application to understand the data access patterns, such as the size of the result set returned by each query, the frequency of access, the inherent latency, and the server-side compute processing requirements. In many cases, a few major entities will demand most of the processing resources.
Use this analysis to determine the current and future scalability targets, such as data size and workload. Then distribute the data across the partitions to meet the scalability target (e.g., choose the right sharding key).
Make sure each partition has enough resources to handle data size and throughput. If the requirements are likely to exceed these limits, you may need to refine your partitioning strategy or split data out further, possibly combining two or more strategies.
Monitor the system to verify that data is distributed as expected and that partitions can handle the load. Actual usage does not always match what an analysis predicts. If so, it might be possible to rebalance some of the partitions.

Designing for query performance

Query performance can often be boosted by using smaller data sets and by running parallel queries. Each partition should contain a small proportion of the entire data set. However, partitioning is not an alternative for designing and configuring a database appropriately. You must still use indexes and other techniques to optimize query performance.

Follow these steps when designing partitions for query performance:

Examine the application requirements and performance.
- Use business requirements to determine the critical queries that must always perform quickly.
- Monitor the system to identify any queries that perform slowly.
- Find which queries are performed most frequently.
Partition the data that is causing slow performance:
- Limit the size of each partition so that the query response time is within target.
- Design a sharding key so that the application can easily select the right partition.
- Consider the location of a partition. Try to optimize based on geography.
If an entity has throughput and query performance requirements, use functional partitioning based on that entity. If this is not enough, combine with horizontal partitioning.
Consider running queries in parallel across partitions.

Designing for availability

Partitioning data can improve availability by ensuring that the entire dataset does not constitute a single point of failure and that individual subsets can be managed independently.

Consider the following factors that affect availability:

How critical the data is to business operations. Identify which data is critical, such as transactions, and which data is less critical operational data, such as log files.
- Consider storing critical data in highly available partitions with an appropriate backup plan.
- Establish separate management and monitoring procedures.
- Place data with the same level of criticality in the same partition so that it can be backed up together at an appropriate frequency.
How individual partitions can be managed. Support for independent management and maintenance.
- If a partition fails, it can be recovered independently without affecting applications that access data in other partitions.
- Partitioning by geographical area allows scheduled maintenance tasks to occur at off-peak hours for each location.
Whether to replicate critical data across partitions. This can improve availability and performance, but can also introduce consistency issues. It takes time to synchronize changes with every replica.

Application design considerations

Partitioning adds complexity to the design and development of your system. If you address partitioning as an afterthought, it will be more challenging so consider it as a fundamental part of system design even if the system initially contains a single partition.

Minimize cross-partition data access operations: Querying across partitions can be more time-consuming, but optimizing partitions for one set of queries might adversely affect other sets of queries. If you must query across partitions, minimize query time by running parallel queries and aggregating the results within the application.
Consider replicating static reference data: Data such as postal code tables or product lists, consider replicating this data in all of the partitions to reduce separate lookup operations in different partitions.
Minimize cross-partition joins: Minimize requirements for referential integrity across vertical and functional partitions. In these schems, the application is responsible for maintaining referential integrity across partitions. Consider replicating or de-normalizing the relevant data if needed. As a last resource, run parallel queries over the partitions and join the data within the application.
Embrace eventual consistency: The data in each partition is updated separately, and the application logic ensures that the updates are all completed successfully. It also handles the inconsistencies that can arise.
Consider periodically rebalancing shards: It can help distribute the data evenly by size and by workload. However, this is a complex task that often requires a custom tool or process.
Replicate partitions: Provides additional redundancy and availability.

Rebalancing partitions

As a system matures, you might have to adjust the partitioning scheme. For example, partitions might start getting a disproportionate volume of traffic and become hot, leading to excessive contention. Or, you might have underestimated the volume of data in some partitions.

Some data stores, such as Cosmos DB, can automatically rebalance partitions. In other cases, it is an administrative task that consists on two stages:

Determine a new partitioning strategy.
- Which partitions need to be split or merged?
- What is the new partition key?
Migrate data from the old partitioning scheme to the new one.

Offline migration

Offline migration is typically simpler because it reduces the chances of contention occurring.

Mark the partition offline or mak as read-only.
Split-merge and move the data to the new partitions.
Verify the data.
Bring the new partitions online.
Remove the old partition.

Online migration

Online migration is more complex to perform but less disruptive. Depending on the granularity of the migration progress (e.g., item by item, shard by shard), the access code in the client applications might have to handle reading and writing data that’s held in two locations, the original partition and the new partition.

architect-handbook

Software Architect Handbook