architect-handbook

Software Architect Handbook

View on GitHub

Data partitioning

Overview

In many large-scale solutions, data is divided into partitions that can be managed and accessed separately. Partitioning can improve scalability, reduce contention, and optimize performance. It can also provide a mechanism for dividng data by usage pattern. For example, you can archive older data in cheapar data storage.

The partitioning strategy must be chosen carefully to maximize the benefits while minizing adverse effects.

Why partition data?

Designing partitions

There are three typical strategies for partitioning data:

These strategies can be combined. For example, you might divide data into shards and then use vertical partitioning to further subdivde the data in each shard.

Designing for scalability

Consider size and workload for each partition and balance them so that data is distributed to achieve maximum scalability. However, you must also partition the data so that it does not exceed the scaling limits of a single partition store.

Follow these steps when designing partitions for scalability:

  1. Analyze the application to understand the data access patterns, such as the size of the result set returned by each query, the frequency of access, the inherent latency, and the server-side compute processing requirements. In many cases, a few major entities will demand most of the processing resources.
  2. Use this analysis to determine the current and future scalability targets, such as data size and workload. Then distribute the data across the partitions to meet the scalability target (e.g., choose the right sharding key).
  3. Make sure each partition has enough resources to handle data size and throughput. If the requirements are likely to exceed these limits, you may need to refine your partitioning strategy or split data out further, possibly combining two or more strategies.
  4. Monitor the system to verify that data is distributed as expected and that partitions can handle the load. Actual usage does not always match what an analysis predicts. If so, it might be possible to rebalance some of the partitions.

Designing for query performance

Query performance can often be boosted by using smaller data sets and by running parallel queries. Each partition should contain a small proportion of the entire data set. However, partitioning is not an alternative for designing and configuring a database appropriately. You must still use indexes and other techniques to optimize query performance.

Follow these steps when designing partitions for query performance:

  1. Examine the application requirements and performance.
    • Use business requirements to determine the critical queries that must always perform quickly.
    • Monitor the system to identify any queries that perform slowly.
    • Find which queries are performed most frequently.
  2. Partition the data that is causing slow performance:
    • Limit the size of each partition so that the query response time is within target.
    • Design a sharding key so that the application can easily select the right partition.
    • Consider the location of a partition. Try to optimize based on geography.
  3. If an entity has throughput and query performance requirements, use functional partitioning based on that entity. If this is not enough, combine with horizontal partitioning.
  4. Consider running queries in parallel across partitions.

Designing for availability

Partitioning data can improve availability by ensuring that the entire dataset does not constitute a single point of failure and that individual subsets can be managed independently.

Consider the following factors that affect availability:

Application design considerations

Partitioning adds complexity to the design and development of your system. If you address partitioning as an afterthought, it will be more challenging so consider it as a fundamental part of system design even if the system initially contains a single partition.

Rebalancing partitions

As a system matures, you might have to adjust the partitioning scheme. For example, partitions might start getting a disproportionate volume of traffic and become hot, leading to excessive contention. Or, you might have underestimated the volume of data in some partitions.

Some data stores, such as Cosmos DB, can automatically rebalance partitions. In other cases, it is an administrative task that consists on two stages:

  1. Determine a new partitioning strategy.
    • Which partitions need to be split or merged?
    • What is the new partition key?
  2. Migrate data from the old partitioning scheme to the new one.

Offline migration

Offline migration is typically simpler because it reduces the chances of contention occurring.

  1. Mark the partition offline or mak as read-only.
  2. Split-merge and move the data to the new partitions.
  3. Verify the data.
  4. Bring the new partitions online.
  5. Remove the old partition.

Online migration

Online migration is more complex to perform but less disruptive. Depending on the granularity of the migration progress (e.g., item by item, shard by shard), the access code in the client applications might have to handle reading and writing data that’s held in two locations, the original partition and the new partition.