Image we need to design a key-value store, there are two orthognoal techniques that we can apply to achieve scalability.
Replication: Copy the same piece of data to multiple places, so that requests can be served from multiple places instead of one, increasing the capacity to handle concurrent requests. One the other hand, if one copy is lost, we still have other copies, data won't be lost in case of individual machine failures.
There are multiple replication strategies:
Primary-backup: all writes go through a leader or primary, read requests can be served from leader or any backup. Write requests are effectively state machine changes that need to be replicated to backup, either synchronously or asynchronously.
In the case of synchronous replication, the system does not scale to heavy write requests, because every write needs to go through the same machine. In addition, adding more backup will make the replication process slower because the leader machine now needs to wait for more machines, subject to the straggler issue, and if any of the backup machines fails, the replication process is blocked.
In the case of asynchronously replication, write throughput at the leader can be significantly improved. But backups can fall behind the leader, so clients can observe different values if they happen to query different machines that are not in sync
Multi-Leader: multiple leaders can serve write requests, read requests are served by backups. In case of partition between the leaders, clients that have connection to one of the leaders can still make progress. However, since this allows concurrent updates, conflicts can often times occur. There is no option but to resolve the conflicts either automatically in the system, or by the users when read requests come in.
Quorum-based: there is no leader in the system, every copy is equal. Clients write and read from multiple copies. If the read and write quorum overlap, clients can read the latest value in theory. But in practice, in case of machine failures, new machines step in without having in-sync value of the failed one. Effectively what that means is that the read and write quorum do not necessarily overlap in such cases.
Partition:
The goal of partition is to divide a large set of data into small independent chunks that can be stored and served on different machines.
There are two properties that we'd like to achieve: 1) balance (or partition skew), each node takes a fair share, so that there is no hotspot. 2) efficiently serve user queries (get a key value pair, or scan a set of key value pairs)
Partition Strategies:
1) Key-range partition: each machine is assigned a continuous range of keys. Range-scan is efficiently supported, but some access-pattern can lead to hotspot (e.g. write quests of continuous keys are routed to the same partition)
2) Hash-range partition: keys are hashed by a good hash function that can uniformly distribute keys into hash value space, and then partitions are assigned a continuous range of hash values. No efficient range-scan any more, but hotspot is much less likely.
However, both are prone to hotspot issues if all reads and writes are for the same key.
Index Partition Strategies:
Key value pairs define a 1-to-many relationship for indexes, that is, for a given value, there can be multiple keys that are paired with it. Thus in general, index entries are of the form: v1: {k1, k3, kn}
To perform partition of indexes, we can do
1) horizontal partition: each index entry is partitioned, each partition is colocated with the key-value partition. To read an index entry, you need to query all key-value partitions and combine them back, so called scatter-gather.
2) vertical partition: each index entry is located on a single partition, but the index key space is partitioned just like the key-value pairs. This approach supports efficient read requests, but writes need to update indexes located on multiple partitions. Updating multiple indexes in a consistent fashion can be done with distributed transaction, but that's very expensive, so index updates are asynchronous in practice.
Rebalance Partition:
Data needs to be moved from one node to another, if there are nodes added (to handle more data or requests) or remove (node failed). There are a few requirements:
1) Reads and writes continue to be processed while rebalance is in flight
2) Load is evenly distributed after rebalance
3) Minimize data movement as much as possible.
Rebalance Strategies:
def
No comments:
Post a Comment