The main idea behind high availability is to identify single point of failure in the system and eliminate it. By single point of failure I mean any component that can cause a service interruption if it gets unavailable.
We eliminate single point of failure through redundancy. I.e. we scale the system horizontally by adding more nodes. Each added node increases the overall capacity of the system. In case one node fails the load will be distributed among the remaining ones. Often it means that the pressure the remaining nodes will increase.
Contrary if one node is removed from the system, the load will be distributed across remaining nodes. The important point - system should not fail if one of the nodes is removed. This attribute is typically referenced to as redundancy.
We organize nodes into clusters to simplify system health monitoring. In a case of heavy load due to spike in traffic we can add more nodes to the cluster. Those newly added nodes can take over some of the work and alleviate the pressure on the existing nodes.
Lets consider a simple scenario with an application consisting of a web server connected and a database. Without redundancy the entire system will be down if server crashes. To mitigate the risk we can add several more servers and put a load balancer in front of them. Load balancer can distribute the incoming requests between the server.
Failure of a single server wouldn’t cause the entire system go down. Although it will increase the pressure on the remaining live servers.
With this setup though, the load balancer becomes a single point of failure. To mitigate this risk we can add another load balancer. Virtual IP addresses will help during DNS resolution. Domain will always resolve to the same IP, but the IP will be reassigned between load balancers depending on which one is available. When the first load balancer goes down, the second one will be able to pick up the load and distribute the requests.
And once again, at this point you can say that DNS itself becomes a single point of failure, so we need to prepare for that.
Overall the pattern should be pretty clear, by solving one problem we introduce another one in different component. This is a common situation when talking about system design.
Load balancer sits in front of the nodes and serves as an entry point into the cluster. Clients access load balancer, which forwards requests to one of the servers. Server then processes request and responds directly to the client.
Load balancer helps with solving two problems:
They can route requests depending on the defined rules and strategies, such as:
Load balancer constantly monitors the pull of servers by sending health check requests. Health checks regularly attempt to connect to servers using the protocol and port defined by the forwarding rules to ensure that servers are listening.
As discussed before, single load balancer can become a single point of failure. To mitigate this risk another load balancer can be added. In such setup they monitor health of each other. If one of them fails, DNS must take users to the second one.
Replicas are just the copies of the data distributed across multiple nodes. It allows you to offload the read traffic from a single node.
Replication can improve the following characteristics of the system:
Those improvements don’t come for free. By introducing replication you’re making a tradeoff between consistency on one hand and latency and availability on the other.
Broadly speaking replication can be achieved synchronously or asynchronously. Synchronous replication offers stronger consistency. Asynchronous replication provides higher availability and performance.
We say that data is consistent if they’re in the same state across every node in the system. Consistency implies the accuracy, completeness, and correctness of data.
Consistency allows us to avoid two well known problems:
I’ll save discussion of those problems for another post as this one is getting too big already.
Data replication can be done by using different techniques:
In a leader-follower replication scheme there is one node dedicated to accept all the writes. It’s called leader. Other nodes known as followers replicate the data from the leader. Even if the leader node fails, data is still available for reads through follower replicas.
In write heavy applications multi-leader scheme can be used. Multiple nodes can accept writes in such setup. It’s way more complicated as it requires mechanism allowing to resolve conflicts efficiently.
Possible conflict-resolution strategies:
Leaderless replication takes a quorum-based approach. In such setup any node in the network can accept write operations. It aims for a consensus among a certain number of nodes which is called “quorum”. The primary benefit of this approach is high availability. It also has the drawbacks related to conflict resolution.
There are many ways to partition a system. Databases are one obvious candidate for partitioning. It allows to avoid limits on database size, throughput, or number of concurrent sessions.
A database can be partitioned in several ways:
Product
data table into several different tables like basic information, item availability, item restrictions etc. A good signal that you should vertically partition a table is when you notice lots of queries only requesting a few of the columns at a time.At a large scale one node cannot hold the entire data set. You’re required to use sharding. It allows to scale far beyond the constraints of a single traditional database. As a side effect it also makes the application more reliable and resilient to failure.
Sharding is associated with additional complexity, as you need to deal with data spread around all the partitions. The biggest issue is keeping data consistent across all of them. Sharding should always be a last resort when it comes to scaling your database. Other alternatives like read replicas and caching should be implemented first because they are much easier to implement. Important note about sharding, that makes it different from clustering or replication is that shards know about each other.
Partitions share the same data schema. Replication of data is based on some kind of logic or identifier often called shard key. For example, customer name can serve as such key. In this setup customers whose names start with A–M
go into one partition, N–Z
into another.
Range To make efficient scans possible, the data can be partitioned into ordered and contiguous value ranges by range-sharding. However, this approach requires some coordination through a master that manages assignments. To ensure elasticity, the system has to be able to detect and resolve hotspots automatically by further splitting an overburdened shard.
Hash In Hash-sharding every data item is assigned to a shard server according to some hash value built from the primary key. The goal of hashing function is to split data evenly.
serverid = hash(id) % servers
- this hashing scheme requires all records to be reassigned every time a new server joins or leaves, because it changes with the number of shard servers (servers).
This approach does not require a coordinator and also guarantees the data to be evenly distributed across the shards, as long as the used hash function produces an even distribution.
A downside of this strategy is the need to remap data to hash values when servers are added or removed. For this reason elastic systems use consistent hashing instead.
Entity Group It is a data partitioning scheme with the goal of enabling single-partition transactions on co-located data.
Lookup service This sharding strategy works by implementing a lookup table that sits in front of the sharded databases. The service tracks the current partitioning scheme and maps to the locations of each shard.
Carefully selected partition key allows to avoid hotspots and is crutial when desgning a system. If you partition a database, but one shard still gets the majority of the requests, then you haven’t solved your problem. Ideally, load gets distributed evenly across all the partitions. For example, hash by customer ID and not the first letter of the customer name, because some letters are more frequent. The same principle applies when partitioning a message queue. Pick a partition key that leads to an even distribution of messages across the set of queues.
It’s possible to have a small number of power users who create much, much more data than others. Sometimes it’s reasonable to allocate them a shard of their own. Random assignments are usually easiest, but are by no means the only option. For example users can be assigned to shards randomly - except for user with ID: 367823
, who is assigned to shard 15
, which no other user is assigned to. Source: Systems design for advanced beginners | Robert Heaton
Others candidates to partition include file storage, cache, queues, and compute instances. Here partitioning a queue or message bus, for example, allows to avoid limits on the number of requests or the number of concurrent connections.