Scaling up a network will face a lot of issues that don’t appear when it’s small. In this post, let’s look at what problems are they.
One of the most important decisions that make the Internet possible is to make the network stateless and push applications which are usually need to store states to end users. In the development of the network, stateful applications are inserted for certain purposes. The following are three most common ones.
Besides all mentioned network applications, the active/active failover is also stateful feature almost all time. A group of devices needs to maintain some states inside and exchange them regularly to achieve the possible fastest failover.
The network convergence is defined as the process of a network become stable when a change happens. The goal, in general, is to minimize the convergence time to maximize the stability of a network. This factor can be ignored in a relatively small network because current computational power is strong enough and applications aren’t usually critical enough. When a network is scaling up, there’ll be more chances to transmit important traffic that any loss should be as little as possible, and other factors contribute to the convergence delay should be taken in consideration.
The convergence time can be divided into four parts:
We’ll discuss these factors by using OSPF as an example, other protocols are more or less similar. The network change will be simplified as a failure in the network for the same reason.
Detecting a change quickly can be the most important task for a fast convergence. In an ideal scenario, a change should be started propagating to devices once it happens. The keepalive message is can be used for this task since it is repeatedly sent to a connected device, a pre-defined number of failure of reception will be translated as a failure in the network. Similarly, a new link can also be detected by the same method. But it won’t be enough for the maximum speed.
Based on the type of network, the delay in a multi-accessing network is usually larger than a physical point-to-point network. For a lower layer technology in OSI, it should be able to detect a loss of a link in the shortest interval possible, such as a P2P ethernet can report failure almost instantly by detecting network pulses, but there can be some hardware related timers that delay reporting the layer 1 events. In Gigabit Ethernet, there is a carrier-delay timer on Cisco, which is set per interface. This timer is usually used to ignore subnetwork features, so a non-critical L1 failure will never be noticed and healed under the layer 3. In most cases, it makes sense to rely on a lower layer mechanic if it is available, but more often there isn’t.
After a change is detected, routing protocols will generate LSA sent to its neighbors. To completely converge the network, the notifications need to reach every router in its flooding scope. In a properly designed network, the flooding is strictly controlled. In OSPF, the scope is in one area, unless they are flooded as external. This delay can be influenced by the following factors:
The SPF algorithm is bounded as $O(L+N\cdot logN)$, where N is the number of nodes and L is the number of links in a topology under consideration. The worst case complexity for dense topologies could be a high as $O(N^2)$, but this is rare in a real-world scenario. A comparison to SPF is the Bellman-Ford algorithm which is used in the RIP with the complexity of $O(N \cdot L)$. This delay is a major limiting factor in the 80s and 90s when the routers used slow CPUs causing the calculation took seconds to complete. The Moore’s Law allows modern hardware significantly reducing the impact of it but is still one of the major reason to reach a sub-second convergence.
The other problem is SPF throttling for protocols on a certain platform. It uses exponential backoff algorithm to schedule SPF calculations, in order to avoid excessive calculations in the times of high network instability but keep SPF reaction fast for stable networks.
As soon as SPF computation is completed, a sequential RIB update is scheduled to reflect the changed topology. The update is then propagated to the FIB table with centralized or distributed process based on the platform architecture. The RIB/FIB update may contribute the most to the convergence time in a topology with a large number of prefixes. In such network, updating RIB/FIB may take considerable time, such as at the order of 10s.
As a reader, you may come up with some ideas to overcome the problems aforementioned, such as using BFD for link detection. This is true that there is always another protocol or tool to fix a problem, but this method stacks protocols on top of each other. This won’t be an issue if the number of protocols is carefully limited, but is usually failed to control. The boomed protocols will eventually exceed the number protocols that can be effectively managed.
Also, a larger network means more devices and links in it, making the network harder to be effectively managed. Things like virtual layers, such as IPv4 layer and IPv6 layer, are usually tangled and become more complexity exponentially with the size of the network.