Standards and backwards compatibility revisited

Everyone loves Standards

There are so many to choose from, which sadly is where problems and resultant costs arise.

As far as networks are concerned, any network-connected device is very likely to be running several different software components - even an outwardly simple network switch.

Each of these components will perform a distinct function within the overall operation of the device, with each supposedly conforming to its own set of Standards.

Problems, and hence costs, will arise if the equipment doesn't conform to the best-practice Standards, or if different manufacturers have interpreted the Standards documents slightly differently leading to an incompatibility issue between devices.

A similar set of problems can arise where different devices are using different versions of the same Standard, but there's a lack of backwards compatibility between the different versions such that older versions can't happily co-exist on the same network infrastructure.

Generic problems with Clusters

A 'Cluster' is usually two or more devices working together to provide some kind of failure resilience. If one device fails, then another device takes over. If there are three or more devices then this gives greater resilience. However, only one device can be in control, often termed as the master or primary device.

A major potential problem can occur with Clusters of devices: the so-called "Split-Brain" condition. This is where the overall Cluster is supposed to be a single logical "entity" with a single "master" node, but for some reason or another it ends up with two or more devices simultaneously assuming "master" status.

A bitter and bloody fight may then ensue between the warring "kings of the castle", leading to all manner of potential chaos.

One common cause for such a "Split-Brain" scenario is where two different nodes become temporarily isolated from each other by a networking fault somewhere between them; if the outage lasts for an extended period, both nodes will assume the other has died and both will assume they can inherit the "throne".

Another, more subtle, way to induce a "Split-Brain" condition may arise if two or more nodes are upgraded to a new version of the Clustering protocol one at a time. The Clustering protocol effectively elects the "master" with a regular exchange of messages. If, as we have seen on several occasions, the format of the Clustering protocol message itself changes between versions, then the old version may be completely deaf to the new version's messages until both are running the same version. As a result, the deaf old node ascends the throne in complete ignorance of the fresh new upstart.

The overall point is that when running any form of Clustering, you should be aware of the potential for Split-Brain conditions and design resilience into the underlying network to avoid it if at all possible. Similarly, when upgrading nodes within a Cluster, beware of the potential for Split-Brain and plan and test the upgrade project wih a view to avoiding it.

By taking care with the design and upgrade of the network, you can avoid the pain and costs which might arise from a broken Cluster.

Some Further Examples

Precision Time Protocol

Precision Time Protocol (PTP) is a somewhat specialised timing protocol used primarily within broadcast video networks. Most simpler networks may not use it, but the compatibility issue with PTP is a cautionary tale which should be borne in mind for any protocol and its use.

PTP is effectively able to synchronise network components' timestamps to within a few nanoseconds, (billionths of a second), in contrast to the older Network Time Protocol (NTP), which is usually only capable of achieving millisecond accuracy.

Sadly, PTP suffers from a potential backwards-compatibility issue. The original PTP version 1 will simply not safely co-exist on the same network with devices talking PTP version 2.

Even until quite recently, some manufacturers didn't support PTPv2, which meant that where a customer had both PTPv1 and PTPv2 devices, they had to provide a completely separate set of hardware to attach the PTPv1 devices onto, with all the consequential extra costs and administrative overhead which that entailed.

When designing a network which requires PTP, you must therefore be very sure to clarify which version of PTP each manufacturer's equipment can support.

In general, we would strongly argue that any equipment that can't use PTPv2 should be entirely avoided; it potentially needs a completely duplicate set of network hardware to be purchased to support the legacy equipment.

Virtual Router Redundancy Protocol

Another example which occurred recently relates to VRRP ( Virtual Router Redundancy Protocol ).

VRRP is effectively used to cluster a pair of network switches together to provide an overall resilient gateway service.

The original standard is effectively Version 2, which dates from before 2004 . VRRP Version 3, which was standardised in 2010, is potentially not backwards-compatible with VRRPv2. This would imply that extreme caution should be observed when running a mixed VRRPv2/VRRPv3 system or when upgrading nodes from VRRPv2 to VRRPv3.

The relevant VRRPv3 standard explicitly warns:

"Mixing VRRPv2 and VRRPv3 should only be done when transitioning from VRRPv2 to VRRPv3. Mixing the two versions should not be considered a permanent solution."

As with other clustering protocols, there is the potential for chaos and hence cost if there is any confusion between the nodes running the clustered VRRP service, so the basic caveat is to carefully plan the upgrade of any such "cluster" of VRRP-capable devices, and to seek to remove the VRRPv2 workrounds once the migration is complete.

IP Header Checksum

TL;DR: A subtle low-level issue, where several major manufacturers failed to strictly adhere to the relevant Standard.

Every packet has a checksum, used to detect corruption, but some manufacturers were using slightly different ways to create the checksums which led to various manufacturers' equipment not recognising each others values and hence dropping packets. What this would mean for your network is that certain types of traffic would be impacted and in particular Video and Audio streams would suffer loss. This is highly noticeable as freezes, Video glitches and Audio break up!

When manufacturers tried to work round the problem with a software patch, they tended to make things worse because the root cause was inside the hardware and it was too expensive to completely rebuild the devices.

This shows how manufacturers must conform strictly to all the relevant protocol Standards.

We first came across the issue in 2010, so it has most likely been fixed on all modern hardware, but the wider caveat applies; interoperability between multiple vendors' devices can be fraught with difficulty.

This issue was originally found after some investigations of long-lived RTP Streams where a small percentage of traffic was being lost; RTP is a protocol used to transport Audio or Video across networks.

Following some extensive bench testing, capturing the traffic entering and leaving various equipment, it became obvious that the dropped packets all had a checksum value of 0xFFFF, which is illegal according to the strict guidelines of the relevant standard "RFC1624", which uses the slightly esoteric "one's complement" style of arithmetic. Because the checksum was 16-bits in size, the chance of this checksum value happening was 1 in 64K, and hence on average, the packet loss was effectively 1 in 64K for a single stream. For most traffic, such a small packet loss is barely noticeable, and is usually automatically corrected by other algorithms. However, on connectionless jitter-sensitive RTP media streams, even a small amount of loss is unacceptable; it leads to clicks and bangs and "wobbly" Audio in simple terms.

The problem was that several manufacturers - notably some Cisco hardware and Windows Vista at least - performed the one's-complement arithmetic wrongly, which gave rise to the "invalid" 0xFFFF checksum, which generally caused the packet to be unceremoniously dropped by any kit which ran in strict adherence with the Standard. If the originating kit behaved correctly it should have generated a 0x0000 checksum on the relevant packets.

A potential fix would be for all equipment to have an option to grudgingly accept packets with an 0xFFFF checksum, based on the fact that some hardware still performed the calculation erroneously, but nevertheless recomputing it correctly for when the packet was handed on to the next device in the network. Whilst that could potentially have corrected any errant checksums that were floating about on the network, the problem would have been that any quick fix would probably have led to the modified devices processing the packets in their central CPU's rather than the dedicated - and much faster - ASIC hardware. The side-effect of that would have been that the corrected packets would themselves have suffered unacceptable timing jitter because they were processed by the slower CPU, which was potentially a worse condition than the original broken checksum.

Accommodating the fix in the ASIC's might have been possible, but would have been a much more expensive to fix, potentially requiring the hardware itself to be completely redesigned, and all with the aim of solving a problem caused by another, less than careful, party!

RFC1624 - last updated in 1994 - provides an optimised calculation path for routers which need to compute the new checksum as fast as possible, and thus make use of this algorithm to perform it incrementally based on the received checksum, and the consequent change in TTL. The original IP Standard, RFC791, from 1981, by contrast, outlines the non-incremental version of the algorithm which is used on end-nodes when creating the original packet.

For those who care to understand the actual algorithm in use, the pertinent standard is RFC1624: "Computation of the Internet Checksum via Incremental Update" http://www.ietf.org/rfc/rfc1624.txt

Summary

Different network configurations and applications make use of numerous underlying network protocols. In order to avoid compatibility issues, it is as well to be aware of all the various protocols that the network and its applications need to use, and the version number - if appropriate - used by all the nodes on the network. Where old or legacy equipment is in use, it will be prudent to check its compatibility as much as possible before "go live".

Where a major upgrade is planned, which potentially updates the version number or format of the "on the wire" protocol, it is as well to bench test what happens during the upgrade where some of the kit has been updated, but some is still running the old software.

Become Diplomatic with your Protocols

At Layer3 Systems, we provide the attention to detail necessary to ensure all the various protocols are compatible throughout your network.

We can liberate your business by highlighting and correcting such discrepancies on your network so that you can confidently deliver what you do best.

If you want to improve uptime, remove uncertainties, and increase the performance of your systems and networks, then we can provide resources to manage such protocol-related issues. Come and talk to us. We can help. Call 0203 805 7795 or email [mailto:This email address is being protected from spambots. You need JavaScript enabled to view it. This email address is being protected from spambots. You need JavaScript enabled to view it.] and talk directly to one of the Layer3 Systems team.


Print   Email