No matter how big or how small your networking project you need to be certain that the design and its implementation are fit for purpose.
Very often employing a consultancy will produce fantastic results, but how do you ensure that what is being designed and built will meet your requirements and won't be over-designed (hence more expensive and difficult to maintain than it might be.)
Here are some top tips for things to look for -
Are there any single points of failure?
Edge DevicesObviously, if you have an edge device such as a Server with a single network connection, that connection is always going to be a single point of failure (SPOF), but it's worth checking that there isn't an option to have the device connected to two upstream switches at once so as to provide protection against switch failure, and to make firmware updates on switches a bit easier.
There are actually several ways edge devices can connect like this, and it's worth checking which are available to you as they have different performance and resilience implications.
Some of the most common options are
- Link Aggregation Control Protocol (LACP) across multiple connections - this is definitely the best option, as traffic can be dynamically shared across all the network connections in both directions and failover time if you lose a link is around 1ms. However it does require support and configuration within the switches.
- Fail-over or Active/Backup - this is where only one of two connections is used at any one time and if the 'active' link goes down (e.g. if the switch dies or the network cable is unplugged) then the other link will come up and take over the activity.
- 'VMware style' - The device will accept traffic up either link with a guarantee that traffic sent up one leg will never emerge on the other and each Virtual Machine (VM) can be bound to a particular interface with fail-over as described in the point above.
Network SwitchesMost modern switches provide support for LACP, so sharing traffic load across multiple connctions to two or more switches is often made easier by this feature.
The overall network design should allow for diverse routes between any two switches so as to accommodate the hardware failure, or firmware upgrade, of any given switch.
The choice of best route is usually controlled dynamically by a variety of Layer 2 and Layer 3 Protocols, such as Rapid Spanning Tree (RSTP) at layer 2, or Open Shortest Path First (OSPF) at layer 3 which all need to be carefully designed and configured.
However it's important to prevent traffic 'looping' between switches as this will inevitably lead to broadcast packet storms which is extremely undesirable as it can bring your network shuddering to a halt while it tries to deal with all these packets. We have actually seen a large network brought to a complete halt for several hours like this.
Professional quality network switches should be fitted with multiple power supply units (PSUs) and supplied from more than one mains supply to guard against loss of functionality should a PSU or an upstream Mains Supply fail.
What's the contention ratio like?''Contention ratio - if all the devices connected sent data at once, how much more data is that than the uplink capacity?''
If you have your end-user devices connected with gigabit ethernet to (say) a 24 port access switch with a single 1Gb/s uplink then, assuming you are using all your ports (which may not be a good idea - I will write something about that later as another post), then your contention ratio would be 24:1.
It does really depend on what your users will be doing for example if your users are all office workers perhaps editing documents then 24:1 is a bit low - you could probably manage 48:1 contention ratio. On the other hand if all your users are editing video off a central server then you would probably need 24:1 or even less.
Picking the right contention ratio - how many users per switch - is a bit of a dark art TBH, more so with servers and even more so with virtual servers
This principle can be applied to every switch on your network - what's the difference between the peak aggregate traffic in and the maximum link bandwidth out on the way to the traffic's destination?
Are there any hot-spots?Look at the overall network diagram from a high level. Is there a single or a few network switches where all the other switches connect to? What's the contention ratio on those device like given what you know about your traffic flows. In practice if you are in the acceptance test phase of your network build, you can look for high traffic levels with your network monitoring tools, and in particular for 'tail drops' where packets are failing to be sent to a destination interface due to congestion.
How much of the Kit is hot swappable?Modern Servers and Switches often incorporate various Hot-Swappable components, such as PSUs, Disks, and small-form pluggable network interfaces (SFPs). Each of these elements may help to avoid unnecessary downtime when monitoring indicates that a given sub-component has failed. In the modern world one shouldn't have to completely depower a Server just to swap a Disk.
Whats your failover time like?If you lose a core device, how long will it take before one of the other core devices takes over? Basically if you have a switched network, this will be determined by the Spanning Tree protocol, which can take anything up to 45 seconds to fail over If you have a routed network it will mostly depend on what routing protocol you are using and how it's configured;
- BGP - around 180secs with default timers and without any assistance from other protocols
- OSPF - around 40 secs with default timers
- Cisco StackWise or StackWise virtual (either big thick stack cable or over 10G ethernet)
- Cisco Virtual Port Channel (Nexus switches - stack at layer2 but not at layer 3)
- Cisco VSS Virtual Switch System (Very much like StackWise but for bigger systems)
When it breaks''How much of the network would be out of action if you have to replace any given network device?''
- If there's a network device failure, how long does it take before your network is available again?
- Is it easy to upgrade the firmware/software on any given device with minimal impact on the overall end-user experience?
- Do you know which of your services would be affected if any given network device dies or has to be taken down for maintenance?
Are there any ticking time bombs?
- Are any of your new network devices about to be end-of-lifed (EOL) by the supplier - that's generally not a good thing for a new installation, but may be appropriate if you are extending an existing network and don't want to have to re-train engineers.
- Can your suppliers explain how to do firmware updates on any given network device without breaking the network in the process? Get them to demonstrate this to your network engineers before you go live if possible?
Do you know how to monitor the network?Some years ago we investigated a network that was having major stability issues; when we investigated we found that the London network core (spread across several sites) was using a proprietary protocol called Metropolitan Ring Protocol (MRP) which the support team had no way of looking at because it was not a protocol known to their monitoring tools. The solution was to re-design the network to avoid use of this protocol and to simplify the network design to make it clearer to everybody what paths the traffic was taking across the network for both non-fault and fault conditions. The takeaway is to be absolutely clear about what protocols are running on your network, what they are doing for you and how you will monitor them.