Is It Me Or Is All Software Pants?

Viv Gregory PublicBlog 30 September 2022 Hits: 812

You would probably agree that the Internet, mobile phones, tablets and computers are incredibly exciting, powerful tools and that they make the modern world tick. The network technology and the computer systems that underpin all of this functionality are crucial to the reliable operation of our connected world.

Why does it all keep going wrong?

At the surface everything in your IT systems will look great and should be performing well. Under the surface a very great deal of processing power is spent handling failures and trying to work around problems. A good deal of systems design is in ensuring that there are no single points of failure and that resilience is built in so that when a failure occurs, something somewhere takes over and keeps the wheels turning.

Your systems not only have to be reliable, but they also need to cope with large numbers of random failures that occur in this technological world.

Sometimes it can feel like everything is constantly failing. There are occasions when every one says they've had "One of those days". Typically everything that can go wrong has gone wrong and if it had not been for the quick thinking and flexibility of some pretty smart technical staff the demonstration to your biggest customer would have been an abject failure instead of the excellent sale that was achieved.

You, like us, will have often noticed the frequent failures in networks and systems. You will have wondered where all these failures come from and may even have formed the opinion that a sharp chisel and a mallet would produce a better, more reliable database than the flimsy, brittle and positively unreliable hardware and software we use today.

Our observation of the world and ICT in particular tells us that; everything fails. The best we can achieve is to stop everything failing at once and to have some spares and backups just in case. The question remains, why do things fail?

It's Complicated

In the physical world faults still occur, but in most cases objects that need to be reliable are designed to support a single well defined function. Consequently they fail less often. Contrast that with general purpose computing, where it is very easy to add a piece of software or change a configuration and suddenly you find things stop working!

Talking to engineering colleagues it became clear that there are many reasons that things fail.

The act of creating something should produce a functional object, but it will also embed limits, tolerances and weaknesses too. This is true for software. Take any device outside of its design specification and you may find that the device won't operate, may fail spectacularly and could possibly be a risk to life and limb!

One of the key features that can lead to failures is complexity. The more sophisticated the code you need to write, say to handle a wide range of scenarios, the more software you will need to write to support internal organisation and management. So you might think that adding one small extra feature would just be a small change. However this can in fact require changes to all the other features that the software supports. Complexity is a problem for computer based systems just as it is for a physical system. Usually this is mitigated by many software techniques such as regression testing, code reviews, continuous integration and beta testing. It is still very hard to test a system as fully as when the system is exposed to a real world experience of a very large number of real users. Software behaves differently on the test bench compared to its use in the real world.

Security plays a major part in failure. Our connected always-on Internet based communities are constantly at risk from being hacked, impersonated, defrauded, socially engineered or disconnected. This means that security ideas and systems are constantly playing catchup against threats that may quietly exist in many different ways. Flaws in and across the computer systems we use, flaws in the software they run on and even within the people that use them.

Much of our system and network functionality depends upon software and network protocols that are buried within systems and network devices. Some security issues arise from faults which are found in the underlying network protocols themselves.

Most physical things are designed to work reliably for a larger number of years than computers. The lifecycle of physical objects tends to start at around ten years, whilst computers, especially consumer equipment, may only last 3-5yrs, though there are some exceptions such as very high quality servers or network components. Even these rarely go much beyond 7-10 years before they are declared "End of Life". Sometimes we see systems or devices being made "End of Life" for commercial reasons, usually because someone wants to sell you something different. These new "things" can bring new benefits, but sometimes it can be a backward step, with less usable functionality and more cost!

So the way to minimise problems is to always use good quality hardware and software, working with capable suppliers with relevant experience and technical knowledge.

Ensuring you use systems and equipment that are from a capable company with an excellent track record is critical to good operation.
They should provide good support and there should be a sizeable user group who are active and in themselves technically capable.
They should provide a good road map and make it clear where they are in the life-cycle of their products.
Do not use consumer grade devices in a professional service delivery role, this is likely to cause more problems than standard industrial equipment. In the long run using professional equipment will be much more cost effective.
Design in good resilience and flexibility, minimising or eradicating any single points of failure.
Design in redundency, this can be used to guard against failure and sometimes to support occassional high loads
Ensure you design monitoring that can track the state of your systems and how they are working.
Understanding how your network is handling current failures in your network is critical to support.
Design as simply as you can, keeping within the design parameters of the components used to build your system.
Document your design from the start and note any changes from the original design.

The take away is that:

Complexity increases risk of failure
Quality reduces risk of failure
Testing is ultimately just the starting point, use in the real world by many customers is a better mark of reliability.
Expecting failure and designing in work arounds will initially be more expensive, but over the life of a project will save much time and cost later.

Reboot and Reset

I assume you are all too familiar with the common cry of "Have you tried turning it off and on again?". I seem to have learned to accept that and on occasion I am happy to do this, but I do ask myself why I accept it.

When I've discussed this with colleagues they often smile and say it is the modern way. Over time though we began to realise that the lack of reliability in ICT systems has become endemic. We can find ourselves struggling with the software tools we take for granted and waste much time working around problems.

The thing we love about computers, (including tablets and smart phones), is that they are very general-purpose and can be used to do many things. Unfortunately that is one of the weaknesses in systems. Doing many things with one device drives complexity. The more "apps" you run on your device, the greater the chance of there being an annoying fault that hits you. Some of which can only be worked around by restarting the system. On occasion you will see or hear of a small software problem destroying a device and turning it into a "door stop" or a "house brick".

The ability to do lots of different things with your system means that there is a downside, poorly designed and poorly written programs can pull down your entire system. It shouldn't happen of course, but it can. This applies to almost any complex system, a chain is only as strong as its weakest link.

Phone and Tablets apps are particularly interesting. There is in effect an ecosystem where almost anyone with a bit of programming knowledge can create an App. Doubtless these are tested. However, as described above, this can be as thorough as you like, but some horrible problems may still be exposed once it is in the real world. There is a kind of Darwinian "Survival of the fittest App" in force. What this really means is that we as consumers do the final and thorough testing. We shy away from faulty apps with bad reviews and head over to those that seem to garner the most praise. Obviously Apps are cheap and so we can't expect too much from them right? But, is that right? They are certainly not cheap to create and test.

This lower cost or free to use, lower quality, software would be ok if this were limited just to cheap consumer devices, but we see the same issues happening on larger more sophisticated systems such as operating systems, networks and databases. That's despite the effort that goes into building first class software that most larger more capable companies expect.

We are also seeing a trend for more computing integrated into cars and homes. Yet given the poor overall quality of software, it is easy to see what will happen. There are already examples of faults in embedded computers in cars causing strange problems that are impossible for garages to resolve. There are examples of battery problems on Teslas which seem to reduce battery life after a software update. Apple were recently the subject of a law suit over a similar issue.

What would you say the cost is globally in using systems and applications that are becoming more brittle?
What is the cost to industry of a fault which impacts the manufacture or delivery of "just in time" critical components?

Generally when something goes wrong we like to blame the network. So at this point I am going blame the Internet for throwing up a paradox.

The Internet makes its possible to do many amazing things, finding information, understanding how things work and learning from a vast array of capable people who create amazing things from Aircraft down to Xylophones.

The Internet also helps to drive marketing. It is increasingly important to get new products out "on time" and working perfectly. Often the first product to market will grab the lion's share of the market, unless of course something is faulty with the product. With the Internet it is possible to deliver a product that mostly works, then quietly update software to eradicate faults that should have ideally been fixed before release.

The Internet also has its dark side. The same connectivity that allows you to find interesting courses on anything from bricklaying to yachting also gives criminals the ability to attack systems. This drives the need to make systems secure. That is an extremely hard thing to achieve and a large amount of patching goes on fixing security flaws.

Then we have a new range of problems around commercial licensing of software products. Many suppliers' support and licencing systems (looking at you Cisco and HP Nimble storage) insist on having continuous internet connectivity to support their business model.

All of the above drives complexity, the more parts that are involved the more there is to go wrong. We may be turning things "off and on again" for some time!

Where Do Bugs Come From

A software bug is simply a fault. It is an unintentional error. A mistake in using the wrong command or instruction or even a simple typo in a program. It can be a misunderstanding of a specification, it can be inexperience on the part of the designer or programmer. There are also weaknesses and problems within the programming languages that we use to write software. Some modern programming languages do not have the basic safety checks and self checks that equivalent languages in the 50's and 60's had as a matter of course. These checks would indicate problems early on in testing and reduce errors before proper testing would even start.

There are modern programming languages such as Rust that provide great performance and flexibility and build in a high degree of self checking (type-safe and bounds-safe capabilities).

We also know that modern software development (such as compiler build environments) contain many different types of checks, which go a long way to improve code and assist the programmer.

Software failures that break operation are one thing, but some bugs actually cause security vulnerabilities. We still suffer from a huge number of "memory boundary errors" in critical software. Sadly, programming languages still seem to be vulnerable to these, which results in a never-ending supply of patch updates to fix the resulting security vulnerabilities.

It is really rather worrying that these problems still exist even though we have evolved a more secure set of languages; perhaps the problem is simply that these new languages have not yet been widely deployed.

Regardless of the source of bugs, there are at least four opportunities for the software company to detect the problem.

Code review tools
Internal testing by the engineer doing the coding (unit testing)
Internal testing by a specialist test team
Pre-release software, effectively being tested by a helpful user community in the real world

If a bug is not picked up by any of the above groups doing "formal testing", then it will escape into the wild, waiting for the poor unsuspecting user to go and stumble over it. Some problems do not become evident within testing because they require a specific set of circumstances that could not easily be created in testing. For example the system was tested for 200 concurrent users, but a bug hits when the 201st user logs in.

The number of bugs in a piece of software is thought to be roughly proportional to the number of lines of code that the software is made from.

So when your favourite software is released as a new bigger better, more featureful product, you can be sure that there will be a bigger and better more featureful collection of bugs.

Sometimes a software program is redesigned. Perhaps it has hit some kind of design limitation and there is a requirement to do more.

This may in effect reset the number of fixed bugs towards zero and will cause the user to learn to live with a larger number of problems than the previous version!
This means that the next release is, paradoxically, worse than the current version for a period of time.

We often say that the best code is code you have already debugged:

The older the code, the fewer the bugs (assuming it has been properly used and maintained)
Reusing the same code or libraries always improves quality of code
While refactoring code will always introduce more bugs or cause regressions

Learning From Experience

I would expect software to get better and improve more quickly than physical objects or systems, largely because it is in theory easy to modify and update. But this does not seem to me to be the case.

I have a keen interest in classic cars. The thing I see is that old cars can be made to be highly reliable. Many classic cars are old enough that all the wrinkles, faults and odd behaviour have been engineered out. Many groups of car enthusiasts have developed solutions to problems beyond the ability of the original car manufacturer. These solutions often extend the life, durability and reliability of the cars involved. The other point to note is that at no stage has an upgrade to a car required that I learn to drive differently, needed a different type of road to run on or suddenly offered me completely irrelevant functionality. I've never needed to agree to an EULA (End User License Agreement) to switch on the windscreen wipers for example.

I have a set of tools some of which belonged to my father and grandfather and they still function perfectly. Provided they are maintained, not abused and only used for the purpose for which they were intended, they should last for another 100 years. I can easily imagine handing them down to my grandson.

This doesn't work with modern computers. My oldest work computer is 2+ years old and failing. In fact it has two hardware failures currently. The printed characters on the keyboard are wearing out and the battery is losing its ability to hold charge. These are physical problems of course, not software, but somewhere along the line cost cutting and poor testing is resulting in more expensive computers that are going down in quality.

Whilst we are not always forced to install new versions of the operating system every year, there is a very clear pressure exerted by manufacturers to constantly upgrade operating systems. This usually drives me to distraction. What usually happens is that the upgrade causes a knock on effect with all sorts of upgrades and changes needed to applications and configurations. Sometimes annoying software problems are fixed, but all too often many bugs remain unfixed and new ones are added, just waiting for me to explore them. Currently I have a web browser that occasionally causes some kind of resource problem and the only way to access the web is to completely reboot the system. Strangely I once saw the same sort of problem on another Unix like system, but in that particular case the problem was quickly resolved and the operating system never exhibited the same fault again. With MacOS I know I have seen the same problem across multiple versions of the operating system.

One of the odd things about computers is that they are never around long enough to fully engineer out the bugs. Currently Apple Computers introduces a new version of their MacOS every year.

The thing is that this is not long enough to allow the software to settle and all the bugs in the system to be understood and fixed. How can we establish long-lived businesses based on software and systems that require constant updates and upgrades? How can we rely on software that forces change towards an initially worse solution?

The truth is that we are sold the cheapest possible software and hardware, this means that we will have to put up with faults as the manufacturer cannot afford to "perfect" their products.

Having said that there are some examples of operating system where there has been some effort put into producing a longer lived product. There are versions of the Linux operating system that are designed for longevity. Keeping operation a bit simpler and focussing on fixing existing problems rather than reinventing completely new software. This certainly helps. We have run examples of highly resilient code that does not suffer as much from code changes and is appreciably better to work with. On the Microsoft platform you can also get LTE (Long Term Evolution) builds of Windows too, which evolve far more slowly and provide a better base to build upon. They can be very useful if you are working in the corporate world.

Can We Fix The Bugs

Certainly many manufacturers of equipment work very hard in trying to resolve problems. The trouble is that there is no money in it. So the "cunning plan" that often emerges is to put in a cheap fix (often called a patch) that may cure or mitigate the original bug, but sadly introduces two or three other bugs. This technique means that the manufacturer does not need to properly recode the software, but instead just adds more code to work around the problem.

Over the course of supporting our own software in-house we have shown that you can eradicate bugs. At least in reasonably small simple programs that are in the half a million lines scale of things - it is much more difficult with big, sprawling GUI applications consisting of millions of lines of code.

Sometimes bugs are easy to fix and sometimes you can end up redesigning large areas of code. What looks like a simple software flaw, goes right to the heart of how your application works, you may need to redesign it. This shows that bugs can be fixed, but sometimes there is a high price to pay. Most of our in-house applications are quite small, so in most cases we can rapidly implement change. We can see how trying to fix bugs in a large code base will certainly become very much more challenging as the size of the code increases.

We have on occasion investigated bugs that evolve across multiple releases and we can say that "some bugs get fixed, some do not, but always some more bugs are added".

How Does Layer3 Create Software

We do it in the same way as everyone else, however since the very start of our operations in 1995, we decided to run counter to most of the world.

We build software to four ideals; Simple, Autonomous, Continuous and Self Monitoring.

We are of course in an enviable position, our software is highly focussed, typically on a single paradigm, doing just one (or two) things really well and our customer base is minute compared to say Microsoft or Apple, these circumstances allow us to provide extremely reliable solutions that won't let you or us down.

Simple Software

Software developers who work on small and relatively simple programs may have several advantages over large teams working on a huge code base.

Being able to hold all the ideas of how the software works can be held inside the head of a single programmer
It is easy to transfer that knowledge to another programmer and thereby gain assistance in reviewing or debugging software
Documentation is easier to write and diagrams describing aspects of operation are easier to draw
Adding debug code that can be left in place and switched on when needed allows deeper levels of feedback from the system
Debugging problems can be easier if the code is simple to understand

Autonomous Software

We work hard to create software that is highly autonomous. It shouldn't need to be heavily managed or constantly honed by users, nor constantly developed with ever more elaborate configurations.

We create code that works toward interpreting what is going on, rather than having a fixed idea that must be adhered to
We handle unexpected input gracefully, carrying out a wide range of input validation to filter out nonsensical data

Continuous Software

We produce code that produces high levels of continuity and is designed to keep going or fail very gracefully. This gives high levels of consistency, where we confidently know what it is going to do.

Continuity - Using the '''same code base continuously''' means that as we identify bugs or problems and eradicate them, the benefits last for years rather than months.
Fail gracefully - software should never crash out, but instead it must produce meaningful errors that are safely logged somewhere and forwarded to a support team in near real time
Understanding - Document code design and review how it operates between multiple members of the team to keep operational knowledge alive.

Self Monitoring Software

To 'err is human, to really foul things up requires a computer. We know we create bugs just like everyone else. The trick for us is to have inbuilt monitoring that provides telemetry enabling us to detect issues long before they impact the customer.

We say that we "Build software around logging", so error handling is not an after thought
This allows us to provide managed services that can give the user of our software a large commercial benefit

What About Software Updates and Upgrades

It is obvious that the larger the application the greater the problem of fixing software bugs. This means that larger applications need to have coordinated software updates, these will happen less frequently with the size of the software package. Hence the use of frequently released patches.

Each new release will attempt to provide fixes and workarounds for bugs and problems. Those problems that lead to redesign will then open the door to options in the way the design can go. It is at this point that an application suddenly introduces new or different functionality.

Additional features will be added to help address specific problems, but these too can introduce additional unintended flaws, then you can often find that the new or different functionality actually works against you. This can cause an entire change in the way that your business operates and can throw unexpected delays, costs and force process redesigns into your business to handle the changes.

Software updates also lead to increasing sizes of applications and what is commonly called "software bloat". This is where an application starts to take up too much space on your storage, is very slow to start, runs slower than ever and leads to user exasperation and the purchase of a new device. Which of course will seem faster and better, but will soon exhibit the same problems as it goes through a similar life cycle.

Again there are some manufacturers out there who do a great job of providing upgrades, but all too often they take away a whole slew of features we need and add extra "functionality" that we simply don't want.

Workarounds

You may at some point experience a problem and for whatever reason you cannot get the manufacturer of the hardware or the writer of the software to understand there is a problem. What can you do? One of the things that our customers have found useful is the concept of a workaround.

On many occasions a missing piece of functionality can be provided using a different technique.

If a third party system consistently fails, and needs software to be restarted, there are the means to allow automation of that task, hopefully only needed as a short term fix.
If a database has no SNMP monitoring capability, we can add an external device that uses SQL to query the device and report results via an SNMP proxy.
if a legacy system is on failing hardware that is beyond support, we can virtualise the system and keep it going almost indefinitely.

Networking is an area where good thorough design should produce a very resilient solution that never entirely fails.

Multiple devices can be integrated to remove single points of failure and can sometimes provide performance benefits
Links can be triangulated to ensure that between any two points there is at least one good working path
Redundancy can be used to increase throughput and protection against failure
Adding telemetry to determine how much '''resilience''' is available can be invaluable protection against unexpected failure

What Is The Cost Of Bugs

At this point I could probably throw any old statistic at you. I could claim that effectively the '''equivalent technical manpower''' used to '''get a man to the moon''' is lost '''each week''' by people trying to solve, work around or otherwise resolve whatever immediate problem they may have. I believe it costs us a lot in terms of lost productivity, frustration and ultimately burn-out in dealing with these minor issues that can turn into vast problems.

I would be pleased if you would try an experiment for me. Over the next week, jot down all the incidents where you are hit by a bug or some software failure and estimate how long you spent fixing it. Please add up the total time and let me know.

What Can You Do

Generally for the majority of us, we can complain on forums, write to manufacturers, complain on social media and warn each other of potential problems.

In most cases, especially if you have a support contract, the suppliers of equipment will help you to resolve serious issues.

If you are running free or open source software, you can often find a solution online, although sometimes the only answer is to fix it yourself or use other software. You could also pay a developer to fix the issue for you and then submit the patch upstream for others to benefit from.

As companies running software and systems professionally, we need to get to the bottom of problems fast and get the manufacturers involved early on. This of course works up to a point, but it can take a long time for a large company to fix problems. Even then, you must be able to provide concrete examples of the problem and you may still find yourself held in a very long support ticket queue.

How Can Layer3 Help

You need a company who has worked hard for over 25 years to build some amazing and very reliable solutions. Our customers rely upon us to create designs that not very many years ago you could only have dreamt of. You can trust in our ability to create applications that will run non-stop for periods in excess of 2 years without failure. You can rely on our ability to create resilient clusters that will run without unplanned outages for 10 years plus, ensuring that you can confidently and consistently deliver what you do best.

You need a creative and adaptable organisation that can be relied upon to help drive the way forward. Someone able to remove doubt and uncertainty and instil confidence, take on the challenge of making your systems fit for purpose and fix unexpected problems fast by using our extensive experience and knowledge.

If you need some assistance, to understand what is and isn't going on, we can provide the attention to detail and the service excellence necessary to liberate your business, allowing you to confidently deliver what you do best.

If you have a lack of performance from your systems or networks, technical problems, bugs, doubts or uncertainties, come and talk to us, we can help!

Print Email