Critical Systems: Three serious errors in the history of critical systems.

Mobile devices, smart domestic appliances, autonomous vehicles, energy production systems… Technology advances every day and so the complexity of the software required to operate and control it also increases.

Major infrastructure and even human lives depend on the correct functioning of the software systems that control these technologies. For this reason, it’s essential to ensure correct operation to prevent accidents. These are the so-called critical systems or safety-critical systems.

Although the development of these systems is highly oriented towards the final quality of the product, this does not make them 100% fail-safe. In the case of critical systems, an error can have catastrophic consequences, as we want to show in this article.

Therac-25 and its excess radiation.

This is one of the most talked-about case studies. Therac-25 was a radiotherapy machine for cancer treatment used in the 1980s. Using a small built-in computer, it generated and delivered radiation to destroy tumors.

The machine had two operating modes: the first, already used in previous models, was a low-power electron beam; the second, a new development, was a much more powerful x-ray beam. The machine also had an energy diffusion system employed as a means of screening before the radiation reached the patient.

What happened? Firstly, due to a race condition, the system allowed the high power beam to be used without the diffusion system, leaving the patient fully exposed to the high energy beam.

The second problem appeared in the control system. The software used an 8-bit counter to store the treatment parameters, which were continuously checked against the patient’s prescription. When this counter went from 255 to 0, the catastrophe occurred: the system allowed the machine to operate, but with incorrect parameters for the patient. This caused an overexposure to radiation that killed at least three people.

Why did this happen? Further investigation showed that the software development process was inadequate, so that the software could not be properly tested. In addition, no independent review of the critical software was conducted.

3 errores graves en la historia de los sistemas críticos

Ariane 5 and its navigation systems.

Another high-profile case is that of the Ariane 5 space shuttle and its maiden flight in 1996, known as Flight 501. The Ariane 5 was designed to replace the Ariane 4, boasting greater payload capacity. Today it is the European Space Agency’s standard launch vehicle, with a total of 47 launches.

Despite its success, its inaugural launch was a catastrophe. Only 37 seconds after takeoff its two navigation systems failed. Two systems managed by two different computers but both with the same software, so the same defect knocked out both systems. The result was the loss of both the vehicle and its cargo, at a cost of $370 million.

The cause of the error: assuming that the inertial navigator of its predecessor, Ariane 4, would be valid for Ariane 5. The new model had much more horizontal speed than its predecessor. This value was tracked in a 64-bit floating point that was then processed and stored in a 16-bit integer by the software. The higher speed values caused an overflow in the system when performing this conversion.

In addition, the software that caused the error should not even have been running during launch.

Therefore, caution must be exercised when reusing software from previous projects. It must always be verified that it will work properly in the new environment, which is not always easy, as we have seen in the cases above. In addition, a careful assessment should be made of whether there are any reasons to execute software that is unnecessary, since it can be a source of problems.

The Patriot missile system and the desynchronization of its clocks.

Let’s talk about the third and final case study in this post, the Patriot missile system. The system was originally created to shoot down enemy aircraft. During the Gulf War the system was later adapted to shoot down Scud missiles. Everything seemed to be running smoothly until a catastrophe occurred.

In 1991 a missile hit barracks housing American soldiers in Saudi Arabia, due to a system failure. As a result, 28 soldiers lost their lives and at least a hundred others were wounded.

The system failed due to an out-of-phase error in the system clock, which increased the longer it remained in operation. In fact, two weeks before the incident it had already been reported that the longer the system was running, the less accurate it became. The only solution provided was to reboot in order to reset the operating time.

One of the modifications made to the system to apply it to Scud missile defense was aimed at controlling timing measurements more precisely, but this modification was not applied to all parts of the software. As a consequence, time measurements were made accurately on one side but were compared with other not-so-accurate measurements made by the non-updated part of the system, resulting in a discrepancy.

The Patriot missile system incorporated a 1/10 second clock, a measurement that was represented by the decimal 0.1 in other systems. The problem: it is not possible to represent this 0.1 in binary. Therefore, converting integers representing the clock’s 1/10 to decimal values resulted in a rounding error.

At the time of this incident, the system had been operating for about 100 uninterrupted hours, so that the accumulated time lag (which was already 1/3 of a second) translated into a tracking error of about 600 meters.

The system worked with different data types, which can be problematic. When monitoring any type of measurement it is important to be consistent with units and data types so that accuracy, rounding, data conversion, etc. are not affected.

3 errores graves en la historia de los sistemas críticos

Conclusion

Even critical systems, developed to the highest quality standards, are not exempt from errors, which, if undetected, can have serious consequences.  Conducting software testing is necessary to prevent and correct these problems before releasing the product. At Centum we can help you with all these quality procedures, improving software performance and the user experience. If you would like more information, please do not hesitate to contact us.

Critical Systems Engineering

At CENTUM Digital we have more than 16 years of experience offering Critical Systems Engineering services in the most demanding environments.
Centum

Centum

Artículo propiedad de CENTUM Solutions, S.L

You want to know more? Contact us

We are digital, and that is why we know the value of a conversation between two people. Please, if you have any questions, have any suggestions or just want to talk to us, contact us through any of the channels we offer you. You have our commitment that we will not use your information to send you SPAM, we like it as little as you do.

NEWSLETTER

Do you want to know the latest news? Subscribe

Would you like to be the first to know what is happening in the sector? In our newsletter you will discover everything.


Loading