Discussion of Control System Attributes Relevant to Personal Rapid Transit Control Requirements 

by Jeff Davis



In response to some discussions regarding control systems that may be suitable for PRT Systems the following discussion is offered
 
First some proposed definitions:

Fault-Tolerant - The control system is designed to continue to operate normally even if some portion of it is turned off or faulted.  This is usually achieved through redundant subsystems and components.

Failsafe - The control system is designed such that single point failures will always cause the system to go to a known safe state and single point failures cannot go undetected.

Vital - The same basic principle as failsafe with the addition that certain elements are designed such that they will never (or almost never) fail in a certain or known manner.
 
The above proposed definitions are applicable to the control system, not individual components.  Note that fault tolerant systems may or may not be failsafe or vital, and vice versa.

Fault-Tolerant:
One of the basic design criteria for a fault-tolerant system is that the overall system continues to function normally in the event of a failure, or failures of subsystems or components.  Any failures that do occur must be alarmed so that the failed portion can be removed from service, a redundant unit assumes control, and the failed portion is subsequently replaced to restore the desired redundancy.  The level of redundancy is determined by the application.  Note basic fault-tolerant design criteria do not include requirements that the software and hardware perform correctly, only that the system remain functional if subsystems or components have failed.  For example, a fault-tolerant system with an improperly designed mathematical or logical function will continue to provide incorrect or undesirable answers or outputs in the presence of faults.
 
Failsafe:
Failsafe systems are systems designed such that if a component failure is detected the system will set itsí outputs into a known safe state.  For failsafe systems that are not fault tolerant this action typically stops the system from operating.  Some failsafe or safety critical systems must perform certain additional shutdown actions in a pre-defined order to avoid a catastrophic failure.
 
Since failsafe systems are designed such that single point failures will not cause a unsafe condition, it is important to detect and react to single point failures because if a single component failure goes undetected, then a second failure might lead to an unsafe condition.  Note that it is not practical to design failsafe systems to not cause unsafe conditions due to multiple failures.

For a system to be certified as failsafe we need to analyze how the system will react if a single component fails in any of its' possible failure modes.  For example, transistors may fail open or shorted, resistors may open, increase or decrease in value, capacitors may open, short, increase or decrease in value, etc.  This also applies to processing, memory, and all support chips for computers.  For a system to be certified as failsafe, single bit errors, or failures of a component such as a chip on a board cannot cause the system to go into an unsafe state, and these failures must be detected and alarmed somehow.
 
Additionally for computer systems it is necessary to prove that the operating system will prevent different software processes from altering the memory contents (stored values) or executable code of other processes since doing so might lead to an incorrect result.  Failsafe computer systems also perform other safety checks such as verifying that the software programs have not been altered (i.e. perform CRC checks on all software during the start up process), version control (i.e. each computer board checks the versions of its software against another computer), and the use of watchdog timers that must be reset to verify that none of the software programs get stuck in an infinite loop (and therefore may not be able to perform a safe shutdown).

All software used in failsafe systems is documented, reviewed by independent reviewers, all generated code is compared against design requirements by independent reviewers, and finally the software is proven to perform as required through exhaustive testing that is thoroughly documented.
 
As far as testing all possible combinations of inputs to verify safety, this has been, and will continue to be a debatable item given the increased complexity of software and hardware of current systems.  For the more advanced vehicle protection systems it is not technically feasible to test all combinations to verify that a certain combination might prove unsafe, and therefore some subset will have to suffice.  But which subset, and how large should the selected subset be? These are the questions safety certification guys sweat over, and the answer is by no means straightforward.
 
Once a failsafe design is shown to work in the positive manner in that the vehicles will move around, it is also tested in the negative manner by introducing single faults to verify failsafe reaction, and by inducing potentially unsafe events to verify failsafe reaction.  Testing examples include disabling a single sensor in a vital or failsafe input and check that the protection system stops vehicle movement, or manually moving a vehicle too close behind another and verifying that the control system sets the brakes when switched to automatic.  Occasionally testing shows that the vehicle protection system would have allowed a potentially unsafe condition to occur.  In which case the failed test is documented, reviewed, design revisions performed, and it's then re-tested. These types of exhaustive tests are documented and submitted with the safety case documentation.
 
Vital:
Vital systems have the same basic criteria as failsafe systems that certain failures of components designated as 'vital' cannot be reasonably expected to fail in a certain mode within the life time of the system.  For example, vital relays are designed with contacts that have dissimilar materials that will never weld the contacts together, i.e. the contacts may burn open but will never weld closed.  Another type of vital relay is designed such that it can only be energized with pulsating DC current, i.e. a steady state DC current will not energize the relay.  For computer systems, timers are used for certain vital functions such that actions are performed after the counter reaches zero.  These vital timers are designed such that all reasonable failure modes increase the time count or multiple independent timers are used and the results compared.

Reliable Failsafe or Vital Systems:
If the system must be vital safety and fault tolerant, then the design of the system becomes more complex.  A typical example of a design option to increase reliability of a failsafe or vital system might be 2oo3 (two out of three) voting implemented such that as long as any two of the processing units declare the system safe to operate then it continues to operate.  If one and only one of the units is faulted an alarm is sent alerting the system problem, but the system continues to operate normally.  A second failure occurring before the first failure is repaired will shut the system down.  This type of scheme is offered by some critical control suppliers in an effort to increase uptime while still maintaining the level of safety desired.  The added complexity comes with increased costs.  There are other methods and options that can be used, but this presents the basic idea.
 
So in summary, an extremely reliable system (fault tolerant or redundant) is not necessarily failsafe or vital, and a system certified as failsafe or vital is not necessarily extremely reliable. Different design criteria to meet different designs are required.  Most suppliers of modern vehicle control systems offer fault tolerant versions of their failsafe and vital control systems in order to increase the reliability/availability of the overall transportation system. If there are multiple vehicles in operation on the system, and there is a failure in one zone,
all the vehicles in that zone will be stopped safely. 
 
For further technical reading and requirements on the topic of failsafe and vital the reader is referred to applicable sections in the American Association of Railroads, (AAR) or the  American Railway Engineering Maintenance of way Association (AREMA) Standards, or equivalent European Standards.
 
There are other standards that can be applied to control systems that provide equivalent levels of safety such as those used for amusement rides, oil/gas refineries, and chemical processing plants. These systems are classified as critical control systems, and a review of the design of these systems will show that they could also be certified as failsafe or possibly vital since the same basic system design rules apply.



Last modified: August 29, 2012