just a few days after Christmas last year AirAsia flight 8051 traveling to Singapore tragically plummeted into the sea. Indonesia completed its investigation of the crash and just released the final report. Media coverage, especially in Asia is big. The stories are headlined by pilot error but,as technologists, there are lessons to be learned deeper in the report.
The Airbus A320 is a fly-by-wire system implying there are no mechanical linkages between the pilots and the control surfaces. everything is electronic and many of a flight is under automatic control. Unfortunately, this also implies pilots don’t spend much time actually flying a plane, possibly less than a minute, according to one report.
Here’s the scenario laid out by the Indonesian report: A rudder travel limit computer system alarmed four times. The pilots cleared the alarms following normal procedures. After the fifth alarm, the plane rolled beyond 45 degrees, climbed rapidly, stalled, and fell.
Pilot Error?
The media headlines focus on the latter steps in the failure chain, in part because the pilots were never trained to deal with the type of upset that occurred. It wasn’t just AirAsia who omitted this training on the A320. All airlines did because Airbus, the aircraft manufacturer, did not expect the aircraft to ever experience such an extreme upset. note that France, as the host country for Airbus, participated in the investigation.
As technologists we need to look further. The technical root cause was cracked solder joints on circuit boards for the rudder limit control system. This system limits the amount of rudder movement at high speeds. A essential point is this same system failed 23 times in 2014. This was considered minor damage and never fixed.
As in numerous situations, the failure chain is a cascade of human failures to respond correctly to a technical fault. little discussed in many reports is how the pilots attempted to fix the fifth rudder control fault. They followed normal procedures for the first faults but the last time they opened and reset a circuit breaker while in flight. somehow that implied the autothrust and autopilot were disconnected and never restored. This put the pilots solely in control of the plane through the fly-by-wire system.
Tragic Sequence of Events
To summarize, here are the three essential failures:
Bad solder joint,
Cycling the circuit breaker,
Inadequate recovery training.
We’ll disregard the mistake of not properly troubleshooting the board. That is a human failure but also a larger policy issue for AirAsia and not directly technical.
Bad solder joints occur despite best efforts to avoid them in manufacturing. Diagnosing an intermittent joint failure can be a nightmare so we can sympathize with the aircraft maintainers. how ought to we deal with intermittent failures in vital or essential systems? Clearly the system was checking its integrity because it kept issuing warnings throughout 2014. Is it possible to have a system refuse to function if a certain number of failures occur? I’d suggest that after 6 faults it could have a heightened alert, like refusing to boot when powered on in a safe environment (i.e. parked on the ground). essentially the system says, “I know I’m bad, now fix me.”
Aircraft Circuit Breaker
Why did the pilots mess with the circuit breaker? One report says the pilot saw a maintenance worker cycle a circuit breaker to clear a fault. That’s fine on the ground but not in the air. Why would a pilot try this, especially because there are advisories to pilots not to reset circuit breakers unless the system is flight critical? The control system here is a safety feature, but not vital so why not just leave it off?
People in general get overly comfortable with technology because it abounds. There are all kinds of jokes about non-technical relatives doing something crazy to a computer because the same action fixed something else.
Unfortunately, this typically implies people don’t know what they don’t know. In this case, the pilots appeared not to know cycling that breaker would disrupt other systems. Yes, it sounds unusual that would happen and I can’t discuss it because I don’t know why that would happen. If true, it appears to be a systemic problem that ought to be addressed. In our work, we need to make sure that failures in one part of a system do not upset critical parts elsewhere.
The pilots weren’t trained to deal with the flight upset because even Airbus, the aircraft manufacturer, did not expect the aircraft to ever experience such an extreme upset. I guess since Murphy isn’t French they don’t expect his effects to occur there. This assumption probably derived from the aircraft being fly-by-wire. The expectation being the aircraft would not let itself become upset to this degree. but the automatic flight systems were disrupted by the cycling of the circuit breaker.
לעטוף
Failures in complex systems takeהרבה מאמץ כדי לעקוב אחר. במצב זה אנו רואים כיצד שלוש פעולות נפרדות גורמות לכישלון ברביעי, כישלון התחזוקה, תורם מאוד. זה מציין כי הכישלון הכולל אולי נמנע במספר פעמים: אם המפרקים הלחמה לא נכשלו. אם הטייסים לא צויכו את מפסק המעגל. אם הטייסים שוחזרו את מחשבי הטיסה האוטומטיים. אם הטייסים הגיבו כראוי אחרי ההרגיז.
אפילו כמו האקרים אנחנו צריכים לזכור מתי ואיך כשלים יכולים להתרחש. כתבנו מאמרים על מנעולים דלת אלקטרוניים שנוצרו על ידי האקרים. איך אתה מקבל אם הכוח הולך או מפרק הלחמה רע נכשל לאחר כמה מאות פתחים דלת וסגירות? אני מקווה חיוני יהיה לעקוף את האלקטרוניקה. למרבה המזל, רבים של הפריצות שאנו רואים אינם קריטיים. למרבה המזל, כשלים לא יהיו מסכנות חיים. בואו נשמור את זה ככה.