ncmncm 5 years ago

This is a very good analysis, but fatally incomplete.

One really essential reason those planes crashed was that each time the MCAS triggered, it acted like it was the first time. If it added 1 degree of trim last time, it adds a second this time, a third next time, up to the five degrees that runs the trim all the way to the stops.

A second reason is that, under the design still on file at the FAA, it could only add a maximum of 0.8 degrees (each time). This was raised to 2.4 degrees after testing, so only two hits could, in principle, put you almost to the stops.

A third was that the only way to override the MCAS was to turn off power to the motor that worked the trim. But above 400 knots, the strength needed to dial back the trim with the hand crank was more than actual live pilots have, especially if it is taking all their strength to pull back on the yoke.

A fourth was that, with two flight control computers, the pilot could (partly) turn off a misbehaving one, but there is no way to turn on the other one. You have to land first, to switch over, even though the other is doing all the work to be ready to fly the plane.

A fifth was that it ignored that pilots were desperately pulling back on the yoke, which could have been a clue that it was doing the wrong thing.

A sixth was that, besides comparing redundant sensors, it could have compared what the other flight computer thought it should be doing.

  • kakwa_ 5 years ago

    This analysis is completely right, but in my opinion, focuses too much on the technical aspects.

    Is MCAS a hack? yes. Is it fixable? yes. Will the 737 MAX continue to fly for two to three decades after all the items above have been addressed? yes.

    But from an engineering perspective, putting an additional system to "fix" another system feels always a bit weird. Sometimes it's not avoidable (ex: cooling), but when it is avoidable, something is at least a bit wrong. A few hacks like that are manageable, but too many, and you dramatically increase the chances of one of these hacks misbehaving.

    And if an organization is pushing a lot for this kind of hacks as Boeing did, the issue is not even technical.

    The story of MCAS reminds me a little about the MD11. The DC10 as a tri-jet could not really compete with new dual engine airplanes in the late 80ies/early 90ies in term of fuel consumption, but McDonell Douglas try to anyway. They optimized the wings, change the engines, add winglets and more significantly reduce the horizontal stabilizer size. This made the MD11 quite hard to land as it needed to go-in with a very high speed for a wide body jet. It was a contributing factor in several accidents (Fedex 14, Fedex 80, Lufthansa Cargo 8460), and pilot training/technical fixes never fully compensate for the design flaw. And in the end, the aircraft kind of failed to reach the target fuel consumption. However, it's still flying today, and it's still a workhorse for cargo companies.

    • ncmncm 5 years ago

      Ultimately it was not a technical failure: the failure was in allowing the plane to be sold with such a thoroughly bad design. Thus, a management and regulatory failure. Regulatory, because FAA signed off on it, obviously without applying any of the process that would have prevented it. Management, because the cost of this debacle will be many, many times what they saved by trying to skate by with a faulty design.

      • reacweb 5 years ago

        I think the blame shall be entirely on the management of Boeing. The engineering details are mainly a plea against management.

      • robertAngst 5 years ago

        You watched the VOX video and are reposting the summery here?

        They are going to fix the software with no changes to the design.

        They don't need to change the design, that would be expensive overkill. Software updates are as close as we can get to Free. There is not a single company or human with infinite money to spend on anything, we can't always have the fantasy.

        • kakwa_ 5 years ago

          The MAX can be fixed by addressing the flaws in MCAS, and it will be in the end. The production will resume, the MAX will be a slightly less commercial success than it should have been, Boeing will be fined a few Billions for its failings, the FAA will have to take an hard look at itself, and everything will be ok.

          I just hope the correct conclusions will be learnt from this crash:

          * have a truly independent certification process

          * don't fix your physical design issues with software

          * don't write and maintain software running on an airplane like the software from the start-up world: significant design flaws are not tolerable for software running in an airplane, even if they can be fixed easily.

          * don't hide new systems, if there is a new system, then pilot should be trained

          by the way, it's not the first time a new system caught pilot by surprised because they didn't knew about it. SAS Flight 751 is another example, ice ingestion damaged the engines, the pilots reduced the thrust to reduce stress, but an automatic system (ATR, Automatic Thrust Restoration), unknown to the pilots, put the thrust back to full power, disintegrating the engines, the plane fortunately crash landed without fatalities thanks to the pilots skills and a bit of luck.

          * Boeing should really think about replacing the 737, the old design is putting too much constraints, preventing a clean design.

          • blackflame7000 5 years ago

            Its not very easy to redesign the 737. Bigger Engines need more clearance. More clearance requires longer landing gear. Longer landing gear requires a wider plane to fold up into. A wider plane is less fuel efficient and needs bigger engines and suddenly you are a 757,767,787, or 777

            • pm24601 5 years ago

              > Its not very easy to redesign the 737.

              Too bad. I know I am now permanently reluctant to fly 737s

              • blackflame7000 5 years ago

                You really shouldn't be. There is a lot of misconception going around that the 737 is inherently unstable. This is not true. The 737-600/700/800/900 was one of the safest designs ever built (http://www.airsafe.com/events/models/rate_mod.htm). Yet of the few crashes it had, a surprising amount was from pilot disorientation causing them to accidentally stall the plane.

                MCAS is not a feature to correct an unstable aircraft, it's to correct a confused pilot. MCAS is triggered if the following circumstances are right: 1) The airspeed is near stalling speed (takeoffs and landings) 2) The AOA (as reported by both sensors now) is greater than the aircraft can climb.

                The problem was that 1) the AOA data was faulty, 2) it did not check if there was a disagreement with the backup system 3) The aircraft failed to reset each time it took corrective measures such that each the correction was compounded. 4) the corrective action was 3 times more than approved.

                All of these things have been fixed but it also shows how many things have to go wrong for something to be catastrophic. But the max has been built on a solid foundation and that's much better than having to start from scratch.

      • Cacti 5 years ago

        The airlines knew what they were doing when they didn’t pay for the “upgrade” and they doubly knew when they didn’t pay for it after the first crash. Airline management is just as culpable in this debacle.

        • InclinedPlane 5 years ago

          This is bonkers. Nobody anywhere in the world should expect that one buys an airplane which is unsafe by design. Also, the idea that there are two versions of a plane, one that kills people and one that doesn't, and if you buy the wrong one that's completely your fault is just insane.

          • Cacti 5 years ago

            Frankly, you have no idea what you're talking about. This is how the airline industry works. Airlines WANT cheap planes and they ARE WILLING to and DO skimp on some safety features to save money. Airlines WANT safety features, at least some of them, to be optional so that they aren't forced to spend money on them if they don't want them.

            For example, backup fire extinguishers in the cockpit? Optional. Extra oxygen masks? Optional. Advanced radar? Optional. There are hundreds of items like this on every model, whether it's Boeing or Airbus or someone else, because their customers WANT them to be optional.

            And by the way, not all airlines do this. American paid for the upgrades. Southwest paid for the upgrades.

            So maybe you should ask yourself why it is that Lion Air and Ethiopian Air were willing to spend huge amounts of money on a new jet, then spend some $1-$2 million on optional upgrades, and yet not include the MCAS upgrade in those options (and they were made well aware of it, as you can see from other airliners purchase of these upgrades). Or for that matter maybe you should ask why Lion Air knew that the plane was having trouble with the AoA sensors for several flights prior to the crash, and yet did not perform the required maintenance on them (a serious violation) and did not pass along information to its next crews (also a serious violation), who were caught completely unaware.

            Airlines run at 2, 3% margin, and they do so by trimming costs at every possible area. Sometimes those areas are safety related. Airlines are not some doe-eyed naive little up who just trust whatever Big Boeing/Airbus tells them. They know how the planes work, they have their own pilots, they have their own engineering teams, they have their own specialists and experts, and they make decisions to sacrifice safety for savings, and they do it ALL THE TIME.

            • InclinedPlane 5 years ago

              There's plenty I could say in answer to this but most of it is unnecessary.

              There is no excuse for a company such as Boeing, working under the regulation of the FAA, to produce any version of any model of plane which is fundamentally unsafe to fly. Period. None.

              We're done here.

              • Cacti 5 years ago

                Life must be great in black and white.

        • ReGenGen 5 years ago

          The 80K upgrade was simply a AOA disagree warning light and has zero affect on MCAS behavior. There is no evidence that airlines were told about MCAS until the Lion Air accident.

        • ncmncm 5 years ago

          There is more than enough blame to go around. E.g., Congress, for the last N decades, failing to fund FAA at levels clearly needed.

          • schraeds 5 years ago

            Beyond funding is the issue of regulatory capture. The FAA is too cozy with the plane manufacturers, allowing them to help test,certify,document their own products.

            The biggest overarching change I would like to see is the abandonment of allowing manufacturers to extend the type rating of airframes indefinitely and designing to avoid costly recertification and training vs a true focus on safety and innovation.

        • kadendogthing 5 years ago

          Yeah, they willingly bought these planes. Airlines should be held as equable responsible.

          • bronson 5 years ago

            You’re saying that Boeing intentionally sold unsafe airplanes? And the airlines, knowing this, still bought them?

            • kadendogthing 5 years ago

              I'm giving money to the airlines. Not Boeing.

              >You’re saying that Boeing intentionally sold unsafe airplanes?

              Yes.

              >And the airlines, knowing this

              You know airlines have their own engineering teams, right? And they were bought explicitly to save money.

    • jussij 5 years ago

      > Will the 737 MAX continue to fly for two to three decades after all the items above have been addressed? yes.

      There is no doubt the MCAS can be fixed.

      But I would say there with all the bad press the 737 MAX has received, there must be some doubt as to whether the 737 MAX will fly for decades to come.

      I would say the flying public will need some convincing before they consider the plane safe.

      > However, it's still flying today, and it's still a workhorse for cargo companies.

      One of the reasons the DC10 is used for cargo, is very early on the DC10 faced it's own own 'bad design' issues that resulted in several fatal crashes.

      A fault in the DC10 cargo door meant it sometimes did not close, which then resulted in an explosive decompression.

      That fault and those early crashes greatly helped the 747 win the race to be the dominant wide body passenger jet of that time.

    • throw7 5 years ago

      I'm confused with your response. So nothing to see here? Nothing to learn and change? You seem to say we should accept design flaws (like we did with the MD11?)... correct? Or?

      • kakwa_ 5 years ago

        Basically yes. The airlines learnt to mitigate the risks associated with the design, and there were a few small modifications (light indicator to detect bounced landing for example). There is a risk, but it's at acceptable level.

        The fact is an airplane is a huge investment, made to last 20, 30 and in some cases even 40 years (not necessarily with the same owner). You cannot exactly throw it away and buy a new one even if it has defects. At most, it ends-up lasting a bit less (ex: 15 years instead of 20 years) or is relegated to a specific usage where the consequences of failures are less dramatic (ex: freight).

        There are still 20/25 years old airplanes flying with passenger, which are inherently less safe than (properly designed) new airplanes because of their age and their avionics. Yet there are still flying.

    • Jonnax 5 years ago

      Will it fly? I think we live in a different world now. The 737 MAX via social media has a perception of being a "Boeing death plane"

      Public opinion could ground it.

      • LandR 5 years ago

        From the people I've spoken to about this, I'd say less than half even know that the 737Max is a thing, never mind that it's crashed twice.

        • jrockway 5 years ago

          That is very interesting as it's been front-page news regularly in pretty much every newspaper (ok, I only read the NYT and WSJ, but it's quite the topic of discussion on those two papers).

          Every time there is news my entire office lights up with discussion about it. Then again, I work for a company called "Pilot" so perhaps people are more interested than average. It's not an aviation company though ;)

      • jrockway 5 years ago

        The DC-10 did fine after its rough start. It's still flying today.

        • kakwa_ 5 years ago

          That's also the interesting part.

          The DC-10 ended-up being a reliable airplane, but its early crashes damaged its reputation heavily, and the handling of these issues by MD was poor.

          Commercially, MD never completely recover, and was absorbed by Boeing in the 90ies. (To be honest, it's not the only factor, you have also the L1011 competition and the fact a trijet was somewhat of an evolutionary dead end).

          Boeing is much larger, and much stronger it can probably cope with it, but it will be a hit on their best selling aircraft. It's basically the 737 that finances the new 777 or 787, costly programs not certain to recoup design costs (same with Airbus, the A320 is basically financing the A380 failure). At the same time, part of the 737 market is a bit captive with the biggest low cost companies (Southwest, RyanAir) using it unlikely to switch.

    • Cacti 5 years ago

      There are thousands of “hacks” like this in every modern airliner. That’s how complicated problems are solved, you come up with a basic idea, and you iterate on it thousands of times until you squash all the edges cases.

      • antris 5 years ago

        You don't cover edge cases mainly by writing specific code for everything and adding on to the existing load. You do it mainly by creating an elegant single solution that covers all cases. Every line of code and every technical layer you add makes you susceptible to even more bugs and edge cases.

        And just because the airline industry does software backwards, doesn't mean that you should do so.

      • tigershark 5 years ago

        Or, you know, you don’t put engines way too big for your frame.

      • ulfw 5 years ago

        There really aren't. Please don't state opinion as fact. There aren't "thousands" of "hacks" to make an airframe fly reliably. That is preposterous.

        • airbreather 5 years ago

          OK, I am going to comment, as a certified and practising functional safety engineer with a TUV number and experience designing and building industrial systems to IEC61508 (functional safety parent standard) and IEC61511 (process industries).

          Functional safety is the engineering discipline related to designing machinery instrumented safegaurding systems to protect humans from harm, to a deterministic safety performance level.

          eg just the right amount of safety so that all the safety money is spent in the right places and right amounts so as to reduce risk across the board to the required level, not over-investing in one area and neglecting another (or so the dream goes, in practice it is a moving target based on a lot of guesses, you hope the swings and roundabouts balance out more or less).

          Aeronautics design is one of the closely aligned fields within the group of this overall discipline. Closest thing I have worked upon in terms of risk/consequences is mine winders - safety failure can kill 10-100 people in one go, they ride multiple times every day.

          Right now, this very minute before needing a break at 1am to browse hacker news, I was trying to wade thru a mess of a fault tree analysis my current project owners "specialised" consultant has produced for the systems I currently need to instrument for safety.

          Most people in general, but especially Americans who live primarily with prescriptive standards, struggle to come to grips with the nature of performance based safety standards. There is no "do it like this and you have met code and have no problems" - you have to analyse and build everything up from scratch.

          It is all about layers, layers of risk reduction that eventually (whether by perception or reality) get the risk down to an acceptable level. So there are cludgey little things that get stuck on as hacks to address this issue or that, not uncommonly often pet issues of one of the review panel. Repeat this several hundred or thousand times and any hope of some kind of uniformly elegent and simplified solution is pretty slim.

          The general reliance is on redundancy and independence, eg layers of protection, "defense in depth", or as more commonly known "the swiss cheese model" - you get a bunch of slices of swiss cheese and when the holes line up to allow a path through, that is when an accident can occur. More layers, less chances (also smaller holes, but that is another story again).

          And, as almost always, the machines are actually the easy part most of the time. It is the humans that design, build, test, maintain, certify the machines that are the weak point, over and over again. Plus the creative ways humans can get around systems in place to protect them, get their job done when the system is telling they should stop, or doing a maintenance task a new "better way" despite the manual that might have cost over $100k plus of engineering time to write and approve is telling them to do it a specific way etc etc etc

          90% of the time overly conservative thinking during risk analysis occurs (we might get hit by a meteorite, happened to my cousin once), which can layer complexity and associated uncertainty and poor availability onto a solution.

          10% of the time there is the wishful thinking of "it will never happen because I have never seen it or heard of it" that allows the unexpected and unusual (black swans often, if you will) to sneak through, at least for the first time. Endless discussions occur about "credible scenarios", sometimes the "discussion"is won by the dominant personality in the room, who might also be doing some of the pay reviews next month.

          It is incredibly difficult to be the person that has to herd the flock of cats that represent all the stakeholders in a hazard revue and risk assessment. These workshops sometimes run for months, maybe in extreme cases for years on and off, considering every system, subsystem, part, action, event, procedure etc etc and all the possibilities and how they can go wrong and what might mitigate failures and events - and on and on.

          I could write about this all night, but I guarantee you that any magical opinion or assumption you might have about graceful and elegant solutions to difficult and dangerous problems being the norm are unrealistic - there is a always a consensus or committee to satisfy, often top heavy with people that might have to own or operate the machine in question, but never designed anything in their lives. You fight for the things you know matter and concede some of the crap, hoping subsequent reviews will see it as pointless or not credible.

          All of this is the reason that grandfathering is so attractive. To apply the current internationally recognized performance based safety standards from scratch to design something as complex as a plane that can kill hundreds of people in one go is an incredibly difficult task. And from a business perspective fraught with immense dangers of totally unpredictable outcomes impacting budget and schedule and even viability.

          This is a highly specialised field with what are often counter-intuitive outcomes (otherwise you would just let John out the back room design the whole plane from scratch, because "he knows what he is doing").

          While I am aghast at some of the information about some of the information about design decisions taken that is emerging, none of it surprises me in the least. I can see directly how a number of them may have effectively resulted from path of least resistance when a product had to be produced.

          I like flying older planes in general, as long as the airline does reasonable maintenance. The unexpected has often been detected and corrected, the chances of latent faults turning up decrease with hours in service. Plus I always remind fearful daughter that the taxi ride to the airport is more dangerous, by the numbers.

        • avh02 5 years ago

          Please don't state opinion as fact.

  • m463 5 years ago

    I don't know, I think the 3 points in the article make it glaringly obvious that the root cause is NOT engineering.

    The decisions made clearly ignored engineering and historical precedent at every turn.

    It's sad because Boeing has had some wonderful engineers, and Boeing aircraft have traditionally allowed the pilots to have the final say.

    • 1v1id 5 years ago

      > So Boeing produced a dynamically unstable airframe, the 737 Max. That is big strike No. 1. Boeing then tried to mask the 737’s dynamic instability with a software system. Big strike No. 2. Finally, the software relied on systems known for their propensity to fail .... Big strike No. 3.

      The article definitely does partially blame engineering.

      • ReGenGen 5 years ago

        We don't know the MAX is inherently unstable. That's still conjecture at this point. MCAS may have altered fight characteristics to match older 737 to avoid additional pilot training. If the MAX is inherently unstable, then this scandal is much bigger and far reaching.

        • halter73 5 years ago

          I heard the MCAS was required to avoid a situation where the MAX, after reaching a certain pitch, would continue pitch up into a stall even if both pilots released the yoke. I don't know if that's considered "inherently unstable", but this is against FAA regulations for any commercial airplane from what I heard. That's probably a big reason why Boeing made it so difficult to completely disable the MCAS system.

          My understanding is that MCAS altered fight characteristics not only to match older 737s and avoid additional pilot training. MCAS altered fight characteristics so the FAA would approve the MAX as a commercial aircraft period. The fact they could match older 737s flight characteristics for a more speedy approval from the FAA was just gravy.

        • cyphar 5 years ago

          The article is making the point that the MAX is inherently unstable because of the larger engines causing the "pitch up" problem with an increasing angle-of-attack:

          > Pitch changes with increasing angle of attack, however, are quite another thing. An airplane approaching an aerodynamic stall cannot, under any circumstances, have a tendency to go further into the stall. This is called “dynamic instability,” and the only airplanes that exhibit that characteristic—fighter jets—are also fitted with ejection seats.

          So arguably the existence of MCAS in the first place indicates that the aircraft design is dynamically unstable (otherwise MCAS wouldn't have been necessary).

          • ReGenGen 5 years ago

            There are several respected industry professionals that believe the MAX is inherently unstable (and maybe they have insider information...) but this is not considered "fact" right now. (AFAIK)

            MCAS was needed to maintain original 737 type specification which allows 737 pilots to fly any 737... significant operational flexibility & cost savings for airlines.

            Commentary from blancolirio who is a current 777 pilot. https://www.youtube.com/watch?v=zGM0V7zEKEQ&t=0s

            • cyphar 5 years ago

              I'm not in any way an authority when it comes to aeroplanes.

              However, my understanding is that the reason why MCAS was needed to maintain the original 737 specification is because of the "pitch up" behaviour on increased AOA (which is what is being described as "dynamic instability" in TFA).

              The video you linked doesn't disagree with this -- though it's phrased as being primarily there to "replicate the same feel as earlier versions of the 737, by giving a little bit of nose-down trim". The article claims that being dynamically unstable means that at high-AOA you get nose-up lift (I'm not a pilot or aeronautics expert, so this might be an incorrect definition -- but I've not seen anyone disputing that definition nor disputing it's against FAA guidelines).

              If you need an additional system to "replicate the feel" of not having nose-up lift at high-AOA that tells me that your plane design must therefore have nose-up lift at high-AOA. The guy in the video then goes on to say that it's an inherently stable design, but he doesn't really qualify it (other than saying that all other 737s are stable designs) and goes on to say that "the nose goes a little bit light".

              Obviously we should hold back judgement until we know all the facts, but "the 737 MAX is an inherently stable design" is not someone holding back judgement.

        • yason 5 years ago

          But wasn't this explained in the article, already. Dynamically stable plane doesn't aggressively rotate along any of its axis, for example, when you increase or decrease throttle. The location of MAX engines generates additional forces to increase pitch when engine power is increased, and even more so when the plane is already pitched high. Earlier 737's could do without MCAS because they were designed with smaller engines in lower locations in order to be dynamically stable.

          • pps43 5 years ago

            You are referring to thrust asymmetry, not dynamic stability. Thrust asymmetry is actually not that different between MAX 8 and NG. MAX 8 engines have more thrust, but they are also mounted higher (closer to the centerline), reducing torque.

            Dynamic stability is a tendency of the plane that flies straight and level to maintain this straight and level flight. MAX 8 still has this property, MCAS or not.

      • masklinn 5 years ago

        The engineering blame is that they couldn't engineer themselves out of a political problem. I wouldn't call that "engineering blame".

        Deep ethical failure, maybe, but this here place usually has a hard-on for epically failing the most basic ethical non-challenges.

    • airbreather 5 years ago

      See my other comment in this thread, engineering is done by humans, and they are almost always the weakest point in all phases of engineering.

    • ncmncm 5 years ago

      Of course, that much engineering failure is not possible without even more management and regulatory failure. But documenting the engineering failures is necessary to quantify the management failures, at least until discovery in the wrongful-death lawsuits begins.

      • tristanstcyr 5 years ago

        I find that engineering vs management is not a very useful discussion point. Management is composed of engineers too and engineers are responsible for the output of their work. Production of something that doesn't work is an engineering failure.

        It's depressing that everyone went along with this. Clearly there's a lack of impartial checks and balances in the process.

  • fma 5 years ago

    Regarding #6...Boeing charged $80k to unlock the software feature to gather data from more than 1 angle of attack sensor.

    • outworlder 5 years ago

      > Regarding #6...Boeing charged $80k to unlock the software feature to gather data from more than 1 angle of attack sensor.

      80k for a _light_ in the panel, not to have MCAS take the two sensors into account - which it couldn't do anyway. With two sensors, how do you know which one is right and which one is faulty?

      • kortilla 5 years ago

        Well you can’t get quorum but you can certainly shut off when you detect a conflict.

        • ReGenGen 5 years ago

          you can't shut MCAS off AND also NOT train the pilots on the system and MAX flight characteristics. Conflicting system goals.

          • bm98 5 years ago

            To me this is the crux of the issue. And it's non uncommon in software development:

            The developer fails to check for an error condition or raise an exception because doing so would add too much complexity to the system. So instead it is assumed (or hoped) that it simply can't or won't happen. Problem solved...

            (Edit: Or that the user will just have to reboot if it happens.)

            • ReGenGen 5 years ago

              Yes. Not checking for errors is the feature of the MCAT "solution" which hides the system from pilot's training AND involvement with other 727 system/team engineers. A bureaucratic solution to a marketing problem. I wonder how many programers they had to go though to find it.

    • aerospace133 5 years ago

      That's 100 false. There are people below you correcting this and you still haven't changed this. The paid option to display AoA information was supposed to be used by military pilots who are familar with how to interpret information from AoA sensors.

    • cjbprime 5 years ago

      The amount of misinformation on this point is astounding. That's not true in any way.

    • ncmncm 5 years ago

      In court that will cost them dearly: it demonstrates they were aware using two sensors would be safer.

      Probably AA or SWA demanded it, and Boeing compromised by charging for it.

      But... #6 was about paying attention to the other computer, not the other sensor. That would involve some big and expensive changes to the flight computer software, which they probably should have done long before the MAX project started.

      More management failure.

      • cjbprime 5 years ago

        It's nonsense. There was no paid option that affected the use of the second sensor.

  • AWildC182 5 years ago

    tl;dr the first thing you learn doing control systems is NEVER use += 1; when it relates to an electro-mechanical device.

    I learned this a long time ago by putting a robot through a door-frame at max speed.

    Edit: To add to this in a way that might actually make it useful to someone, motor outputs should almost always be a continuous function of something rather than their own internal state. MCAS should have been coded as something like (vastly oversimplified)

    if(AoA >= MAX_AOA) trimPosition = g(f(AoA, IAS), trimPosition, dt);//AoA/IAS dependent ramp function

    else trimPosition = g(trimInput, trimPosition, dt);//ramp function

    not

    if(AoA >= MAX_AOA) trimPosition += 1;

    Because the latter is very likely to result in a runaway even if you have bounds checking somewhere else. Worst case you don't have bounds checking and the position value/register value loops over and the tail just starts spazzing out, flapping up and down as fast as the motor will drive it.

    • amluto 5 years ago

      PID controls do exactly what you are objecting to.

      • AWildC182 5 years ago

        It's more nuanced than that. My ramp functions technically have a += 1 somewhere in them. You just don't want to += a motor value directly or otherwise add to state out of a feedback loop. You can verify the PID function in simulation/unit test. It's much harder to unit test the motor/driver/controller on a stand.

        • jschwartzi 5 years ago

          It seems like what you're really getting at is that when you have a control like this where there's a non-linearity in response for a linear change in input quantity, what you really want to do is instead have a non-linear change in input quantity so that there is a linear response. In that case, you would characterize the effect of the control on the response variable and arrive at a table of acceptable values. Then your +=1 becomes an increment of an index which returns the next-highest acceptable value and you no longer produce a non-linear change in the response.

          • kortilla 5 years ago

            No, you’re missing the point a bit. It’s not about the magnitude of the response, it’s about the propensity for a feedback loop.

            =+1 is shorthand for “ignore everything going on in the world and increase your value”. This is almost never what anyone actually wants so they try to spend a bunch of time guarding against calling that when it’s already at a maximum.

            Instead, the safe thing to do is only assign to it from a function with a ceiling.

            val = min(CEIL, val+1)

            It’s way to easy to get runaways with =+1 even in serious systems like this one. Every time I see that in code I review where the value is some long-lived thing, I just confirm with the author that they don’t care if it overflows, because it’s probably gonna happen.

      • Gibbon1 5 years ago

        All practical PID controls have anti-windup features.

  • Cacti 5 years ago

    Stab trim cutout is not the only way to disable MCAS. Extending the flaps any amount also disables MCAS.

    Turning on autopilot also disables MCAS, though this isn’t entirely effective since spurious AoA readings may quickly disable the autopilot again.

    • cjbprime 5 years ago

      Boeing put out an emergency airworthiness directive after Lion Air. It doesn't tell pilots to lower flaps. AoA sensor failure causes IAS Unreliable warnings, and the checklist for that item demands that flaps be left alone -- if you don't know how fast you're going, lowering flaps could cause a wing stall.

      It's not reasonable to expect pilots to disobey checklists. We would all be less safe if they did. If pilots are following Boeing's instructions and planes are crashing, that's on Boeing.

    • metanoia 5 years ago

      One question I've had is why Boeing changed the design of the stab cutout switches from the 737NG from a pair of switches - the right side switch that disabled the autopilot control of trim and the left side that disabled the trim motors entirely [1] - to a design with a similar pair of switches, but each controlling the primary and backup control motors of the elevator trim [2] instead of controlling the source of the input. (If you look closely at the yoke trim switches, there are two independent switches that are grouped together by a frame so that they move simultaneously. One drives the primary motor, and the other the secondary).

      Assuming they'd kept MCAS on the switch associated with the automatics, the procedure would have been to throw the AUTO PILOT switch, disabling MCAS, but keeping the MAIN ELEC switch on, allowing them to trim back to neutral for that speed.

      [1] https://www.airliners.net/photo/Australia-Air-Force/Boeing-7...

      [2] http://www.b737.org.uk/mcas.htm#stcs

      • metanoia 5 years ago

        EDIT: Oops, there's only one trim motor on the NG. Both trim switches must be actuated on the yoke switches though. But the input cutoff switches still work the same way I originally stated.

        Not sure about the Max's # of trim motors, but the Max definitely doesn't have input source cutoff switches.

  • acqq 5 years ago

    > A sixth was that, besides comparing redundant sensors, it could have compared what the other flight computer thought it should be doing.

    If you mean AoA sensors, AFAIK, there was absolutely no redundancy at all the way MCAS was designed. Exactly one sensor was ever used for MCAS. And last time I've read the Boeing's reported coming software changes, they wanted to keep it so, but just to add the notification to the pilot when the sensors disagree.

    • ncmncm 5 years ago

      Comparing sensors and comparing judgment of the other computer were both things they could have done, that they utterly failed to do, in both cases.

      • acqq 5 years ago

        Because their main premise was that they are selling the "same old" plane, and any change in any workflow would make obvious it is not.

  • d-sc 5 years ago

    Architecting solutions is hard. In this case, you need knowledge of motors, flight controls, sensor fusion, etc.

    It’s easy to find edge cases when they present themself (tragically here). But most electrical-mechanical-software assemblies have similar issues.

    • ncmncm 5 years ago

      This case is unusual because, rather than a whole series of things that all had to go wrong before the plane would crash, this system has numerous failure modes that individually were almost enough by themselves to cause a crash. It is only astonishing that it took so long for it to happen.

      In the Lion Air case, painfully inadequate maintenance contributed.

    • ncmncm 5 years ago

      Thus, the standards and procedures that if followed would have prevented the problem.

      There is a standard that says automated controls are not allowed "authority" such that the pilot cannot counteract it.

      There is a standard that says failure of a single sensor must not cause a critical failure.

      Either violation alone makes the design not airworthy, and not certifiable for civil aviation, preventing both crashes.

    • gizmo385 5 years ago

      > In this case, you need knowledge of motors, flight controls, sensor fusion, etc.

      Expertise which should be readily available at one of the world's foremost designers of high performance commercial aircraft.

DuskStar 5 years ago

This article reads well. Unfortunately, it's filled with fundamental mistakes.

> In the old days, when cables connected the pilot’s controls to the flying surfaces, you had to pull up, hard, if the airplane was trimmed to descend. You had to push, hard, if the airplane was trimmed to ascend. With computer oversight there is a loss of natural sense in the controls. In the 737 Max, there is no real “natural feel.”

Whoops. The 737 is one of the few airliners produced today that DOES still directly connect the pilot's controls to flying surfaces. There's literally 12mm wires connecting the yoke to the control surfaces. So, in fact, the 737 does have "natural feel". In fact, that's the whole problem MCAS was designed to solve - how to add force to the yoke when at high AOA in certain speed regions, in order to ensure a linear AOA response. (Certification requires something along the lines of "5 pounds force -> 5 degree AOA, 10 pounds force -> 10 degree AOA, etc") If the 737 was fly-by-wire, this force could be added directly in software. Instead, they added this force by changing the aircraft's trim.

> In a pinch, a human pilot could just look out the windshield to confirm visually and directly that, no, the aircraft is not pitched up dangerously.

No, you can't. There isn't a visual indication for AOA - the author is confusing this with pitch. You can have a 45 degree AOA while still having the nose of the plane pointed at the horizon. (The author mentions using the artificial horizon for this purpose, too) If you could read AOA from sensors that didn't require being exposed to the airflow, planes would use them.

I'm sure there's more, and it greatly reduces my confidence in the article.

  • lisper 5 years ago

    > There isn't a visual indication for AOA

    Private pilot here. What you say is, strictly speaking, true but that doesn't mean that you can't tell an awful lot about what's going on by looking out the window, and, more importantly, at the other flight instruments. If your attitude, airspeed, and rate of climb are all looking normal, then if the AOA says you're stalling it's almost certainly wrong. And it it says that your AOA is 70 degrees (which the Ethiopian airlines AOA sensor did) then it is definitely wrong.

    MCAS was designed to blindly trust a single AOA sensor, which was known to be prone to failures. To call that inexcusable would be quite the understatement.

    • DuskStar 5 years ago

      >If your attitude, airspeed, and rate of climb are all looking normal, then if the AOA says you're stalling it's almost certainly wrong.

      Agreed - you can certainly sanity check the AOA data with other data sources. Unfortunately, I'm not sure that will always let you pinpoint AOA as the cause - imagine some nice dirt-loving wasps have built nests in all of your pitot tubes, or they've iced over, and now those values are locked in place. So your IAS and altimeter could both be reading normal despite the fact that you're in a dangerously fast descent at a high angle of attack. (A radar altimeter wouldn't have this issue, of course) Things like this have happened before. Of course, this would require your engines to have failed at approximately the same time as your airspeed and altitude data, but I can imagine failure modes in which that would be the case. (A particularly stupid autopilot attempting to reduce speed but not seeing the speed decrease, repeat in loop)

      > And it it says that your AOA is 70 degrees (which the Ethiopian airlines AOA sensor did) then it is definitely wrong.

      Or it's accurate and you're about to die unless you fix it. Air France Flight 447 comes to mind. (though 70 degrees is probably past the point of being recoverable)

      > MCAS was designed to blindly trust a single AOA sensor, which was known to be prone to failures. To call that inexcusable would be quite the understatement.

      No disagreements here.

      • lisper 5 years ago

        > imagine some nice dirt-loving wasps have built nests in all of your pitot tubes

        All of them? And I didn't notice any of them during preflight? And none of them were there on the previous flight? And I didn't notice that the airspeed was not alive on the takeoff roll? Not going to happen.

        > or they've iced over

        Again, pretty freakin' unlikely on takeoff flying out of Addis Ababa. And this is another thing a pilot can rule out by looking out of the window. If you're not in clouds, you're not picking up ice.

        Also, there are OAT (outside air temperature) sensors.

        > your engines to have failed

        Another thing that would be pretty apparent to the pilots.

        No matter how you slice it, Boeing screwed the pooch bigly.

        • DuskStar 5 years ago

          > > imagine some nice dirt-loving wasps have built nests in all of your pitot tubes

          > All of them? And I didn't notice any of them during preflight? And none of them were there on the previous flight? And I didn't notice that the airspeed was not alive on the takeoff roll? Not going to happen.

          You underestimate the abilities of people to fuck things up. Birgenair Flight 301 - one tube blocked by mud dauber wasp, leading to crash. [0] Later that year was the crash of Aeroperú Flight 603 - the static ports were covered with tape, which was never realized by the pilots. [1] (They continued flying as if their altitude and airspeed readings were accurate)

          > > or they've iced over

          > Again, pretty freakin' unlikely on takeoff flying out of Addis Ababa. And this is another thing a pilot can rule out by looking out of the window. If you're not in clouds, you're not picking up ice.

          > Also, there are OAT (outside air temperature) sensors.

          Not applicable in this instance, sure. But "are there clouds around" isn't something I'd expect the flight computer to be able to reliably detect, and that's what would have to make the decision here.

          > > your engines to have failed

          > Another thing that would be pretty apparent to the pilots.

          I'd agree that it SHOULD be.

          > No matter how you slice it, Boeing screwed the pooch bigly.

          Yep. But there's still more complexity here than people seem to want to acknowledge.

          0: https://en.wikipedia.org/wiki/Birgenair_Flight_301

          1: https://en.wikipedia.org/wiki/Aeroper%C3%BA_Flight_603

          • lisper 5 years ago

            No technology is ever going to completely protect you against an incompetent pilot.

  • tzs 5 years ago

    > No, you can't. There isn't a visual indication for AOA - the author is confusing this with pitch. You can have a 45 degree AOA while still having the nose of the plane pointed at the horizon. (The author mentions using the artificial horizon for this purpose, too) If you could read AOA from sensors that didn't require being exposed to the airflow, planes would use them.

    I think you may have missed the context of that part of the article. The author was talking about diagnosing a possible AoA sensor malfunction.

    Changes in AoA should generally correlate with changes in attitude according to the artificial horizon and with changes to what you see looking out the window. So if you have an AoA sensor that claims you are at, say, a 70 degree AoA for several minutes (like the one in the Lion Air crash did), but you can see out the window that the plane alternating between pitching up and down relative to the ground, you've probably got a busted AoA sensor.

    • DuskStar 5 years ago

      I wouldn't say that you can't correctly diagnose the failed AOA sensor in the majority of cases, but the conditionals to disable a safety system are probably going to have to be more rigorous than "AOA changing between extremes without significant change in pitch, airspeed or altitude". Just for starters, imagine you've hit a particularly violent updraft and then a downdraft... Though I'm not sure if such weather conditions would exist anywhere you'd plan to fly an airliner.

      (And I'm kind of being "perfect is the enemy of good" here, sorry. Any "AOA sensor is wonky, disable MCAS" methodology would probably have resulted in better outcomes than the one they used. It's just that none of them are really perfect)

  • FabHK 5 years ago

    > I'm sure there's more

    How about this one:

    > "The solution was to extend the engine up and well in front of the wing. However, doing so also meant that the centerline of the engine’s thrust changed. Now, when the pilots applied power to the engine, the aircraft would have a significant propensity to “pitch up,” or raise its nose."

    The centerline of the engine’s thrust moves up, doesn't it (the constraint is the floor, bigger engine requires you to go higher, wings are in the way, so you go higher and forward), so presumably it moves closer to the centre of gravity, giving you a shorter lever, thus full throttle should give you less aircraft nose up momentum.

    In reality, the issue is (as far as I understand) that the nacelles generate lift that is further out in front of the centre of gravity, giving you a longer lever, and that's what produces a stronger pitch up (under high AoA) than before.

    EDIT to add:

    The article mentions that later:

    > "And the lift they produce is well ahead of the wing’s center of lift, meaning the nacelles will cause the 737 Max at a high angle of attack to go to a higher angle of attack."

    The relevant property is not that the lift of the nacelles is ahead of the centre of lift, but ahead of the centre of gravity.

  • ncmncm 5 years ago

    Yes, it is unfortunate that these errors detracted from the thrust of the article.

rkagerer 5 years ago

  I believe the relative ease — not to mention the lack of
  tangible cost — of software updates has created a cultural
  laziness within the software engineering community.
-- This --^

As someone who carefully crafts their code to strive for perfection, seeing sloppy work out there in the wild drives me nuts. I know folks here will deride me for being "inefficient", but in the long term I still maintain from my experience it's less efficient to push out buggy software and try to fix it retrospectively.

  • anbop 5 years ago

    Well, great code is inefficient if the impact of bad code is minimal. If you are shipping an animated emoji you should be sloppier with your code than if you are building flight control software. Overengineering software is like over-specing building materials by 10x and spending way more money on a building than it needs.

    • Raidion 5 years ago

      I agree, but practically a lot of bad software practices happen when middle management don't fully communicate the tradeoffs that are made when designing and building software. I find it hard to believe that this was a problem that no one really saw coming, but just didn't get flagged at a level that was needed to investigate it.

    • tremon 5 years ago

      And in that same line of argument, great code is inefficient if the bad code can always be fixed post-release.

      Never mind the consequences of the first release, if we sell enough of it we might be bothered to pick up our slack 2 years from now.

  • adreamingsoul 5 years ago

    Recently, I had someone join my team who was instantly popular and everyone enjoyed this person's personality. I trusted this person more then I should have and during a period when I was unable to give enough attention to their code reviews, they introduced several critical bugs and defects into the production environment.

    I never figured out if it was ignorance or laziness. But, either way I learned a valuable lesson. It was also a reminder to me that not everyone has the same level of ownership or commitment that I would expect.

    • crescentfresh 5 years ago

      > not everyone has the same level of ownership or commitment that I would expect

      This is a good way of putting it. A dev might be junior and won't have the knowledge or skill to build stuff with minimal defects, but they can still have the sense of ownership and commitment to a quality product that a senior/trusted dev on the team has.

    • underwater 5 years ago

      If the only gatekeeper against shipping bugs is a senior engineer catching bugs in review, then that's possibly a sign that your processes need improvement, rather than the people.

      • tremon 5 years ago

        Not rather; in addition to would be more appropriate.

    • quickthrower2 5 years ago

      They are new to the team and are not getting sufficient code reviews. I wouldn’t put the blame on them. There is a lot of gotchas in an unfamiliar codebase. Although they probably should have known this and insisted on a full review if they are at senior level.

      • megaremote 5 years ago

        Maybe, but code reviews don't catch everything. They never can. Relying on them can also be a mistake.

        • toomuchtodo 5 years ago

          If you can't trust code reviews for reviewing code of someone relatively new to a codebase, what can you trust (besides tests)? Despite tests and reviews, you might still introduce bugs into production code.

  • blunte 5 years ago

    I think you vastly underestimate the scale of an aircraft in whole. Your code is one thing, but the many systems it interacts with, the systems they interact with, and the possible outcomes, is almost unfathomable to a human.

    And even in simple cases where the number of possible situations is calculable, you probably haven't covered all possibilities.

  • baddox 5 years ago

    It definitely depends what you're doing. Designing aircraft control systems? Yep, probably strive for perfection. Building a social network site, or a photo-sharing site, or an online forum? You're probably okay sacrificing some level of reliability for iteration speed.

    • Retric 5 years ago

      It’s less about the domain than the impact of the code.

      Bad code can kill your company or go unnoticed, depending on where it is and what it’s doing.

      • Gibbon1 5 years ago

        I wrote a buggy POS test program that had race conditions because I didn't really understand what I was doing. And sometimes it would assign two units the same serial number. It would check once the test was done and just fail one of the two if that happened. And the tech would just rerun the test on that unit.

        In the big scheme of things no one cared.

      • baddox 5 years ago

        Of course, but the solution for the vast majority of companies and projects isn't "never release code that isn't perfect."

  • rubicon33 5 years ago

    If only managers and CEOs saw it the same way.

    • WalterBright 5 years ago

      You'll see it in any piece of software written by any developer who is self-managed. The notion that software quality would greatly improve without managers is a myth.

      • ken 5 years ago

        We seem to have very different experiences. Every manager I ever had (including one at Boeing) has pushed me to ship software that wasn't at all ready, on the basis that the UI of the prototype looked good.

        • inflatableDodo 5 years ago

          I am terrified these days of showing anything that looks like something that works to a project manager or client, without covering it in warning graphics.

          Even photoshop mockups of UIs, introduced as, "Here's a photoshop mockup of the UI, what do you think?" will get some people to demand that it be shipped immediately.

          • outworlder 5 years ago

            One thing that I've learned (from someone else) is that unfinished software should look unfinished. Even if you have to muck around with CSS, disable some buttons here and there, maybe add something out of place.

            If the UI looks perfect, then everyone will deem it to be finished, even people who should know better.

            Yes, I'm advocating knowingly crippling demos. Your master branch may have something prettier, but take a sledgehammer and uglify the bastard before showing to decision makers.

        • WalterBright 5 years ago

          If you were your own manager for your own code, you'd do it to. I'm unaware of any data showing managed software development is buggier than programmers working by themselves, and my experience is it is less buggy.

          Programmers by default tend to want to work on the fun stuff in a program, and neglect the unfun pedestrian stuff. This is where managers step in to improve things.

      • blunte 5 years ago

        Funny you should present it this way. My view is the opposite. As a self-managed, often solo developer, I find that team developers seem sloppy and very willing to choose not to consider the full scope of possibilities because they figure it's not their problem. Many of them work "to spec" and (rightfully?) choose not to consider the bigger picture.

      • rubicon33 5 years ago

        It's the artificial timeline, not the manager. The manager just enforces the artificial timeline.

        A developer themselves can hold themselves to an artificial timeline. It takes discipline, and patience, to choose quality over speed to market.

        CEOs and managers almost always prioritize time to market, over quality. That's the problem.

        • WalterBright 5 years ago

          Most projects (not just software) never ship unless there's a deadline. For example, how many students complete their term papers weeks before they're due? If there was no due date, how many would ever complete them? I know I never would have learned much of anything in college without deadlines and exams.

          • rubicon33 5 years ago

            Of course. I'm not advocating for no deadlines. I'm advocating for flexibility.

            If your deadline arrives and you can honestly say you've worked hard but haven't met it, consider extending the deadline in the name of quality.

            Again, this takes discipline. You have to be honest with yourself if you are just moving the deadline back because you were slacking. In my case many times, the deadline was not hit because the scope of the project was underestimated. We were working hard on a new technology stack, but missed the deadline. Rather than delay the product, CEO decided to launch (despite our warnings).

            Needless to say, the brand suffered.

            I understand its a tug of war. Managers need to put out something. Engineers are never ready. The art in it all, is finding the balance. Being flexible to listen to your engineers, and knowing them well enough to trust them.

      • LordHog 5 years ago

        In the various industries I have worked in DO-178B, industrial controls/IEC-61508, and storage the one consistency shared by managers is meeting schedules and milestones. Generally, a manager's insight to development/engineering is for the product to be just good enough. Developers/engineers have a tendency to over engineer the solution. Good peer reviews have the most influence on software quality as does independent testing. Managers rarely have little impact that I have witness over my 25 year career, thus far. For industries that adhere to DO-178B or IEC-61508 guidelines, it is the process that imbues greatly improved quality.

        • airbreather 5 years ago

          Yep, 61508 is basically a specialized form of IS0 9000 quality management.

          And independent testing really makes a difference, I am trying to introduce automated testing into my company at the moment, they still spend a month with buttons and lights testing a safety system.

          I have been getting blank looks when I ask how they test for single scan events like this. But I know of several industrial incidents resulting from common single scan software design failures in safety systems (usually order of execution issues, but sometimes the flitter logic of oneshots is the cuplrit).

          One single scan incident in particular was from equipment in service for over 10 years, and then the stars suddenly aligned and the resulting software failure event ended up costing a big miner well over a billion dollars.

        • ncmncm 5 years ago

          I have frequently seen standards, guidelines, and "process" prevent improvements to code. Since change is considered risky, it takes more time and effort to document improvement than the individual change is worth.

          But software improves, when it does, by a series of small changes, each by itself hardly worth doing, but in sum producing a wholly better product.

          It is the process that turned lizards into birds.

        • WalterBright 5 years ago

          Businesses would not employ expensive managers if the dev teams worked better without them.

      • andbberger 5 years ago

        [citation needed]

        idk man, my code is pretty damn clean

        • WalterBright 5 years ago

          I think my code is cool, too, until I look at it again 5 years in the future.

          • D-Coder 5 years ago

            There's your problem — you're releasing too early. :-)

        • alacombe 5 years ago

          Code cleanliness has no correlation with lack of bugs. Also, 1) your sample size is pretty small, and 2) what looks clean to you might very well be despicable to someone else.

          • blunte 5 years ago

            And more importantly in this case, the appearance of your code is relatively meaningless compared to you failing to consider all cases of inputs.

            Honestly, the appearance of code is irrelevant. The goal is to absolutely accurately meet the requirements - the true, actual requirements. It may not be within scope to know all the information you need to know to cover all the cases, and that's certainly a problem. But the cleanliness of your code, assuming it functions correctly, is unimportant to the operation. (It certainly will be a maintenance cost, though.)

          • onemoresoop 5 years ago

            The actual argument is that code cleanliness improves readability which in turn makes finding bugs easier and also eases the conceptualizing of what the code does. I'd say it's first line of defense against bugs.. tests could miss bugs too but that doesn't mean we shouldn't test thoroughly

          • andbberger 5 years ago

            I guess that was a poorly chosen word on my part then. How about 'elegant'.

            • ncmncm 5 years ago

              "Elegant" does't mean "right" either.

              But with elegant code it is often easier to tell if it's not right, and also how to get it right if it's not. Often that requires making it less elegant, because the real world rarely poses elegant problems. You want to ensure that as much as possible of the necessary inelegance is at the top level, with everything below clean.

    • winrid 5 years ago

      At my current company all leadership is in agreement that it is more expensive to deal with bugs in production.

      • Frost1x 5 years ago

        But CD...agile...

        It really depends on the scenario. If you can bill those fixes to the client or if you have to eat the cost to make the fixes. Most modern development processes seem to push this model simply because they can essentially pass the costs to the client and charge other or future clients for the improvements.

        • winrid 5 years ago

          We charge upfront for someone to use the product and negotiate any feature requests.

    • alacombe 5 years ago

      One the other side, unless you are an artist and not dependant on your software actually being used, you have to get product out at some point.

      It's easy to demonize CEO and managers. We, as devs, will always find an excuse as to why the software is not ready and QA will always find new things to test to justify not putting the software out.

      • airbreather 5 years ago

        If you have fully defined what your software will do up front, then you are finished when have made it do what you said it would.

        Safety critical software is usually designed, built and tested to the V-model.

        When done properly every piece of logic/code can be traced back to a requirement through all steps and phases of development and testing.

        There is no "move fast and break things" in the safety instrumented field, the engineering hours per byte/instruction delivered would absolutely flabbergast most software devs, by many orders of magnitude.

      • Frost1x 5 years ago

        It depends on the real world effects of pushing software out. If the side effects are minor and its perfectionism, then sure, push it out ugly.

        If it's a safety control mechanism for a car or airplane, then perfectionism needs to be applauded over faster time-to-market. If it's another chat application and changes can be easily rolled back, then push it out warts and all.

      • Gibbon1 5 years ago

        I think that 'can safely ship' is primary feature of any software project. What the means in practice is it puts pressure on new features and 'refactoring'

  • amelius 5 years ago

    > As someone who carefully crafts their code to strive for perfection, seeing sloppy work out there in the wild drives me nuts.

    Do you use formal methods to prove the correctness of your code?

    Because that is what Boeing engineers do (I hope).

    • philjohn 5 years ago

      You would indeed hope. But then again, look at the Toyota ECU issues ... and the subsequent code review as part of the court case ...

    • philipov 5 years ago

      "Beware of bugs in the above code; I have only proved it correct, not tried it." - Donald Knuth

    • Raidion 5 years ago

      It's not the logic that's the problem as much as the fact that the systems are meant to work with clean data, and the fact that data could not just be missing, but straight up incorrect was never considered as a valid possibility. This was made worse by the fact that the poorly formulated "solution" was barely communicated to the pilots. I saw somewhere that the plane could be unrecoverable in 40 seconds. You need to have bigger safety margins than that.

      • ncmncm 5 years ago

        40 seconds is an eternity to a pilot taking off or landing. Even Air France, taken down by a plugged pitot tube from cruise altitude, was doomed after only two minutes.

        Airbus has a lot to answer for on that one: averaging inputs from the pilot's and copilot's game controllers? Game controllers? Turning off the stall warning in deep stall, so that starting to recover sounds it again? Failing to teach pilots what stalls are, what they feel like, and how to prevent and recover from them?

        There is more than enough blame to go around for that one.

        The Boeing software, like the Airbus's, apparently performed as specified, so there was no problem with execution of the spec. The problem was that it was a bad spec, in too many ways to count.

  • rdiddly 5 years ago

    The most efficient way to do it is to do it once.

    • atoav 5 years ago

      Which leads to the heart of the discourse: programming is many things at once. Programming a single purpose library or module who has to do one thing and this one thing as good as possible demands efficient and sustainable solutions. Theoretically if you get that kind of code just right, you don’t have to touch it again for a long time.

      On the other hand you have code where the codes environment is so volatile or ephemeral, that developing the appropriate code isn’t possible, because you are aiming at a moving target. This often demands faster solutions, shims, code that was never really meant to be maintained etc. Sadly many companies have a culture where these shims become like glorified tradition and a few years down the line you have a pile of shims that nobdy sane is willing to touch.

      However there are situations where fast solutions make sense, e.g. because it is a one time thing or it doesn’t really matter that much etc.

    • baddox 5 years ago

      That is definitely not true for all tasks, and especially not all software engineering tasks.

    • blunte 5 years ago

      Indeed, but the prereq is that you know the need absolutely and correctly initially. And you usually don't.

  • mathgenius 5 years ago

    This was not buggy software. This was a bad specification. The software did what it was supposed to do.

    • ncmncm 5 years ago

      Anyway, specified to do.

wpietri 5 years ago

What a stellar example of an article on a complex topic written to be clear enough for the audience to understand. I especially like the way he brought it back repeatedly to hands out the window and bitey dogs.

  • wpietri 5 years ago

    I also heartily agree with him that software's general laxity with regards to reliability is contagious. I've come think that calling it all "software" is dangerous, like thinking of all things made with atoms as the same.

    I think we as an industry should get together, divide the work into various domains, and establish professional and ethical standards for the domains that matter. Standards with teeth, such that developers who want to do the right thing in the face of bosses insisting otherwise have the backing of their peers. And also such that developers who don't care about the right thing fear the professional consequences.

    I honestly think this is something we should do regardless. But as a practical matter, if we don't governments will do it for us soon enough. Software keeps getting more important, as do its failures.

    • ubertakter 5 years ago

      Something like (or exactly) software systems safety methods should be applied to critical systems (such as aircraft systems). DoD does this for all of their critical software. And I say "something like" only to indicate there may be something particular about software in aircraft. I doubt it though. And as the article indirectly points out, analysis was severely lacking.

      DoD Software System Safety handbook https://www.acq.osd.mil/se/docs/Joint-SW-Systems-Safety-Engi...

      Full disclosure: The company I work at does this type of work. I don't work in that group.

    • xiphias2 5 years ago

      We already have standards, one of which is clear technical documentation of the system, which the users were not getting. Also we have standards for writing fault tolerant systems (I would say even 2 measure devices are not enough, you need at least 3 to be able to decide).

      In Boeing it's clear that software best practices and standards were not followed at all.

      As an example here's the description of the Google Test Certificstion process:

      https://mike-bland.com/2011/10/18/test-certified.html

    • uxp100 5 years ago

      I don't know that this hasn't been done already. If you're working on automotive software, for example, you will comply with ISO 26262. Is this the type of professional standards you are referring to?

ReGenGen 5 years ago

The 737 Max keeps getting viewed as an "Engineering Failure" but we should consider if this really was a "Management" failure. The unstated goal w/MCAS was to avoid additional pilot type certification. (If you tell pilots about MCAS, give them an off switch, or if the system switches off... then pilots need to be trained specifically on the MAX.) If management gave the MCAS project to Senior Boeing Engineers they would likely push back jeopardizing unstated goals. Executives likely steered this project to junior engineering or yes-men... who delivered a solution which avoided pilot training.

  • theclaw 5 years ago

    The article states that MCAS was implemented "on the hush-hush," which makes me wonder if it could even be subject to the same level of quality control as other features of the software.

    It might have had to bypass some of the more stringent parts of Boeing's development process to avoid appearing in documentation that the customer or the FAA might see.

    • ncmncm 5 years ago

      This appears to be what happpened.

      If so, it amounts to criminal negligence. People should go to jail, but if anybody does, it will certainly not be the ones ultimately responsible. Most likely Boeing will pay fines and court judgments, something probably already factored into their stock price, impacting people holding the stock this year, not those who might have demanded better management five years ago.

      Certainly the whole top tier of management should be fired, but that won't happen either.

      • jeremyjh 5 years ago

        How about the fact that one plane full of people going down was not enough to wake everyone up. No, we need two planes to go down, and still the FAA was telling us we have nothing to be concerned about. The biggest problem here is not the engineering or even the management. Its regulatory capture.

  • masklinn 5 years ago

    > The 737 Max keeps getting viewed as an "Engineering Failure" but we should consider if this really was a "Management" failure. The unstated goal w/MCAS was to avoid additional pilot type certification.

    The failures were first political, second managerial, and third ethical. The actual engineering failure is a far fourth.

  • _bxg1 5 years ago

    This is true, but any number of engineers still should have blown the whistle. The driving force may have been managerial greed, but other factors still could have averted it.

_bxg1 5 years ago

"Various hacks (as we would call them in the software industry) were developed."

Dear God. That's a sentence I never, ever wanted to hear about an aircraft.

  • WalterBright 5 years ago

    All engineering designs are full of compromises (the article uses "hacks" meaning compromises), as there are a large number of competing issues at work. Pretty much none of those issues are ever aligned along the same axis.

    For just a taste of this, an airliner flies at high altitude, and at low altitude. It flies at low speeds, and high speeds. It flies heavily loaded and empty. Optimizing for any one of these regimes means unacceptable behavior on the others. So compromises are necessary.

    Ever notice the flaps on the wings? They're a compromise (a "hack" if you will) to change the shape of the wing to make it work better across different flight regimes. Because metal isn't very flexible, and a long list of other issues with the machinery that operates the flaps, the shape of them is hardly anything but compromises.

    • _bxg1 5 years ago

      I don't even work on life-or-death machinery, only JavaScript UI's, but if a higher-up requested a new feature from me that would require multiple cascading changes just to keep other things from breaking, I would fight tooth and nail to talk them out of it. The compromises you're talking about are fundamentally unavoidable ones that come directly from the constraints at hand. A hack is something that degrades the integrity of the overall system for the sake of a short-term feature. Hacks are the payday loans of technical debt.

      • WalterBright 5 years ago

        They're the same thing. BTW, many aircraft crashes were due to a pilot doing what would have been the right thing on another design they were familiar with. Boeing's plan to make the MAX behave like other airplanes is a reasonable plan to improve safety.

        The "multiple cascading changes" is always an issue with a complex design, and in fact Boeing's approach with the Max was choosing a route which minimized those cascading changes.

        • mcny 5 years ago

          » They're the same thing. BTW, many aircraft crashes were due to a pilot doing what would have been the right thing on another design they were familiar with. Boeing's plan to make the MAX behave like other airplanes is a reasonable plan to improve safety.

          » The "multiple cascading changes" is always an issue with a complex design, and in fact Boeing's approach with the Max was choosing a route which minimized those cascading changes.

          I think the charges people are bringing is that Boeing wanted to make the new airplane acceptable by airliners by making the 737 Max the same as the old plane so there was no need to certify pilots for the new plane. My vote is that this is criminal negligence at best and under an ideal government, the then CEO and board would be in jail pending charges right now.

          • WalterBright 5 years ago

            You wouldn't have any planes (trains or automobiles) under such a standard. Keep in mind that horses killed far more people per mile than planes, trains or automobiles ever did.

        • Gibbon1 5 years ago

          > "multiple cascading changes"

          That's aircraft in a nutshell. I read a book on light aircraft design by a now long dead professor[2]. He had a penciled out analysis of what would happen to a 2 seater light plane if you replaced the simple and light stick with cables design[1] with a hydraulic one. Only added about 40lbs more weigh. Or 300lbs when that was propagated through the design. And needed a bigger engine.

          [1] Control surfaces operated by a system of cables and pulleys controlled by a stick.

          [2] KD Wood https://en.wikipedia.org/wiki/Karl_Dawson_Wood

    • tigershark 5 years ago

      No, MCAS was absolutely not a compromise. It was definitely a hack to avoid the pilots recertification on a plane that would have handled differently from the previous model. And it was a hack put in place to correct for the other hack of moving forward the engines, that are way too big for that frame, and that caused the different handling. The only compromise that we can see is that safety was effectively compromised causing the death of hundreds of people just to avoid the pilot recertifications with their hacks.

  • nutjob2 5 years ago

    The modern 737 is a collection of hardware and software hacks, flying in close formation.

    • penagwin 5 years ago

      > flying in close formation "conveniently in close proximity most of the time, usually".

      I wonder what the equivalent of "Developing in Production" looks like for planes, or if this is it?

    • WalterBright 5 years ago

      So is any airplane that was ever built that is capable of flying.

    • cmauniada 5 years ago

      cloud formation*

      Sorry, I couldn't resist.

laydn 5 years ago

The article states: "When MCAS senses that the angle of attack is too high, it commands the aircraft’s trim system to lower the nose. It also does something else: It pushes the pilot’s control columns downward"

No, it does not push the pilot's control columns downward.

  • Declanomous 5 years ago

    I think that the author is mistaking the fact that trimming the plane forward means that more force is required to pull back on the yoke for the plane actively pushing the yoke forward.

    • cjbprime 5 years ago

      Yeah, maybe also some Boeing/Airbus confusion. The author doesn't seem to understand the extent to which the 737 cockpit is mechanical, with cables running to control surfaces. It's an important point because it likely explains why the Ethiopian pilots were unable to regain control after disabling electric trim, becoming overpowered by aerodynamic load on the stabilizer.

raz32dust 5 years ago

Could someone please ELI5 what exactly went wrong that caused the crash? I read the article but I still don't understand if the crash was caused by

(a) The MCAS doing something it wasn't programmed to do (by specification)

(b) MCAS was working as expected, but the expectations/assumptions were wrong (excluding pilot mistakes).

(c) MCAS worked as designed, and the design was correct IF the pilots behaved like Boeing expected them to.

Or has this question not been answered yet? I am talking purely from the software correctness point of view, irrespective of the fact that using software to work around this problem was already bad design.

  • ncmncm 5 years ago

    It is none of the above.

    The MCAS worked as specified. The specification was criminally stupid. It violated at least two bedrock principles of avionics design. It would not have been approved at all if Boeing had not bent over backwards to draw attention away from it, for fear that pilots might have needed extra training, or the plane need more exanination.

  • tgsovlerkhgsel 5 years ago

    "All of the above"

    The MCAS itself did exactly what it was programmed to do. One of the sensors feeding it, which was not redundant (!), malfunctioned, and MCAS acted on the data - garbage in, garbage out.

    By not having sufficient redundancy, MCAS was designed in a failure-prone way. This was justified because the failure would be like another kind of known failure mode (runaway trim), and pilots already had a checklist for it (the key point being to turn off electric trim).

    The problems with that are:

    1. The problem a malfunctioning MCAS (acting "correctly" based on bad data) introduces is intermittent/erratic, making it harder to correctly react to it. Especially since the pilots weren't aware that a system that would act like this was on board!

    2. Humans make mistakes in high-stress situations, which is why accepting something that introduces additional problems and relies on humans to handle them is a bad idea.

    3. Turning off the electric trim also made it hard for the pilots to correct the situation, as it disables the manually-controlled but motor-powered trim. One theory is that the forces may have been too great for the backup solution (trimming manually) to work, a situation that old (but not current) manuals addressed.

    MCAS has a limit how much it will trim, but any trim input by the pilot resets it, and lets it move the trim by the limit amount again.

    This is just a layman's interpretation:

    From the flight data recorder, it seems like the Ethiopian ET302 pilots only partially corrected the wrong trim, then did turn trim off. After this, there was no noticeable adjustment in trim until the pilots turned electric trim back on to be able to use it. They used it to trim a little bit, and 5 seconds after their last trim input, MCAS triggered again and drove them into the ground. [1]

    On the other hand, the Lion Air JT610 crew doesn't seem to have turned trim off, just corrected it each time MCAS trimmed down (only for MCAS to trim back down 5 seconds later) [2] - until they once only trimmed a little against it, so that they hadn't corrected what the previous MCAS trim had done, but unlocked it for another one. MCAS applied more trim, and that was it.

    This could likely have been avoided by telling the pilots about MCAS and how it works, having a separate switch that kills MCAS but not electric trim, some sort of indication that MCAS is working, some sort of indication that the AoA sensor is broken, redundant sensors, and countless other things. The most likely reason why those weren't implemented seems (for me) that adding them would likely trigger the certification/training requirements that Boeing was trying to avoid.

    [1] https://leehamnews.com/2019/04/05/bjorns-corner-et302-crash-... [2] https://static.seattletimes.com/wp-content/uploads/2018/11/L...

  • xigency 5 years ago

    The answer is B.

wnevets 5 years ago

>Unfortunately, the current implementation of MCAS denies that sovereignty. It denies the pilots the ability to respond to what’s before their own eyes.

>In the MCAS system, the flight management computer is blind to any other evidence that it is wrong, including what the pilot sees with his own eyes and what he does when he desperately tries to pull back on the robotic control columns that are biting him, and his passengers, to death.

Wow, I had no idea the issue with the 737 Max was so nuts. I can't imagine what the pilots were going through when this was happening.

  • ncmncm 5 years ago

    Well, no robotic control columns, but the effect of silently adding uncommanded trim is the same.

NoNameHaveI 5 years ago

So, the linked article is an updated version of an earlier article in the EE times. In the comments section, a reader (ie) Frankly, I am astonished that a single point of failure (AOA) could make it through a FMEA (Failure Mode Effect Analysis). For those who are unfamiliar, a FMEA is (basically) looking at each part of a system and saying "What happens if it breaks?". Having worked in commercial vehicle software development, I am astonished as well.

  • cjbprime 5 years ago

    It appears it made it through because Boeing didn't reveal that MCAS can incrementally mistrim until it has overpowered the pilots. They may even have added that capability after going through the initial certification.

  • theclaw 5 years ago

    I don't work in this field. Do you have to publish the findings from this analysis? Is it possible that no such analysis was done because the MCAS system had to be kept quiet?

Xcelerate 5 years ago

Just a meta-comment, but I've noticed that for every single article posted about the 737 Max on HN, the top comments all tend to say that something in the linked article is dramatically wrong. I'm not sure what this means, but I find it interesting.

  • cjbprime 5 years ago

    I think the takeaway is that it's not usually the case that something this technical makes it to world news. Everyone wants to have an opinion, but even in this article's case of someone who's both a pilot and programmer, they apparently know next to nothing about how a 737 MAX is controlled (mechanically, not with forces artificially applied to inputs) because they fly a Cessna. And there aren't that many 737 pilots with the time or ability to write something frank about this without perhaps getting in trouble with their employers.

  • mncharity 5 years ago

    > I'm not sure what this means

    Medicine is struggling to adopt modern best practices for quality processes from other industries, including aviation. Journalism hasn't yet even noticed they're needed?

murkle 5 years ago

> In the 737 Max, only one of the flight management computers is active at a time—either the pilot’s computer or the copilot’s computer. And the active computer takes inputs only from the sensors on its own side of the aircraft.

  • paulopontesm 5 years ago

    Also wondered what was the logic behind this decision...

    • nutjob2 5 years ago

      Like everything else, they were maintaining certification. The original avionics worked that way. The same reason that the LCD screens display simulations of the original avionics hardware such as the artificial horizon.

    • digikata 5 years ago

      I suspect combined reliability of a simple switch and two independent systems is likely higher than one composite system and pilots or software trying to estimate and select which combinations of computers & sensors are "good" in the middle of an emergency.

      • airbreather 5 years ago

        And on the surface this looks like a reasonable comment, but it is exactly why there is a whole branch of engineering dedicated to understanding how to build safer systems. Counter-intuitive results abound.

        So many issues - a simple switch usually has poor diagnostics, at least in one mode of failure, so you dont know it has failed until it is too late. A continuous measurement device connected to a computer/s will have a vast array of available diagnostics, 'most probably leading to less "dangerous undetected failures" than a simple switch, or combination of.

        And "independent systems", sounds easy, but in practice full independence is almost impossible to achieve, and messy unpredictable humans dominate the common cause failures that overlap these systems.

        There is more, much more, but this is why it is hard to right readable articles about these things, so much devil is in the detail that is hard to explain in bite sized portions.

        • digikata 5 years ago

          "A continuous measurement device connected to a computer/s will have a vast array of available diagnostics, 'most probably leading to less "dangerous undetected failures" than a simple switch, or combination of."

          Isn't this exactly the approach that failed in the MCAS system? And if you had a switchable independent system, a copilot would have righted the plane and flown on.

          But really I agree with your overall comment, it's very difficult to know why a given safety design decision was made unless you are well steeped in the system - there are almost always little corner tradeoffs. That's why I added the "I suspect" to the front of my comment.

          • airbreather 5 years ago

            This the approach that IEC61508 leads you down by the numbers, but it is also always better to cover off the unknowns with redundancy(multiple sensors) and diversity (different kinds of sensors) wherever practical.

            However, more instruments mean more potential disagreements, so more complexity of possible outcomes/actions/diagnostics etc.

            It becomes a balance for the best outcome and surprisingly when you go through all the factors there is still quite a bit of subjectiveness and sometimes the numbers for failure rates are so low that the calcs become extremely sensitive.

            Additionally, there is always beta factor, which allows for common cause failures between instruments/systems. Often beta factors are the dominant factor numerically in a performance calculation, but are a) essentially traceable back to issues with humans (design, installation, maintenance) b) often vastly underestimated and represented as an average value, where in the worst cases are rare but very high - one tech installs both instruments incorrectly so they both read wrong but same

blunte 5 years ago

I think a fundamental concept that is relatively ignored in press (mainstream or non-fringe at least) is in business priorities.

As the essay points out, Boeing made decisions based on market (financial) factors all along the way. Of course they did - almost all companies do. Because it's so common, we forget it's not a RULE that cost, shareholder value, profit, etc. must be the final judgement.

Certainly any company that might prioritize safety (as in the case of a people-carrier) or some other non financial focus may not be as attractive to investors, in some cases - particularly where human life is concerned - maybe that's ok.

I know this is a bit sensational of me to suggest, but Walmart could make an airplane. They have the funds to do so if they wanted (they would just buy a company; that's the quick way to get rolling). And if the Walmart airline charged half of what other airlines charged, you can be sure a whole lot of people would fly it. And sure, after enough flights, the fatalities per flight would become uncomfortable to most flyers.

The point is, it's a very long game. If people are not willing to consider the long game, they are basically gambling. I fly a lot. I almost always choose my flights based on cost/convenience. I didn't previously avoid 737 MAXs. (I would now, but they're all grounded anyway.) That said, if I know an airline cuts every corner to lower the price and increase the profits, I will certainly choose the next more expensive airline that doesn't behave so poorly. In this case it's the manufacturer, and the consumer has much less choice where that is involved.

But let's get back to the point. If Boeing were to fall behind Airbus in the 737/A320 race, would that really be such a terrible thing? Would the cost be human life, or might it be some stock price level? As an investor, do you really care about your shareholder value more than human life?

I like to fly. It can actually be fun. And I really like to visit lots of places in the world, eat lots of awesome food, and experience different cultures. I don't want to die because of some shareholder value goal (fuck you). I will die, and maybe it's tomorrow, but it shouldn't be for a stupid reason unless I choose it (here, hold my beer.)

  • bumby 5 years ago

    I agree that people often seem to miss that this may be rooted in a business decision. The problem is it isn't a risk-informed decision. I would doubt Boeing was accurately able to assess the actual risk of MCAS causing a catastrophic failure or else the decision to rush to market wouldn't have happened.

    I think just as large of a problem is mis-aligned incentives. Management is almost assuredly not singly focus on the extreme long term. The are graded quarter-by-quarter, year-by-year. This pushes uncertain risks to the periphery in favor of short term profits. I'm worried that unless the incentive structure is re-evaluated (by jail terms as an example) management will continue to make these types of decisions because schedule and cost will remain king

    • linuxftw 5 years ago

      > I would doubt Boeing was accurately able to assess the actual risk of MCAS causing a catastrophic failure or else the decision to rush to market wouldn't have happened.

      That's most likely because upper management has been cutting staff and not investing in their people.

      They should be sued and fined out of business. Whoever picks up the pieces will know: "Don't cut corners, or it's absolutely ruins."

      • bumby 5 years ago

        I have no clue about Boeing's staffing practices but it's often the case in large organizations that the people trying to prudently hold up a project because it's not ready are looked at less favorably by those with "go-fever".

        I agree they should be held accountable but there's also blowback from bankrupting one of a nation's major aerospace manufacturers

        • linuxftw 5 years ago

          > I agree they should be held accountable but there's also blowback from bankrupting one of a nation's major aerospace manufacturers

          So let there be blowback. Flight costs will go up if they have to, nbd. If there's too big to fail, then just nationalize them and get on with it.

          • bumby 5 years ago

            I meant blowback bigger than just consumers pocketbooks. Issues related to national security because the nation just lost one of it's primary aerospace contractors.

            The "too big to fail" issue is an important one, though many Americans tend to hate the idea of nationalization because in their minds it's a social evil.

            • linuxftw 5 years ago

              Yeah, it's a tough lesson. We shouldn't put all of our eggs in one basket. We have other aerospace contractors anyway.

  • xigency 5 years ago

    > Certainly any company that might prioritize safety (as in the case of a people-carrier) or some other non financial focus may not be as attractive to investors, in some cases - particularly where human life is concerned - maybe that's ok.

    This reminds me of the Takata air bag recall. If your product is a life saving device, margins should not be the highest priority.

sisu2019 5 years ago

Sorry, I don't believe that our perspective on this is especially valuable or insightful. Everyone with an average IQ can explain in 5 sentences what went wrong with this plane. Yes new engines didn't fit the plane, fixed with software, well done.

But why? It's not profits. Boeing has been for profit from the start.

It's something more. We are getting bad at doing hard things. Here in Germany we can't seem to finish a new airport (BER). Or a new train station (Stuttgart 21). Or the new ICE trains. Or keep the autobahn bridges maintained.

And software? There is an article about problems with Win 10 updates ever day now it seems. And what about all these new SPA type websites that pull down 30 megs of javascript in order to keep failing at basic tasks in new and exciting ways? On latest android there is a packet fragmentation bug that has gone unfixed for 6 month and counting now. Oh it only prevents people from using amazon and netflix so no big deal.

I grew in a time where we expected stuff to get better and better but lately it seems we'd do well to just hang on to what we have.

  • theclaw 5 years ago

    It was profits. Airbus did effectively same thing with the A320 Neo in 2010, and it became the fastest selling commercial aircraft in history [0]. Boeing clearly was unhappy with that and needed to do something to compete.

    [0] https://en.wikipedia.org/wiki/Airbus_A320neo_family#Orders_a...

    • sisu2019 5 years ago

      okay let's back up and remember that you can't compete with a plane that crashes all the time. So the reason is clearly not profit, in fact they will lose a big chunk of money from this and that wasn't hard to predict at all.

      • bumby 5 years ago

        I think it's a mistake to imply the managers making this decision had our hindsight knowledge. Of course if they knew this would happen they would've taken a different course, for profits and other reasons.

        To your larger point, I think as systems get more complex it's much more difficult for management to make accurate risk/benefit decisions. Take the Shuttle Challenger disaster. In Feynman's report, the management estimated something along the lines of a 1-in-100,000 chance of catastrophic failure. I think the actual number was ultimately reported around 1-in 1,000. The exact numbers from memory may be off, but the point is as systems get highly complex, increased interfaces and interactions lead to more failure modes. Understanding them all is really tough

      • quickthrower2 5 years ago

        You are assuming long term strategic thinking. I guess that didn’t happen here.

  • Frost1x 5 years ago

    From my experience I disagree. It's not that we can't do hard things, it's that there's frequently more profitable approaches than taking difficult challenges. This is becoming systemic across our culture as ideas drawn from capitalism begin invading all aspects of our lives (including how we interact in personal relationships), not just business.

    Pushing quicker time to market and cost saving solutions to the brink is part of the driving optimization problem businesses aim to do.

    This search is done across various contributing costs but inevitably, safety becomes one of factors on the board. I have previous experience with the coal mining industry and I can attest that safety is frequently hit in the name of cost savings.

    When you think about it, you rarely converge on an optimization solution across a search space from one direction, you typically have to cross boundaries to find the an optimum, especially for highly complex problems with many variables. Inevitably in this search, business interests will conflict with their profit motives as costs are cut down tighter and tighter in convergence. If we're lucky, it's a local optimum and some new technology or finding makes cost savings possible outside of cutting safety.

    • sisu2019 5 years ago

      Boeing was capitalist from the start. Why fail now in such a stupid way?

      • Frost1x 5 years ago

        Obviously Boeing didn't think they'd have planes falling out of the air (that's bad for business), but I suspect the drive to remain competitive and keep shareholders happy led those further down in the organization to push sloppy solutions (as the article described). That's been my lifelong experience working professionally in the US. It's hard to believe those design oversights mentioned were missed by multiole teams of people.

        As the article pointed out, Boeing did their best to keep these software system changes as quiet as they could, so they wouldn't have to go through a new certification process, which is extremely costly (less costly than having planes crash though). That's not something you do when safety is the top priority but it is when the priority is profit.

        My central point though is that inherently in the framework of capitalism, the endless pursuit to increase profits by reducing costs (and increasing revenue) will inevitably lead you to the opposite in order to find the optimums you seek. Sometimes, business choices that seem good lead to losses or choices that seem bad lead to gains. You often have to fail (cross the boundaries) in order to see where the boundaries of success lie for various cost saving attributes. All businesses play this game and safety is one of many factors always on the chopping block. Sometimes you have to push safety to see exactly how much investment in safety is truly needed.

rdiddly 5 years ago

Despite the possible negative consequences for our salaries, we need to work to remove the divide between the software people and the subject-matter experts. Those who use, but don't necessarily build, software, seem to place undue trust in "the computer" to be always right, whereas we all know the computer is just being instructed by ordinary human schmoes like us. And we (the schmoes) are reliant on what to me seems like a tiny bottleneck of 2-way communication with the SMEs.

It doesn't need to be binary technical/non-technical; everybody is, to some degree, technical. If you can use a knife, you're technical. And it doesn't need to be this rigid specialization, I do computers / I do aviation. If we removed that divide we might start to widen and deconstrict that bottleneck, and we might even start to find more people like this author who know both, and who don't need to have an aviation SME tell them to check more than one sensor for example, because when they're writing it, and coding the part where you set or decide the value for AOA, they automatically say to themselves "hey let's add a cross-check here like I always do when I'm flying." Specialization seems like a magic bullet but it also creates a big burden of oversight & communication, I guess is my point. It's like splitting the monolith of the human world into microservices, and with the same resulting problems.

cmurf 5 years ago

Author suggests the 737 is fly by wire, by saying there's no direct feedback in the control stick forces as it relates to control surface forces, but are rather artificially presented by computer. That's simply not true.

Even in the case of MCAS, the side effect of stabilizer nose down trim is to make elevator backpressure on the yoke more forceful. It is not a simulation. That's presumably its design goal is, and why it was so simple without any safeguards for overcorrection or a failed sensor.

I also don't like the suggestion the plane is longitudinally unstable due to the engines. That is simply not consistent with FAR 25.173 (a). A central feature of this stability requirement is the plane will recover from a stall merely be releasing back pressure. Not all airplanes do this because not all airplanes are designed that way nor are they required to, but FAR 23 (normal category) and FAR 25 (air transport category) aircraft are required to exhibit this kind of stability as well as lateral static stability. Yet people keep saying the airplane isn't stable and MCAS makes it stable.

You can't have a damn switch that instantly makes the airplane unairworthy, with a damn airworthiness directive that tells pilots to solve one problem (runaway trim, or MCAS upset) by making the airplane unairworthy. If MCAS is there to make the plane airworthy, turning off autotrim makes the plane unairworthy. That's why I don't buy any claim that MCAS is there to make the plane airworthy, until there's a preponderance of evidence from reputable sources that includes an explanation how in the world such a thing is not a violation of FAR 25.

The sensible explanation is it's a stick force moderator, in order to ensure the FAA didn't require a type certification for this make/model derivative. A plane with a type certificate triggers a requirement in FAR 61 for the pilot to obtain a type rating to fly planes with that particular type certification. I do find the suggestion of conspiracy plausible, among Boeing, the FAA, and airlines, to avoid 737 MAX type certification different from prior 737s. Consistent with that, is this week's ass covering by an FAA board saying they see no reason for additional 737 simulator training for pilots to fly 737 MAX, i.e. paving the way for a software update only solution for the current problem.

_bxg1 5 years ago

"Neither such coders nor their managers are as in touch with the particular culture and mores of the aviation world as much as the people who are down on the factory floor, riveting wings on, designing control yokes, and fitting landing gears. Those people have decades of institutional memory about what has worked in the past and what has not worked. Software people do not."

I wonder if, once basic programming becomes a part of standard education, these kinds of problems will be mitigated. Right now if you're a developer, you're not anything else. So developers come in as outsiders to any given industry and have to learn about it. But software is applicable to nearly every industry, so we end up with lots of domain-specific code written by people who don't know those domains very well. What if the domain experts could not only speak the language of code, but could do some of the coding themselves?

  • hinkley 5 years ago

    I’m attacking this problem from the other side.

    I reject the notion that one must become a monk to be a programmer. There’s a happy medium between cave troll and brogrammer to be found.

    As developers we should have more hobbies. We should find the technical aspects of those hobbies, and we should be able to walk into a company serving that community and serve as a backup SME from day one. If Software Is Going to Eat the World, that is how it will be. Not from a bunch of smart stupid people “disrupting” industries they know nothing about.

    There’s long tail of subject matters without a lot of money that could still use software. Getting software to them is going to take cheaper labor, like 2-person teams or possibly volunteer work.

  • Sharlin 5 years ago

    > What if the domain experts could not only speak the language of code, but could do some of the coding themselves?

    Some can, especially in the academia. It… sort of works but is certainly not optimal. The truth is, it is not enough to be a coder who knows something about the domain, and neither is it enough to be a domain expert who can code. You have to be an expert at both, like the author of TFA. Or at the very least, your team has to be.

    • _bxg1 5 years ago

      Could there not be domain experts who know how to express their general ideas in terms of code (pseudocode, even), paired with coding experts who can architect the overall software system? You'd still need software experts, but you wouldn't have just software experts.

      • Sharlin 5 years ago

        Yes, as I said: teams are a superorganism, a force multiplier. But a team of coders isn't enough, you need a close-knit team of people with diverse areas of expertise, and those people need to be great at communicating.

  • linuxftw 5 years ago

    That would be an overworked engineer IMO. Should the engineer that designs the air frame go weld it together?

    One issue, if the system resetting and no honoring the maximum authority to do so would not have been caught by a half-software engineer. That's definitely something the software team should have caught if there was any amount of reasonable QA involved.

    It's clear to me Boeing doesn't employ capable people on the software side.

    • tonyarkles 5 years ago

      Not necessarily go weld it together, but they should be going and spending time with the welders. Ben Rich’s memoirs from the Skunkworks talks about this a fair bit. Around there, having the engineers close to the fabricators was one of their central tenets, and the engineers did occasionally roll up their sleeves in the shop.

      With the SR71, at one point when they were fighting with a nasty engine problem, Ben himself was going to go up on a flight so that the pilot could show him what was happening.

      • linuxftw 5 years ago

        I agree, they should work as closely together as possible. But it's still a team effort, not one person involved in large aspects of the process by themselves.

adreamingsoul 5 years ago

Something that stood out to me in this article was the author's opinion of how software developers are removed from the subject matter.

From my experience I tend to agree but I hope we can change this.

For context, I'm a software engineer and designer with a UX focus. As part of my process to design a system, interface, feature, or fix I first immerse myself into the world of the person who is using the software.

What's the environment that these people are working in? What are their motivations and frustrations? What allows them to be successful in their job?

Over the last couple of years I've had less and less time to answer those questions. Either because of a high workload or because management didn't see the value in spending time to research the people using our software. That trend is alarming and concerning to me. If we as software engineers are not aware of the people who are using our software, how are we solving the right problems?

  • mch82 5 years ago

    You’re asking great questions. Keep fighting for the time you need to answer them. I’m 13 years out of college & my perception is companies care more about stakeholder research than they did when I started working, so I think the overall trend is better.

  • linuxftw 5 years ago

    We're agile now. We don't have a solid view of the problems we need to tackle, we're just given a little user story that says "Plane noses down with these inputs" and we code it up, send it over the wall.

    The larger the org, the less likely the actual engineers that will design the system are invited to the planning and arch meetings. So when stupid ideas come up in those meetings, there's no one to say "no, that's dumb" because management only knows how to say yes to their boss.

    • mch82 5 years ago

      The Agile Manifesto calls for frequent conversations and demos with customers so developers understand customer needs and acceptance criteria. If you’re given a story card & there is no customer interaction, then there’s a chance the team is calling itself Agile without actually using Agile.

amluto 5 years ago

One thing I really don’t get about MCAS: even disregarding all its bugs, it seems like a really awful style of envelope protection. I’m neither a pilot not an expert, but I can imagine several more reasonable strategies when the pilot pulls up too hard: ignore the problematic yoke input, offset the elevators a bit, or apply more force to the yoke. Moving the stabilizer out of trim seems totally wrong.

As an analogy: almost every car has envelope protection. But no car designer in their right mind would build an ABS system that responded to wheel lock by moving the brake pedal or loosening the master cylinder (and not putting it back!). Similarly, it would be crazy for a stability control system to counter oversteer by offsetting the steering wheel (and leaving it offset).

  • cjbprime 5 years ago

    Elevator probably isn't powerful enough to avoid the impending stall once you get there.

    • cjbprime 5 years ago

      ... and the 737 MAX control surfaces are mechanical. I don't think there is any ability to ignore yoke input! It's connected to the elevator with a cable. This isn't an Airbus-style fly by wire aircraft where input signals are being interpreted as suggestions for actions for a computer to take.

fogetti 5 years ago

Let me tell one thing: I absolutely hate sloppy engineering. And by sloppy I also mean rushing. There is a class of engineers and engineering managers who think that they are rockstars because they can churn out code quickly. I am absolutely disgusted and sick of this attitude.

And why do I bring it up here? Because I suspect this kind of attitude was partly blamed for this fatal and tragic accident. Of course this is pure speculation. But my 12 years experience in the software industry makes me believe that's what happened.

Also a relevant talk by Uncle Bob: https://youtu.be/ecIWPzGEbFc

  • robertAngst 5 years ago

    >I absolutely hate sloppy engineering.

    >Of course this is pure speculation

    Cool ideas, people have already figured out this was designed into the system. Sure more testing can catch things, but how much money are you allowed to spend on every single feature?

    There isn't a 'right' answer to these questions, engineers are literally doing cutting edge, never before jobs.

    Everyone wants more time, more money, and better suppliers.

    You need a call to action, not just a youtube video talking about groups not being responsible.

tlc1970 5 years ago

This article increases my concern that the primary issue here is a failure in process to ensure the safety of passengers before the 737 Max went to market.

We have world class expertise in the aviation industry, and there is no excuse for rushing a product to market without fully vetting it for safety.

Boeing, the FAA, and the airlines moved too quickly to place the 737 Max in the air and too slowly to ground the planes after the second crash. In both cases it appears that these decisions were influenced by an over-reliance on the capabilities of technology over people.

outworlder 5 years ago

> When MCAS senses that the angle of attack is too high, it commands the aircraft’s trim system (the system that makes the plane go up or down) to lower the nose. It also does something else: It pushes the pilot’s control columns (the things the pilots pull or push on to raise or lower the aircraft’s nose) downward.

Wait a minute. Since when does MCAS include a stick pusher? I haven't flown anything outside simulators yet, but AFAIK 737's do not have stick pushers of any kind.

  • ncmncm 5 years ago

    This is an error in the article. 737-MAX is not a fly-by-wire system. The forces felt by pilots are derived directly from aerodynamic forces on the control surfaces.

HankB99 5 years ago

I'm unclear on how moving the engine up causes application of power to cause the attitude to tend to go nose up. I would expect the opposite to happen. I think something else must have changed such as the center line of the engine relative to the center line of the plane. Or did moving the engine forward at the same time cause this effect?

  • Declanomous 5 years ago

    The new engine is more powerful, and at high angles of attack the nacelles also generate lift. The added power, along with the greater displacement from the center of lift/drag causes the plane to pitch up more easily, and the lift from the nacelle causes the controls to get lighter as you approach stall speed.

    The FAA requires the controls to get heavier as you approach a stall. MCAS was primarily designed to address this flaw, which is why the added lift from the nacelles is an issue.

  • leoedin 5 years ago

    I don't think it does. Other sources I've read suggest the larger nacelle introduces a pitch up during certain flap configurations and at a high angle of attack. It's not the engine thrust that's the problem, but the drag due to the large and further forward nacelle. That's the justification for the MCAS software.

    https://theaircurrent.com/aviation-safety/what-is-the-boeing...

    • ncmncm 5 years ago

      It's both. The engine thrust center is also farther below the center of pressure of the airframe. And it can apply a lot more thrust than the original engines.

      The point about lift from the engine nacelles was that it acts as positive feedback: if you are already pitched too high, it acts to worsen the problem. And, because it is farther forward, it has a greater lever arm to act.

      • HankB99 5 years ago

        Thanks, Those factors make sense.

  • mannykannot 5 years ago

    I think it is true to say that all low-slung engines produce a pitch-up moment that increases with power, which is a desirable trait, in moderation (for one thing, if the power fails, pitching up would be more dangerous than pitching down.) Also, all engine nacelles produce a small amount of lift when at an angle to the incident air. It's just that the 737 MAX has become so different from the original 737 that this has finally become a problem - the bigger engines further forward were the last straw (and it is the lift, specifically, that is the problem, hence a solution that takes angle of attack as its input.)

  • matt4077 5 years ago

    Forward, where the wing is higher off the ground. The rest is like a skateboard suddenly accelerating, which will tend to pitch you nose up (and on the ground).

  • fwip 5 years ago

    They moved it forward so they could move it up, but the centerline of the engine was still under the original engine's centerline.

torgian 5 years ago

“ I believe the relative ease—not to mention the lack of tangible cost—of software updates has created a cultural laziness within the software engineering community. Moreover, because more and more of the hardware that we create is monitored and controlled by software, that cultural laziness is now creeping into hardware engineering—like building airliners. Less thought is now given to getting a design correct and simple up front because it’s so easy to fix what you didn’t get right later.”

This is why I feel the software industry has too many problems with security, performance, etc.

I’d like to think that engineers care how good and efficient their code is. But too often, it’s up to managers or customers how quickly software needs to be completed.

This introduces bugs, incomplete features, and (in the case of Very Important Thigs to keep you Alive) potentially cause dangerous breaks.

It sounds like Big Corp was trying to push that laziness and lack of foresight into other engineering disciplines. If so, how many of our cars, planes, and other items that potentially directly affect our lives are affected by mechanical design flaws and software errors?

Hopefully the industry gets its shit together as a whole. As it stands, if I ever work on anything that affects a life, I’m damn well blowing a whistle if I feel like something is off.

PaulAJ 5 years ago

One issue about the "bitey dog" MCAS not mentioned was the story of Air France 447 (https://en.wikipedia.org/wiki/Air_France_Flight_447). Thats the one that stalled into the Atlantic in 2009. The flight crew plus a pilot that was riding in the back seat spent minutes trying to debug the flight computer, and only realised too late that the copilot was pulling back on the stick the whole time, keeping the aircraft stalled.

This happened because of mode confusion: the aircraft computer realised it had compromised sensors and had switched to "alternate law", in which the computer would not override a stall. I have no doubt that Boeing knew about this incident and did not want to create a repeat.

euske 5 years ago

Slightly off-topic, but I generally found IEEE Spectrum having more interesting articles than Communications of the ACM. Maybe they're more inclined to the industrial side whereas CACM is more academic?

zerogvt 5 years ago

"So the FAA said to the airplane manufacturers, “Why don’t you just have your people tell us if your designs are safe?”" This

tlc1970 5 years ago

Frankly, as a traveler, I have lost trust in Boeing, the FAA, and the airlines.

This article underscores my concerns that the 737 Max was rushed to the market too quickly, and that there was an unacceptable delay in grounding these planes after two crashes.

The expertise of the airline and software industries is world class. However, the process by which a determination was made to consider the Max 737 safe has failed us as travelers.

salawat 5 years ago

Couple pedantic quibbles, forgive me:

>I will leave a discussion of the corporatization of the aviation lexicon for another article, but let’s just say another term might be the “Cheap way to prevent a stall when the pilots punch it,” or CWTPASWTPPI, system. Hmm. Perhaps MCAS is better, after all.

It's not actually just intended for when pilot's pour on the thrust, it's for any time they pour on the AoA. There need not be any throttle change involved to bring this about. The most readily imagined example, however, is low speed, large throttle increase like you'd have if you were aborting a landing attempt to go around.

>When MCAS senses that the angle of attack is too high, it commands the aircraft’s trim system (the system that makes the plane go up or down) to lower the nose. It also does something else: It pushes the pilot’s control columns (the things the pilots pull or push on to raise or lower the aircraft’s nose) downward.

Nothing I've read characterizes MCAS as having an inbuilt-stick pusher. I've mentioned before that MCAS shares a spot with stick-pushers in terms of what they are trying to do, and problem they are employed to solve but MCAS does not actively put force on the stick through an extra mechanism with the intent to actuate a control surface deflection, and alerting the pilot through the haptic response, which is the defining characteristic of that type of system as I understand it. All MCAS does is modify the trim, which has the effect of passively modifying the flight characteristics of the aircraft.

>In the 737 Max, like most modern airliners and most modern cars, everything is monitored by computer, if not directly controlled by computer. In many cases, there are no actual mechanical connections (cables, push tubes, hydraulic lines) between the pilot’s controls and the things on the wings, rudder, and so forth that actually make the plane move. And, even where there are mechanical connections, it’s up to the computer to determine if the pilots are engaged in good decision making (that’s the bitey dog again).

As far as I am aware, Boeing has maintained manual reversion with regards to the trim system and yoke in the 737 MAX 8. I.e. there are direct connections to the control surfaces from the pilot's hand actuated controls. Boeing's concession to FBW is by implementing parallel automation managed control circuits that allow for electrically driven manipulation of control surfaces fed back to the pilot through the mechanical linkage. The envelope-protection -> bitey-dog analogy in this case, is still an accurate characterization.

>But it’s also important that the pilots get physical feedback about what is going on. In the old days, when cables connected the pilot’s controls to the flying surfaces, you had to pull up, hard, if the airplane was trimmed to descend. You had to push, hard, if the airplane was trimmed to ascend. With computer oversight there is a loss of natural sense in the controls. In the 737 Max, there is no real “natural feel.”

As far as I can ascertain, there still is "natural feel" in regards to pitch control. There are no hydraulic boosters in place. The problem though, is that MCAS actually hides the aberrant behavior at high AoA from the pilot by intentionally down-trimming the plane. This gets the job done (by some definitions), but does it instead through the trim system (which operates automatically in other circumstances enough where the pilot may mistake the behavior for some other system) and without doing anything to the stick (the primary instinctual control for the plane).

Other than those quibbles, this is a beautiful write-up that is a treat to read. The Supplemental Type Certification section was also informative for me, as it illustrates unambiguously that there was a process to handle this exact type of augmentation which was sidestepped by Boeing for whatever reason.

Bravo!

ggm 5 years ago

DNS: changing the engines mid-flight.

we've used this analogy.

torgian 5 years ago

But did they make it mobile first?

Apparantly not.

stevespang 5 years ago

EXECUTIVE SUMMARY: So Boeing produced a dynamically unstable airframe, the 737 Max. That is big strike No. 1. Boeing then tried to mask the 737’s dynamic instability with a software system. Big strike No. 2. Finally, the software relied on systems known for their propensity to fail (angle-of-attack indicators) and did not appear to include even rudimentary provisions to cross-check the outputs of the angle-of-attack sensor against other sensors, or even the other angle-of-attack sensor. Big strike No. 3.

crimsonalucard 5 years ago

The irony of software is that it is the only discipline in engineering where results can be proven with logic. Yet we still do testing on it as if it was a blackbox.

  • teddyh 5 years ago

    I refer you to Joel Spolsky about formal proofs of programs:

    […]

    So in the first day of that class, Dr. Zuck filled up two entire whiteboards and quite a lot of the wall next to the whiteboards proving that if you have a light switch, and the light was off, and you flip the switch, the light will then be on.

    The proof was insanely complicated, and very error-prone. It was harder to prove that the proof was correct than to convince yourself of the fact that switching a light switch turns on the light. Indeed the multiple whiteboards of proof included many skipped steps, skipped because they were too tedious to go into formally. Many steps were reached using the long-cherished method of Proof by Induction, others by Proof by Reductio ad Absurdum, and still others using Proof by Graduate Student.

    For our homework, we had to prove the converse: if the light was off, and it’s on now, prove that you flipped it.

    I tried, I really did.

    I spent hours in the library trying.

    After a couple of hours I found a mistake in Dr. Zuck’s original proof which I was trying to emulate. Probably I copied it down wrong, but it made me realize something: if it takes three hours of filling up blackboards to prove something trivial, allowing hundreds of opportunities for mistakes to slip in, this mechanism would never be able to prove things that are interesting.

    https://www.joelonsoftware.com/2005/01/02/advice-for-compute...

    • nextos 5 years ago

      I have a MSc in formal methods, and this is really misleading. Surely you cannot make some proofs easily. However, how did Airbus verify that some errors simply do not exist in their fly-by-wire software (which is ~100 KLOC implemented in a subset of C)? (I'm sure Boeing also employs these techniques internally).

      Using abstract interpretation. There are tons of formal methods, ranging from type systems to formal proofs. Lots of compromises can be made to make them practical and useful for a particular domain. Look into [1,2] for some quick introductory examples. Going straight into formal proofs is in general a really bad idea.

      I have worked on railway control systems, with really nasty potential race conditions and managed to prove the absence of large classes of errors. Then derived implementations formally. It's really not that hard. There's even a subfield of CS looking into verifying formal properties of biological systems [3], which are really complex.

      [1] http://adam.chlipala.net/frap/

      [2] http://www.concrete-semantics.org/

      [3] http://lucacardelli.name/Papers/Abstract%20Machines%20of%20S...

      • teddyh 5 years ago

        I wonder if anyone can follow Feynman’s style of proof?

        By the end of that summer of 1983, Richard had completed his analysis of the behavior of the router, and much to our surprise and amusement, he presented his answer in the form of a set of partial differential equations. To a physicist this may seem natural, but to a computer designer, treating a set of boolean circuits as a continuous, differentiable system is a bit strange. Feynman's router equations were in terms of variables representing continuous quantities such as “the average number of 1 bits in a message address.” I was much more accustomed to seeing analysis in terms of inductive proof and case analysis than taking the derivative of “the number of 1’s” with respect to time.

        http://longnow.org/essays/richard-feynman-connection-machine...

      • teddyh 5 years ago

        As explained in this talk by Robert Martin, proving programs correct was the big goal which Edsger W. Dijkstra tried to reach by eliminating GOTO in favor of structured programming (which is essentially having if-else statements, loops and iteration as part of a language as replacement for all practical uses of GOTO).

        https://www.youtube.com/watch?v=SVRiktFlWxI#t=2h9m38s

        Software could be proven correct, but we abandoned that, just gave up on it, it’s too hard. But we can test it. We can use science, and we can write tests that demonstrate that the software is not failing. We treat software like a science, not like mathematics.

        — Robert Martin

        • nextos 5 years ago

          I think Robert Martin doesn't fully grasp the field of formal methods and program semantics, which has evolved a lot since the early 1980s. Paraphrasing Alan Perlis, beware of the Turing tar pit, where everything is possible but nothing of interest is easy.

          So one trick is to work on DSLs with restricted semantics that are good enough for your domain. That makes proofs plus other formal techniques, and hence security guarantees, much much easier.

          But if you insist on Turing complete languages, as I explained above, you can e.g. build abstract interpreters or data flow analyses that are able to prove really sophisticated things. I have implemented a C abstract interpreter for an industrial client that among many other things detected whether you were making potential out of bounds accesses to arrays, or potentially using uninitialized pointers. Of course it erred on the safe side. But it was really precise (few false positives). Not a walk in the park as it relies on Galois connections. See the seminal Cousot & Cousot 1977 paper [1].

          [1] https://www.di.ens.fr/~cousot/COUSOTpapers/POPL77.shtml

  • bollu 5 years ago

    I'm not sure you understand how difficult it is to prove software correct. I've written a decent amount of Coq code. It's quite bonkers how much of proof one needs to write to get anything done. For reference, the certified compiler CompCert's code base is something like 10% code and 90% proofs.

    • umvi 5 years ago

      Even then it doesn't matter if your software is "correct" if you don't understand/have not captured all the requirements.

      You could write a formally verified MCAS system in Coq or whatever that still kills people because you didn't consider the case of sensor failure (because that wasn't in the requirements). Or because you didn't consider the case of double sensor failure or a combination of extremely rare hardware failures (that wasn't in the requirements either!).

      Your code is mathematically perfect, yes! But it was built with the incorrect assumption that the hardware is perfect as well!

      I'll take a robust set of tests written by someone thinking outside of the box in terms of what could fail, how it will be used in operation, etc. over a piece of software "guaranteed unsinkable" because it is formally verified.

      • AnimalMuppet 5 years ago

        This. MCAS was the wrong thing to build. The wrong thing, correctly built, is still wrong, no matter how perfect the formal verification.

        • no-s 5 years ago

          So true. The problem is in the system at a higher level. It is not really a software defect that killed all those folks.

    • tluyben2 5 years ago

      It would ensure people thought about it a lot longer and harder than without those proofs, so the code had a whole lot more critical thinking done over it. Seems worth it in some areas like airplanes.

      • mikeash 5 years ago

        The trouble with formal proofs is that you can only prove that the code does what you say it does, not that it does what it needs to do.

        MCAS performed as designed. Only reading one AoA sensor, acting on bad readings, not rejecting values that are clearly out of bounds, operating continuously without any limits on its pitch authority, all of this was how is was designed to work. You could have proved MCAS “correct” and ended up with the exact same result.

        A lot of software problems are due to discrepancies between the design and the implementation, of course. But it’s not a panacea.

        • tluyben2 5 years ago

          Yes, agreed, but for life or death situations, as a programmer, I would prefer to use all tools at our disposal; besides money it does probably improve the quality if you spend this time.

          • mikeash 5 years ago

            I think this is likely to be true, but I do wonder if there might be a better way to spend those resources. For this particular example, you’d have been better off putting more money into flight testing with various sensor errors.

            • tluyben2 5 years ago

              Well, yes. But, and who knows if it is true, but you would think you can think up that you need sensor errors to be tested and maybe figure out you need three errors vs two for instance to cover all cases; formal systems can help you model and reason about that and come up with cases that are not covered.

              I think the point is which these disasters show, and some others in the past (was there not a recalled Japanese car with a software issue?) that a few million more for the right resources will save you a lot more down the line. Problem is that this is not really a statement we can prove for formal verification because we do not have enough to compare with; i have a very strong feeling it will make quite a significant difference; if not for the proofs themselves then for the sheer number of hours and time the proofwriters thought about it by the time of delivery.

              • mikeash 5 years ago

                Two sensors is enough if you can shut off the system when they disagree. MCAS didn’t doom the plane if it shut off, so that was enough. The problem was that they didn’t do this, and instead used only one sensor at a time. Basic sense and standard practice tells you not to do that. I don’t know if more formality would have helped them not do something so boneheaded.

                However, this looks like an unusual case. Based on the fact that they had such a dumb design for this system, that shouldn’t have passed muster in any aviation context no matter your software methodology, it’s probably not representative enough to draw a larger lesson about engineering. The larger lesson seems to be business and regulatory.

                Regarding Japanese cars, Toyota went through a big thing with unintended acceleration that got quite a few people killed. Software was suspected, but the cause was ultimately determined to (mostly?) be a combination of drivers mixing up the accelerator and the brake (happens more often than you might think) and unsecured floor mats pushing the accelerator down. I bought a Toyota not too long after this and the dealer made a special point to show me how the floor mat attached to the floor and that I must be sure it was solidly connected.

                Their software was audited and apparently it was really badly made. It was described as “spaghetti-like” and had global variable abuse, ignored errors, failed to restart unresponsive tasks, and had potential memory corruption problems. It’s quite possible that this really was the cause, and a jury even found this to be the cause in a civil trial over one of the crashes, but it was never definitely linked.

      • stefan_ 5 years ago

        It also makes absolutely no sense in any software that works entirely with and relies on external, mechanical hardware sensors. There is a reason people use this to certify their operating system or algorithm works: you can forever stay in a bubble that your processor is just a big math machine, no external inputs, no external outputs. We have "certified secure" operating systems that nonetheless are trivially attacked through something like rowhammer.

        How do you solve it? Assume your AoA sensor is always correct? Congratulations, MCAS is a provably correct solution! Make the sensor behavior more complex? Sorry, the problem is now intractably complex (and still doesn't model the actual hardware).

    • zestyping 5 years ago

      If the proof is more complex than the software, I don't see how that gains you anything. It's harder to verify that you've proved the right thing than to verify the software itself, so how does that help?

      • crimsonalucard 5 years ago

        In a type checked language. Your software is guaranteed to have no type errors. How hard was that?

        • zestyping 5 years ago

          That's a great example! It illustrates the point in a better way than I tried to.

          A necessary part of why types are so useful is that understanding the types is easier than understanding the code itself.

          If the types become so complex that they start to get harder to understand than reading your code (e.g. C++ template errors can sometimes get this bad), then they are no longer helpful.

    • crimsonalucard 5 years ago

      No, I get it, but still... it's ironic. Software for a plane, though, I would want 1% code 99% proofs.

      • scrape_it 5 years ago

        Well, to support the top parent's comment, here's what John Carmack had to say about SAAB's fighter jet:

        > The fly-by-wire flight software for the Saab Gripen (a lightweight fighter) went a step further. It disallowed both subroutine calls and backward branches, except for the one at the bottom of the main loop. Control flow went forward only. Sometimes one piece of code had to leave a note for a later piece telling it what to do, but this worked out well for testing: all data was allocated statically, and monitoring those variables gave a clear picture of most everything the software was doing. The software did only the bare essentials, and of course, they were serious about thorough ground testing.

        > No bug has ever been found in the “released for flight” versions of that code.

        so not quite that tests are useless because blackbox, but more that yes, it is possible to write software that facilitates showing a code will do what it says it will do.

        In another note, I do feel sorry for the engineers but at the same time Airbus also has similar MCAS system, but the only thing that saves them is an extra Angle-of-Attack sensor whereas I believe the boeing relied on just one or two. Airbus had 3 AOA sensors and if 2 agreed, that was the data fed into the MCAS.

        It still boggles my mind that they would place a bigger engine when the original plane was not built for it at all, this was 100% profit orientated move to prevent Airbus from taking Boeing's majority marketshare, it's highly ironic that the opposite results.

        Even more infuriating is that Boeing passed it off as the exact same plane, not much extra training is necessary, along with FAA giving their blessing, now caught up in the mess.

        Canada for instance is now looking to the EU for an independent auditing, as any credibility FAA has built up over the past years has taken a significant hit.

        • mongol 5 years ago

          Interesting quote. Seems it was actually a quote of Henry Spencer by Carmack. http://number-none.com/blow/john_carmack_on_inlined_code.htm...

          Note however that there were several software-related crashes in the early days of Gripen. Might not be related to exactly this code, but to the problem of regulating the feedback loops to manage the plane's purposeful instability

          • scrape_it 5 years ago

            huh? interesting, do you have more info, I love reading stuff like this. I don't know much about the Gripen but it is my all time favorite after the F-16.

            Off topic but here's a neat youtube video of Gripen jets undergoing rapid turn-around, on a normal street road with just a handful of people, and special tools for the crew:

            https://www.youtube.com/watch?v=49L9BlYQSjw

            • mongol 5 years ago

              The first crash: https://www.youtube.com/watch?v=k6yVU_yYtEc

              Second crash (over middle of Stockholm) https://youtu.be/mkgShfxTzmo?t=133

              From Wikipedia: During the test programme, concern surfaced about the aircraft's avionics, specifically the fly-by-wire flight control system (FCS), and the relaxed stability design. On 2 February 1989, this issue led to the crash of the prototype during an attempted landing at Linköping; the test pilot Lars Rådeström walked away with a broken elbow. The cause of the crash was identified as pilot-induced oscillation, caused by problems with the FCS's pitch-control routine.[22][33][34]

              In response to the crash Saab and US firm Calspan introduced software modifications to the aircraft. A modified Lockheed NT-33A was used to test these improvements, which allowed flight testing to resume 15 months after the accident. On 8 August 1993, production aircraft 39102 was destroyed in an accident during an aerial display in Stockholm. Test pilot Rådeström lost control of the aircraft during a roll at low altitude when the aircraft stalled, forcing him to eject. Saab later found the problem was high amplification of the pilot's quick and significant stick command inputs. The ensuing investigation and flaw correction delayed test flying by several months, resuming in December 1993.[22]

              • scrape_it 5 years ago

                can't believe the first one didn't result in a ball of fire...

                the second was even more shocking, all of a sudden it looked like he was trying a Cobra maneuver until he ejected! that's insane.

        • cjbprime 5 years ago

          Isn't Airbus's a stick pusher, though? i.e. it puts force on the stick that is hard to overpower but can be overpowered, so the pilot is still the control authority?

          That's extremely different to a system that silently severely mistrims you, then tells you to disable your ability to trim electrically before you can fix the mistrim, as if that even makes sense, all while overpowering the yoke with aerodynamic load.

          For the record, Airbus's three sensors haven't prevented crashes completely. Here's one where two sensors were damaged, causing the correct sensor to be treated as erroneous and lose the quorum vote:

          https://en.wikipedia.org/wiki/XL_Airways_Germany_Flight_888T

        • drewrv 5 years ago

          What saves Airbus is the extra MCAS system, and a stronger regulatory environment, and most importantly a jet that has the engines mounted in the proper place so it doesn't have a tendency to stall.

        • chopin 5 years ago

          I don't know about modern systems but PLC's (industrial controllers) worked exactly this way when I worked with them in the 90s. You can introduce subtle bugs with these either.

        • rightbyte 5 years ago

          Interesting about the SAAB fighter. Essentially the opposite of spaghetti code? One really long spaghetti straw that is uncooked?

          I like really long function bodys so I get turned on by the idea.

  • WalterBright 5 years ago

    Proving that software matches the spec is one thing, proving that the spec is free of bugs is quite another.

    • bdamm 5 years ago

      Indeed. It is normal that implementation will find a bug in the spec. Anecdotally, have never found a flawless RFC. Every real-world implementation of an RFC I've ever done has found some discrepancy when it came to compatibility testing.

    • crimsonalucard 5 years ago

      Yes. But for critical software systems, what excuse does one have not to prove that the software does not match the spec. There is no excuse not to test the spec, but what excuse is there not to prove the software?

  • jsiepkes 5 years ago

    Well we can prove some things behave under certain conditions in a certain way. We can't prove the absence of bugs.

    • crimsonalucard 5 years ago

      No, we can literally prove it correct based off of a specification.

      https://en.wikipedia.org/wiki/Correctness_(computer_science)

      In science nothing can be proven but in the world of logics and math, things can be proven. Bugs can arise where programs intersect in the real world.

      • tluyben2 5 years ago

        If the spec is complex you are probably proving the bugs in the spec are indeed correctly implemented...

        Edit: that said, spending the formal spec time will probably reduce the number of bugs far below than what we find normal now. But money...

      • softwaredoug 5 years ago

        Yeah and then the specification at that level of specificity IS the code. How do you prove the specification is bug free?

        • crimsonalucard 5 years ago

          Define what is a bug in a spec? I defined the program to do one thing and one thing only. What does does it mean when I have a bug in my definition? There's a definition for the definition? Makes no sense.

          • throwaway2048 5 years ago

            You can argue that no code ever has bugs by that logic, after all you defined the program to do one thing, and it did it, it was merely human expectation that was in error.

            • crimsonalucard 5 years ago

              That's what I'm arguing. Code can be proven correct against a formal spec.

              However, because we don't do proofs in software, there are bugs.

          • aassddffasdf 5 years ago

            It means that the one thing you defined it to do was not the right thing.

      • AnimalMuppet 5 years ago

        > Bugs can arise where programs intersect in the real world.

        Well, the context is flight control software for airplanes. If that code is bug-free but doesn't intersect the real world, that's rather useless. And if it intersects the real world but therefore is not bug-free, that's not a great argument for proofs of formal correctness.

        But all of that is kind of beside the point. The MCAS specification was the wrong thing. Proving the implementation correct is useless. (You asked what is a bug in a spec? MCAS shows you the answer. Only taking input from one sensor is the wrong thing. Repeatedly applying nose-down is the wrong thing. It seems like I'm missing one or two more.)

        • crimsonalucard 5 years ago

          You can prove the code correct against the spec. However you can't prove that the spec will do what you expect it to do because the spec is where the intersection with the real world occurs. Therefore, you must test the spec. Everything that falls under the spec can be formally verified to 100% correctness.

  • alacombe 5 years ago

    Proving something is correct is only part of the problem. A random bit flip (no matter amount of error correction and physical hardening) and your proven-correct software goes amok. And this is just one of the many issue you can encounter dealing with the physical world.

    • samatman 5 years ago

      Ironically, the solution to this in spacecraft, where radiation is a given, is 2 of 3 with polling.

      The system which should have been using for attitude detection in the 737 MAX, and wasn't.

      • throwaway2048 5 years ago

        Dosen't eliminate a single point of failure, after all a bitflip in the voting mechanism can also result in error.

    • crimsonalucard 5 years ago

      Yeah, you're right. Can't completely get rid of tests but the irony still exists.

  • threatofrain 5 years ago

    Mathematicians often don't rely on the exact correctness of a proof's description, instead relying on a rough sketch of strategy.

  • mehrdadn 5 years ago

    Isn't digital logic testing a thing for hardware?

    • Gibbon1 5 years ago

      Complex hardware often contains bugs many of which map to software bugs. Hardware though tends to be highly constrained in how it's operated.

      Man: Doc it hurts when I bend my knee this way.

      Doc: Well don't do that.

      EE: The hardware state machine runs into the weeds when given these inputs.

      Tech Rep: Yeah don't do that.