Editor’s Note: this post was penned offline early Friday evening, before the author had knowledge of the issue being resolved, and fiserv processing the backlog as of 16:45. We have chosen to simply add some links, now that we are online to retrieve them. The post follows.
As I write, on Friday, 2025-05-02, fiserv has been offline all day, or substantially all day. This company acts as a third party to a number of banks, providing wire transfers, ACH, and/or direct deposit services, and possibly even online/mobile banking. A number of large banks, including Ally Bank, Bank of America, Capital One, and Synchrony have been affected in some way by this outage, as was my regional bank.
I don’t know anything about the root causes yet. It would be irresponsible to speculate about those causes, so of course I am going to.
Ways to Deadlock a System
This incident occurred on the first Friday of a month, so it might have collapsed under load. Sometimes, the failure of part of a service cascades into taking the whole thing down, and puts it in a state where it also cannot be restored piecemeal.
Core infrastructure with (unintended) circular dependencies could have gone down. If your higher-level services depend on your file servers, but the file servers accidentally started depending on one of those higher-level services, great pain awaits once both of those go down simultaneously. If most pieces of the cycle usually work, caching information from each other to cover a minor failure/maintenance window, this problem can remain latent for a surprisingly long time.
There have also been notable outages where a company lost a necessary component for both the external service and internal administration, impeding their ability to repair it, as they no longer had access themselves. They couldn’t run their monitoring tools, or issue commands, unless they fixed the external site, but those are the actions they’d take to fix it.
It’s Probably None of Those
The time to repair speaks to a less simplistic, more complicated failure. Unless something happens like “the physical hardware is gone, and the data with it,” or conditions are like “we are a small team of 3 people,” even large-scale failures are usually resolvable within a few hours.
Regardless of what went wrong technically, management is also culpable. Whether the corners were cut to “accept” more risk, or to provide efficiency, risk management failed. If staffing was reduced to cut operational expenses, without considering institutional knowledge, then that was a spectacular failure of risk management. Likewise, if management isn’t willing to pay for expert oversight of their systems. (Loosely related, but search/scroll to the “Schedule prediction” heading.)
It’s easy to have 20/20 hindsight, and it's easy to criticize from an armchair. And it’s hard to remember sometimes that it’s “efficient” to run reactively, rather than proactively. After all, if you create and test a plan for some contingency, and never suffer that contingency, then logically, all that effort had zero return on the investment.
But one would hope with possibly billions of dollars for millions of customers at thousands of clients on the line, a company could afford to think about some of this.
Unless I just take my own job too seriously?
No comments:
Post a Comment