All systems down

Among the 30-odd CIOs who serve Boston's world-famous health-care institutions, John Halamka is a star among stars. He has been CIO of the CareGroup health organization and its premier teaching hospital—the prestigious Beth Israel Deaconess Medical Center—since 1998. He helps set the agenda for the Massachusetts Health Data Consortium, a confederation of executives that determines health-care data policies for New England.

Until 2001, the 40-year-old Halamka also worked as an emergency room physician, but he gave that up to take on the additional responsibilities of being CIO of Harvard Medical School in 2002. However, as a globally recognized expert on mushroom and wild plant poisonings, he is still called when someone ingests toxic flora.

All of this has earned Halamka a considerable measure of renown. For two years running, InformationWeek named Halamka's IT organization number one among hospitals in its yearly ranking of innovative IT groups. In September 2002, CareGroup was ranked 16th on InformationWeek's list of 500.

Two months later, Beth Israel Deaconess experienced one of the worst health-care IT disasters ever. Over four days, Halamka's network crashed repeatedly, forcing the hospital to revert to the paper patient-records system that it had abandoned years ago. Lab reports that doctors normally had in hand within 45 minutes took as long as five hours to process. The emergency department diverted traffic for hours during the course of two days. Ultimately, the hospital's network would have to be completely overhauled.

This crisis struck just as health-care CIOs are extending their responsibilities to clinical care. Until recently, only ancillary systems like payroll and insurance had been in the purview of the CIO. But now, in part because of Halamka and his peers, networked systems such as computerized prescription order entry, electronic medical records, lab reports and even Web conferencing for surgery have entered the life of the modern hospital. These new applications were something for health-care CIOs to boast about, and Halamka often did, even as the network that supported the applications was being taken for granted.

"Everything's the Web," Halamka says now. "If you don't have the Web, you're down."

Until last Nov. 13, no one, not even Halamka, knew what it really meant to be down. Now, in the wake of the storm, the CIO is calling it his moral obligation to share what he's learned.

"I made a mistake," he says. "And the way I can fix that is to tell everybody what happened so they can avoid this."

Sitting in his office three weeks after the crash, Halamka appears relaxed and self-possessed. There's another reason he's opening up, talking now about the worst few days of his professional life at CareGroup. "It's therapeutic for me," he says, and then he begins reliving the disaster.

Wednesday: The network flaps

On Nov. 13, 2002, a foggy, rainy Wednesday, Halamka was alone in his office at Beth Israel when he noticed the network acting sluggishly. It was taking five or 10 seconds to send and receive e-mail. Around 1:45 p.m., he strolled over to the network team to find out what was up.

A few of his 250 IT staff members, who range from low-level administrators to senior application developers, had already noted the problem. They told him not to worry. There was a CPU spike—a sudden surge in traffic. RCA, one of the core network switches, was getting pummeled. From where, they didn't know. It might have to do with a consultant who was working on RCA, preparing it for a network remediation project.

"We happened to have had a guy in there," recalls Russell Rusch of Callisma, the company leading the remediation project. "We knew the hospital had had similar incidents in the past few months." Those previous CPU spikes lasted anywhere from 15 minutes to two hours, he says. Then they worked themselves out. Like indigestion.

Halamka's team decided to begin shutting down virtual LANs, or VLANs. They would turn off switches to isolate the source of the problem, much in the same way one would go around a house shutting off lights to find out which one was buzzing. Halamka thought the plan sounded reasonable.

It was a mistake.

Shutting switches forced other switches to recalculate their traffic patterns. These calculations were so complex that those switches gave up doing everything else.

Traffic stopped. The network was down.

Within 15 minutes, by 2 p.m., the team reversed course and turned all the switches back on. A sluggish network, they figured, was preferable to a dead one.

For the rest of the day and into the night, the network flapped—a term Halamka uses to describe the network's state of lethargy dotted by moments of availability and, more often, spurts of dead nothing. The team searched for the cause. Around 6 p.m., when most of the doctors, nurses, staff and students left, the network settled down. Finally, at 9 p.m., the IT staff found its gremlin: a spanning tree protocol loop.

Spanning tree protocol is like a traffic cop. Data arrives at a switch and asks spanning tree for directions. Say, from John's server to Mary's desktop. Spanning tree calculates the shortest route. It then blocks off every other possible route so that the data will go straight to its destination without having to make decisions at other crossroads along the way.

But spanning tree will look only as far out as seven intersections. Should data reach an eighth intersection, called a hop in networking, it will lose its way. Often, it will drive itself into a loop. This clogs the network in two ways. First, the looped traffic itself gums up the works. Then, other switches start to use their computing horsepower to recalculate their spanning trees—to make up for the switch that is directing traffic in a loop—instead of directing their own traffic.

That's what happened at Beth Israel Deaconess. On Wednesday, a researcher uploaded data into a medical file-sharing application, and it looped. The data was several gigabytes, so it clogged the pipes. Then, when Halamka's team turned off a switch at 1:45 p.m., it was as if one cop closed an intersection and every other cop stopped traffic in all directions to figure out alternate routes.

Halamka's team now knew what happened, if not where it happened. Standard troubleshooting protocol for spanning tree loops calls for cutting off redundant links on the network. "What you're doing is eliminating potential spots where there are too many hops, and creating one path from every source to every destination," Callisma's Rusch says. "It might make for a slower environment"—without backup—"but it should make for a stable environment."

"We cut the links," Halamka says. "It seemed to work. We went home feeling great. We had figured it out."

Thursday: Clogged arteries

Hospitals come alive early. By 7 a.m., doctors and nurses started to send some of Beth Israel Deaconess's 100,000 daily e-mails. The pharmacy began filling prescriptions, transferring the first bits of the 40 terabytes that traverse the network daily. Some of the 3,000 daily lab reports were beginning to move.

By 8 a.m., the network again started acting as if it were flying into a headwind. Halamka realized the network had settled down the night before only because hardly anyone was using it. When the workday began in earnest, CPU usage spiked. The network started flapping. The problem hadn't been fixed.

Halamka's team scrambled to find other possible sources of the trouble. One suspect was CareGroup's network of outlying hospitals in Cambridge, Needham, Ayer and elsewhere in Massachusetts. They operated as a distinct network that plugged into Beth Israel Deaconess. The community hospitals' network was sluggish, and a billing application wasn't working, according to Jeanette Clough, CEO of Mount Auburn Hospital in Cambridge, which serves as the hub for the outlying hospitals' network.

The easiest thing to do would be to cut the links, eliminating the potential for spanning tree loops. But that would isolate the outlying hospitals. Instead, the IS team, along with Callisma engineers, chose a more complex option. They would try converting from switching to routing between the core network and the outlying hospitals. That would eliminate spanning tree issues while keeping those hospitals connected.

They tried for seven hours, and, for arcane reasons that have to do with VLAN Trunking Protocol (VTP), they never got the routing to work. The network flapped all day.

Around midmorning, as Halamka was explaining the routing strategy to CareGroup executives in an ad hoc meeting, a patient, an alcoholic in her 50s, was admitted to Beth Israel Deaconess's ICU. Dr. Daniel Sands, a primary care physician and director of the hospital's clinical computing staff, saw her. She had what Sands calls "astounding electrolyte deficiencies," a problem common to people who drink their meals. In fact, Sands says, "It was incredible she was alive.

"I needed to be careful with this woman. I needed to try treatments based on lab reports and then monitor progress and adjust as I went," recalls Sands. "But all of a sudden, we couldn't operate like that. Usually I get labs back in less than an hour; they were taking five hours, and here I have a patient who could die. I was scared." (The patient would survive.)

At 4 p.m., Halamka met with a minicrisis team that included the head of nursing, the heads of the lab and the pharmacy, and hospital COO Dr. Michael Epstein. "Even then," Halamka says, "I'm still saying, 'We're one configuration change away,' and my assumption is things will be back up soon."

But his team was tense and frustrated. CareGroup's help desk had been flooded with calls. They were hearing everything from "I can't check my e-mail" to "I don't know if the blood work I just requested went through."

At 3:50 p.m., Beth Israel closed its emergency room. It stayed closed for four hours, until 7:50 p.m., according to Massachusetts Department of Public Health documents.

It was at the 4 p.m. meeting that COO Epstein says he realized "this was more than a garden-variety down-and-up network." Clinical users, like Sands, were signaling that they were worried. Epstein and Halamka, along with hospital executives and network consultants, decided to take extreme measures. They called Cisco Systems, the hospital's San Jose, Calif.-based equipment and support vendor. Cisco responded by triggering its Customer Assurance Program (CAP), a bland name that belies how rare and how serious CAPs are. CAP means Cisco commits any amount of money and every resource available until a crisis is resolved.

CAP was declared shortly after 4 p.m. By 6 p.m., a local CAP team from nearby Chelmsford, Mass., had set up a command center at the hospital and initiated "follow the sun" support—meaning additional staff at Cisco's technical assistance centers would be plugged in to the crisis until their workday ended, when they'd hand off support to a similar group a few time zones behind them.

First, the CAP team wanted an instant network audit to locate CareGroup's spanning tree loop. The team needed to examine 25,000 ports on the network. Normally, this is done by querying the ports. But the network was so listless, queries wouldn't go through.

As a workaround, they decided to dial in to the core switches by modem. All hands went searching for modems, and they found some old US Robotics 28.8Kbps models buried in a closet. Like musty yearbooks pulled from an attic, they blew the dust off them. They ran them to the core switches around Boston's Longwood medical area and plugged them in. CAP was in business.

An outmoded network

By 9 p.m., they had pinpointed the problematic spanning tree loop. The Picture Archive Communication System (PACS) network, for sharing high-bandwidth visual files and other clinical data, was 10 hops away from the closest core network switch, three too many for spanning tree to handle.

And that's when the dimensions of the problem fully dawned on the team members: They were struggling with an outmoded network. In September 2002, Halamka had hired Callisma's Rusch to audit CareGroup's infrastructure. When Rusch finished, he told Halamka, "You have a state-of-the-art network—for 1996."

Halamka's network was all Layer 2 switches with no Layer 3 routing. Switching is fast, inexpensive and relatively dumb, and it relies on spanning tree protocol. Routing is more expensive but smarter. Routers have quality-of-service throttles to control bandwidth and to isolate heavy traffic before it overwhelms the network. State-of-the-art networks in 2002 have routing at their core.

In 1996, CareGroup's network was Beth Israel Hospital, and at its core was a switch called Libby030. In October of that year, the hospital merged with Deaconess Hospital. Deaconess's network was plugged into Libby030.

Other systems were tacked on in the same way. In 1998, CareGroup connected PACS to what used to be Deaconess Hospital. A year later, CareGroup linked a new data center and its two core switches (RCA and RCB) to Libby030. There would be a fourth core switch added and a skein of redundant links, but Libby030 remained the main outlet. Halamka now understands that this was a "network of extension cords to extension cords. It was very fragile," he says.

1 2 Page 1
Page 1 of 2
7 inconvenient truths about the hybrid work trend