Electrical Blackouts: A Systemic Problem
Aug 10 - Issues in Science and Technology
Although human error can be the proximate cause of a blackout, the real causes are found much deeper in the power system.
The day after the August 14, 2003, blackout, President George W. Bush and
Canadian Prime Minister Jean Chrtien directed a joint U.S.- Canada Power System
Outage Task Force to investigate the causes of the blackout and the ways to
reduce the possibility of future outages. On November 19, the task force
reported that the blackout was due to human error. In laying the blame on
operators in two control centers, this interim report follows a long tradition
of singling out individuals and companies who made the wrong decisions. Firing
individuals and suing companies might be satisfying to those who suffered in the
blackout, but it will do little to prevent future blackouts. Similar problems
are common in generation and transmission companies, as examination of the
frequency and geographic distribution of blackouts shows.
Promises to end blackouts have been made for decades, but they ignore the
reality that complex systems built and operated by humans will fail. Congress
and the Federal Energy Regulatory Commission (FERC) must implement a framework
that recognizes that individuals and companies will make errors and creates a
system that will limit their effects. Such a system would also be useful in
reducing the damage caused by natural disruptions (such as hurricanes) and is
likely to improve reliability in the face of deliberate human attacks as well.
Fortunately, air traffic control provides us a guide to a system designed to
minimize the effects of human error. The problems uncovered by the August
blackout can be addressed by the kind of change that transformed the air traffic
control system from one of frequent deadly accidents to a system that has
provided a relatively accident-free environment, despite enormous growth in the
number of daily flights and occasional errors by pilots and controllers.
August 14 and earlier blackouts
There were indeed many individual errors on August 14. A plant operator
pushed one generator near Cleveland too hard, exceeding its limits and resulting
in automatic shutdown at 1:31 that afternoon. With that generator lost, power
flowed over transmission lines to fill the need in Cleveland. The utility failed
to appreciate the seriousness of the situation, because it did not perform a
contingency analysis after the loss of the plant to see if another failure would
lead to serious trouble. Although the computer model that performs the
contingency analysis was running and could have provided results in a few
minutes, none of the operators consulted it at any time that afternoon,
according to interviews cited in the report.
The transmission system operations center had trouble with its computer
analysis tool starting at 12:15 p.m. Only a single analyst was on duty. That
person fixed the problem by 1:07, but then went to lunch without setting the
program to report automatically every five minutes. So the analysis tool was not
available until 4:04 p.m.
At 2:02, one line failed because although it was carrying less than half the
power it was designed for, it sagged into a tree that had not been trimmed
recently, causing a short that took the line out of service. Accident analysts
say most accidents are caused by a chain of events, and one weak link in this
chain was the failure to recognize that tree-trimming is an essential part of
the design of transmission systems.
With both the generator and the line out of commission, other lines were
overstressed and failed between 3:05 and 3:39 p.m. That led to more failures and
the blackout at 4:08. As the situation worsened, competing generating companies
and power transmission line operators did not share data. Even some data that
were shared could not be interpreted, because companies had not kept track of
changes in the grid that had been made by others.
Meanwhile, back at the utility, the operators did not notice that their own
alarms and graphs were disabled when the computers driving them froze at 2:14.
Having no training in recognizing and reacting to failures of their computer
systems, they believed their rosy frozen data even when calls came in reporting
trouble. The control displays were hard to interpret, even after they began
working again at 3:59. When it finally became clear that not enough electricity
was being generated or transmitted to supply all customers, the operators made
no attempt to shed load-that is, to black out a few customers in order to
prevent blacking out 50 million.
The task force interim report concludes: "training was inadequate for
maintaining reliable operation . . . internal control room procedures and
protocols did not prepare them adequately to identify and react to the August 14
emergency."
It is tempting to assume that these were isolated errors made by one or two
organizations. It is always easy to blame individuals, but such complacency is
misplaced. Operators at consoles in 140 control centers around the country
control the grid by calling for generators to ship power over the proper lines
at just the right time to meet demand. These operators work for some of the
3,000 power companies and transmission operators in the United States, but they
do not have timely access to the information they need to make wise decisions.
"Most control facilities do not receive direct line voltage and current
data on every facility for which they need visibility," the task force
concluded. To make up for the lack of realtime data, they sometimes use
computers to estimate the state of the grid. Use of such models is uneven in the
140 control centers. Says the report, "control areas that have them
commonly run a state estimator [a computer tool used to estimate what is working
and what is broken, since very little actual data is monitored] on regular
intervals or only as the need arises (i.e., upon demand). Not all control areas
use state estimators."
The task force also listed additional factors that contributed to the August
2003 blackout. They include, "inadequate interregional visibility over the
power system; dysfunction of a control area's SCAD A/EMS [data system]; and lack
of adequate backup capability to that system." This list of other factors
is important, because it draws attention to underlying weaknesses in the system
that could also be identified factors in earlier blackouts.
A few common themes emerge from investigations of the 2003 and earlier
blackouts:
* Monitoring of the power grid is sparse, and even these limited data are not
shared among power companies.
* Industry standards are lax; for example, vegetation under transmission
lines is trimmed only every five years.
* Operators are not trained routinely with realistic simulations that would
enable them to practice dealing with the precursors to cascading failures and
the management of large-scale emergencies.
* Power companies have widely varying levels of equipment, data, and
training. Some companies can interrupt power to customers quickly during an
emergency, whereas others are nearly helpless.
* Decades-old recommendations to display data in a form that makes it easy to
see the extent of a problem have been ignored. This was a contributing cause of
the 1982 West Coast blackout, where "the volume and format in which data
were displayed to operators made it difficult to assess the extent of the
disturbance and what corrective action should be taken."
* Monitoring of the power system is everywhere inadequate, both within
regions and between them.
How the grid is operated
The United States has attempted voluntary measures to prevent electrical
blackouts for much of the past century. Originally, vertically integrated
utilities planned for their own system reliability, with a few tie lines to
neighboring utilities that might be helpful in some emergencies. It became clear
in the 1965 Northeast blackout, when a failure in Ontario blacked out New York
City 11 minutes later, that growing electric demand had made regional issues
important. In the next two years, 10 voluntary regional reliability councils
were established to coordinate the planning and operation of their members'
generation and transmission facilities. In 1968, the North American Electric
Reliability Council (NERC) was formed to coordinate the regional councils. One
of NERC's primary functions is the development of reliability standards for the
regional generation and transmission of power. According to its Web site, "NERC
has operated successfully as a voluntary organization, relying on reciprocity,
peer pressure and the mutual self-interest of all those involved."
Consumers of electricity may have adifferent definition of success. Despite
the voluntary standards, large blackouts unrelated to storms occurred in
Pennsylvania, New Jersey, and Maryland on june 5, 1967 (affecting 4 million
people); Miami on May 17, 1977 (1 million); New York on july 17, 1977 (9
million); Idaho, Utah, and Wyoming on January 1, 1981 (1? million); four western
states on March 27, 1982 (1 million); California and five other western states
on December 14, 1994 (2 million); the Pacific Northwest on july 2, 1996 (2
million); 11 western states on August 10, 1996 (7.5 million); and San Francisco
on December 8, 1998 (0.5 million).
After the passage of the Public Utilities Regulatory Policies Act in 1978 and
the Energy Policy Act of 1992, the electricity industry became a hybrid of
vertically integrated utilities and new structures of multiple forms.
"Merchant generators," independent of utility companies, installed
their own plants and sought customers anywhere in the country. Aggregators
bargained for better rates on behalf of large numbers of customers. Energy
brokers used the open market and long-term contracts to buy and sell power.
Restructuring has transformed the operation of the electricity system.
Utilities formerly transmitted power from a nearby generation plant to
customers. Now, industrial customers can buy power from plants hundreds of miles
away, putting major burdens on the transmission system and increasing the
likelihood of a blackout. That has made a huge difference: The number of times
that the transmission grid was unable to transmit power for which a transaction
had been contracted jumped from 50 in 1997 to 1,494 in 2002. This metamorphosis
has done little to improve the physical system of transmission or its control
systems. The burden of making the new system operate reliably has instead fallen
on people.
No organization that generates, transmits, or distributes electric power
wants low reliability. But in a deregulated competitive electricity market,
companies have to pay for investments out of the revenues they earn. Unless
companies can find a way to bill customers for reliability, or unless regulators
mandate reliability investments and ensure that they are reimbursed, no
investments will be made. None of the 19 states that have implemented electric
restructuring has figured out how to pay for investments to prevent
low-probability events such as blackouts.
Eight years ago, reacting to that summer's two large outages in the West,
NERC's chief executive officer wrote that, "[a new model] must include
universal participation, more detailed and uniform reliability standards that
can be put in place quickly, independent monitoring of reliability performance,
and the obligation to support, promote, and comply with NERC's policies."
In 2002, NERC incorporated many of the new market participants that emerged
after restructuring (such as brokers and aggregators) in developing its
voluntary reliability standards. In 2003, NERC stated that, "the existing
scheme of voluntary compliance with NERC reliability rules is no longer adequate
for today's competitive electricity market." However, both a 1998
Department of Energy report and a complaint to FERC in 1997 question NERC's
authority to make its standards mandatory.
NERC has supported federal legislation that would establish an Electric
Reliability Organization (ERO) with power to establish and enforce mandatory
standards. A NERC panel put forward this proposal first in January 1997. Eight
months later, it was endorsed by a task force chartered by the Department of
Energy as a response to the 1996 blackouts. It was part of the energy bill that
passed the House on April 11, 2003, and subsequently appeared in section 1211 of
the conference committee language. The proposed ERO would be industry- led and
could level penalties for violations of standards, but its authority over grid
operations (as distinct from planning standards) is still to be defined.
Reducing blackouts
We can apply lessons from the history of air traffic control [see sidebar] to
the electric utility industry. just as in air traffic control in the early
1960s, the time for ad hoc fixes and finger- pointing has passed. The United
States needs a national plan implemented through an organizational structure
that recognizes that human beings make mistakes and that preventing those
mistakes requires checks and balances.
A long-range plan should take into account engineering improvements. These
could include ways to control exactly where power flows through the lines,
electrical compensation for the strain on the system when a customer spins up a
large motor, and direct current transmission lines (which reduce the loss of
energy that occurs in transmitting alternating current power long distances).
Generating electricity in relatively small plants located close to consumers,
rather than in large central generation plants, will reduce blackouts. This
distributed generation holds promise, but for the foreseeable future the system
will rely on the existing transmission grid. Other technologies, such as robust
automatic control systems to reduce dependence on human operators, might be
feasible in a decade.
Long-range planning should not distract us from the significant improvements
we can make within the next few years. Some elements of such a nearterm plan are
clear.
We need national standards for telemetry data on power flows and transmission
system components. Competitive pressures and changes in the way the grid is used
have led to a very sparse data system, and market pressures are not likely to
improve matters. Operators can no longer be expected to make the right decisions
without good data. Today's hodgepodge of individual capabilities resembles the
rudimentary air traffic control system of 1934-37, which was operated by a few
airlines. Control centers must have displays and tools that allow operators to
make good decisions and to communicate easily with operators in different
control areas. There must be backups for power and data, and clear indications
to all operators that data are fresh and accurate. The emphasis should be on
data and presentations that support decisions. The present representations of
system state, particularly indicators of danger, are too complex. They stress
accuracy over clarity. Grid operators need much clearer metrics of danger and
suggestions for action (like collision avoidance alarms in aircraft and in air
traffic control centers), even if they are a little less accurate. If the
existing 157,000 miles of transmission lines in the United States were fitted
with $25,000 sensors every 10 miles, and each sensor were replaced every five
years, the annual cost would be $100 million. This would increase the average
residential electricity bill (now 10 cents per kilowatt-hour) to 10.004 cents
per kilowatt-hour. The total would be roughly one-10th the estimated annual cost
of blackouts.
All grid operators must be trained periodically in contingency recognition
and response using realistic simulators. These simulations must include all
operations personnel in a way that exposes structural deficiencies, such as poor
lines of authority and insufficient staffing. The goal should be to recognize
and act on signs of extreme system stress that may be well outside daily
operations experience. The description of flying as "years of boredom
interrupted by moments of stark terror" applies also to grid operations.
Grid operators must have the systems and training that only realistic
simulations, using their specific control center configuration, can provide.
Federal standards for training, licensing, and certification of grid operators
and control centers are warranted to ensure that a single weak control center
does not bring down a large area. No federal entity now mandates such realistic
training for grid operators, but the owners of nuclear generation plants proved
(after Three Mile Island) that it can be done.
Operations control centers must be able to control. The patchwork ability to
shed load is not appropriate to the current interdependent transmission grid.
Some systems do it automatically, but some cannot even do it manually from the
control center. Shedding of load in the near term will probably be in the form
of blacking out large areas. Some power companies have customers who have agreed
to be blacked out in emergencies, but this practice is not uniform. A decade
hence it may be possible on a large scale to provide signals to consumers to
shed parts of their load in exchange for lower tariffs, but this partial load
reduction solution has not been economically feasible with current systems.
Just as air navigation aids are monitored and flight-checked periodically,
sensors, load-shedding devices, and other system components must be checked on a
much more systematic basis than they are at present. In a competitive
environment, chief financial officers will frown on such periodic testing, which
is why it should be mandated by national standards.
Industry standards for such items as tree-trimming under transmission lines
must be set with the costs of failures in mind, not just by the competitive
constraints of the immediate marketplace. Companies that do not comply should be
penalized. These standards will vary by region and should be set by regional
bodies such as the Regional Transmission Operators.
A national grid coordination center should be established and run as a
national asset by a private body. It would stimulate R&D for the data needed
for grid monitoring. It would also monitor the situation at regional and larger
levels, provide national flow control, and perhaps act as a backup for computer
failures in individual control regions. As in air traffic control, the roles and
responsibilities of the local and national centers will be neither perfectly
optimum nor static, but they will complement each other so that we can avoid the
complete \lack of situational awareness seen in so many blackouts.
A permanent government investigation body, including professional accident
investigators who are trained to look for systemic as well as discipline-related
causes, should be an entity separate from the operators or regulators of the
grid.
How can we evolve to such a system?
The current electricity reliability system was created and developed in an
environment of voluntary participation. Trying to get all companies to
participate and comply with the recommendations has meant that standards have
not been stringent. They have also, demonstrably, not worked. Creating a better
system is not simply a matter of making the current rules mandatory. Mandatory
rules are necessary but not sufficient. We need to set rules for operations as
well as for engineering by clarifying the goals of the transmission and
generation systems and the responsibilities of each party. The NERC standards
here are not at all specific, mandating only general guides such as "return
the system to a secure state." What will emerge from the rule-setting is
unlikely to have policy or engineering purity or complete coherence, but it will
be better than the present fuzzy goals that provide little guidance on difficult
tradeoffs.
The new rules for engineering and operations must be informed by the current
state of technology and the technology improvements that are likely in the next
few years. The size and complexity of any of the three U.S. interconnection
regions mean that the new system has to be flexible and adaptive, since there is
no mathematical formula that can be developed for so large a system. The need
for innovative thinking suggests that an expert commission should be created to
advise the body setting mandatory standards. The commission should have experts
from opcrating companies, systems operators, FERC, and academia to take a fresh
look at how to design both engineering and operations standards that will
satisfy the goals.
This is analogous to the long-range air traffic control study President
Kennedy ordered in 1961. But the experience of the air traffic control system
also provides insights about likely problems to avoid. Although its operations
have produced an admirably safe system, its investments in technology and
infrastructure have been far from satisfactory. This suggests that
infrastructure decisions should be informed, but not dominated, by current
operations decisions. The electricity industry has its own technology issues.
Industry funding of its Electric Power Research Institute has dropped by half
since restructuring began. Both R&D tax credits and detailed regulations
have been proposed as stimuli for lagging innovation.
Industry is struggling to avoid detailed federal oversight (through FERC) of
the transmission and generation of electricity. NERC is not a federal entity,
and FERC has very little authority to perform oversight of its voluntary
policies. FERC has limited ju risdiction over reliability issues, such as
reserve generation capacity requirements, and over the real-time operations of
the transmission grid. FERC is exploring its rather limited authority on
reliability and indicated in late 2003 that it would require public filings of
any violations of the existing voluntary grid reliability standards, which are
overseen by NERC. These standards deal with the planning of adequate generation
and transmission capacity to meet expected load.
The new FERC proposal does not appear to require reporting of noncompliance
with NERC's operating standards (which themselves do not require that data be
collected on the status of the grid more often than every 10 minutes). Even this
innocuoussounding FERC proposal was opposed vigorously on January 9, 2004, by
the Edison Electric Institute, whose members (large utilities) want the
industry's NERC, not the federal government, to have responsibility for
standards. Three days later, FERC's chairman told the Wall Street Journal that
he "intends to hire 30 engineers in coming months to conduct performance
audits and bird-dog the work done by the reliability council."
If the legislation to create an ERO passes, it will be interesting to watch
the experiment of an industry organization chartered by the federal government
to enforce with penalties standards it develops. The experiment aside, this
provision is only a start. If a body such as this is to make real progress, its
authority should be expanded to include certification of transmission operators
and systems to meet national standards of data and control, training, and
periodic testing.
The Federal Aviation Administration's (FAA's) certification and training
standards as well as its air traffic operations have been admirable. However,
the FAA and its predecessors have found the management of new technology systems
challenging. Better control center computers and precision upgrades for landing
navigation systems have been decades behind schedule. It seems reasonable that a
grid control system should be managed privately. Nevertheless, the past 40 years
have shown us that voluntary standards and individual operating practices are
not appropriate for the grid. just as the FAA sets standards for airlines and
national standards for navigation data and control centers, a body (either the
proposed ERO or a federal agency) should set operations requirements and police
them. The same body could operate the national grid coordination center. A
separate agency, such as the Department of Energy, should house the permanent
investigation personnel.
The gross revenues of the electric sector and the airline sector are very
similar. FERC is currently funded by user fees, and the improvements in the
national grid control system that are required could be funded in a manner
similar to the air traffic control system, and would add perhaps one-1 Oth of a
percent to the typical electric bill.
The best parts of the air traffic control experience can be incorporated, and
the worst parts avoided, by implementing a strong set of mandatory federal rules
and certifications covering the seven elements discussed above. Unfortunately,
pending legislation misses most of these key points. The battle between industry
and FERC is not likely to be resolved without comprehensive legislative
attention to reliability, with the debate taking into account the lessons of
related critical infrastructures.
A plan comprising these elements, one recognizing that failures of complex
systems involve much more than operator error, better reflects reality and will
help keep the lights on.
Lessons from air traffic control
Electricity is not the only critical infrastructure in which safety conflicts
with economics. It is instructive to consider the history of the air traffic
control system as a framework that could reduce the errors leading to blackouts.
Federal regulation of air transportation expanded steadily through the
mid-20th century as traffic increased. Federal licenses for pilots and mechanics
were required in the late 1920s. In 1934, when planes began flying through
clouds, the Bureau of Air Commerce asked the airlines to develop rules to
control and separate air traffic, but in 1937 the government took direct control
of the system, and air traffic control for planes flying through clouds became
mandatory. The bureau and its successor the Civil Aeronautics Authority (CAA)
had authority for operating the system and investigating accidents and other
mishaps.
Radar had demonstrated its value to air travel during World War II, but it
was not required for civilian flights until 1956, when a crash over the Grand
Canyon killed 128 people and spurred the government to build a national
radar-based air traffic control system for high-flying planes. Safety was the
goal, but radar also had the immediate effect of improving airway capacity,
because accurate data allowed the distance between aircraft to be decreased from
30 miles to 5 miles. The government later expanded the radar system to cover
low-flying planes near airports.
During and after World War II, the Army argued successfully that the CAA
should control only civil aircraft. Two parallel control systems-civil and
military-persisted until 1958, when two collisions between military jets and
civil aircraft killed 61. Later that year the new Federal Aviation Authority
(FAA) was established with a mandate to coordinate all civil and military air
traffic at high altitude. At about the same time that it was adding operational
responsibility for FAA, the Department of Transportation moved investigative
authority to a new National Transportation Safety Board to eliminate the
possibility of conflict of interest.
In 1961, President John Kennedy told the FAA to prepare a long- range plan
for the air traffic network and to perform associated R&D. The resulting
study recommended a system that could monitor aircraft throughout their flights
and improve information displayed on screens to give operators better data and
to make it much easier to interpret. Throughout the 1960s, the 21 regional
control centers retained wide flexibility to formulate rules appropriate to
their local conditions, but by 1970 it was clear that the system needed a
national coordination center. As a result, airliners have takeoff'and landing
slots and are held at the gate, not in the air, until it is safe to fly.
Aircraft delays due to air traffic congestion fell by two-thirds after the
opening of the national command center. The safety and efficiency of this large
national network have been improved by federal standards for data, displays, and
certification.
These improvements in air safety have been costly. Beginning with the Airport
and Airway Revenue Act of 1970, funding for the system, as well as for
improvements to airports, has been raised largely by taxes on airline tickets
and fuel. But the money appears to have been well spent. In I960, U.S. air
carrie\rs had 44.2 fatalities per 100 million aircraft miles. In 2000 the rate
was 1.2. Although the goal of preventing all air crashes will never be achieved,
increases in safety have been impressive, because the incident investigations
have sought ways to make the system safer, not just to blame pilots for crashes
or near misses.
We draw the following lessons from the history of air traffic control:
* The federal government assumed control of a system that could not be
handled by state and local government or by a voluntary system run by the
airlines.
* The system moved beyond panic responses to a crash to a comprehensive
system that included R&D and facilities to handle future issues.
* A single agency should not be responsible for both operation and
investigation.
* Comprehensive monitoring of data is crucial, and so is the ability to
interpret the data in real time and take remedial action.
* Many of the actions are local or regional, but a national coordination
center is required to bring the controllers together.
* As a result of all these remedies, an extremely complicated and potentially
deadly system of air transport has been made very safe, much safer than driving.
Jay Apt (apt@cmu.edu) is Distinguished Service Professor in Engineering and
Public Policy and executive director of the Carnegie Mellon University
Electricity Industry Center. Lester B. Lave (lave@cmu.edu) is Higgins Professor
of Finance and codirector of the Carnegie Mellon University Electricity Industry
Center. Sarosh Talukdar (talukdar@cmu.edu) is professor of electrical and
computer engineering at Carnegie Mellon University. M. Granger Morgan (gm5d@andrew.cmu.edu)
is Lord Professor and department head of Engineering and Public Policy and
codirector of the Carnegie Mellon University Electricity Industry Center. Marija
Ilic (milic@andrew.cmu.edu) is professor of electrical and computer engineering
and engineering and public policy at Carnegie Mellon University.
Copyright Issues in Science and Technology Summer 2004