HomeMy WebLinkAbout06.b. Receive report on April 21, 2017 IT server outage and present Certificates of Appreciation to County employees who assisted 6. b.
NETWORK INTERRUPTION EVENT
ON APRIL 21 , 2017
John Huie
v--�-- Information Technology Manager
May 4, 2017
EVENT OVERVIEW
• Friday, April 21 at 11 :00am, Central San
experienced a sudden unexpected network
outage.
• Most systems servers were offline and
unavailable including file, print, and email
servers.
SCADA systems were unaffected.
• All systems are now fully operational.
• All lost data has been fully recovered.
MAP"
'0007
1
04/27/l7
OUTAGE TiMELINE
1 1 :00am — Network connectivity was lost between
two network switches. Access to files, printers,
email and other systems was down.
12:00pm — Nimble core storage system lost power
and then restarted resulting in file corruption and
lost data
Staff worked through the day to restore the Nimble
and began migrating servers to the CSO data
center as a "Plan B" option
Staff developed workaround to bypass the failed
network switches
OUTAGE TiMELINE CONTINUED
9:00am — Staff continued to migrate all servers to the data
center at CSO in case Nimble could not be stabilized
1 1:00arn — Nimble Storage in Martinez was successfully
recovered but still considered unstable.
5:00pm — Replacement Nimble unit arrived and Nimble
engineers arrived on site
11:30pm — New Nimble unit was installed, configured and
ready to go at Martinez data center
Sunday
6 9:00arn - Staff began moving server storage from old
Nimble to new Nimble device.
0 2.*00pm -All services were restored and operational.
2
04/27/17
ROOT CAUSE OF OUTAGE
Most likely cause:
• Internal equipment malfunction that caused issues with
nearby equipment
Other Possibilities:
• Data Center power quality — Uninterruptable Power
Supply
Malicious Hacking (internal & external)
Physical Data Center intrusion
COSTS & LABOR
Five members of the IT Team worked through the weekend.
Four of them collectively used 112 hours of overtime
• 18 hours of KIS consulting time
■ Ed Woo and Rex Fujikawa from Contra Costa County
volunteered late into Friday evening
Vendor support from Nimble, Cisco and Dell were all on
hand over the phone or on-site for many hours
No direct hardware or materials cost. All equipment was on
hand or provided under our service contract
04/27/17
BUSINESS IMPACT
The outage impacted a partial day of normal work hours for
Central San.
Office staff unable to do primary work-for duration of outage
on Friday (6 hours).
Email outage caused disruption throughout the weekend
until Sunday afternoon.
CSO and Plant Maintenance crews were unable to receive
Cityworks work order information. Resulted in varied impact
on field crew production ranging from no impact to
moderate impact.
• Outage did not impact SCADA systems.
e", V
NEXT STEP
S
Target to complete
Fully implernent remote server replication September 2017
(SRM)
Replace suspect equipment May 2017
Evaluate UPS and power distribution in Data 2 weeks
Center
Share event information with CH2M Immediately
Collaborate with SCADA group to improve Ongoing
systems and incident response
4
04/27/17
LONGTERM/ONGOING IMPROVEMENTS
• Security Risk Assessment with CH2M
• Annual vulnerability penetration tests
• Ongoing internal hardening efforts including:
Password requirements
External device pre-authorizations
Site access improvements including secondary restrictions to
server rooms
Video and Network monitoring
• Construct new Data Center
Allows physical separation of redundant SCADA servers
a Reduces risk of water intrusion
a Improves temperature control
Improves Data Center monitoring
ANY U ETION .
=d
5