Loading...
HomeMy WebLinkAbout06.b. Receive report on April 21, 2017 IT server outage and present Certificates of Appreciation to County employees who assisted 6. b. NETWORK INTERRUPTION EVENT ON APRIL 21 , 2017 John Huie v--�-- Information Technology Manager May 4, 2017 EVENT OVERVIEW • Friday, April 21 at 11 :00am, Central San experienced a sudden unexpected network outage. • Most systems servers were offline and unavailable including file, print, and email servers. SCADA systems were unaffected. • All systems are now fully operational. • All lost data has been fully recovered. MAP" '0007 1 04/27/l7 OUTAGE TiMELINE 1 1 :00am — Network connectivity was lost between two network switches. Access to files, printers, email and other systems was down. 12:00pm — Nimble core storage system lost power and then restarted resulting in file corruption and lost data Staff worked through the day to restore the Nimble and began migrating servers to the CSO data center as a "Plan B" option Staff developed workaround to bypass the failed network switches OUTAGE TiMELINE CONTINUED 9:00am — Staff continued to migrate all servers to the data center at CSO in case Nimble could not be stabilized 1 1:00arn — Nimble Storage in Martinez was successfully recovered but still considered unstable. 5:00pm — Replacement Nimble unit arrived and Nimble engineers arrived on site 11:30pm — New Nimble unit was installed, configured and ready to go at Martinez data center Sunday 6 9:00arn - Staff began moving server storage from old Nimble to new Nimble device. 0 2.*00pm -All services were restored and operational. 2 04/27/17 ROOT CAUSE OF OUTAGE Most likely cause: • Internal equipment malfunction that caused issues with nearby equipment Other Possibilities: • Data Center power quality — Uninterruptable Power Supply Malicious Hacking (internal & external) Physical Data Center intrusion COSTS & LABOR Five members of the IT Team worked through the weekend. Four of them collectively used 112 hours of overtime • 18 hours of KIS consulting time ■ Ed Woo and Rex Fujikawa from Contra Costa County volunteered late into Friday evening Vendor support from Nimble, Cisco and Dell were all on hand over the phone or on-site for many hours No direct hardware or materials cost. All equipment was on hand or provided under our service contract 04/27/17 BUSINESS IMPACT The outage impacted a partial day of normal work hours for Central San. Office staff unable to do primary work-for duration of outage on Friday (6 hours). Email outage caused disruption throughout the weekend until Sunday afternoon. CSO and Plant Maintenance crews were unable to receive Cityworks work order information. Resulted in varied impact on field crew production ranging from no impact to moderate impact. • Outage did not impact SCADA systems. e", V NEXT STEP S Target to complete Fully implernent remote server replication September 2017 (SRM) Replace suspect equipment May 2017 Evaluate UPS and power distribution in Data 2 weeks Center Share event information with CH2M Immediately Collaborate with SCADA group to improve Ongoing systems and incident response 4 04/27/17 LONGTERM/ONGOING IMPROVEMENTS • Security Risk Assessment with CH2M • Annual vulnerability penetration tests • Ongoing internal hardening efforts including: Password requirements External device pre-authorizations Site access improvements including secondary restrictions to server rooms Video and Network monitoring • Construct new Data Center Allows physical separation of redundant SCADA servers a Reduces risk of water intrusion a Improves temperature control Improves Data Center monitoring ANY U ETION . =d 5