Duke ITAC - March 12, 2009 Minutes

Duke ITAC - March 12, 2009 Minutes

ITAC Agenda
March 12, 2009 4:00-5:30
Allen Board Room

  • Announcements & Meeting Minutes
  • Storage Outage Update (Klara Jelinkova)
  • myResearch in Duke@Work (Todd Orr)
  • Disk-Based Backup Service Directions (Eric Johnson)

Announcements & Meeting Minutes

Terry Oas opened by asking ITAC members present at the February 26, 2009 meeting if they had comments on the minutes. Noting no objections, Terry accepted the minutes and stated that they would be posted on the ITAC web site.

Terry introduced Jimmy Chang from Cardiology and the School of Medicine as a new ITAC member.

Storage Outage Update (Klara Jelinkova)

Klara began by describing the storage environment that was impacted by last week’s storage outage.  OIT today supports multiple storage arrays.  A storage array located in FitzEast (EVA-01) was the one impacted. Duke has two storage array vendors, EMC and HP.  The EMC systems provide Tier 1 storage capability and are replicated across data centers.  The HP EVA array (Tier 2) that failed was installed in November 2006 and has been very stable.  The university side has a pair of unreplicated EVA arrays and the Health System has several more. 

Klara said that the expectation was that these arrays were redundant to 99.999%.  Specifically, these arrays have redundant controllers and hardware.  Alvin Lebeck noted that the expectation of redundancy has not always been met. 

Klara described the classification of applications and how that determines where they are stored.  The four classifications are essential, critical, sensitive, and tolerant.  “Essential” applications are replicated, said Klara.  For example, SAP and AMCOM are examples of applications categorized as “essential”.  These applications have Tier 1 storage provided to them.  Klara stated that “Critical” applications have failover or cache mechanisms. The Duke web presence is in this category and has the capability of failing over as requested by the Office of News and Communication, she said.  “Sensitive” and “tolerant” applications do not have that level of failover.

Klara provided a high level overview of the outage timeline and the technical response.  The first disruption, a controller reboot, was experienced on Monday, March 3rd at approximately 5:45pm.  OIT immediately opened a service request with HP, Klara said.  HP was onsite by 9:00pm and the affected EVA crashed at 10:00pm.   Klara said that OIT immediately activated a Disaster Recovery situation.  This situation uses the previously mentioned applications classifications to make failover or backup decisions. 

By 01:00 am, HP was able to restore the array.  Klara said that OIT then began restoring services on the impacted array and performed full functional and technical tests.  By 3:00am, the controller rebooted again before OIT was able to complete all these tests. 

Klara said that he team decided to enact the disaster recovery situation again.  The group split into two groups:

  • Service Continuity Team -- Led by Kevin Davis and Chris Meyer. This team made sure services and essential applications were online.
  • EVA Restoration Team -- Led by Klara. This team worked with HP on resolving the technical issues.

The team met a 6:00 am restore deadline due to the Time & Attendance application. Klara said that at 6:00am HP provided “all clear”; therefore, the team began restoring services again.  This restoration was completed at around noon that day, she said.

On the initial crash, Klara contacted HP to request a replacement due to concern that a root cause had not yet been identified.  From Monday morning through Wednesday morning the system performed as expected.  On Wednesday at 11:00 am the controller rebooted again.  Klara said the team decided that the array was not stable and that OIT must move the 27TB of data to another arrays. If OIT had another matching-brand array on site with sufficient capacity, it could perform array-based replication; however, the restoration team was performing host-based copying, which was very time consuming, she said. 

Plan B was to get a secondary array on site to ensure that we had the storage if we required it.  A third team, led by Eric Johnson, was the installation team for this backup approach. 

John Board asked what the normal time frame for bringing a new array online was. Klara responded that a similar project this summer took two months. 

Steve Woody asked what the SLA with HP stated.  Klara stated that this is being evaluated.  OIT staff performed a complete review of all hosts to determine if any changes had been made and one OS service pack on one host was rolled back.  Despite this rollback, the array failed again.

Klara said that on Friday HP delivered new hardware, and the contingency array was also delivered.  OIT proceeded to install both systems since it was unclear which path would be pursued. 

By 11:00pm on Friday, the team had configured parallel EVAs which were replicating data.  By Sunday, the data had been replicated to the new EVA and the impacted systems were configured to use it, she said. 

Terry asked to what extent the replication process impacted the server’s ability to offer its services.  Klara said that the server I/O would have been impacted; however, since it was spring break and the weekend, some non-essential services were not yet online.  In addition, to minimize the impact, the team worked with customers to identify the critical services that needed to come up and which ones could be delayed.

Klara emphasized that the impacted array has been up for two and half years without incident.  There were no SAN or array changes made and the only host change, an OS Service Pack, was rolled back; nevertheless, the controller failed again.  Since the new EVA has been up, the system has been stable.  HP is analyzing the root cause and plans to have the root cause analysis ready the week of  March 16th.  

Klara added that another outcome of this incident was the verify the current application classification and priority.  Tracy noted that the date of the outage, the night of a storm, escalated the importance of Duke Today, due to its notification responsibility.  This shows that Duke needs to periodically review the overall service criticality, in addition to when that classification might shift.

Terry asked if Duke Today is considered part of the Duke “web presence”.  Klara said that it was not at the time of the outage; however, there was a great deal of real time re-categorization of criticality. One of the outcomes from this incident is to validate application classification with stakeholders. Ultimately, the level of redundancy expected for services is a business decision.

Terry suggested that “web presence” may reflect what external customers see.  Duke Today faces internally as opposed to the main Duke web site which external customers would visit, he said. Kevin Davis said that as pertains to Duke Today, the team reached out to the Office of News and Communication to prioritize web sites. Andrew Tutt stated that the Duke web site implied the storm was the cause of the outage.  Kevin D. stated that the Office and News and Communications crafted the message, and that did not seem to have been their intent, but that this would be a better question for them as functional owners.

Alvin L. asked about the classification and whether the suggested process will be the most effective.  Given the need for cost savings, the financial impact, in terms of staff time, was significant.  Maybe Duke needs to look at insuring against this type of an event by utilizing additional arrays, he said.  Tracy speculated that financial climate might not lend itself to the level of capital outlay for all services to be made fully redundant.  She added that Duke needs to clarify who OIT’s partners are when these decisions need to be made.  For example, Todd Orr was the ASM representative, but OIT doesn’t necessarily have that contact for every other service. 

Alvin asked if Duke is going to provide any additional redundancy beyond today’s capabilities.  Tracy responded that “platinum services” have that level of redundancy.  She said the question to ask was whether Duke has all the services classified correctly.  In addition, the cost and risk associated with provided complete storage redundancy are business decisions for the executive leadership team.

Rafael Rodriguez asked what was the risk of not being able to restore data at all.  Klara responded that Monday night data loss was a concern; moreover, all the arrays have backups, it was a question of when it would be restored.  Klara added that the new backup architecture based on disk rather than tape would expedite backup times.

Robert Wolpert asked how far back the backups go.  Klara stated a nightly differential is performed, thus, you have one day of backup.  Rafael stated that the risk presented of data loss needs to be examined and managed.  Alvin asked what were the criteria for an application or service to qualify for the highest level of storage availability.  Klara stated that business needs drive classifications.

myResearch in Duke@Work (Todd Orr)

Todd Orr noted that his presentation continues previous conversations on employee self-service and iForms.  This effort builds work initiated by the Research Administration Continuous Improvement committee (RACI).  The goal was to identify a single place to tie in all the information and systems supporting the compliance and administration of sponsored research with the investigator as the user.  Todd said the goal was to be live in the beginning of May 2009.

Todd clarified that it was not a tool aimed at grant administrators.  Terry asked why this was not for grant administrators.  Todd stated that the focus of this application was for faculty research, and that grant administrators already have tools for their needs.

Scope of myResearch is for all of Duke, both campus and Health System.  It is a systems integration effort, not new applications, intended to provide access to a number of existing applications and data sets.  Todd mentioned that RACI has a faculty advisory group that provides input on this service.  The goal has been to get a broad base of input from faculty members and researchers. 

myResearch will be a new component of the Duke@Work web site.  Every faculty member at Duke will get the myResearch tab, even if they do not receive grants.  The data displayed in this space will be for projects for which faculty members are the primary investigator (PI) or co-PI.  This site will require authentication and will use Shibboleth.

Robert Wolpert asked if Safari was supported.  Todd responded that SAP’s web technologies do not provide full support for Safari, though some aspects may actually work.  Terry asked if Duke@Work could be turned into a URL.  The current site is http://work.duke.edu, said Todd. 

Andrew T. asked if the university could have an easier process for creating internal URLs that are more intuitive.  Tracy said that David Jarmul’s office owns the process, and that Mike Schoenfeld is examining this.

Todd provided a demonstration of the myResearch tab and its contents. 

Todd mentioned that the tabs across the top of the screen are determined by security roles based on information in SAP.  Rafael stated that some future efforts might require added granularity to the existing SAP design, and suggested convening a working group to examine this issue.

Todd stated that “projects” in myResearch are defined as “proposals that have been submitted into the sponsored projects system or awards that have resulted in fund codes setup in SAP.”  There is an initiative to look at a broader project registry, independent of sponsorship, said Todd. 

Steve Woody added that the concept of project is either non-existent or different across the university.  The goal of myResearch is to introduce the “project concept” and provide a mechanism for interfacing with the disparate administrative systems.  Administrative systems have grown around specific administrative needs.  The myResearch tab changes the traditional approach to providing information, Steve said, adding that faculty and researchers need to get access to these varied administrative systems.

Terry asked if compliance training would be part of the myResearch tab.  Todd said the Conflict of Interest statements in the quality assurance tab refer to compliance issues.  Todd added that the training section of the myResearch tab would list all required training with due dates from the Safety Office, OESO, and others.

Jim Siedow stated that myResearch tab is effectively a dashboard of all the activities requiring action. Terry asked if faculty biographies and CVs could be included.  Todd stated that once a system exists to track and collect this information, the myResearch portal could link to that information.  Terry asked if the design was customizable, and Todd responded that the interface is all HTML and designed by the Duke team.  Todd welcomed members to contact him with feedback and that he would be willing to come back and provide an update.


Disk-Based Backup Service Directions (Eric Johnson)

Klara introduced Eric Johnson as the new manager for systems infrastructure within OIT.  Eric was previously with Nortel. 

Eric Johnson described the current Tivoli Storage Management (TSM) backup system that has been in place for a number of years.  Eric said OIT’s current backup architecture consists of machines transferring data over the network to a disk cache pool.  The TSM servers process that data and prepare it for transfer to tapes.  Today Duke has two tape copies, one on-site and one off-site copies of tapes.  The offsite copy goes to Iron Mountain.  A recent outside analysis determined that Duke’s backup success rate was 99.78%, which was described as best in class by the consultant.

Eric said that Duke is meeting resource constraints necessitating an upgrade.  Currently servers send daily differentials to a disk pool after 5pm, and that “staged” data is copied off to tape; this process is intended to complete by 5pm the next day. The current system has difficulty processing all the data in 24 hours, and the tape library capacity is being reached.  Capacity projections show that sometime in May 2009, TSM will exceed its capacity if current trends continue.

Eric described the new architecture as having a pair of new storage arrays, one apiece in each data center.  The on-site data copy will go to these storage arrays. The offsite copy will continue to go to tape.  OIT will also add a new TSM server to improve performance.

Terry asked if in the new configuration the tape library would remain on-site.  Eric confirmed that it would.  John Board asked how this new configuration solves the “24 hours in a day” problem.  Eric stated that several factors will have an impact.  Specifically:

  • A larger disk cache pool will allow for more data
  • The new server will have more horsepower and process data faster
  • OIT will only make one copy of the tapes.

Klara clarified that the data is dumped to the disk pool and then backed up to tape.  Robert W. suggested that moving to a 48-hour backup and restore window, rather than the current 24-hour window.  Klara added that migrating to disk backup will also enable simultaneous restores.  Rafael followed up on Robert’s point by suggesting that having a 24-hour Disaster Recovery program may not be essential.  Robert asked if there were any plans to move completely off of tape. Eric stated that that is a long-term possibility.

Rafael asked if Iron Mountain remote backup over-the-network was an option.  Klara said that data validation might be a concern.  She added that when remote over-the-network analysis option was explored it was deemed to be cost prohibitive.  Kevin D. suggested that the cost of current off-site Iron Mountain tape storage is fairly low.  However, off-site disk backup would be the next logical evolution, he said.

Steve W. said that Duke research is growing the amount of data quickly and that researchers want to retain that data.  Tape represents a low cost storage option.  Steve asked if tape has been considered as a “near line storage” option.  Rafael suggested that tape is not a long-term archival solution.

Terry asked if the backups were incremental or full backups.  Eric stated that the TSM does a nightly differential backup with periodic full backups.  He then described some possible archival options.

Jimmy Chang asked if TSM was only for campus.  Eric responded it was for campus only.  Rafael suggested that the Health System has similar services as well as similar challenges. 

Terry contributed that NIH mandates that data be archived.  One challenge is that today’s storage medium may not be recoverable in several years; therefore, central backup and storage is critical to all researchers, said Terry.  Eric clarified that TSM is today meant as a backup and recovery solution in the event of equipment failure, not necessarily a near-line or archival option.  Klara suggested possibly re-purposing the tape robot for more archival storage uses as a possibility once capacity is not a concern, and that the team would keep this in mind.

John Porrman stated that spinning disk drives use power that may not need to be used, whereas tapes only use power on-demand.  Klara noted that in a scenario as the one described earlier, restoring from tape would take months.  Eric stated that OIT will down the road also upgrade the version of TSM to a version that supports de-duplication.  In addition, Duke spends money on tapes annually that will be saved using the disk array solution. 

John Board asked how long it would take to restore 27TB of data from tape.  Klara speculated that would take several months.  Alvin asked if a full recovery would ever occur from tape.  Eric said that the new array would provide that recovery.  Rafael suggested the group should evaluate what events would require a full recovery and what the business decisions would be to justify disaster recovery expenditures.

Alvin asked if Duke would have the server storage on one array and the backups on a second array rather than redundant storage for the servers across arrays.  He added that we have 210TB of data on the arrays, and another option could be to provide redundancy to existing systems using the array proposed for backups. Eric contributed that the new arrays would support 250TB of data. Kevin D. contributed that the TSM system provides for restoration of previous versions, not just the last version.

Eric said that as backup storage utilization approaches capacity in the future, OIT would be able to easily add more disks to the array to increase capacity.  In addition, de-duplication should reduce the amount of data that needs to be backed up.

John B. said some studies indicate that tape storage may not necessarily be cheaper than disk storage. Tracy asked if Alvin or John had specific suggestions for the proposed architecture.  Alvin said that the application classification is very important.  Alvin asked if the RAID solution was hardware or software.  Klara stated that it was hardware RAID.  John B. suggested that we gain speed in recovery while retaining the disaster recovery option.

Alvin asked about scratch storage – data that is intentionally not backed up.  Terry added that most data storage is effectively “scratch storage” in terms of reliability and availability.  Robert W. suggested different data classifications do not currently exist.  John B. suggested that outside agency’s data restore requirements are somewhat vague.  Terry added that health care regulations also could drive some decisions that the campus should be able to learn from.

Klara thanked the group for the community effort in responding to the storage array outage issue.