Duke ITAC - March 3, 2011 Minutes

Duke ITAC - March 3, 2011 Minutes

ITAC Meeting Minutes
March 3, 2011, 4:00-5:30
Allen Board Room
  • Announcements, Introductions, and Meeting Minutes
  • Scalable Computing Support Center and DSCR Support (John Pormann)
  • Research Computing (Julian Lombardi)
  • Alternatives to CATV in Student Rooms  (Angel Wingate, Robert Johnson, Joe Gonzalez)


Announcements, Introductions, and Meeting Minutes

Alvy Lebeck called the meeting to order. Tracy Futhey gave a brief summary of the prior night’s 9th annual Froshlife festival. The event was a huge success with as many as 4,800 people in attendance online via a Ustream channel.

Tracy also introduced Richard Biever, OIT’s recent Chief Information Security Officer and asked him to share his early observations on the job. Richard responded that he has been doing a lot of listening and learning. He feels there are a lot of bright people to work with here at Duke and he looks forward to doing so. He also noted that there are some significant differences in structure between Duke and Georgia Tech where he had been working previously. Richard views his primary purpose as listening to what’s going on, and supplying the tools and services Duke needs to protect itself.


Scalable Computing Support Center and DSCR Support (John Pormann)

John began by reporting that the DukeWiki space for the DSCR/SCSC is an authoritative source for research computing info, and that a new blog is in place with information on internships and developments in the field.  He also noted that the first videos in a new series have gone up, capturing information about the DSCR/SCSC for people like post-docs who come in off-cycle; these six to eight videos are 15-20 minutes each and have the content that’s in the seminars that John’s staff presents. 

John noted that Intel and Dell have put some press releases out based on information from SCSC’s evaluation of CPU load and power draws on machines; this work was done for internal evaluation and analysis to understand the resource use in the data center.  John said that the 2008 and 2009 series blade servers are drawing significantly less power, saving about $25,000 on the annual power bill versus 2007 series 1U servers.  Tracy asked what this could be attributed to. John noted a combination of improved CPUs and a new generation of blade servers.

The DSCR remains the primary platform, and remains in Centos 4, John said; Centos 5 is the destination for upgrade, while the marketplace is pending the release of Centos 6. John added that this is a production service so it is not just accepted but desirable to run somewhat behind the cutting edge.  Summertime will be the Centos 5 transition time, and 6 may follow shortly after.  John noted that this pace is based on researcher feedback.  He added that Intel compilers, MKL/IPP and MPI tools were now installed on the DSCR, providing improved performance for cluster users who decide to develop and schedule cluster-based software. Researchers using cluster-certified software and hardware should be most likely to see future commercial research computing applications supporting their environment.

John noted that memory is the main constraint in research computing these days; if you can’t fit a problem in memory, you can’t compute it. 48 GB and 96 GB of RAM are common these days, with 512 GB being a sweet spot on the higher end at a cost of roughly $30,000. A terabyte of memory would more than double that cost.  He noted that Intel and AMD are fairly similar in terms of overall performance in the abstract, but that the SCSC has tried to track along with the Intel chips. On the non-technical side, John said that Intel has a software division producing the aforementioned compilers and a field engineering division, while AMD does not.  For hardware vendors, out of the Tier 1 vendors like IBM, HP and Dell, Dell tends to be the least expensive; a 10% reduction for “white box” systems are available but that misses Dell’s extensive and growing resources for network and hardware engineering support.

The networking is now delivering 1 Gbps to the blades, with 16 machines sharing a single 10 Gbps uplink; it’s not clear that there’s anything to be done to improve bandwidth externally without improving core network speeds.  Infiniband, which runs at 20, 40 or 80 Gbps, is one technology being watched.  John Board said the differential used to be massive, but asked if price had come down?  John Pormann said that it might cost today just a few hundred dollars more to put an Infiniband card in versus a 10 Gbps card, though he reminded the committee this provides fast service only at the core, not beyond the edge connection.

From a storage perspective, John said the DSCR is growing its storage at 65% per year. In the future, John foresees having high-cost, high performance storage at a top tier for perhaps $2,500 per terabyte; tier 2 and tier 3 storage might run $1,000 and $500 per TB, respectively. John added that there are emerging technologies to mix fast and slow data in order to pre-cache data being used on faster drives.  Additionally, he added that the SCSC is working with the research data initiatives underway on campus.  Terry Oas asked how much data is being stored in archival fashion on the cluster instead of being active storage. John said that the DSCR’s users have been accustomed to having to archive, since the growth available has been low and users are conditioned to only keep core active data there.  John Board said that that was scary to contemplate given the growth in overall storage.  John Pormann and Julian Lombardi noted that the new gene sequencing machines produce phenomenal amounts of data, as much as 1 TB of data per day.

John referenced his February 2010 ITAC presentation, noting that GPU-based researching computing systems are seeing big performance boosts, with 4 of the top 10 grid supercomputers now being GPU-based. However, software needs to be specifically written or rewritten to take advantage of it.  The SCSC does have blade-based systems that can support GPU functionality in case a researcher is interested. John indicated that at this time no one has approached him to purchase GPU based machines but that they are ready for these requests. Terry asked if Mathematica or Matlab were being run on a routine basis; John said they were not, but that the BDGrid could be accessed through the SCSC portal.  New and future versions of these applications would support GPU functionality, John added. Michael Ansel asked if you had to launch Matlab on the BDGrid to use it; John suggested that VCL could be used in conjunction with the grid to provide those compute cycles.

John reminded the committee of the http://wiki.duke.edu/display/SCSC site for more information on the SCSC.

John Board asked if the data center space was still full and did not allow for more systems. John Pormann responded that there would be space for up to 100 new machines, and that there are some older machines that have not been retired pending some upcoming changes in funding and support models for the cluster.  Robert asked if the heat densities are rising fast enough to impact replacement. John said that we are not maximizing physical use of the data center space because of heat density concerns, but added that still, we were now working with 48 machines in the rack space that 40 used to consume. 

Tracy stated that being now 5 years into the DSCR, we continue to be able to pack in more power without having to add more space, cooling, or capacity, and that not over provisioning has paid off for us over time. John agrees. 


Research Computing (Julian Lombardi)

Julian began by describing the organization of research computing support at Duke, which includes RCAC faculty, with SCAC faculty led by Jeff Chase and with John Pormann directing the SCSC. John’s position has oversight from the provosts office and OIT. Research computing also encompasses training, the cluster itself, and consulting services.

 Julian noted that Arts & Sciences, the medical center and Pratt were the main users of the service and that there has been tremendous growth in systems since 2003 and a leveling off in the last two years, but that latter number ignores growth in CPU cores; factoring that in, the growth is significant.  Today, OIT picks up about $400,000 per year towards research computing, the Provost and SIP funds $331,000, and researchers picking up the remaining $758,000. SIP or strategic initiative funds initiated by the provost’s office have paid a significant portion of the costs for starting up the SCSC, Julian said.  Stefano Curtarolo asked how you become associated with the cluster; Julian said interested faculty should talk with John Pormann.

Julian said that the SIP funds are ending, and there became the need to replace those dollars.  Robert Wolpert asked whether the OIT and SIP funds aren’t funded out of overhead that faculty pay on grants. Tracy and Julian said that they are not; Julian noted Jim Siedow was not present and would be needed to talk to that. Alvy Lebeck pointed out that when you buy into the cluster, you are essentially buying into a part of a larger computer and only paying for your machines not the overhead.

Julian said that a group of research computing administrators came together at Cornell in May as part of an NSF workshop on the sustainability of funding and managing such university clusters, with the loss of the “venture capital” funds from dollars like Duke’s SIP program.  Julian said the issue at Duke is seen commonly throughout peer institutions, and that there are a common set of problems.  Out of the workshop emerged a white paper with recommendations.  The mid-range parallel processors and networked workstations clusters like the DSCR are seen as the systems that are at the greatest risk from funding challenges.

In all of these universities, Julian said there tend to be boom and bust cycles for funding followed by boom-bust cycles of participation, all built on top of an infrastructure that doesn’t exist. The main recommendation of the group is to go past boom-bust by raising the common cyber-infrastructure to a stable level that can be supported and cultivated over the long term. Besides raising this cyber-infrastructure base, Julian said that sustainable and stable support models for that base and greater collaboration across institutions (powered by the robust networks connecting them) are required.

For Duke, the recommendations are to continue support for the scalable computing cyber-infrastructure; provide support for visualization, which was cut in the last funding challenge; better leverage existing service structures, like the OIT service desk; and address lifecycle issues for research information and data.  Duke is recommending to do those things through a number of changes, Julian said.  Effective July 1, scalable computing services will be expanded, while Duke will add a Condo Computing Service and launch a Cloud Computing Service and a ‘Fog’ Computing Service, along with establishment of visualization support services.

For expanding scalable computing resources, this involves adding another senior HPC analyst at the Ph.D. level, along with undergraduate and graduate student workers.  Funds for external training in applications like Matlab will also be provided.  A three year refresh on major equipment is significant, and would be built-in --- it includes both the core nodes, front ends and monitoring systems along with network, software and monitoring infrastructure.  There would also be dollars to provide limited support to initial explorations in new areas. Tracy explains that with permanent funding as opposed to funding only a couple years at a time, we can keep things moving.

Condo computing is new terminology, Julian said.  Similar to the DSCR, you would buy machines, add them to the cluster, and share excess cycles with others.  You would still have immediate access to your own systems and low-priority access to others’, but you would be paying $200 per system per year in the equivalent of “HOA” fees per year, from which the “condo” model comes.  Robert Wolpert asked what unit of system this applied to; John Pormann and Julian noted that it applies to a single blade, not at the core.  Julian said that while the costs may have to be revisited as technology and densities approve, those consultations would happen through the research computing advisory model. The annual fee would also support training, guest speakers and student employees; John Board emphasized that Jim Siedow is supporting the idea that these will be allowable expenses with NSF, based on his prior conversations with program directors there. Julian assured that due diligence will continue on this with funding agencies.  Terry asked if this applied to NIH as well as NSF; Julian said it did not so far as he was aware but that this would be an area of future exploration.  Alvy said that if overhead were charged on the blades and we charged annually per blade we might have issues but no overhead is charged. Robert Wolpert said we will need high-level support at NSF to ensure program officers enforce these rules fairly.  Julian said there appears to be a consensus at NSF and other agencies that cloud services and the like are the direction in which the agencies see computing evolving. Robert asked that boilerplate text be provided for inclusion in grant applications. Julian indicated that preparing such text would be no problem.

In the DukeCloud service, Duke will be providing high-priority cycles as needed from a pool of machines. The charge is based on the actual cost of Duke operating a system, and initial estimates suggest it would be in the range of  $0.03 per CPU-hour; that cost compares favorably to Amazon charging $0.085 per CPU-hour for their EC2 cluster.  Robert asked whether we should be pricing at 100% utilization when we are unlikely to run at that intensity; Julian said it depends on how many machines are set aside to provide the resource to support that service. Those details are still being worked out. Terry asked if the systems could ever be used close to 100% of the time; Julian said probably not.

The “fog” computing service will allow the purchase of low-priority cycles only; those are expected to be available at roughly $0.01 per CPU-hour, vs. about $0.03 per CPU-hour on Amazon EC2 spot pricing.  This model has no need for machine ownership or “HOA dues.”  The caveat is that there is no guarantee that the fog service may be paused, migrated, stopped or restarted.  Julian described this as a good entry point for many researchers.  Terry said that software is the big barrier for entry for many users as many potential users aren’t writing their own code; what resources for software like Matlab, R and Mathematica would be available?  John P. said that we’re talking cloud and fog, but that day one this would look much more like the DSCR today.  The hope is to mirror things like Amazon’s cloud computing service, where you can install any OS and bring any software onto that image. Terry likened it to his decision to buy a new laptop and how many cores he should buy; if he knew this were available he’d save the money for CPU time on the fog service.  Julian said that this is about building business models and processes to support the growing need for these services.

Visualization support will include a visualization and HPC consultant in the Library GIS lab, along with software licensing support for visualization software like Amira and expanded support for the Visualization Forum. Robert asked where decisions on software would be made; Tracy and Alvy noted that there’s a visualization committee on RCAC that would look at these issues.  Terry asked if the DiVE was part of this plan; Julian shared that the DiVE was supported by Pratt and not part of this plan.

Multiple tiers of storage will be available, Julian said, referencing John Pormann’s earlier presentation.

Effective July 1, Siedow, Julian and the RCAC faculty will still have oversight over John Pormann as director of the Research Computing Support Center.  Two new positions, one for an HPC consultant and one for a Viz/HPC Consultant, will be created; the Library is a partner here, Julian noted.  Terry noted that Jeff Chase no longer appears on the org chart; Terry recalled that the original rationale was to have Jeff take care of administrative overhead, but Julian suggested that remained a challenge.  Julian emphasized there would be ongoing faculty direction in the form of RCAC, but this was a streamlining. John P. said he has dealt much more with the scaling computing subcommittee off of RCAC, but that most of the work happened via email and that this should be doable.

The new revenues are foreseen as $2.2 million for FY2012; it includes a $300,000 commitment from the provost to stand up the storage infrastructure model, though as John B. noted, those dollars no longer go to fund routine operations.  Terry asked if that support remains ongoing; Julian noted that it did not, but was instead seed funding.  By 2012-2013, a $2 million operation would be supported with researchers paying direct system purchase fees as well as condo and other fees.  He emphasized that 50% of the operation is still supported by allocated costs paid by OIT and the deans; John B. noted that this is where overhead dollars on grants comes from. Julian said that this model reflects the provost’s goal of a 50/50 split.

Terry said that in two years this means $1 million will need to come from faculty; Julian said that this is not very different from today.  Terry asked whether this assumes grant funding remains constant. Julian noted that it does. Terry noted that grant regulations have strict restrictions on non-grant use of a computer, i.e., use of the web browser for anything except research. He suggested that the use of condo and cloud services from a dumb terminal allows the full expense of those monies to be spent on research without risk.  Julian added that there are researchers today who buy research computing systems but have only 5% utilization or so.


Alternatives to CATV in Student Rooms  (Angel Wingate, Robert Johnson, Joe Gonzalez)

Angel noted that student room cable TV service (DTV) cost up to $3 million and was established in 1991.  It was the funding source, through a subscription-based model, to pay for wiring the campus data network.  A few years ago to reduce cost we switched to require prepayment and to offer service by the semester or year.  We provide 8 free EdNet channels, 50 channels for $175/semester, and a premium package with some movie channels.  Costs exist from programming, maintenance (personnel and equipment), infrastructure support, and customer service and billing; Time Warner, satellites and content channels represent programming costs.

Angel described the infrastructure as complex, including a freestanding tower for capturing OTA channels; 10 satellite dishes and 8 DirecTV dishes on campus, a Time Warner feed, direct fiber link to Tel-Com via fiber transmitters, and a series of fiber splitters, 13 nodes, coaxial to RF amplifiers and other equipment.

Mark Elstein asked why there were some channels that traveled to Tel-Com twice; Angel said these supported the Chapel and Cable 13.  The Electroline system that is critical to the CATV system is based on 1990s technology and still runs on MS-DOS; it is no longer supported and hard to find parts for.  The system is also at its maximum channel capacity. 

Angel said there are business challenges in addition to technology challenges and described the ongoing collaboration with Student Affairs to evaluate various business models that could provide a sustaining base for CATV.  As an older technology, there is analog signal only and limited channel capacity, plus there are quality issues with the signal.  Subscription rates are down 65% versus five years ago, eliminating opportunity to generate upgrade funds.  With fewer subscribers, the rates cannot cover costs and the financial model is not sustainable.  Additionally, there is now increased availability of other option for students in their residence halls, and younger students are less interested in broadcast models and restrictions, TV ownership or non-demand usage.

Angel noted that only 14% of rooms subscribe, versus 39% five years ago. Service is available for free in commons rooms.

Rates could be raised, Angel said, but there is the risk of losing more subscribers, and no guarantee costs will be fully covered, providing no way to address infrastructure issues.  Duke could upgrade the infrastructure, but this would likely require doing what other campuses do with baking the subscription into the room rent.  Tracy noted a couple years discussion with Student Affairs to evaluate the feasibility of moving subscription rates into room rent, but there was not sufficient support for these changes. Robert W. asked whether you would have to upgrade the coaxial to the rooms; speakers suggested the major impact would be on network integration on the back end, not the in-room wiring.  Tracy said we have a financial gap as a university with making this service work, not a technology problem—as the subscription base dwindles, so does the base over which we would need to spread the upgrade cost. Angel noted the irony that we installed CATV to pay for the data network, and that the network is now making CATV unnecessary.

The alternative, Angel said, is to close down the cable business, eliminating CATV service to student rooms and directing students to online options. Joe Gonzales from Student Affairs said that they wanted to create a web site to provide information on how to access popular content, including “how to” information and an emphasis on legal sources for content.  He reiterated this was the choice of 85% of students today.  For commons rooms, Joe said that IPTV solution supplementing online options with high-definition sets and 101 channels of service was being pursued. This would bring better picture quality and channels like NFL Network, he said.  John Board noted this would be a non-Duke service provider offering the IPTV option.  Joe said that they would continue to evaluate other services that might support this framework.

Ben Getson asked about Central Campus, where there are no common rooms. Joe said there are a couple of common rooms there now, but that more will be added under the house model beginning in 2012. Tracy asked if there would be some by fall 2011; Joe clarified that there would, and in response to Ben’s follow-up question, that these rooms would be accessible to all students, not simply those in selective living groups.  Ben asked if these would be accessible to all students; Joe said that at least one of these rooms would be located in every area of campus.

Robert said that one goal of CATV was making foreign language themes available supporting class needs, and asked whether this continued to be a need. Joanne van Tuyl noted that online options plus the language lab provided a significant number of these services instead. 

Bob Johnson said the internal IPTV system would use about 9 mbps per stream, consuming perhaps 1.6 Gbps of WAN bandwidth out of 3 Gbps total; today’s use of services like Hulu uses a tenth to a third of the bandwidth levels per stream of IPTV.  The IPTV service will be multicast, Bob said.  Robert said that today’s computers that get SD should eventually draw down HD and that student internet video use will grow.  Bob said that changes to the network, including improvements with MCNC to significantly grow our external links, should address that growth.  Tracy clarified that an IPTV solution using satellite dish down is three and a half times greater than Internet video solutions, which would make the deployment of IPTV to individual dorm rooms versus common rooms challenging.

Angel showed slides demonstrating the types of content available for students online.  Johnny Bell demonstrated how IPTV quality compared to a signal versus Duke’s on-campus analog cable system.