Duke ITAC - March 10, 2016 Minutes

Duke ITAC - March 10, 2016 Minutes

I. Announcements

Yesterday, between 2 and 4 PM, we set a network maximum; and at about 2 PM today we eclipsed yesterday’s all-time demand.  This was at the time of the Duke NCAA tournament game and we saw an aggregate demand to campus of right around 10 G of traffic compared to our normal 5 G of traffic.

We can prove from this data that our undergraduates do not go to sleep until about 3 AM.  We hope everyone’s streams were good.

The Duke Digital Initiative RFP is now available, and flyers are available in the meeting.

II. Agenda Items

4:05- 4:25 – Digital Storage Working Group Report, Larry Carin, Tracy Futhey, Tim McGeary (10 minute presentation, 10 minute discussion) 

What it is:  A faculty working group recommendation to the Provost of the workflow and services to support digital research data 

Why it’s relevant:  Working with the Deans, faculty committees, and senior officers, the Provost develops Duke’s intellectual priorities and oversees the implementation of research.  The storage of digital research data is a pivotal area of focus to support Duke’s research needs. 

The Federal Government (NSA and other agencies) are requiring that universities provide long-term storage for our projects, and in some cases that storage must be available almost indefinitely.  We have a working group to provide a framework for the Provost’s approval and funding. 

We’ve had several meetings and have come up with this model.

OIT has technical expertise at delivering computing power and infrastructure.  The Library has expertise in archiving.  In addition, there are units across campus that have their own specialties, such as Genomics.  We wanted to craft a solution that would leverage all those strengths and not make it overly centralized. 

OIT will provide a computing substrate, shared by units across campus, including the School of Medicine.  Units within the campus who have particular needs, if they so desire, can provide a localized substrate for their unit, which would be tailored for their faculty and research needs.  For example, the Humanities have special needs; the Franklin Humanities Institute; SSRI; Engineering; each of these areas could provide a specialized resource as they choose, according to their needs.  

Federally funded grants require data management plans.  Two resources we recommend: a central research data specialist to help institutes, departments, and schools know requirements.  As of Jan 25, the NSF and Department of Energy now require all publications of their proposals to be submitted to a central repository that they're hosting.  This requirement is not well-publicized.

Some of the specialized substrates will be focused on use cases, not necessarily organizational cases.  Extensions of the substrate will exist for protected areas as well as open areas. 

Instead of looking at archiving as something you do at the end of a project, we want to constitute a computing research environment associated with projects that allows for archive preparation throughout a project.  A virtual machine (VM) used throughout the project could then be “freeze-dried” at the end of the project. 

In the coming academic year we are planning that every Duke faculty member and every Duke student will have access to a basic VM for research at no cost.  Beyond-basic needs may incur a cost.  This VM will be freeze-dried at the end of the project. 

The idea is to create a computing environment that leads to effective archiving. 

We expect to offer this to basic sciences in the School of Medicine, as well to other university departments.  An offering will be available by the fall. 

We believe this will be an excellent service, easy to adopt rather than mandatory.

Questions and Discussion 

Question: What about research groups with very large storage needs?

Answer: Basic VM provisioning immediately addresses 80% of needs.  We are also working on specialized solutions for the remaining 20%.  Some projects will need to decide which datasets are easily replicable, and which datasets require archiving.  This will inform decisions about what must be archived and what can be reproduced.

NIH, NSF, and DOE now have the same standards: you need to keep primary care you have gathered, produced, or collected throughout your funded project, reasonable and sharable to other researchers.

This is essentially an unfunded mandate, and we are proceeding in good faith with finite resources.  Every university is facing the same challenge. 

Question: What about faculty who use individual laptops to manage their data, and aren’t especially interested in working in a virtual machine?

Answer: We envision a Research Data Archive Interface that will allow for easy ingestion of data: for example, ingesting data from Box in an automated fashion.  We’ll be able to freeze-dry data from a variety of sources.

Question: What timeframes do we have in mind?

Answer: We’re looking at a seven-year timeframe, which is beyond the 5-year timeframe NSF requires.  There are going to have to be difficult decisions.  Funding agencies have specific standards on "trusted copies", including three copies on three platforms in multiple geographic locations. 

The Digital Preservation Network, a higher-ed initiative that President Brodhead is on the Board of Directors for, is looking at a 20-year model, but that isn’t quite priced out yet.  There are other repositories, not at Duke, that we may be able to use alongside our on-premises and cloud options.

Duke is not the only university to have these challenges.  We are working with our peers, and we imagine national resources emerging in the future.

It may be useful for faculty to have language from this project that can be inserted into their applications.  Libraries Data Management already has some of this available.

4:25- 4:40 – GPU Computing Update, Mark DeLong (10 minute presentation/5 minute discussion)

What it is:   GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate scientific, analytics, engineering, consumer, and enterprise applications.

Why it’s relevant:  Mark will discuss how GPU Computing resources are being made available to the Duke community.

Google’s AlphaGo computer, with 1202 CPUs and 176 GPUs, recently beat a human Go champion using machine learning.

It’s said that there are as many variations of a Go board as there are atoms in the universe. 

(Nvidia Mythbusters video on GPU parallelization – demonstrating the Mona Lisa being painted in a single massive shot rather than pixel-by-pixel, sequentially.)

https://www.youtube.com/watch?v=-P28LKWTzrI 

GPU computing has arrived at Duke in a big way.

We now have four Dell C4130 rack-mount servers; each has four Nvidia K80 CPUs.  Each of these has nearly 4,992 cores, but only 24 GB memory.  Performance is rather astounding.

When I first took this job, our report to the NSF about Duke’s shared computing resources was: 81.1 teraflops of shared cluster computing capacity at Duke.

Just those four machines total about 114 teraflops.  This is a significant addition.

Each machine has nearly 20,000 GPU cores and 24 cores of regular CPU. 

How are we presenting this hardware to researchers? 

We took our four machines and tried four different tactics:

    • Bare metal – this presents the entire machine to you, with all four K80s.
    • KVM
    • VMware – both KVM and VMware present two machines, with two GPUs each.
    • HyperV – this did not work for our purposes.

Many kinds of work must be done in serial – one operation following another – but for other kinds of work, many calculations can be done simultaneously, and massively parallel cores allow for great time savings.

On the Duke Computer Cluster, there’s a new partition called “gpu-common”.  This can be used for GPU-ready code. 

We’ll be doing a pilot into May 2016, to assess the userbase and fine-tune how the GPUs are used in the cluster.  If you have a Windows application, or a specific visualization application which can benefit but doesn’t necessarily fit into a clustered environment, contact us and we can make special arrangements.

A CUDA class will be offered this fall to help with getting coders up to speed. 

We will test-drive Intel’s “Phi” Knights Corner and Knights Landing acceleration chips, which allow GPU cores to access general memory. 

Questions and Discussion

Question: Is CUDA now the generally-accepted standard?

Answer: CUDA, an Nvidia technology, is dominant.  National labs are going with GPU in a big way; the Argonne lab will be using Intel’s Knights Landing.

Question: Is it hard to use?

Answer: This isn’t necessarily easy to program for.  Some commonly-used tools such as Matlab and Mathematica have sophisticated, automatic support.  Other tools require specific libraries, but offer more control over the use of the GPUs.  (Example: pyCUDA.)  Still other projects don’t fit; but applications that can use parallel code and can work with GPU memory constraints can benefit enormously. 

4:40-5:10 – SoM Research Computing Update, Iain Sanderson, Rebecca Brouwer, Erich Huang (20 minute presentation/10 minute discussion)

What it is:  The Office of Research Informatics (ORI) is a comprehensive research hub that combines Information Technology, academic informatics, research faculty and curriculum development, and data stewardship.  ORI’s mission is to support Duke Health’s Learning Health and Personalized Health visions.  Two of the projects ORI is currently working on are the MyResearch Home Investigator portal and the Duke Data Service.

Why it’s relevant:

    • The MyResearchHome Investigator portal is an evolution of, and successor to, the MyResearch portal currently housed in the Duke@work infrastructure. The goal is a one-stop-shop for Duke researchers with full portal capability and tools to assist research discovery, administration and team science.
    • The Duke Data Service is a high provenance data store for managing research data from source to publication, while preserving the chain of provenance of the data. The first version is being tested using OIT’s OpenStack infrastructure as the destination storage system.

The Office of Research Informatics is tasked with building research IT infrastructure for Duke Health.  We have about 17 active programs.  We’re responsible for things like the Clinical Trials management system, IRB, and the new Animal Management system. 

Integrated researcher – Researchers couldn’t find where Duke resources were.  We have a collection of enabling applications, plus a portal to help folks navigate Duke’s complex research environment. 

Data provenance – Reproducibility of science is a huge problem, especially in the medical environment.  Our answer is the Duke Data Service. 

The Duke Data Service guards against malfeasance as well as research that’s difficult to reproduce.  It’s now in an alpha release, deployed to a limited group of users. 

If you ask a PI at Duke “where are your data”, it’s difficult for them to give a granular answer to that question.  “Where did your data come from?”  Also difficult to answer. 

Duke Data Service seeks to improve research data liquidity and guard integrity.

This is funded by NIH, Dean’s Office in School of Medicine, and the Burroughs-Wellcome Fund. 

Philosophically, this is a service.  This means it needs an understandable interface, a standardized transmission protocol, and a reliable origin.  With these, you can do many things with the service. 

We’re providing both easy access and “blinking cursor” programmatic service.  We’re also able to store data on many physical endpoints. 

We’re providing a common toolset of resources for capturing and knowing about our data.  We’re providing flexible tools and hub for storing them and linking them.  Data get unique resource locators; so do individual computers.  Join those, and you get scientific or analytic workflows.  The service also tracks provenance: when created; by whom; how. 

(Demo.)

The entire project is open-sourced.  We’ve also extensively documented the application program interface (API), which allows others to extend its functionality. 

Provenance is revealed through a directed acyclic graph, which is also stored in a searchable form. 

We’re also implementing a blockchain, which provides an open but cryptographically secure ledger of all transactions.

MyResearchHome 

This is an integrated hub for the whole Duke research community.  It’s about finding resources and people in a way that is customized to your interests and preferences. 

Conversations with the research community revealed several area of common interest.  These included discoverability of resources; transparency; sharing of information to avoid duplication of effort; and centralized training; information tailored to individual interests; collaboration; and financial information.

By summer, we’ll deliver on “low-hanging fruit.”  This includes applications tailored to your area of research (e.g., not presenting MouseBase to researchers working only on human subjects); training; data now available in MyResearch; and presentation of several existing services. 

Questions and Discussion

Question: A few meetings ago there was discussion of enabling DOI with datasets.

Answer: That’s something we want to do.

Question: Data cleanup?

Answer: Programmatic clients allow an easy call out to the API to generate provenance.

Question: If I have a service that I want to present through MyResearchPortal, how do I do that?

Answer: One way is to create a standard widget that users can place on the portal.

5:10-5:30 – Lightweight Virtual Desktop Experiences, Chris Dwyer (10 minute presentation/10 minute discussion)

What it is:  Chris and his team have been re-experimenting with a virtual desktop strategy that scales well for mid-sized academic research groups using new commercial then/zero-clients and centralized servers. 

Why it’s relevant:   The goal has been to find a sustainable strategy for IT support that also enables realistic research computing environments that are under constant development.  The results indicate that improved network latency and thin-client optimizations have enhanced the user experience to a level that is virtually identical to heavy-client desktop performance.

Previously, we’d bought $2,000 desktop PCs for our 4-5 person research unit, but scaling up our needs to 15-20 introduced an opportunity to experiment. 

We’re using fifteen HP thin clients with reasonable specs, running Windows Embedded, connecting to four high-powered servers. 

We found that the user experience is not very different from a typical desktop.  We tested arbitrary multi-HD video streaming from YouTube.  It worked; that was impressive. 

We use Remote Desktop Protocol (RDP), compressed on both ends; we also newly have good network latency.  It’s easy to push out patches; the responsiveness provides just as good an experience as you’d want. 

We increase productivity by centralizing software installations.

This also forces the use of parallelism on a high-powered back-end, rather than depending on less powerful computation on personal laptops. 

The total cost is lower than fifteen desktop PCs. 

The experiment is working well, thanks to improved networking and recent advances in RDP.

Questions and Discussion 

Question: Windows vs Linux?

Answer: Linux works fine, but high-intensity interactive graphics doesn’t work as well with that back-end.  There may be a difference in compression or codecs that limits performance.  We believe the difficulty is with open-source RDP implementations as compared with the Windows implementation, which is the standard.

Question: How important is the thin-client device, and why not use the students’ laptops instead?

Answer: RDP 2.0 isn’t supported on every laptop.  This is offered for uniformity.

Question: How does your project speak to the possibility of extending our existing virtual machine efforts in new ways?

Answer: Infrastructure improvements have made this much more reasonable than it would have been in the past.  This also works well with the hybrid-device use case of graduate students, in which there’s usually one or more additional devices in use alongside the main computing platform.