July 14, 2022 Minutes

4:00 - 4:05 p.m. - Announcements (5 minutes)

David MacAlpine announces that there will be back-to-back ITAC meetings in August. The dates are August 11th and August 18th. ITAC will continue to host a series of meetings to pull information together on faculty and research needs with the goal of better serving these needs. Then, a follow-up is planned to pull these findings together and present them. The meeting on August 18th will be a reception.

4:05 - 5:05 p.m. - Engineering Research and IT Support, Jessilyn Dunn, Volker Blum, Amanda Randles, Miroslav Pajic, Hai Li, Henry Pfister (40 minute presentation, 20 minute discussion)

What it is: Faculty representatives of Engineering will be joining us to present upon their research/academic efforts, discuss the role IT currently plays in support of their work, and identify and review some of the areas for growth and additional opportunity between Engineering and IT support.

Why it’s relevant: In an effort to learn about the overall character of research and research IT support throughout the University, as well as to explore commonalities between the needs of individual domains and Duke as a whole, ITAC will be hosting a series of presentations/discussions over the course of the summer semester with key researchers and their colleagues. These discussions will aim to distinguish the most prevalent services for which IT need to aim to provide institutional level support versus those that are surely essential for certain research but are not pervasively used, and so may be better supported from the school/institute/department/lab level. Ultimately, OIT is seeking to open better lines of dialogue with the major research efforts at Duke, to learn how to better support our researchers, overcome any gaps in the current system, and collaborate to identify new ways to assist in elevating Duke's Research Community as a whole.

David MacAlpine begins the meeting by introducing three engineering research teams who will be presenting on how they use IT and what IT can do to continue to and better support them. Evan Levine adds that this is an effort to both determine what IT can do to better serve Duke faculty and researchers and more broadly to determine what are the needs of faculty and researchers.

Tracy Futhey acknowledges Jerry Lynch, Dean of Engineering, as well as Chris Freel, Associate Vice President of Research, and Jenny Lodge, Vice President for Research and Innovation, who may be joining this meeting. Tracy underscores that OIT is intent on listening and hearing what the issues are and then, doing our best to address them. Tracy thanks everybody who has been involved throughout this dialog.

Volker Blum, professor of Mechanical Engineering and Materials Science (MEMS) begins by speaking about high-performance computing (HPC.) Volker’s research involves simulations of the quantum mechanics of materials. Volker develops simulation tools using density functional theory which requires high-performance computing resources. Volker’s research also involves open-source projects funded by NSF. Volker’s lab simulates materials, both organic, inorganic, and hybrid. To model these materials, large computers that handle large data sets and large code are required. Volker shows examples of complex structures and the Eigenvalue equations that are used in modeling electronic, atomic, and molecular structure. Using high-performance computing, matrices that are in the range of 10s of thousands to hundreds of thousands in dimension are processed with these linear algebraic equations. Recently, Duke in collaboration with Nvidia, the Molecular Sciences Software Institute, Oakridge, and the Max Plank Society was able to process large eigenvalue equations with reasonable efficiency. Normally, Volker must go to external facilities for supercomputer use as this type of facility must be built right.

Volker then addresses what works well and where more could be enabled:

What Works Well (OIT)

• Database development

• Virtual machines

• Backup storage – this is critical to ensure data management, reproducibility, and availability

• Trivially but hugely important, unbureaucratic and well-working email

• That data won’t be lost

Where More Could Be Enabled – Access to High-Performance Computing

• Parallel simulations across nodes with 100s or 1,000s of tasks are the de facto standard in computational materials science.

• Access to very large High-Performance Computing (HPC) resources is possible through external proposals but can mean time-consuming, yearly repeated proposals at multiple centers, and resources can fluctuate wildly.

• HPC in teaching is not routinely possible at external HPC resources and key expertise cannot be adequately incorporated into classes.

• Undergraduate and Masters research is similarly hampered by reliance on external resources.

• Stable, accessible mid-scale HPC at Duke would address a significant gap and solve the competitive disadvantage for HPC-based research and teaching.

Henry Pfister, Electrical and Computer Engineering professor, addresses three areas:

1. Computing for theoretical research.

2. Simulations of communication systems.

3. Machine Learning and Signal Processing.

Everyone wants to start with someone else’s GitHub repository that’s designed for some particular machine and trying to make it work on anything that’s not already barebones can be somewhat painful.

Computing for theoretical research

One example includes Jupyter Notebooks with Mathematica and Python. Henry uses standard OIT VMs for Python.

Using standard VM for Python works pretty well but one issue is that versions change all the time. So Henry will get something up and working and then, the version will change and OIT will upgrade the version, Henry’s code will not work anymore, and though it is not hard to fix, it is a continuous loop that requires time.

There have also been compatibility issues with getting JAX to work with Python. JAX is a Google library that is useful for automatic differentiation. These compatibility issues led to the use of Google CoLab which Henry’s students and post-docs requested and costs $10/month.

Google CoLab advantages:

• More flexible for reconfiguration than OIT VMs

• Many repos include scripts designed for Google CoLab

• Easy to share code with people outside of Duke

Google CoLab Disadvantages:

• No common file storage

• Billing is complicated – Henry wonders if institutional access to Google CoLab would be possible so he could be billed through Duke

Simulation of Communication Systems

Communication system simulations use Python code running on 100 to 200 cores in parallel. Henry has done this type of thing many times in the Duke Compute Cluster (DCC) and it works great. Henry uses the scavenger ability and at night can grab hundreds of CPUs. One challenge is that simulations have to be broken into pieces to run and then, be put back together. One time, Henry’s team got a machine with 128 cores so that all of the parallelization could be done from one Jupiter notebook.

DCC pro: can grab a large number of scavenger/common nodes

DCC con: setup/teardown

Machine Learning and Signal Processing

Henry teaches a senior design course for machine learning and signal processing. Projects for this class involve deciding on a repository and then, applying machine learning. As Henry mentioned, matching all versions can be a nightmare on an OIT VM so the easiest thing is a Google GCC machine where root admin access is provided, and everything can be installed with required versions. But this can be expensive. Also, training with multiple GPUs has been problematic. Henry provides an example of a student using Jukebox, a neural network that generates pop music when provided with a parameter such as: I want a song by Drake with these lyrics. This was expensive to do on Google because it runs for a long time. So, they got it running at the DCC with the help of Tom; this was a success story. Another group started with the Salesforce Research AI Economist package. This involved reinforcement learning that starts with simple agents that pretend to be an economy and you see how the economy works. This utilizes a reinforcement library which is on top of a parallelization library called Ray. Ray is very finicky, and Henry has never gotten it to work well on anything other than bare metal. Because this research is for a course, Google provides free credits of $100 to each student.

From Steffen Bass:

https://www.top500.org/resources/top-systems/summit-doescoak-ridge-national-laboratory/

Next, Amanda Randles, a professor of Biomedical Sciences, who works in the field of computational fluid dynamics and who creates patient-specific fluid dynamics simulations, presents. Amanda’s research involves taking data from CT or MRI scans and using commercial software to obtain a 3D geometry where a fluid simulation can be run. These are large-scale simulations, and the Message Passing Interfaces (MPI) requirements are significant but current bandwidth is limited. One of Amanda’s main concerns is these connections.

Amanda also cares a lot about visualizations. The simulations are personalized for each patient with the goal of diagnosing disease better. Simulations are real-time and can be used for virtual surgery and for treatment planning. Again, high-throughput and high-speed are needed.

Amanda shows a 3D flow visualization of red blood cells in a blood vessel. Cellular simulations and modeling have some of the largest computing requirements. This visualization used 140,000 different processors and took six hours just to run one heartbeat. This visualization involves between 200 and 300 million different red blood cells and captures interactions of the red blood cells with each other, with the fluid, and with the wall.

The biggest simulation that Amanda’s lab ran was a full-body simulation at that same scale. Just keeping the fluid in memory took about 140.7 TBs. Maxxing out memory is an issue. We need the capability of running on Amazon Aurora. We need to be able to store our data into the future as required by NIH grants so long-term storage is needed. Blue gene/q total system memory requires 1.6 PBs. Then, if we are processing 1PB of data and creating 1 PB of data, and running millions of steps, we run into lots of data problems.

Amanda says her team is also using devices like Oculus Rift, HTC Vive, and zSpace to investigate different modes of interaction for virtual surgery.

Amanda summarizes her thoughts on DCC:

• Use the DCC for dev and prod; need to be able to support: CUDA, MPI, and OpenMP; everything is bandwidth limited

• Helpful to have quick deployment and interaction is key

• Storage (esp., long-term) is critical and becoming more and more of an issue

• Maxed out at 700TB.

• Need storage from DOE simulations.

• Need storage for quick access for Machine Learning (ML) workloads; also, need archival storage.

• Lab used Duke Wiki as an electronic lab notebook so that this is going away is a problem.

• Classes

• Support for classes through connection with Azure.

• But every year, this has to be reestablished – every year students leave not wanting to work on the cloud because of this experience which is not good.

• XSEDE requests must be made each year for time on Stampede.

Jerome Lynch says NIH has a 5-year requirement for holding research data.

Tracy asks if there is a new Duke data storage policy that requires holding research data for 7 years after the last date of the publication. Mark Palermi says, yes, 7 years. Tracy says Charley Kneifel just submitted an NSF grant proposal for how we could do storage looking at moving data between hot storage, not-so-hot storage, cold storage, and Artic storage.

Tracy says she is hearing 3 things:

1. The need for mid-scale computing resources that would provide more than the Duke Compute Cluster but less than provided by a national lab. This includes fast interconnection and a diversity of hardware.

2. Storage.

3. The need for more staff to help figure things out or help when things go wrong.

Amanda mentions that help from people who could be more proactive with Microsoft is needed. And help with hardware especially around computation and networking, with storage issues, and other things that faculty would say are the biggest needs.

Henry says yes, this captures a lot. Henry would like to see VMs expanded to provide many GPUs. Henry says he had a nice discussion with Katie Kilroy on needs and the staff have been great. But more staff would be great. Needs are moving so quickly, and the challenge is keeping up with these needs.

John Board says the bare metal spin on hardware is one he has not heard so that is very valuable to hear. Volker Blum says bare metal is critical for his research as well. Colin Rundel underscores the need for bare metal computing. Colin would like to be able to give a system to a student and let them mess with it for a while. Often, students don’t know what they need so they need to be able to install a bunch of Python packages and deal with version conflicts and other headaches. Colin teaches a course involving machine learning packages and the rate at which they are advancing, changing, and modifying things is unsustainable and there are many backward-compatibility issues at the moment so having the level of control that we could have with bare metal would help this scale. And dealing with corporations is not easy. For example, a student got a $1000 bill; this was eventually resolved.

Steffen Bass contributes that Duke is part of the managing consortium with Oak Ridge National Lab. Steffen is a representative on the Board of Governors for the Oak Ridge National Lab and Oak Ridge is interested in strengthening its ties to Duke and the director for research wants to visit this fall. Steffen will make sure these researchers get time with her.

Volker has spoken with Oak Ridge and has collaborated with Oak Ridge, and this was helpful,but these were for specific single projects and there was not much flexibility. It would be nice to have a special allocation for Duke.

Tracy tells Steffen that a sustaining relationship with Oak Ridge would be nice. Duke could pay something but less than Duke would have to pay Amazon, etc., and less than the cost of setting something up ourselves.

John Board asks about any issues with sensitive data and otherwise protected data. Amanda says that many teams are using patient data where security is a big issue, but this is not an issue for Amanda’s team.

John Board asks about other pain points for Pratt.

James Daigle contributes that a lot of researchers find software in the wild. This is often difficult and requires time to get it to work. Sometimes they are successful and sometimes not. This needs attention. Also, storage is an issue as Amanda mentioned.

Jerry Lynch thinks more investment in staff is needed. He appreciates OIT's strength and looks forward to co-investing.

Tracy asks for feedback from Charley and Katie on what they have heard today.

Charley says:

More sharing is better but not so much that it becomes a problem.

The idea of hybrid storage environments enables us to move into the cheapest storage, if possible, based on needs. Much of the data talked about today requires slow and cheap but then, there are issues with finding the data which is being addressed. OIT is being deliberate about how to back up data. Amanda talked about fast vs slow. Opensource CEPH storage may provide something different from Isilon storage; this is being looked at. Also, partnerships with national labs would be helpful so we can learn from them. For example, can we learn from CERN?

Katie Kilroy says there are a lot of unique challenges for engineering. Also, making technology across the university more accessible to more people is an issue. More flexibility and more training are needed.

Jerry Lynch underscores that the ability of faculty to bring this technology into the classroom is great; this is how we want to educate students.

Henry says educating students in this way would be easier with bare metal. It is painful to spin up a google cloud instance and it would be nice to learn without this hurdle. Volker agrees with Henry but teaching real-world how to set up and run is also important. Amanda would also like both options as students sometimes get discouraged before they even start programming.

Jerry Lynch says some students have programming experience and some do not, and both must be considered.

Tracy concludes that it would be great to have some easy experiences that would then be followed by more challenging experiences.

Mark Palmeri believes the engineering school has a unique opportunity in that most students come through the introductory EGR 200 classes. This provides an opportunity to present consistent tool sets alongside relevant curricular offerings.

Jerome Lynch asks about the tension between spinning up local resources in one's own lab vs having centralized resources in OIT. How do you determine which you use?

Volker says that there are not many local resources. It is easy to burn through money going outside of Duke so local resources are preferred as long as machines are homogeneous and do not change rapidly over time.

Amanda says local is her choice as well. Amanda has a local cluster that is handled by OIT. Amanda’s team worked with OIT to get the needed performance. But it was set up in 2016. Amanda does not know what to do to update the cluster and have this functionality continue.

Mark Palmeri says his department bought a lot of hardware to see if it was cost-effective. A dedicated system administrator who kept the systems up-to-date was hired. In 15 years, $500,000 was spent. The system administrator was lost with the contraction of school-specific support.

Robert Wolpert says that if the VMs that we offer at Duke are insufficient for our needs, we need a method for communicating our needs to OIT.

Colin Rundel says that in the Stats department a lot of systems sat idle so sharing resources makes sense but doesn’t always fit the project. So having a larger pool to share amongst everyone helps but is not perfect.

Volker agrees that shared resources are great for homogeneous needs and wonders if there is a way to determine needs that could be met with shared resources.