September 21, 2017 Minutes

4:00 - 4:05 – Announcements

Introduction of Student Representatives:
- Joel Mire - Sophomore studying English and Computer Science – second year on Student Government
- Tommy Hessel - Freshman and Duke Student Government with an Economics major
- Sonya Kochhar - Sophomore studying Computer Science and Biology
A Graduate Student Report will be forthcoming later

4:05 - 4:25 – DataFest, Mine Cetinkaya-Rundel (10 minute presentation, 10 minute discussion)

What it is:  DataFest is a data analysis competition where teams of up to five students spend a weekend attacking a large, complex, and surprise dataset. This presentation will provide an overview of the event, share sample student work, and discuss logistics of hosting the event at Duke (including OIT support for computing resources).

Why it’s relevant:  Duke has been organizing DataFest annually since 2011, and the number of participants has increased from 25 to 350 over the years, with participation from a growing number of area universities. This competition encourages teamwork and collaboration, and helps position Duke as a leader in the growing field of data science.

DataFest is a weekend long data hackathon using an open ended big (1.5 GB) dataset with complex relationships and incorporating data science and statistical analysis. The goal is to come up with useful conclusions based on data and make recommendations either to a business owner or a policy maker. Students can participate on own but being part of a 2 – 5 person team works out better. The dataset is large enough to be a challenge but not so large as to overwhelm.

The event begins on a Friday evening and ends on a Sunday afternoon in the Spring Semester. There is lots of food, fun, friendly and collaborative competition, and winners. A representative from the client introduces the dataset and provides guidelines for successful projects but with a general open ended problem statement.

DataFests started at UCLA and have been growing throughout. Last year over 2000 people were doing a similar event over a 5 week window in the spring.  The event started at Duke in 2012 with 23 Students.  Participants are anywhere from High School to Masters Students. In 2016 about 350 students from area schools like Duke, UNC, NCSU, Wake Forest, Meredith, Elon, NC A&T, and NC School of Science and Math took part.

The event continues to grow and last year Fire and Safety training had to be incorporated due to its size. The workshops before the event are mostly held at Duke in collaboration with Data and Visualization Services. Consultants and judges do not require too much technical knowledge and include Ph. D students, Faculty, and Industry representatives which make a good diversity profile.

Distributing data is a challenge as 350+ people attempt to download data simultaneously. It is useful to have 30 - 40 USB sticks to hand out if the web crashes. The rule is that work must be done on premises to keep the community element and collaboration but participants are free to come and go throughout the weekend. Prizes are given throughout this time to encourage student presence. There are also Consultants for helping and networking. At the end of the event, 48 hours of work has to be narrowed down into a 4 minute presentation with 3 content slides and 1 minute Q & A - which can be “soul crushing” for participants. Two rounds of judging and four main categories of prizes are awarded – Best Insight, Best Visualization, Best Use of Outside data, and one open category for Judge’s Choice.

Companies like Kiva, eHarmony, Gridpoint, Edmunds, Ticket master, and Expedia have contributed their datasets in the past. The format of datasets are mostly flat files with 150 - 200 variables where multiple files can be linked via an ID.

Other key points:

The issue of space maybe solved (Penn Pavilion this year), but Student engagement in Organization, reliable funding, and bodies for Consulting and Judging still remain a challenge.
The Technology used was Virtual Machines for RStudios and Jupyter, Reinforced Wi-Fi, Web, USB sticks for data distribution, and Event specific SSIDs or open ports.
The next DataFest will take place April 6 – 8, 2018 at Penn Pavilion and is a surprise dataset.

Questions and Comments:

The space issue may be resolved as Penn Pavilion should be large enough but more will be known after the event.
The event is jump started by the client representative who introduces the dataset and the purpose of the data collection. As an example Expedia had a pricing algorithm solved, other solutions can be travel patterns, business oriented, Social Science, or data quality specific.
Participants do have help over the weekend and can check in with the client to see if they’re on the right track.
There is a need for resources to keep this excellent program going.
The prize category Best Use of outside Data comes from teams of diverse interests and students from all backgrounds are encouraged to participate.
Teams come mostly pre-formed but students can be placed in teams if not already part of one. Although the teams that come pre-formed tend to do best.
The methods of advertising to students are the Duke Card system, T-shirt from last year, announcements in undergraduate classes, flyers at the Statistics Majors Unions, Student Cafeteria, First year student newsletters, advisors, etc.
Google has been a major sponsor for the last two years as well as some local corporate sponsors.
This is the only ASA DataFest in the Region and both the DataPlus and Stats Major programs serve as feeder systems to this event.

4:25 - 4:55 – Camera Validation Analytics Project and Camera Policy Update, John Board, Joe Camilo, Leslie Collins, Stan Francis, Jordan Malof (20 minute presentation, 10 minute discussion)

What it is: The camera analytics project is a collaborative effort between Electrical & Computer Engineering and OIT to enhance and automate security camera readiness inspection. The goal is to ensure proper camera functionality and operational consistency especially as it relates to image quality and intended coverage area. Duke’s camera policy will also be briefly described.

Why it’s relevant: With use of campus security, the need to validate operational readiness of security equipment becomes all the more important. While there are current methods that validate the general availability of security cameras, including some limited checks on operational functionality, there appear to be no real solutions to ensure the consistency of security camera aim, coverage area, and focus. Additionally, there is little documented research addressing this particular need, which prompted the formation of this project. Further, the expansion of security camera infrastructure on campus also motivates having well-understood controls on access to and retention of video information.
Duke has been rolling out a significant security camera presence and broken security cameras pose concern.

Prior to the deployment of this project, OIT had the arduous and manual task of verifying whether or not the 900+ cameras, mostly installed in parking garages, were still present, aimed, and working correctly by comparing the current images with stored images and the entire processing took about a week.

Stan Francis at OIT collaborated with Duke’s Electrical and Computer Engineering department’s Applied Machine Learning Lab to reduce the camera workload by building an API into the camera infrastructure using Cisco’s VSOM. The application houses camera inventory along with its large metadata consisting of an indexed list of cameras with approved reference images and daily snapshots with timestamps. The application GUI allows verification of current images against stored images which can be approved, rejected, or deferred and is capable of reporting problems via a ticketing system, allows power cycling, and also maintains event logs.

It was noted that the Electrical Engineering team (Jordan Malof and Joe Camillo under the leadership of Leslie Collins Prof. of Surgery and Biomedical Research), was tapped to develop Video Surveillance and Big data Analytics tools for analyzing Change Detection and Rapid image analysis as they had the invaluable experience of working in Afghanistan with Synthetic Aperture Radar data analysis while driving around trying to find IEDs beside the roads. The team used the most modern technology to develop algorithms and decision modules using vision research not only to reduce errors but to improve quality by employing a simple technique of aggregating predictions over multiple days. Since bad cameras tend to be consistently bad overtime, the team needed about 3 days to capture reliable data and get massive gains of a 90% filtering rate and was pleasantly surprised by the results.

Emerging technologies for surveillance can be scary and exciting. The Algorithm include efficient search for Object recognition like cars, people, trees, bikes, across the entire camera system that are capable of looking for Lost or stolen objects, find a suspect, track somebody and their path. Face detection and recognition is very feasible now but will require massive computational power.

In conclusion, there is need for policies to regulate the use of this system of highly sensitive and private information just like emails and may contain language as in AUP - Acceptable Use Policy.

Questions and Comments:

Duke has to officially review requests and there is now a process and committee that approves or denies all camera requests.
The Police department has the strongest say and access to other system of viewing the videos such as game day operations including pedestrian and traffic flow. Also, DUPD is the only authorized personnel to make recordings.
The SLA with the Police Department is 7 days as compared to infinity.
Departments requesting cameras need to have a long term budget for the true cost of a camera so as to avoid “Sticker shock” as a camera is not just the cost of the equipment but encompasses the long term maintenance, backup of data, monitoring and security considerations.
There is an Iron clad retention period of 30 days in general but 180 days for Libraries and Rare Book Rooms.
3 years ago the cameras were a hodge podge and were not deployed in a structured way but now there is a formal process to install them.

4:55 - 5:15 – Apache Spark Infrastructure, Mark McCahill (15 minute presentation, 5 minute discussion)

What it is:  Apache Spark is an open-source data analysis framework that coordinates a cluster of computers to run statistical and other analysis tools in parallel over large datasets.

Why it’s relevant:  Apache Spark and is becoming popular in fields such as biostatistics, bioinformatics, and the social sciences, because Spark clusters can be scaled up to perform analysis that would be painfully slow to run on a single machine. In addition to running Python, Scala, and R code in parallel, Spark can also act as a large scale parallel SQL engine for big data analysis. OIT recently worked with Cliburn Chan to provide a Spark cluster to students in the STA663 Statistical Computing and Computation course, and we are working on several other applications of this technology.

Big Data, Internet of Things (IoT) & ubiquitous sensors create mountains of data but since the storage is cheap, we can save the data. Also, networks are faster so the data can be moved around quickly. Although CPUs are cheaper, the individual CPU is still not much faster.

So how can we coordinate CPUs to run in parallel to speed analysis of large datasets?
We use Spark clusters to break the analysis into parts and spread the work around using a scatter/gather approach like Hadoop’s MapReduce technology.
Spark coordinates jobs to run in parallel across a cluster to process partitioned data
Some of the advantages of Spark over Hadoop are that it is 10 – 100x faster, data gets cached in memory vs read/write to disk for each transformation operation, it supports multiple languages (Scala, Python, SQL, R, Java), and has an open source framework.
Spark supports semi-structured (text files, gene sequencing data) and structured (SQL, CSV files, tabular data).

Lessons learned:

Course assignment k-mer counts in ~ 2 minutes on a 10 server cluster for 40 students.
Spark cluster are configured to enforce limits on student jobs
Hortonworks HDP Ambari deploy is semi-automated, can install Spark software onto a cluster in ~20 minutes.
Microsoft Azure has a Jupyter+Spark cloud offering, but it is not optimized for parallel coursework by default due to speed and single threaded processors.

Case Study:

Office of Personal Management longitudinal data on federal government employees spans 40 years.
Duke researchers (Jerry Reiter, Ashwin Machanavajjhala, John de Figueiredo, et.al.) have been developing synthetic data and differential privacy techniques to allow broader audiences to develop models run against data in a privacy preserving fashion

Summary:
A Spark cluster provide researchers and students powerful analysis tools for big data from Python, R, SQL, Scala, and Java. OIT is in the pilot phase of supporting Spark services for research and coursework.
Later this fall OIT will offer a Spark for Beginners workshop in October/November.

Questions and Comments:

Configuration is a nightmare but once up and running, it is very useful.
The disk space is dependent on the dataset and can be transient based on how long the data is needed.
Users can stand up and tear down clusters on demand.
Designed for interactive analysis and quick turnaround vs. a batch process.
Looking for anchor tenants who need to use this technology.

5:15 - 5:30 – CSG Update, John Board, Mark McCahill, Tracy Futhey (10 minute presentation, 5 minute discussion)

What it is: The Common Solutions Group works by inviting a small set of research universities to participate regularly in meetings and project work. These universities are the CSG members; they are characterized by strategic technical vision, strong leadership, and the ability and willingness to adopt common solutions on their campuses.

Why it’s relevant: CSG meetings comprise leading technical and senior administrative staff from its members, and they are organized to encourage detailed, interactive discussions of strategic technical and policy issues affecting research-university IT across time. We would like to share our experiences from the recent September 2017 meeting.
About 40 major research university groups meet twice a year to discuss solutions, share observations, and steal ideas.

A day long workshop was held on Campus Security and Technology. Police chiefs and IT leads from many universities attended. It reinforced that there was no level of consistency between University Police Chiefs and IT department leads. Some had never met their police; others worked closely with their public safety colleagues, with Duke well into the latter category.

Rice University’s IT department was instrumental in coordinating the campus operations for public safety and assistance during hurricane Harvey by doing data analysis in matching people in need with people who could help. They setup call banks staffed by undergrads, sent online surveys and targeted emails and had a quick turnaround. They used graphics and visualization to see the affected areas and connected people to help.

It was noted that having robust communication technologies is only part of having good emergency communications. Simple things like text messages arriving of order can increase confusion; this can be remedied by having a sequence number with each message. Several campuses reported on incidents that became more serious because of the inability of the university to counter or control misinformation spread on social media and by other means rapidly enough.

In another session, a member university talked about how their analytics tools were being used to predict individual student success in specific classes; this has been so accurate at to raise a host of new ethical and policy questions. They reported on other efforts involving natural language processing, and automated quizzes and even textbooks.

There was a short policy session on computer labs. Universities are exploring extensive use of virtual machines (VMs) as we do at Duke, and also cloud services such as Amazon’s App Stream etc.

GDPR – The European Union’s General Data Protection Regulation aims to protect the personal data of citizens of the EU; its terms go into effect next year. It addresses standards for export of personal data outside the EU. New set of data standards such as the right to be forgotten may apply to us if we have students from EU or offer programs in Europe. American institutions are struggling to figure out how these regulations will affect both our operations in Europe and at home.

Questions and Comments:

What is the view with guiding the advisors in student course selections? There are early hallway conversations on whether we need more means of electronic advising that can remove the known 80% issues and then focus on the remaining 20%. Our new President and dean of Pratt are interested in this and wonder what the predictors of success or lack of success are in their freshman year. Are we making students as successful as they can be? Are there tweaks and adjustments to allow students to do more fun things like DataFest?