Recent NewsJune 13, 2018
Capstone projects: ‘A taste of the real world’
(The data science curriculum at the University of Rochester culminates in the semester-long capstone course. It complements the more structured and traditional course work that students take earlier in their program.)
Ricky Su and his teammates took on an interesting challenge from the New York State Attorney General’s office for their senior capstone project in data science.
Using initial filing information for 7.7 million previous cases, would it be possible to train computer models to predict, at the moment a new case is filed, how long that case might be expected to last?
This would help the attorney general’s office better allocate its resources and help litigators prepare better.
But before the students could create a model, they faced a daunting task: cleaning up the data.
Much of the data they received had been entered by hand and was riddled with inconsistencies and errors. For example, one court might enter “uncontested matrimonial” with lower case letters. Another court might capitalize it or simply enter it as “UM” or any number of other variations.
“That was one of the biggest challenges we faced,” Su said. “The entries have to be consistent for a machine to know if it’s comparing the same thing.”
Masters students Brooke Hamilton and Hehua Chi faced data cleaning challenges, as well, during their capstone project. They worked with Origent Data Sciences on assessing the therapeutic benefits of a drug used to treat amyotrophic lateral sclerosis (Lou Gehrig’s disease).
Many of the more than 10,000 de-identified patient files they used from a clinical trials database lacked key data or needed to be condensed and simplified.
“The data cleaning effort was so much more complicated than I expected. This has been a really great experience.” Hamilton said.
That is exactly what the capstone projects are intended to provide.
“One of the main learning objectives of the (Data Science Practicum) course is to get the students exposed to real-world data and get a feel for how they can work with strategies to clean messy data,” says Ajay Anand, deputy director of the Goergen Institute for Data Science, who teaches the practicum for master’s students. He also teaches the Data Science Capstone course for seniors. “Data scientists spend around 60 to 80 percent of their time on preparing and managing data for analysis. Students hear this in the classroom and get to experience it first hand in the capstone and practicum project.
“Compared to a very structured data set they might work with in the classroom, this is giving students a taste of the real world.”
Partnering with companies
A hallmark of the Rochester data science curriculum for both undergraduates and masters students is the chance to partner with companies on “real-world challenges.”
Consider these other projects from the spring of 2018.
Is it possible to build a computer model to further optimize an ad campaign’s performance on social media?
Masters students Matt Trudeau, Josh Kolodny, and Karan Vombatkere worked with Brand Networks, whose Optimize Now program automates the management of bids, budgets, and ads for clients who launch campaigns on Facebook and other social media. Brand Networks offers social and digital advertising software and services, supporting enterprise brands and large agencies. The students drew on a database of 81,008 ads from 10,448 campaigns to identify which primary goals do the best job in optimizing campaign performance. Two models were implemented from the Python Library SKLearn: Logistic Regression and a Multi-Layer Perceptron Classifier.
Can the internet of things be used to predict heat waves?
Undergraduate Jack Gallagher and masters student Yihe Yang worked with Arable Labs, which has developed a weather and crop monitoring device that can track air temperatures, dew point, radiation, and other variables in local agricultural farms. The students applied two times series models – Long Short Term Memory (LSTM) and Autoregressive integrated Moving Average (ARIMA) –to hourly air temperature readings from two locations, resulting in model that could predict temperatures to within one to two degrees. This could be useful in helping farmers predict heat waves or cold snaps that might affect individual fields, more accurately than regional forecasts could.
Can a computer quickly identify suitable reviewers for a paper that has just been submitted to a scientific journal?
Masters student Yue Zhao worked with Gaurav Sharma, professor of electrical and computer engineering who is also the editor-in-chief for IEEE Transactions on Image Processing. The goal: create a model that could compare key words from a submitted paper, with key words from a corpus of previous papers of potential reviewers, to find suitable reviewers with expertise in the topic area. Zhao used the Latent Dirichlet Allocation (LDA) Model and Bag of Word Model to identify key words that could match papers with similar topics.
‘A fantastic experience for us’
In addition to giving students experience with “messy” data, and applying computational models to real-world problems, the capstone projects reinforce the importance of teamwork.
“You have to pay attention to how you’re designating tasks,” says Anya Khalid, who worked with three other seniors on a better security system for the image collections of VisualDX. The company offers access to medical images spanning the full spectrum of human disease to aid in educating medical students and practitioners and to aid in diagnosis.
Khalid and her team succeeded in developing new security features the company was able to implement.
Partnering with students on capstone projects gives companies an extra resource in tackling thorny data science challenges.
“I think without question this has been a fantastic experience for us,” said Luke Guerrero ’05, executive vice president of technology at Brand Networks, after hearing Trudeau, Kolodny, and Vombatkere summarize their project.
“They came onto our site every week, they were highly engaged, and this is further than we’ve seen anyone get with this problem,” Guerrero said. “The students have really helped us push this forward.”
How capstones are organized
Capstone projects are a culmination of the work students do for both bachelors and master’s degrees at the Goergen Institute for Data Science. The projects enable students to apply what they’ve learned in the classroom to “real-world” challenges, working in partnership with sponsoring companies and other outside organizations.
During a semester-long course, students are organized in teams of 3 to 4, and follow this timeline:
Week 1-2: Students attend classes and lectures on project management concepts, data science lifecycle concepts, data visualization, and a review of data cleaning.
Week 3-4: Representatives of companies and other organizations present their projects to students. Students are assigned to projects based on their ranking of preferences, but also on the individual skills they can bring to a project.
Week 5-10: Students work on their project, directly communicating weekly or biweekly with company sponsors.
Week 11-12: Students give final presentations, then provide code and write up individual final reports, which, for undergraduates, also satisfy an upper level writing requirement.