Recent News

UR-Cornell partnership seeks to bridge diverse approaches to data science

May 11, 2020

Lines of code on a computer screen.
Photo by Markus Spiske on Unsplash.

Vast datasets are now being generated from a variety of sensing systems i, including brain monitoring and imaging devices, global positioning systems (GPS), radio-frequency identification (RFID) sensors, medical devices, emergency systems, and energy systems—each providing rich information about unexplored aspects of the modern world.

New ways of analyzing and extracting useful information from these datasets are being developed within many disciplines—each with its own unique set of methods, perspectives, and problems.

What is lacking, however, is a rigorous, shared mathematical foundation that would integrate the different tools and viewpoints used by these various disciplines—so that advances in data science will be enduring and broadly available.

The goal of a new Greater Data Science Cooperative Institute at the University of Rochester and Cornell University is to develop that shared foundation, with a focus on applications in medicine and health care.

Grounding the work in medical applications “helps us ensure that the assumptions we are making about the availability and quality of data are realistic,” says principal investigator Mujdat Cetin, interim director of the Goergen Institute of Data Science at Rochester who will take over the position full time on July 1. “And it allows us to test our methodology and results with real data.”

The work, funded by National Science Foundation grants totaling $1.5 million, will involve a core group of 27 faculty members at the two institutions with expertise in electrical and computer engineering, mathematics, statistics, and computer science. The principal investigator of the Cornell team is David Matteson. Four postdoctoral researchers will also be hired.

They will work in these broad topic areas:

  • Topological data analysis
  • Data representation
  • Network and graph learning
  • Decisions, control and dynamic learning
  • Diverse and complex modalities

Beneficial applications could include ways to help researchers better track the progression of infectious diseases like COVID-19, better understand how different brain structures are related, perform improved automated analysis of medical images, and achieve more efficient compression of the data needed for analyzing hospital usage to stay within the constraints of available computational tools. An ad hoc working group created within the Greater Data Science Cooperative Institute, has just published the results of a study they performed on COVID-19 epidemiological data analysis.

The basic challenge

The “basic problem in data science is that you observe complicated, imperfect data and want to extract information from such data in a principled manner at various levels of abstraction,” Cetin says.

However, the myriad terminologies and methods used by different disciplines—and even the different focuses they bring to the challenges of data science—can sometimes become an impediment to advancing the field, Cetin says.

For example, electrical and computer engineers traditionally tackle the challenge of data analysis with detection and estimation theory, whereas a computer scientists use machine learning.

A statistician might be primarily concerned with measuring certainty and confidence in results; a computer scientist, on the other hand, might be primarily focused on developing computationally efficient algorithms to process the data.

Developing a shared foundation, Cetin says, “gives you the freedom to essentially be both rigorous from a mathematical and statistical standpoint, connected to applications in a variety of domains, but to also come up with efficient algorithms that not only give you results, but also allow you to characterize the performance and understand the limitations of those algorithms.”