Data Science Q&A with the Dean


Patrick J. Wolfe became the Frederick L. Hovde Dean of the College of Science and Miller Family Professor of Statistics and Computer Science in 2017. He also holds a faculty appointment in electrical and computer engineering. A native of the Midwest and a 1998 graduate of the University of Illinois in electrical engineering and music, with a 2003 doctorate from the University of Cambridge, Dean Wolfe specializes in the mathematical foundations of data science.

After teaching at Cambridge and Harvard, he joined the faculty of University College London (UCL) in 2012, where he became founding executive director of its Big Data Institute. He currently serves as a trustee and non-executive director of the Alan Turing Institute, the United Kingdom’s national institute for data science and artificial intelligence, which he helped to found while at UCL.

He has received research awards from the Royal Society, the Acoustical Society of America, and the Institute of Electrical and Electronics Engineers (IEEE), and most recently, he was named the inaugural IEEE Distinguished Lecturer in Data Science.

A past recipient of the Presidential Early Career Award for Scientists and Engineers from the White House while at Harvard, Dean Wolfe provides expert advice on applications of data science to a range of public and private bodies in the U.S. and U.K. He has also forged strong bilateral international scientific connections between the U.S. and U.K., and with countries such as India, Japan and Singapore.

I sat down with Dean Wolfe to ask about his academic journey in data science and how this emerging field will transform the academy, the economy and everything we can imagine.

Q: You earned your degrees in electrical engineering and music from the University of Illinois. How did it evolve for you? When did data science enter the picture?

A: When I was in high school and being recruited to Champaign-Urbana, I met Professor James Beauchamp (now emeritus) who had a dual appointment in music and electrical engineering. I thought that was very cool, because I also wanted to study music and mathematics, and electrical engineering as a discipline was said to offer the largest helping of mathematics.

In graduate school at Cambridge, I worked with a faculty member, Simon Godsill, who had started a company based on his PhD work in modeling music for the purpose of restoring old audio recordings. We used techniques in signal processing — a subfield of electrical engineering — to model noise in audio recordings and to develop algorithms to mitigate the noise. At the time, the company (CEDAR Audio Ltd.) was probably best known for remastering the Star Wars soundtracks when the originals were re-released with the second set of movies. Statistics is integral to developing audio and auditory models of any kind, whether for music or for speech, and through this work, I became interested in the theoretical aspects of statistics. I worked in audio and in imaging after I joined the faculty at Harvard. My group and I collaborated with Sony, for example, on digital cameras, which didn’t perform well under low-light conditions in the early days. This also turns out to be a type of noise-removal problem.

Over time, I became interested in networks — rather than just audio signals or images — as a type of data. Transportation networks, supply chain networks, Facebook, Twitter — networks had been theoretical and abstract for a long time as a subject of graph theory, and were suddenly very practical. I found myself sitting directly at the intersection of statistics, signal processing and computer science — in particular, its focus on algorithms. At the time when “big data” started to capture the public imagination, I realized that I’d already been working in the area. It just didn’t have a name. When “big data” and data science became part of the zeitgeist, there were opportunities for initiatives. For instance, at University College London, we had the opportunity to help form a new national institute focused on data science, which became the Alan Turing Institute.

Q: When did data become big?

A: Three things have come together to make this a pivotal area. The first is the ubiquity and low cost of computing. Well before services like Amazon’s cloud existed, computers were becoming more and more powerful, to the point where they became a critical enabler for research, which was around the time of my PhD work. Inexpensive and plentiful on-demand computing, however, has only recently come to be.

The second is the low cost of data storage and transmission. You might generate quite a bit of data, but there isn’t much you can do with that data without cheap ways to store and transmit it. Consider the development of digital cameras, for example. The basic camera on your cellphone provides resolutions that are orders of magnitude greater than what was possible with first-generation digital cameras. And now you can send image files that carry those larger resolutions — something we wouldn’t be able to do without a low-cost, ubiquitous storage and transmission infrastructure.

And thirdly, there is the notion of advances in algorithms. Ideas and algorithms that might have been purely abstract, which we couldn’t previously test in the past, because of the low cost of storage and the ubiquity of computing …

Q: … you now have data.

A: Exactly, you now have data. And many of the data algorithms we’re employing now are not new algorithms. Quite a lot of success currently is happening with algorithms that were conceived in the late ’70s or early ’80s — it just turned out those algorithms needed loads and loads more data and computing power in order to be able to work!

Q: As our research tools become more and more sophisticated, such as mass spectrometers that can see so much more of a protein than ever before, there is a whole new realm of data science.

A: Yes, and for this discussion, it’s worth considering how technology affects data science in different industries. On the one hand, there are specific, mature technology applications such as speech and audio processing. Before the development of MP3 encoding and the iPod playback device, we were using bigger and bigger hard disk storage devices to capture audio up to the resolution limit of human hearing. Then, when we hit that resolution limit, there was an enormous demand for putting quite a bit of music on a very small storage device. And now audio is routinely transmitted in a very, very compressed way through your cellphone and streaming music services.

On the other hand, there is scientific discovery. As sensors improve, and the quality of sensing technologies improves, we discover and confirm more about the nature of our universe. In imaging and microscopy in particular, these technologies can have relatively short shelf lives, because new modalities and hardware advances are always driving new scientific discoveries. Purdue, for example, has worked for a long time with health care imaging technology companies. These organizations do an awful lot of work with new imaging hardware, and much of the algorithm expertise that goes along with that comes out of Purdue. There are many Purdue alumni working at these places and many long-term Purdue research partnerships underpinning the flow of ideas and people.

“Those of us who choose to live and work in the academy, who live and breathe science all day long, can now advance discovery by sifting through these enormous haystacks of data to find the needles that will change the course of science.”

Frederick L. Hovde Dean of Science Patrick J. Wolfe

Imaging is a big strength at Purdue, and it cuts across a lot of areas, such as signal processing in electrical engineering, biology, neuroscience, material science and chemistry. At a place like Purdue, algorithm developers and users of new imaging technologies tend to talk to each other quite a bit. These discussions can motivate new applications and collaborations, such as when College of Science mass spectrometry expert Graham Cooks, the Henry Bohn Hass Distinguished Professor of Chemistry, teams up with hospitals and clinicians.

Unlike in audio, the general trend has been that as the resolution of all of this instrumentation increases, we see more and more data being generated. For example, for our new STEM teaching building that will bring chemistry and biology together, we are thinking a lot about how to combine more traditional “wet” labs with modern, computational “dry” ones. When we’re looking at tissue through a microscope, that microscope is generating a lot of data that an undergraduate might not have used in the labs of the past. In these labs of the future, we will be generating and collecting the kind of data that students might analyze as part of their coursework or as undergraduate researchers.

Q: Is data science the science that connects all our departments at the College of Science? Is it a thread that runs through everything?

A: Yes, and I do think, as an emerging field, it will stand the test of time. The discipline brings together a foundational core, which largely consists of mathematics, statistics and computer science, in any order that you want to put them in. At the same time, data science done right requires domain expertise. And so, in the best situation, you have a very healthy push-pull between people who do foundational work on the analysis and modeling side, who push out new tools and new ideas, and people who are working in scientific or other domains, who pull and drive the practical, in-depth needs and questions.

In the College of Science, we’re very fortunate to have the three core departments that comprise that foundational side. If you look at our departments in the experimental sciences, we cover the life sciences and the physical sciences, where there is an enormous need for and interest in how data science can help promote scientific discovery. Whether we’re discovering advances in gene therapies or new biofuels by way of chemical properties, data science clearly has significant applications across the College, and in many other areas of scientific discovery and research across Purdue, from clinical advances, to finance and business, through to what we now call the digital humanities.

Q: Well, certainly there is a huge application in art archives. Data science begs for collaborations.

A: Yes. And that is not easy to accomplish, in the sense that it is never enough simply to take people from two areas, put them in the same room and say, “Go.” I know from experience that there will be different jargon for the same phenomena, and there will be the same jargon but with different meanings. At the Alan Turing Institute I hosted a number of scientific workshops, and one of them was about data anomalies and change detection, terms which you might think sound universal. And yet, we spent an inordinate amount of time overcoming the fact that the statistician’s definition of an anomaly differed from the computer scientist’s, which was slightly different from whoever was working on particular application areas such as cyber-defense. Data science work in change detection can impact a large number of fields, but the work won’t have much of an impact at all when done in isolation. Collaboration requires an investment of time and effort to build understanding and overcome differences in technical language and approach.

Making the choice to devote energy and effort to a new collaboration means that you have to see a path to how revolutionary the corresponding discoveries might be. Universities must create an environment in which people are free from the kinds of constraints that can encourage scientific tunnel vision. At Purdue, we are asking: What are our core strengths in data science, how do we build upon these across the entire University, and how does this promote our land-grant mission, particularly workforce development and economic development for Indiana? For example, which sectors are economically pivotal, such as scientific discovery, digital agriculture and advanced manufacturing? Which sectors might be important in the future? And how might data science at Purdue play a role in helping to bring new industry to the state?

Q: Can we teach students these skills of communication — learning how to talk to someone in the language of another domain?

A: I think that is going to become more and more key. At the College of Science, we’re fortunate now to offer a data science degree for specialists — for people like me who would have enrolled in a degree program like this had it existed when I was in college. But for those who aren’t going to be specialists, being a productive member of the workforce is going to require, if not a detailed understanding of what is under the hood of every algorithm, at least the ability to draw evidence-based conclusions from data. At Purdue, we want to make sure data science becomes a part of every student’s education.

Our trustees recently declared data science to be a new Purdue strategic priority, and our new Integrative Data Science Initiative (IDSI) will be a primary vehicle for ensuring we equip every Purdue student with the key skills needed to make sense of data — and to understand the basis of sound decision-making based on data.

As chair of the initiative’s steering committee, I was very pleased indeed to be able to congratulate our longtime head of the Department of Computer Science, Sunil Prabhakar, on his appointment as our inaugural IDSI director. Sunil has done an outstanding job for us, and his appointment to this key role is quite simply a testament to the many great achievements of the Department of Computer Science, and a reminder that the time has come for all of us in the College to help lead more broadly across, and on behalf of, all of Purdue.

Q: On that note, you’ve been selected to co-chair one of the four topics for Purdue’s sesquicentennial anniversary — Giant Leaps in Artificial Intelligence, Algorithms and Automation. What does this role entail?

A: In agreeing to serve as co-chair, I’ve taken on the task of bringing together various aspects and strands of artificial intelligence research across Purdue, and of showcasing what we’re accomplishing in this exciting area. If you pick up a magazine or newspaper, the chances are high that on any given day, you’ll see an article describing the potential of these technologies to change our world. Whether in scientific discovery, health care, defense and security, manufacturing operations, or the formulation of public policy, the promise of this area is strong. We at Purdue have a number of internationally leading strengths in data science and artificial intelligence, and I can’t wait to showcase them!

Q: Of everything data science makes possible, what excites you in particular?

A: I personally am excited about the transformational nature of data science. It will transform economies, sectors and organizations. It will transform scientific discovery. Those of us who choose to live and work in the academy, who live and breathe science all day long, can now advance discovery by sifting through these enormous haystacks of data to find the needles that will change the course of science.

I am also excited about the implications for public policy and issues such as artificial intelligence and the law. As we get closer to a world of autonomous vehicles — and algorithms that make decisions about things like our credit-risk scores — how will we ensure fair and efficient decisions? This is an area of focus for me right now. I currently serve on the Science Council for the UK Food Standards Industry, which is comparable to the “F” component of the FDA. As the data science member of the council, I help to answer questions such as: How do new technologies help improve overall safety and security of the global food supply chain? We also look at how data science could help to prevent the spread of food-borne illness. This area has become very important and timely, and it’s been a wonderful opportunity for engagement, with a very clear route to impact.

Q: I imagine data science will provide many novel routes to impact. How do we ensure we aren’t going off in myriad directions where cooperation is possible?

A: Directly after arriving at Purdue, I was encouraged by discussions with the Office of the Provost and the Office of the Executive Vice President for Research and Partnerships about helping to build bottom-up activity and engagement. Through faculty town halls and other mechanisms, I wanted to ensure that everyone at Purdue continues to have a stake in what happens in data science, while also encouraging our freedom to pursue our academic interests.

I think it’s also very important that the data science research and education efforts we think about stay very coupled and tightly integrated. I’ve been pleased to find this to be such a clear priority at Purdue. By keeping these efforts tightly integrated, I’m convinced we make both of them stronger.