The data science ecosystem
Actors, incentives, challenges
Data science initiatives have been popping up at an increasing pace in the last couple of years all around the world. In the US, following the announcement of the National Big Data R&D Initiative of the White House in 2012, both the national funding agencies and individual universities engage in large-scale top-down actions in order to promote data science and research on big data. New York University opens its Center for Data Science, University of Washington founds its eScience Institute, Berkeley launches its Institute for Data Science, and the Moore and Sloan foundations announce a five-year 37.8M$ cross-institutional initiative to support the three previous institutes. Columbia opens its Institute for Data Sciences and Engineering, the University of Rochester announces a 100M$ commitment to create and house its Institute for Data
Science, and UMass launches its Center for Data Science. Europe follows with the University of Amsterdam creating its Data Science Research Center, Edinburgh University launching its Center for Doctoral training in Data Science, Delft University of Technology initiating Delft Data Science, and our newly inaugurated Université Paris-Saclay opening its Paris-Saclay Center for Data Science.
Whereas we hear plenty of similar top-down mission statements about what data science is and how we should bring it to domain sciences, there is relatively little information on
what these centers do,
what bottom-up tools they create to enable data science research, and especially on what they do to tackle the unique new challenges of this fast-moving multi-disciplinary domain.
At the PSCDS we actually believe in these mission statements. We believe that data science is the key to solve major scientific and societal problems. But we also know, experience shows, that our
current ways of doing science just does not work for this domain.
Even with the best intentions, nothing good will happen if we put five physicists and five computer scientists in the room and tell them: “go at it”. We will need new tools, software but mainly management, to get the juices of collective creativity going. Some of these tools exist, but most of them will have to be invented or adapted to the unique needs of this domain.
Our goal here is to initiate a conversation among those interested in the self-reflective research of how we will do data science. We start off the conversation with an essay defining the data science ecosystem with its constituents, actors, and challenges.
Data science is a deeply interdisciplinary domain. Besides the usual challenges of projects involving experts of two distinct domains, potentially successful data science projects also have to include a third pole of software and system engineers who can implement the methods developed by data scientists, maintain the tools, and sometimes run the software in production mode. For a complete the ecosystem, we also need to define and fill the role of experts at the interfaces of these three poles. The following figure sketches a fully developed data science ecosystem, outlining the activities and the actors.
The roles in this figure should be filled in a fully functioning data science ecosystem. We do not mean to identify and develop experts for these roles in isolation; real profiles usually overlap so researchers and engineers can fill several of these roles at the same time. Nevertheless, it is useful to analyze the different roles and incentives in order to understand the challenges in building and running a data science ecosystem.
The domain scientist
Experimental domain scientists build instruments and detectors, collect data, and analyze the data in order to study new phenomena or to discover new laws of nature. They usually work with engineers both at the instrumentation/product development and at the data acquisition interfaces. They may also be acquainted by the data science and software engineering aspects of the analysis chain, but their main drive is
and their carrier incentive is publishing scientific results. They are less interested in advancing the state of the art in data science as long as the data analysis gets done reasonably efficiently. They may be interested in developing and maintaining single-purpose software tools which can be reused in other experiments within the same scientific domain.
The data scientist
Data scientists design and analyze algorithms. Their main drive is to propose
new or improved methods
or to analyze them with new techniques. Improvement is measured on standard and well known-problems and benchmark data. Their carrier incentive is to publish technical papers on methods. They are less interested in solving actual problems as long as the motivation to develop the method is plausible and/or well accepted within the technical community. They are especially turned off when a problem can be solved by existing techniques which are not publishable in technical venues. They are interested in building tools which are flexible enough to allow wide methodological experimentation, but which do not necessarily have the quality and efficiency to be used in large-scale production.
The software engineer
Software engineers implement existing techniques, design and maintain software, and run large-scale production on large data and large computational resources. Their main drive and carrier incentive is to build tools that are used by a large community. Software engineers are often employed by domain science laboratories, so these tools often focus on the problems of a given scientific community. For the same reason, software engineers may also be acquainted by domain science but rarely with the latest research in data science. Softwares are built for particular application domains, so most of the developed tools are not shared in a horizontal fashion among different domains. Software engineers also work with system engineers who build and run large computational infrastructures, and who develop the middleware to provide flexible ways to access the infrastructure.
The data engineer
Data engineers are familiar with the latest developments in data science and, at the same time, with state-of-the-art software engineering tools and techniques. Their goal is to develop general purpose software tools that can be reused across several domains.
The applied scientist
Applied scientist are familiar with the latest developments in data science and, at the same time, with the problems and vocabulary of a domain science or an application domain. Their goal is to bring state-of-the art data science techniques to domain scientists, to formalize domain-specific problems in a language that can be understood by data scientists, and, in a strategic role, to identify innovative directions within a data-intensive application domain or domain science that can only be proposed by someone who has a broad vision of the available data science techniques.
The data trainer
Data trainers train domain scientists to use data science software tools for solving their domain science problems. They have a broad knowledge about existing data science techniques and tools, as well as about training techniques. They do not necessarily carry out research in either data science or domain science, and they also do not necessarily participate in the software development.
Today this ecosystem exists only in a small number of large multinational IT companies and in some data-intensive sciences with large experimental projects. In scientific domains with smaller experiments and in smaller companies with a primary focus other than data, challenges related to developing and managing the data science ecosystem come in all shapes and sizes. That said, we can identify and describe some of the typical challenges, generalizable across multiple domains.
To a large extent, the major bottleneck is the lack of manpower. We have arguably enough domain scientists and software engineers, but there is a major mismatch between supply and demand in any of the remaining four roles related to data. The recent success of data science in the IT industry means that a handful of multinational companies, largely concentrated in North America, are draining data scientists and engineers from the academic sector, making it hard to trigger an exponential growth of the number of data scientists. Most of these companies work on highly lucrative applications (e.g., optimizing advertisement and product recommendation), so it is increasingly difficult to find competent data scientists to work on less lucrative but highly important public societal domains (e.g., health care, transportation, energy), let alone (domain) science.
Most data scientists, as other scientists, are trained and incentivized to do research on highly specialized domains. They search scientific visibility in their international community, which is equally highly specialized, because their carrier advancement is almost entirely based on peer-reviewed publications. Even when they would have the expertise, they have little incentive to venture into the tool builder (data engineer) role since software authorship has little value in their evaluation, and it can only serve them implicitly through the visibility they gain in the community of tool users. By the same token, they have little incentive to venture into domain sciences and to tackle economic or societal challenges. It is possible that a domain science requires new techniques which then can be published in data science venues, but this is not guaranteed at all. It usually takes heavy investment of time and effort to be able to understand domain problems, so excursions into domain sciences are highly risky. Even when such collaborations are established, data scientists have a strong prior to use their highly specialized expertise which is not necessarily the best solution for a given problem. Finally, data scientists have little incentive to bring the project to full fruition, and they often “run away” with an abstract data science problem (and solution) extracted from the project.
Symmetrically, domain scientist have no incentive to advance data science and to develop and publish new techniques, as long as their data science problems get solved. When they venture into tool development, they have little incentive fordeveloping general purpose tools.
Finally, none of the researchers have interest in taking on the crucial data trainer role. They are incentivized to teach (as part of their full-time or part-time contract), but teaching in general is usually secondary in their carrier development, and within teaching, they usually prefer to train highly-skilled students (master and Ph.D.) who can participate in research. They are also not necessarily equipped to teach tools to domain scientists because they themselves are not familiar with a broad set of tools. Data and software engineers have, again, little incentive to venture into training.
Even though they are not interested in advancing data science, domain scientists are very much interested in collaborating with data scientists and data engineers to solve their data analysis problems. On the other hand, data scientists may be interested in applicative domains if they can transfer their narrow but deep expertise, but only if it takes relatively little effort (risk aversion). They may also be interested in collaborating with data engineers to implement their techniques in general purpose tools, for reaping the indirect benefits related to visibility. The difficulty is that there are no well-developed channels to identify the right experts for a given problem, so most of the collaboration happens through ad-hoc channels, essentially by chance. The matching usually requires a special expert, the applied scientist (see Figure), who has a broad overview of existing techniques and available experts, and who can also converse with domain experts. Such experts are rare.
Even when the right experts have been identified and they are motivated, there are few tools that can help them to collaborate efficiently. The process of a data scientist picking up domain science expertise or vice versa is long and laborious. Success stories usually involve some form of “embedding” researchers in each other’s teams, most often a data scientist visiting a domain science lab for an extended period of time.
This analysis has been naturally shaped by the realities of the French research system. Nevertheless, we believe that most of these challenges are universal, and that nobody with the ambition of running a successful data science ecosystem can ignore them. Solving some of the challenges are beyond what a local center can do. For example, changing deeply interiorized incentive structures would require strong top-down signals; here our role is limited to lobbying and hammering the message. On other fronts, such as improving access and communication channels or building tools to use data science expertise efficiently, data science centers can have a crucial role. At the PSCDS, we have been designing and learning to manage such tools. We will soon start communicating about them so that other centers can learn from our experience. By the same token, it would also be great to read about the experience of other data science initiatives around the world.