New data science and machine learning platforms are popping up almost every week. That’s because vendors are building tools to optimize the data science workflow. They are creating better notebooks, making it easier to track the training of machine learning models, and facilitating the deployment of models or data apps in production. Along the way, they are often creating an end-to-end platform that covers everything from data ingest to productionization.
They are not, however, seeing mass adoption yet. Most data scientists still prefer to work locally with small data sets and free open source tools, and use email and manual handoffs of code to share their work. This workflow may be old school, but it’s also convenient and familiar (and often cheaper on the surface!).
So how can you design a data science platform that will make them change their ways? In a nutshell, you need to match the efficiency of local development but also address its weaknesses, such as collaboration and reporting. If your tool can speed iteration cycles and make it easier for data scientists to showcase their work to external stakeholders, it should see rapid adoption. That’s why the following elements are the most important in designing a data science platform:
- Collaboration with business and engineering stakeholders
- Support for familiar workflows and toolsets
- Interoperability with standard workflows and tools
- Scalability to big data
- The right balance of collaboration and governance
Let’s look at each in detail:
Making it easy for data science teams to hand off work
Collaboration with non-data scientists has traditionally been one of the most challenging and overlooked tasks facing data scientists, but it’s also the way they deliver value to their businesses. After they’ve performed their analysis, they need to share their work with business end users or engineers, who will then productionize their ML models or deploy data applications.
In most cases today, data scientists do this the same way they did years ago. They email results to the business unit, share project files with their teams, or hand over a notebook to an engineer. Such a workflow can not only be cumbersome but, without a systematic way of transferring knowledge, projects become stale.
Great data science platforms make it easy to hand off work. For business users, they offer slides that can be used in presentations or dashboards that update in real-time. You can find good examples of this in startups like Streamlit and Hex, which are laser-focused on helping data scientists build beautiful data apps.
A data science platform should also facilitate handoffs from data scientists to engineers by enforcing the organization of a project such that it’s easy to understand and reproduce. Domino Data Lab, for example, makes it easy to share work by helping data scientists organize project files and environments. Collaboration workflows like this are where data science platforms really shine — and create the most value for companies.
Support for familiar workflows and toolsets
Data scientists are often reluctant to change their existing workflows, partly because they don’t want to spend time learning new tools and languages that might not turn out to be useful. Any organization that tries to force them to work in new ways may find that it creates a lot of friction. Even small differences can lead to big roadblocks. A notebook that has different shortcuts than Jupyter, for example, can greatly impede progress. Ditto for a machine learning library that has slightly different syntax from what a data scientist learned in school or an unfamiliar interface for managing data and other files.
That’s why a good data science platform supports the most common and important workflows, enabling users to work in the way they are most comfortable.
Interoperability with standard workflows and tools
Interoperability means that a data scientist can move a project to and from a platform seamlessly. Its absence leads to situations in which people need to rewrite code, throw away work, and sometimes fail to collaborate. However, you can achieve interoperability in a variety of ways. For example, Jupyter is the most popular notebook for data science. This does not mean that all data science platforms need to support Jupyter. In fact, there are plenty of compelling open source (Iodide, Polynote) and proprietary notebooks (Google Colab, Deepnote) emerging. The key is that these notebooks can seamlessly migrate to and from Jupyter. And, of course, the notebook experience has to be familiar to Jupyter users so that data scientists don’t feel like they are relearning things they already know.
Scalable to big data
A good platform can easily scale up to big data and back down. That’s because data scientists often want to work locally with samples of data pulled from a database. They typically learn statistics and machine learning on relatively small data sets and as a result might be more familiar with tools suited for them.
However, there are several reasons why it is important for code to be able to scale to big data frameworks, e.g. Spark and Dask. First, data scientists need to be able to see if their analysis holds true on the full data set. While it takes longer to perform computations on a large dataset, it can also be more efficient overall to validate throughout the development process rather than be caught by surprise at the end. In addition, a data set may be so large that a single machine cannot handle the computations — and in fact, it is a common occurrence for data scientists to have to rely on engineers to test their models on the full data set. Finally, it’s much easier to hand off machine learning models to engineers if the development environment is closer to production.
Databricks, whose founders created Spark, has made significant inroads here by making it easier to work with large data sets in an interactive notebook environment and by reducing compute and runtime complexity. This way, data scientists can run computations on large clusters without being an expert in distributed computing. Coiled Computing is a new startup tackling this problem by commercializing Dask, which natively scales Python.
Balancing collaboration and governance
Collaboration and governance may seem to be separate concepts, but they are two sides of the same coin. In most organizations, separate forces are in play here, in particular data science on one side and legal and security on the other. Legal and security look after governance, while data scientists want easier collaborative workflows. This often puts the two sides at odds. A good platform should enable the closest possible collaboration between users, while still meeting legal and security requirements, such as maintaining an audit trail and making sure users don’t have access to code, data, or infrastructure that they shouldn’t.
This may seem straightforward but it can require highly nuanced capabilities. A simple example involves a situation in which it’s permissible to share aggregated data but not the underlying data. If User A works with user level data, but User B doesn’t, User A has to be able share overall results without exposing the more sensitive information.
The elements of a data science platform are fairly straightforward. There’s typically a connector to a data source, an interactive notebook or an editor with a console, and some way of sharing work. There are, however, a lot of considerations to navigate in building one, as seen above. But if a platform can improve iteration speed and cross-functional collaboration, data scientists will love it and others across the entire organization will embrace it.
Have additional questions or your own ideas about what’s important for data science platforms? Feel free to reach out to me.