Data science has gained significant mindshare across many industries over the last decade. Today, organizations are accumulating vast amounts of data and want to efficiently extract as much value from it as possible. The interest in data science, however, has often not translated into business impact. Though data science has exploded in popularity and there has been a proliferation of tools for individuals, there is a big gap between the tools that data scientists prefer and those that meet requirements for the enterprise.
Data science tools are either not enterprise-ready or user friendly
Why is this the case? While data has been in the limelight for quite some time, many organizations have just been laying the foundation for meaningful data analysis and data products. They have been focused on the basics – ingestion, storage, quality, and simple transformations. The most immediate value that organizations can extract from data is business intelligence (BI). On the other hand, data science is an investment that requires a longer horizon. As a result, both data engineering and BI ecosystems are much more mature than data science, which is still playing catch-up.
The reality is that most data scientists in organizations still download data on their laptop and run code locally with open source tools. While this is a very familiar workflow, they run into issues with collaboration and reproducibility. After all, these tools were not designed with collaboration as a first-class citizen. Unfortunately, existing data science tools are either not enterprise-ready because they are not conducive to collaboration nor do they scale to large volumes of data; or they are not user friendly because they require users to work with tools and languages that they are not familiar with.
Data scientists want to focus on data science – not DevOps and IT
Data scientists would rather solve the data problem at hand than worry about managing environments, configuring clusters, figuring out why a computation is running so slowly, monitoring resource usage or making sure they are compliant with security. As I discussed in a previous article, a disruption to the data science workflow, even if it might appear small, can have a significant impact on productivity. Sometimes the disruptions are not so small. For example, when an entire data science organization is not able to use a fundamental data manipulation library because the organization has adopted a platform that requires them to learn a new one. I argue that one of the key reasons data science teams are not as productive as they should be is because they don’t have the tools that allow them to work productively and collaboratively.
A modern solution for seamlessly scaling data science
This is where Coiled comes in. Coiled accelerates data science adoption and increases productivity of data science teams by removing distractions so that data scientists can focus on solving data problems. Founded by Matthew Rocklin, Hugo Bowne-Anderson and Rami Chowdhury, Coiled is uniquely positioned to solve the problem of data science for the enterprise. Matt is the creator of Dask, which is the most widely used Python library for parallelization. He has deep expertise in enabling data science in large organizations and in scaling data science workloads from his experiences at Continuum Analytics and NVIDIA. Hugo and Rami also have deep roots in data science and open source communities through their work at DataCamp and Continuum Analytics.
Coiled solves the problems of enterprise-readiness and usability in many ways:
- Laser-focused on the Python experience. Python has emerged as the de facto language for data science with libraries such as pandas, numpy, scikit-learn and XGBoost in the toolkit of many data scientists. In fact, its popularity in data science has propelled to become the second most popular programming language behind Java
- Natively scales Python workloads to big data. With minimal code changes, data scientists can trivially scale up their workloads to many machines. When an organization’s data becomes too large for single node computations, it no longer has to worry about migrating its data infrastructure to a new platform and requiring users to learn new tools to do the same work
- Data scientists do not need to change their existing workflow. They can choose whichever local IDE or notebook they prefer. Coiled manages cloud computing behind the scenes.
- Enhances collaboration among data teams. Coiled makes it easy to share virtual environments and also offers a set of template environments for different types of work. Data scientists can share projects with their team knowing that their colleagues will be working off of the same environment and compute infrastructure.
- Increases visibility into computation costs. Coiled uses the popular Dask dashboard and elevates it for the enterprise. Administrators can have a clear view into how resources are being allocated across users. Users can see whether or not their code is efficient.
Coiled will be the crucial connective tissue necessary for enterprises to do data science at scale. That is why we are incredibly excited to partner with them. Congratulations to Matt, Hugo, Rami and the rest of the Coiled team!