There has never been a better time to build a data infrastructure company. Despite all the buzz over the last decade about the need for businesses to be data-driven, the reality is that the infrastructure to enable meaningful data work at scale has lagged the hype. That is changing now. Why?
- Cloud storage and compute technologies have matured significantly.
- Standard tools and workflows have emerged for different data personas.
- Data infrastructure businesses like Snowflake and Databricks are riding at astronomical valuations and showing incredible growth, retention, and expansion.
There are multiple business models in data infrastructure, but one that I’m very excited about is the open core model, commercialising open source projects. Databricks and Confluent got their start by commercializing Apache Spark and Apache Kafka, respectively. A wave of companies, including Starburst (Trino) and Fishtown Analytics (dbt), have also followed this model.
Open core businesses have unique benefits and challenges. An open source project appeals to customers because it does not lock them into proprietary technology. Its community can also serve as an incredible base of evangelists for the open core company. However, figuring out how to add significant commercial value beyond the open source project presents a significant challenge. It’s hard to beat something that is “free.” Additionally, the principles of a great enterprise business still apply. Like traditional enterprise SaaS, open core businesses need to solve high priority business problems that improve employee productivity, reduce costs, or unlock new business opportunities.
There are two additional criteria, however, that I look for in an open core company.
- Laying the foundation for strong, bottoms-up adoption
- Creating meaningful value in the commercial offering
Laying the Foundation for Strong, Bottoms-Up Adoption
The beauty of an open core business is that it naturally lends itself to a bottoms-up go-to-market (GTM) approach. The open source project allows customers to “try before they buy.” Also, it engages with a community of contributors often from different organizations and with end users who provide feedback, catalyzing an evangelist-focused GTM approach.
There are three key elements that drive bottoms-up adoption.
An easily adoptable open source project
The open source project has to be easy to set up. After all, the user needs to actually be able to implement the project to try it. Many open source projects require complicated configuration and installation of many dependencies just to get started. If the user has to struggle to get a working instance of the project to run locally, that is probably not a great sign for large-scale adoption. dbt does a great job because it only takes a few lines of command line code to set up and start building models with SQL.
The open source tool also needs to be easy to use. The best open source projects achieve this by having a user interface that is very familiar to the target audience and meets users where they are. Dask, for example, natively fits in the Python data ecosystem. It only takes a few steps to change Python code to run parallel processes with Dask. It also has integrations with popular Python libraries, like Tensorflow and XGBoost, to make the transition to distributed computing even smoother.
A community that cultivates evangelism and deep engagement
Community is critical to an open core company’s success. Evangelism from passionate members of the community goes hand-in-hand with the ability to use the open source project for free. When done right, these two become a marketing engine for the business.
Community engagement takes place along multiple dimensions.
- Who is engaging in project discussions? Contributors? Users? To what extent?
- Where are discussions taking place? GitHub, StackOverflow, Slack, Discord, Gitter?
- What types of discussions are taking place on each of these forums? Debugging? Discussing roadmap? Sharing best practices?
The reality is that there isn’t a single formula for building a great community. The unique characteristics of each project—the audience and the use cases—will foster a specific type of community. For example, a project that is a core piece of infrastructure might have a lot more contributor activity on GitHub with developers raising issues and submitting pull requests compared with a Python library that is likely to have a lot of users asking questions on StackOverflow.
A good community has answers to contributors’ or users’ questions. A simple Google search leads to results on StackOverflow or to a GitHub issue where members are actively discussing the topic. A great community proactively engages its members. New developers receive help and encouragement from more seasoned contributors. Members of the community go beyond discussing specific technical challenges of using the project to discussing other tools that make their lives easier.
There are many quantitative measures for open source adoption, but there is no single metric that can fully capture community engagement. Metrics from GitHub, Slack, PyPI Stats, among others, can serve as a useful indicator but are no match against compiling qualitative feedback in the form of articles written about the project, discussions, and conversations.
A founding team that is influential and empathetic towards the community
Although there are exceptions, the most successful open core businesses are generally founded by significant contributors to the open source project, if not the lead maintainers themselves. Maintainers (or members of the Project Management Committee for Apache projects) wield the most influence in the community. They typically have significant sway in the direction of the project and are in an ideal position to set the culture for the community. In the best cases, open source project leaders build a loyal base of evangelists. Databricks’ Data + AI Summits draw huge audiences, many of whom most eagerly anticipate keynote speeches from the founders, who unveil the latest developments in the company’s open source projects.
Project maintainers also know the users, and their pain points better than anyone else. They are aware of the pressing problems facing the project. As a result, they have the best intuition of what the community should build next and a unique understanding of what is better addressed by a commercial product rather than the open source project.
Having leaders of the open source project as part of the open core business also lends outsized credibility to a commercial offshoot. Early in an open core business’s life, the product is often a straightforward managed deployment of the open source project without too many bells and whistles. One of the key reasons to buy a product from such a startup is that the world experts are behind it and that the vision of the company and open source project are more likely to be aligned.
Creating meaningful value in the managed offering
Driving bottoms-up adoption of the open source project is critical for any open core company, but so is creating meaningful value in the managed offering. An open core platform should be able to serve high-value use cases that a basic deployment of the open source project cannot solve. The open source projects that fit this model well are either core to infrastructure or central to a common workflow. The most successful open core companies commercialize open source projects that have both of these attributes.
Managed deployment of the open source project
A common value proposition, particularly at the early stages of a company, is managed deployment. An open core company might offer a hosted cloud product, a product that deploys into its customers private cloud, or an on-prem product. These deployments tend to remove a lot of the headache of IT management, including maintenance, user administration, cost management, and support. However, solving the IT piece is just one piece of the puzzle. Open core companies have to build a commercial product that is so much better than the open source for key use cases that end-users are willing to disrupt their existing workflow to adopt the new product and convince their teams to pay for it.
For many successful open core companies, there comes the point when data teams find it challenging to deploy and manage the open source technology at scale. Teams that manage their own deployments can face obstacles as size of data, user adoption, and production use cases increase within their organization. Oftentimes, these are projects that involve highly-complex infrastructure or serve critical applications. The alternative to buying a managed service is to hire people specifically to manage the open source deployment. In some cases, existing data scientists and engineers are made to own a piece of architecture that is not aligned with their skill set and not the best use of their time.
Open source Kafka, which Confluent commercializes, is very performant but tricky to deploy and manage at scale. In particular, because customer use cases for Kafka are often mission-critical, uptime and performance are essential. As a customer scales Kafka to many clusters across multiple regions and cloud deployments, it becomes increasingly difficult to meet SLAs. At this scale and criticality, it often makes sense for companies to turn to Confluent to manage their deployments rather than to manage it in-house.
Infrastructure open source projects focus on core functionality and not on administration. Managed offerings usually have table-stakes features around user management, including secure access to applications and to data, permissions, and groups. The best ones can unlock organizational collaboration in a way that a simple deployment of the open source project cannot.
Spark is a general data processing engine that was created by Databricks’ founders. It has APIs for SQL, Python, Scala, and R, which enables it to reach a wide base of users. Because of the broad surface of Spark, Databricks is uniquely positioned to enable organizational collaboration. One such product is the Workspace, a hosted notebooks platform that allows data scientists and engineers to easily collaborate on the same platform with their language of choice. Just by sharing a link, a data scientist can hop into a colleague’s notebook and immediately interact with it. To many, this is a significant upgrade over sharing a local file over email or Slack only for the colleague to run into issues reproducing the results because of inconsistent data sets and mismatched environments. Products like this, that are able to facilitate collaboration across the organization, are very sticky and deliver a ton of value that the open source project cannot.
To be clear, an open core approach does not work for all data infrastructure companies. In fact, it’s probably a small subset that fits the model well. But companies that are commercializing open source projects should have a foundation for strong, bottoms-up adoption *and* solve meaningful use cases that the open source project is not well equipped to serve. Companies cannot be great at one and not the other.
At Costanoa we’re excited about the opportunities that lie ahead for open core data infrastructure and are proud investors in Coiled and Jitsu. If you’re building an open core business or have other ideas of what it takes to build a great open core company, feel free to reach out to me. I’m curious to hear more about it!