The last 5 years have seen the machine learning industry explode into a variety of tools and services, both managed and open source. This article tells our story of navigating this space. If you’re an engineering manager looking to build up teams, infrastructure and invest in AI enabled products, it is for you.
Let’s start at the beginning though. Every architecture and organizational decision should be founded on a business challenge, so here’s ours: We are kineo.ai and we bridge the gap between industry and AI. That means working with a variety of customers across all verticals, diving into their data and implementing custom machine learning models to their challenges. Clearly, your specific need might be different, but chances are
The first step we took was all about speed. We went all in on AWS SageMaker, implemented AWS Glue as a managed spark solution and glued everything together with AWS CodePipeline. In addition we used Metaflow by Netflix as a tool to describe machine learning workflows in code. This was at the beginning of 2020 and the issues we encountered then — namely a lack of pipeline solution from AWS and easily extendable custom models on top of SageMaker — have largely been addressed. In fact I would go as far to say that in 2021, if you are already committed to one of the big three cloud providers going all in on their respective managed cloud AI service might just be the right thing for you. The feature speed of the big cloud providers is impressive and can pay off quickly.
All in cloud might be the way to go for you, since the investment the big cloud providers are making is significant and can pay off quickly for you
But, then reality hits. We started working with health providers. Government. Automotive. Deutsche Bahn. The truth is that many of these industries are far from going all in on cloud and never might. In particular in the German market there are some real concerns in terms of data regulation and ownership. For this reason many of our customers choose to run their workloads on premise or hybrid cloud environments. Even if you are not worried about data, your concerns might be economic or strategic: how much of a commitment are you willing to make into a single providers ecosystem, while maintaining some degree of independence? So, let’s add two more requirements to our list
At this point everyone that’s left should have more or less the same question on their minds: how can I benefit from the feature speed of the machine learning ecosystem, even sometimes from cloud capabilities without compromising on my requirements? It really isn’t feasible to start all the way from zero, due to the complexity of the workloads. Luckily, there is a middle ground.
In late 2020 we started adopting Kubernetes and Kubeflow in a big way. The two of them solve major issues outlined above: Kubernetes as the de facto standard for running containerized applications allows us to move our workloads flexibly. As a common denominator between clouds, hybrid and on premise environments Kubernetes gives us flexible access to compute power for both CPU and GPU based workloads. So if we manage to make machine learning at home on Kubernetes we are in a great position. Luckily, that is exactly the mission of Kubeflow. Kubeflow is a collection of curated open source services, collected into one integrated, cohesive experience on top of Kubernetes.
Kubeflow is supported and contributed to by both Microsoft and Google, in fact you can run Kubeflow natively on Google Cloud. At this time, there is support for running Jupyter Notebooks, building python based, reproducible pipelines for training, hyperparameter search and deploying your models as serverless functions in pretty much any machine learning framework you may already be using today. The project is being developed very actively; in fact just recently support for open source feature store darling Feast has been announced. There are the typical issues with Open Source as well, namely sometimes things don’t integrate perfectly. And not every solution is always perfect and final, but for any of the larger building blocks (training, pipelines, deployment), there is freedom of choice.
Finally Kubeflow has been designed from the ground up to run on all clouds as well as on premise, with specific manifests that make deployment on all of them similar and manageable.
Today we are running multiple machine learning workloads in various environments successfully on production. Kubeflow is the common ground, the shared contract that our data scientist can rely on. No matter where we go, our tooling, our pipelines, our deployment and our training will be the same. Yet, depending on where we are we can leverage cloud capacity. S3 can cover for in-cluster Minio. Managed spark jobs can be run on Glue 2.0 instead of the in-cluster spark operator. Kubernetes itself can run on the native managed Kubernetes services, for example AKS on Azure.
Of course the journey is far from over, in fact just a few weeks ago we started adding support for great expectations — a wonderful open source tool that allows to write assumptions, tests against data. I suspect the next step will be adoption of one of the major feature stores.
Where are you in this journey? Have you gone all in on cloud or do you find yourself somewhere in between, as we have? Let me know in the comments or get in touch directly!
Finally, if this is the kind of environment you’d love to work in and you care as much about people as you care about tech, we are hiring.