7 Nov

2023

min read

CoreWeave Kubernetes: The AI Workload Solution You've Been Waiting For

CoreWeave Kubernetes delivers fast GPU servers for AI via Kubernetes. Direct access to Kubernetes, NVIDIA GPUs, SLURM simplifies AI infrastructure. Scale optimized AI easily on CoreWeave's platform.

Sarfaraz Rydhan

Business Development

Platform Engineering + DevOps

Content

Share this article

Launch Your AI Workloads with CoreWeave Kubernetes

Artificial intelligence and machine learning workloads like Deep Learning, Large Language Models (LLMs) and Generative AI require extensive model training, inference serving, data processing, and specialized infrastructure to operate at scale. Running GPU-accelerated and distributed AI workloads can be challenging without the right platform.

CoreWeave is a hybrid cloud that provides a purpose-built solution optimized for these AI/ML workloads. With CoreWeave + Zeet, you can deploy GPU servers, containers, and orchestrate complex ML workflows easily and efficiently.

CoreWeave eliminates the operational headache of managing bare metal infrastructure so you can focus on developing models and applications instead of ops, and Zeet sits on top of this (and any other clouds you might have) to simplify Kubernetes, CI/CD, Infrastructure Provisioning, and more. The CoreWeave platform is tuned for performance, scale, and GPU optimization out of the box, and paired with Zeet, you get one dashboard that abstracts away complexity while still providing the flexibility and power of a native cloud dashboard.

With CoreWeave Kubernetes and Zeet’s managed Kubernetes feature, you can finally unleash your most demanding AI workloads and innovations.

Benefits of CoreWeave Kubernetes for AI

CoreWeave Kubernetes delivers powerful benefits for running artificial intelligence workloads compared to other Kubernetes solutions or cloud providers. The platform is purpose-built for unmatched performance, automated scalability, and optimized infrastructure management.

Blazing Fast Performance

CoreWeave Kubernetes runs on bare metal servers with direct access to GPUs, removing the overhead and throttling common in cloud environments while still reaping the benefits of cloud computing. This enables AI workloads to leverage the full power of the underlying hardware for maximum speed and reduced time-to-results.

NVIDIA GPUs like A100 and H100 are tightly integrated for accelerated model training and inferencing, and can be provisioned right from the Zeet dashboard. CoreWeave also offers bleeding-edge capabilities like NVIDIA HGX A100 for enterprise-scale AI needs.

Automated Scaling & Rapid Deployment

The platform includes intelligent auto-scaling capabilities to dynamically adapt to workload demands. This allows fast deployment of new resources in minutes instead of days or weeks.

Zeet automates the management of Kubernetes clusters, infrastructure, and software updates. The platform abstracts away these complexities so data scientists can focus on developing models.

Optimized for Distributed Training

With built-in support for distributed training frameworks like Horovod and PyTorch Elastic, data scientists can train neural networks faster and more efficiently. CoreWeave enables running distributed workloads across multiple bare metal nodes.

The integrated workload scheduler intelligently allocates resources for different training jobs. This ensures efficient utilization and avoids the need for manual optimization.

Bare Metal Performance for Peak AI Performance

CoreWeave Kubernetes is built on bare metal infrastructure, providing your AI workloads with unparalleled speed, throughput, and performance compared to virtualized options. By bypassing the hypervisor and running workloads directly on the metal, CoreWeave delivers unmatched training times and low-latency inference.

Key benefits of CoreWeave’s bare metal AI solutions include:

Max GPU Utilization: Make the most of your GPU investment by avoiding the virtualization tax. Bare metal allows you to utilize 100% of GPU resources for your models.
No Noisy Neighbors: Dedicated bare metal servers mean you don't have to compete for resources with other tenants. Consistent and reliable performance.
Reduced Latency: Direct metal access removes the latency penalty of virtualization for faster response times. Critical for real-time inference apps.
High Throughput: Bare metal maximizes throughput and bandwidth for data-intensive model training at scale.
Better Cost Efficiency: More workload density and utilization per server compared to virtualized resources.

By leveraging dedicated bare metal, CoreWeave Kubernetes delivers the speed, performance and predictability your AI workloads demand. The specialized NVIDIA GPUs and networking fabric are optimized specifically for machine learning, unlike general purpose clouds. Experience bare metal benefits without operational overheads.

Automated Scaling & Fast Deployment

CoreWeave Kubernetes with Zeet allows you to spin up GPUs and scale your infrastructure faster than ever before. Zeet leverages CoreWeave’s Kubernetes and their bare metal architecture to provide incredibly fast deployment times, while still getting the benefits of a cloud-based ecosystem with high-performance computing.

You can deploy a multi-node GPU cluster on CoreWeave Kubernetes in minutes, not hours or days like other platforms. Our proprietary system design and automation allows for near instantaneous spin up of new nodes.

We also offer the most responsive auto-scaling available. You can configure auto-scaling policies based on GPU utilization, allowing your infrastructure to scale up seamlessly as your workloads demand more compute resources. As GPU usage decreases, the cluster can automatically scale back down just as quickly.

There's no need to guess ahead of time what resources you'll need or sit around waiting for instances to deploy. With CoreWeave Kubernetes, you can deploy exactly what you need, when you need it. This on-demand elasticity is ideal for the dynamic workloads of machine learning and AI applications.

You get the performance of bare metal with the automation and ease-of-use of Kubernetes and auto-scaling. This enables unprecedented agility and efficiency for your GPU-accelerated workloads. Rapid deployment and instant scaling means you can focus on your AI projects, not your infrastructure.

Optimized for Distributed Training

CoreWeave Kubernetes enables cutting edge distributed training capabilities to handle the most demanding AI workloads. With CoreWeave's bare metal infrastructure and NVIDIA HGX H100 accelerators, you get the performance needed to scale model training to multiple nodes.

CoreWeave makes it easy to spin up GPU clusters optimized for distributed training in minutes. The platform's advanced scheduling efficiently allocates resources across nodes so you can run multiple concurrent training jobs with minimal wait times. Automated scaling allows your cluster to dynamically grow and shrink as your training workload demands.

Tight integration with popular distributed training frameworks like PyTorch and TensorFlow ensures you can leverage optimized communication libraries like NVIDIA NCCL for synchronous multi-GPU and multi-node training. CoreWeave's use of RDMA networking delivers the ultra-low latency and high bandwidth required for large-scale distributed training.

For more complex workflows, CoreWeave offers managed Kubernetes clusters with pre-installed SLURM for simplified job scheduling and cluster management. This provides a powerful environment to develop and deploy distributed training pipelines at scale.

With CoreWeave's optimized infrastructure for distributed training, you can accelerate time-to-accuracy for the most demanding AI models. The platform's automation and ease-of-use removes the headaches of managing infrastructure so you can focus on developing cutting-edge AI applications.

Flexible GPU Options

CoreWeave Kubernetes offers flexible GPU options to right-size compute resources for your AI workloads. Whether you need a single GPU or scaled-out multi-GPU clusters, CoreWeave has you covered.

NVIDIA A100 GPUs - The most advanced data center GPU available today, providing massive acceleration for AI training and inference workloads. A100 combines high performance with scalable multi-GPU connectivity so you can scale your workloads as needed.
NVIDIA T4 GPUs - Balanced GPUs offering great performance for mainstream AI workloads like computer vision and Natural Language Processing. T4 provides cost-effective performance for production deployment of inferencing after models have been trained.
MIX AND MATCH - CoreWeave allows you to deploy a mix of GPU types and counts per node to optimize price/performance. For example, you can train models on A100 GPUs then deploy for inference on more affordable T4 GPUs.

By offering all NVIDIA GPU options and supporting mixed configurations, CoreWeave Kubernetes provides the flexibility to right-size compute resources depending on your workload needs. Spin up the exact GPU resources you need and scale seamlessly as requirements evolve.

Integrations & APIs

CoreWeave Kubernetes integrates seamlessly with the tools and pipelines you already use for model development, training, and deployment. This allows you to leverage your existing workflows and access CoreWeave's specialized infrastructure without business disruption.

For model building, CoreWeave supports all the leading open source ML frameworks like PyTorch, TensorFlow, Keras, and OpenCV. Bring your own tools or take advantage of CoreWeave's curated model development environments.

When it comes time for training, CoreWeave makes it easy to leverage distributed training across multiple bare metal nodes for maximum performance and scalability. Integrate with orchestration tools like SLURM or use CoreWeave's managed service for simplified cluster deployment.

For deployment, CoreWeave enables one-click model serving using Kubernetes and inference servers such as TensorRT, Triton Inference Server, and Seldon Core. Easily transition models from training to production while monitoring with Prometheus and Grafana.

Zeet also provides REST APIs and a CLI for programmatic infrastructure management, meaning you can automate your ML workflows by provisioning CoreWeave Kubernetes clusters, GPU/CPU resources, and storage volumes as needed. The Zeet API enables full infrastructure-as-code capabilities.

With turnkey integrations, users can focus on the machine learning while CoreWeave handles the heavy lifting of deploying and managing the specialized infrastructure required to run AI workloads at scale. Get the most out of your data science investments by leveraging CoreWeave's deep integrations.

Simplified Infrastructure Management

Managing infrastructure for AI workloads can become complex and time consuming without the right tools. CoreWeave Kubernetes provides a simplified way to deploy and administer your AI applications without operational overhead.

With CoreWeave Kubernetes you get:

No ops overhead - CoreWeave's infrastructure is designed for automation and optimized for AI. There's no need to spend engineering resources on managing infrastructure.
Kubernetes abstractions - Kubernetes provides APIs and objects like Deployments, Services and Ingresses that abstract infrastructure details away from your applications. Developers can focus on the code.
Automatic scaling - Based on resource utilization or schedule, CoreWeave Kubernetes will automatically scale GPUs and other resources up and down to match your workload's needs. No more complex auto-scaling scripts.
Self-healing infrastructure - CoreWeave Kubernetes monitors infrastructure health and automatically restarts failed containers, replaces unhealthy nodes, and reroutes network traffic in case of outages.
Turnkey clusters - With a few clicks CoreWeave Kubernetes clusters can be deployed with GPU drivers, Kubernetes and tools like Helm pre-installed. No need to configure devops pipelines.
Workflow integrations - CoreWeave Kubernetes integrates with CI/CD tools like Argo and Kubeflow Pipelines to simplify ML workflows.

By leveraging Zeet's Managed Kubernetes and CoreWeave's optimized infrastructure, you remove the heavy lifting of managing the operational side of an ML stack. Resources that would be spent on devops can instead focus on high-value AI/ML development.

Security & Compliance

CoreWeave Kubernetes provides robust security and compliance capabilities to protect your data and meet industry regulations.

Role-based access control, certificates, VPC networking, firewalls, and encryption safeguard your assets and IP.
Integrated monitoring tools like Grafana provide observability into your infrastructure.
Data sovereignty compliance ensures your data stays within geographic boundaries.
HIPAA, SOC2, ISO 27001, NIST 800-53, and GDPR readiness means CoreWeave meets all key compliance standards.
Dedicated bare metal means your workloads are fully isolated at the hardware level.
Automatic security patching and hardening for Kubernetes and OS services maintain protection.
Control plane monitoring, log aggregation, and anomaly detection enhance threat visibility.

With CoreWeave's security-first architecture, you can confidently run sensitive ML workloads knowing your data is safe and compliant. The platform is purpose-built to exceed rigorous controls for finance, healthcare, government, and other regulated industries.

Case Study: Tarteel Migrates to CoreWeave

A customer that has benefited from CoreWeave is Tarteel. Tarteel leveraged Zeet to smoothly migrate from AWS to CoreWeave.

With CoreWeave and Zeet, Tarteel achieved:

1600 requests/min using 40 NVIDIA A4000/A5000 GPUs
22% improvement in latency (median request latency of 42-35ms)
56% cost reduction

The migration took just 1-2 days. As Tarteel's CEO said, "Zeet made the switch to CoreWeave a breeze. We had minimal downtime and did not need any K8s or CW expertise; we just clicked a few buttons and we were live."

Now through CoreWeave Tarteel is leveraging NVIDIA's SDKs and toolkits to build new AI features like NLP, TTS, semantic search, and more to enrich the experience of its 3 million users.

Tarteel AI leveraged Zeet to smoothly move its deployment from AWS to CoreWeave, translating to a 22% improvement in latency and ~56% cost reduction.

You can read more in the full CoreWeave case study.

Get Started with CoreWeave Kubernetes for AI

We makes it easy to get started with Kubernetes for your AI workloads. Here are some options to begin AI-cloud journey with Zeet and CoreWeave:

Sign Up for a Zeet and CoreWeave

Zeet is totally free for your first three Projects, and CoreWeave offers a fully featured free trial so you can test out our Kubernetes platform at no cost. The trial includes access to GPUs, managed Kubernetes, and all of our CI/CD workflows features. Sign up in minutes and have your cluster spinning in no time. Once you've signed up, connect your CoreWeave cloud to Zeet.

Already have a cloud? Zeet is multi-cloud enabled, so integrating CoreWeave into your existing cloud stack requires no extra work—just connect your CoreWeave cloud and hit deploy.

Request a Custom Quote

For larger deployments on CoreWeave specifically, you can request a custom quote from their sales team tailored to your specific infrastructure needs, however you’d be surprised at just how much availability there is, and at affordable prices. They’ll help you determine the right GPUs, nodes, storage and other requirements for your AI workloads. Our experts can provide recommendations to maximize performance and value.

Explore Resources on the CoreWeave Website

Both the CoreWeave website, as well as Zeet’s, have many useful resources to help you learn more about our combined AI-enabled Kubernetes offering. Browse through customer stories, product docs, blogs, and other material to educate yourself on our capabilities. Our technical resources can help you get started with architectures, workflows, and integrations.

With these options, it's easy to get hands-on with CoreWeave Kubernetes for your machine learning and AI workloads. We offer the flexibility to trial at no cost, get expert guidance for larger deployments, and tap into our extensive self-service resources. Get started today!

Thank you!

Your submission has been processed

Oops! Something went wrong while submitting the form.

First time at Zeet?

CoreWeave Kubernetes: The AI Workload Solution You've Been Waiting For