Provisioning the AI Stack: Using Ansible to Deploy GPU Clusters and Vector DBs
Focus on the 'Ansible Content Collections for AI' released in 2025. Explain how DevOps teams can use Ansible to standardize the deployment of Nvidia drivers, PyTorch environments,
Provisioning the AI Stack: Using Ansible to Deploy GPU Clusters and Vector DBs
The rise of generative AI has transformed not only how we build applications but also the infrastructure they run on. Gone are the days of simple web servers; today’s AI stack demands powerful GPU clusters, specialized drivers, and complex data stores like vector databases. For DevOps and MLOps teams, this introduces significant challenges in consistency, scalability, and reliability. Manually configuring a single GPU node is tedious; provisioning a cluster is a recipe for drift and error.
Enter Ansible. With the recent release of the Ansible Content Collections for AI in 2025, automation for the entire AI/ML lifecycle is now a first-class citizen. These new, curated collections provide robust, idempotent roles for provisioning everything from the bare metal to the application layer, allowing teams to treat their AI infrastructure as code. This article dives into how you can leverage these tools to standardize your AI stack.
What You’ll Get
- An overview of the new Ansible Content Collections for AI.
- A clear, repeatable pattern for automating NVIDIA driver and CUDA installations.
- Example playbooks for deploying standardized PyTorch environments.
- A high-level architecture for provisioning vector databases like Milvus.
- A comparison of automating self-hosted vs. SaaS vector DBs.
The New Frontier: AI Infrastructure as Code
The core challenge of MLOps is bridging the gap between flexible data science experimentation and rigid production requirements. Infrastructure is a major part of this challenge. A slight mismatch in a CUDA driver version or a Python dependency can bring a multi-million dollar model training pipeline to a halt.
This is where Ansible’s new AI-focused collections shine. They abstract away the low-level, error-prone commands and provide a declarative interface for defining your stack.
Key Collections introduced in 2025 include:
-
nvidia.gpu: For managing NVIDIA drivers, CUDA toolkits, and container runtimes. -
community.ml_frameworks: For deploying environments with PyTorch, TensorFlow, and JAX. -
community.vector_db: Roles for deploying and managing vector databases like Milvus, Weaviate, and interacting with SaaS APIs like Pinecone.
These collections enable teams to build a single, version-controlled source of truth for their entire AI platform.
Laying the Foundation: The GPU Node
Everything in the AI stack starts with a properly configured GPU-enabled compute node. This is notoriously difficult to get right due to the tight coupling between the Linux kernel, NVIDIA drivers, the CUDA toolkit, and the ML framework.
Taming the NVIDIA Driver
The nvidia.gpu collection simplifies this dramatically. The nvidia.gpu.driver role intelligently handles distribution-specific package dependencies, kernel module signing (for Secure Boot), and version pinning.
Here is a playbook for ensuring a specific NVIDIA driver and CUDA version are installed on all nodes in your gpu_cluster group.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
---
- name: Configure GPU nodes with NVIDIA drivers and CUDA
hosts: gpu_cluster
become: true
vars:
nvidia_driver_version: "535.129.03"
cuda_toolkit_version: "12.2"
tasks:
- name: Ensure NVIDIA drivers are installed
ansible.builtin.include_role:
name: nvidia.gpu.driver
vars:
driver_version: ""
# The role handles the logic for different package managers
# (e.g., dnf, apt) behind the scenes.
- name: Ensure CUDA Toolkit is installed
ansible.builtin.include_role:
name: nvidia.gpu.cuda_toolkit
vars:
cuda_version: ""
This simple, declarative playbook replaces dozens of shell commands and conditional checks, making your GPU fleet uniform and predictable.
Standardizing ML Environments
Once the base OS and drivers are set, the next layer is the ML environment itself. Data scientists often use conda or venv to manage dependencies, but these can become inconsistent across a team. Ansible can enforce a standard, reproducible environment.
Deploying PyTorch with Ansible
The community.ml_frameworks collection doesn’t reinvent the wheel; it provides robust wrappers around tools like pip and venv. This lets you define your Python environment in a structured Ansible variable file.
This example playbook sets up a dedicated user, creates a virtual environment, and installs a specific version of PyTorch compatible with our CUDA toolkit.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
- name: Deploy a standardized PyTorch environment
hosts: gpu_cluster
become: true
vars:
app_user: "ml_service"
project_path: "/opt/apps/inference_api"
venv_path: "/venv"
pytorch_version: "2.1.0"
cuda_version_short: "cu121" # PyTorch wheel identifier for CUDA 12.1+
tasks:
- name: Ensure the application user exists
ansible.builtin.user:
name: ""
shell: /bin/bash
create_home: true
- name: Create the project directory
ansible.builtin.file:
path: ""
state: directory
owner: ""
group: ""
mode: '0755'
- name: Create Python virtual environment
ansible.builtin.pip:
name: venv
virtualenv: ""
virtualenv_python: python3.9
become: true
become_user: ""
- name: Install PyTorch and TorchVision
ansible.builtin.pip:
name:
- "torch=="
- "torchvision"
virtualenv: ""
extra_args: "--index-url https://download.pytorch.org/whl/"
become: true
become_user: ""
Now, every node in your cluster has an identical, isolated PyTorch environment ready for your application code.
Automating the Vector Database Layer
Modern AI applications, especially those using Retrieval-Augmented Generation (RAG), rely heavily on vector databases. These systems have complex, distributed architectures. The community.vector_db collection is designed to manage this complexity.
Below is a Mermaid diagram illustrating the high-level workflow of Ansible deploying a self-hosted Milvus cluster.
graph TD;
A["Ansible Control Node"] --> B{"1. Provision Cloud<br/>Infrastructure (VPCs, VMs)"};
B --> C{"2. Deploy Dependencies<br/>(etcd, MinIO, Pulsar)"};
C --> D{"3. Deploy Milvus<br/>Coordinator & Worker Nodes"};
D --> E["4. Configure and<br/>Initialize Milvus"];
E --> F[✅ Milvus Cluster Ready];
This flow shows how Ansible acts as a single orchestrator for infrastructure, dependencies, and the application itself.
A Tale of Two Databases: Milvus vs. Pinecone
Your automation strategy will differ depending on whether you choose a self-hosted solution like Milvus or a SaaS platform like Pinecone. Ansible is flexible enough to handle both.
| Feature | Ansible for Milvus (Self-Hosted) | Ansible for Pinecone (SaaS) |
|---|---|---|
| Deployment | Provisions VMs/containers, installs dependencies (etcd, MinIO), and deploys Milvus components via the milvus_cluster role. |
No deployment. Ansible’s role is configuration. |
| Configuration | Manages milvus.yaml configuration file, setting resource limits, storage endpoints, and replication factors. |
Uses the pinecone_index module (part of the collection) to create, configure, and delete indexes via the Pinecone API. |
| Scaling | Adds new worker nodes to the inventory and re-runs the playbook. Ansible ensures the new nodes join the existing cluster correctly. | Manages the replicas and pod_type parameters for an index via the API to scale up or down. |
Here’s a conceptual playbook snippet for deploying a Milvus cluster using the new collection.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Conceptual example - not a complete playbook
- name: Deploy self-hosted Milvus cluster
hosts: milvus_nodes
become: true
tasks:
- name: Deploy Milvus using the vector_db collection
ansible.builtin.include_role:
name: community.vector_db.milvus_cluster
vars:
milvus_etcd_endpoints: "etcd-1:2379,etcd-2:2379"
milvus_storage_type: "minio"
milvus_minio_endpoint: "minio.internal:9000"
# ... other configuration parameters
The Power of Integration The true strength comes from combining these roles. You can create a single master playbook that provisions a GPU node, installs the drivers, sets up a PyTorch environment, and connects it to a freshly deployed Milvus cluster—all in one automated, repeatable run.
Bringing It All Together: The Full Stack Playbook
A top-level playbook ties all these components together, creating a full AI stack from a single command. This modular approach is a core strength of Ansible.
1
2
3
4
5
6
7
8
9
10
11
12
13
# Filename: deploy_ai_stack.yml
---
- name: Provision GPU drivers and CUDA
import_playbook: playbooks/01_configure_gpu.yml
- name: Deploy standardized Python environments
import_playbook: playbooks/02_setup_ml_env.yml
- name: Deploy and configure the Milvus vector database
import_playbook: playbooks/03_deploy_vector_db.yml
- name: Deploy the inference application
import_playbook: playbooks/04_deploy_app.yml
This structure makes the entire platform definition easy to read, manage, and version in Git.
Key Takeaways
The complexity of AI infrastructure demands a mature Infrastructure as Code (IaC) solution. The 2025 Ansible Content Collections for AI provide the specialized tools needed to manage the modern AI stack effectively.
- Standardization is Key: By codifying your setup, you eliminate configuration drift and ensure every environment, from development to production, is identical.
- Speed and Agility: Ansible allows you to provision entire GPU clusters and their software stacks in minutes, not days.
- Reliability: Idempotent playbooks ensure that your infrastructure always converges to the desired state, reducing errors and increasing uptime.
Ansible is no longer just for configuring web servers and databases. It has evolved into an essential orchestration tool for MLOps, enabling teams to build, scale, and manage the powerful infrastructure that drives the next generation of artificial intelligence. ✨
Further Reading
- https://www.redhat.com/en/blog/whats-new-ansible-automation-platform-content
