@intentsolutionsio/jeremy-vertex-terraform

v2.0.0

Published

11 days ago

Terraform configurations for Vertex AI platform and Agent Engine

0High
0Medium
0Low

terraform vertex-ai gemini model-garden infrastructure iac ml-ops ai-infrastructure claude-code claude-plugin tonsofskills

Jeremy Vertex Terraform

🎯 VERTEX AI MODEL GARDEN & AI INFRASTRUCTURE

Terraform infrastructure specialist for broader Vertex AI services including Model Garden, Gemini endpoints, vector search, ML pipelines, and enterprise AI infrastructure (NOT Agent Engine - use jeremy-adk-terraform for that).

⚠️ Important: What This Plugin Is For

✅ THIS PLUGIN IS FOR:

Vertex AI Model Garden deployments (foundation models)
Gemini API endpoints (gemini-pro, gemini-2.0-flash)
Vector Search infrastructure (ScaNN-based similarity search)
Vertex AI Pipelines (Kubeflow Pipelines for ML workflows)
Endpoint deployment (model serving infrastructure)
Batch prediction jobs
ML model training infrastructure
Feature Store for ML feature management

❌ THIS PLUGIN IS NOT FOR:

Agent Engine infrastructure (use jeremy-adk-terraform for ADK agents)
Cloud Run deployments (use jeremy-genkit-terraform)
Self-managed ML infrastructure

Overview

This plugin provides Terraform modules for deploying Vertex AI services including Model Garden foundation models, Gemini API endpoints, vector search for RAG applications, ML pipelines, and production model serving infrastructure.

Key Infrastructure Components:

google_vertex_ai_endpoint for model serving
google_vertex_ai_deployed_model for model versions
google_vertex_ai_index for vector search
google_vertex_ai_index_endpoint for similarity search
google_vertex_ai_feature_store for feature management
Cloud Storage for model artifacts
BigQuery for ML model training

Installation

/plugin install jeremy-vertex-terraform@claude-code-plugins-plus

Prerequisites & Dependencies

Required Tools

1. Terraform:

# Install Terraform 1.5+
wget https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip
unzip terraform_1.6.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/

# Verify
terraform version  # Should show 1.5.0+

2. gcloud CLI:

# Install gcloud
curl https://sdk.cloud.google.com | bash
exec -l $SHELL

# Update to latest
gcloud components update

# Authenticate
gcloud auth application-default login

3. Terraform Google Provider:

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
    google-beta = {
      source  = "hashicorp/google-beta"
      version = "~> 5.0"
    }
  }
}

Required Google Cloud APIs

# Enable all required APIs
gcloud services enable \
    aiplatform.googleapis.com \
    compute.googleapis.com \
    storage.googleapis.com \
    bigquery.googleapis.com \
    logging.googleapis.com \
    monitoring.googleapis.com \
    cloudtrace.googleapis.com \
    --project=YOUR_PROJECT_ID

Required IAM Permissions

# Service account for Terraform needs:
- roles/aiplatform.admin              # Deploy Vertex AI resources
- roles/storage.admin                 # Manage model artifacts
- roles/bigquery.admin                # ML training datasets
- roles/compute.networkAdmin          # VPC for private endpoints
- roles/monitoring.admin              # Observability
- roles/iam.serviceAccountAdmin       # Service account management

Features

✅ Model Garden Deployment: Foundation models (Gemini, PaLM, Claude, Llama) ✅ Gemini API Endpoints: Dedicated endpoints with rate limiting ✅ Vector Search: ScaNN-based similarity search for RAG ✅ ML Pipelines: Kubeflow Pipelines for training workflows ✅ Model Serving: Production endpoints with auto-scaling ✅ Batch Predictions: Large-scale inference jobs ✅ Feature Store: Centralized feature management ✅ Monitoring: Model performance tracking and drift detection

Quick Start

Natural Language Activation

"Create Terraform for Gemini endpoint deployment"
"Deploy vector search for RAG application"
"Set up Vertex AI Pipeline for model training"
"Create Feature Store for ML features"
"Deploy custom model to Vertex AI endpoint"

Terraform Module Structure

1. Gemini API Endpoint

# gemini_endpoint.tf

# Gemini 2.0 Flash endpoint
resource "google_vertex_ai_endpoint" "gemini_endpoint" {
  display_name = "gemini-2-0-flash-endpoint"
  location     = var.region
  project      = var.project_id

  description = "Production Gemini 2.0 Flash endpoint"

  # Network configuration
  network = google_compute_network.vertex_vpc.id

  # Encryption
  encryption_spec {
    kms_key_name = google_kms_crypto_key.model_key.id
  }
}

# Deploy Gemini model
resource "google_vertex_ai_deployed_model" "gemini_flash" {
  endpoint = google_vertex_ai_endpoint.gemini_endpoint.id

  model = "publishers/google/models/gemini-2.0-flash-001"

  display_name = "gemini-2-0-flash-001"

  dedicated_resources {
    machine_spec {
      machine_type = "n1-standard-4"
    }

    min_replica_count = var.min_replicas
    max_replica_count = var.max_replicas

    autoscaling_metric_specs {
      metric_name = "aiplatform.googleapis.com/prediction/online/accelerator/duty_cycle"
      target      = 70
    }
  }

  # Traffic split
  traffic_split = {
    "0" = 100
  }
}

# Service account for endpoint
resource "google_service_account" "vertex_sa" {
  account_id   = "vertex-ai-endpoint-sa"
  display_name = "Vertex AI Endpoint Service Account"
}

resource "google_project_iam_member" "vertex_permissions" {
  for_each = toset([
    "roles/aiplatform.user",
    "roles/storage.objectViewer",
    "roles/logging.logWriter"
  ])

  project = var.project_id
  role    = each.value
  member  = "serviceAccount:${google_service_account.vertex_sa.email}"
}

2. Vector Search Infrastructure

# vector_search.tf

# Vector index for embeddings
resource "google_vertex_ai_index" "embeddings_index" {
  display_name = "${var.app_name}-embeddings-index"
  region       = var.region
  project      = var.project_id

  description = "Vector search index for RAG application"

  metadata {
    contents_delta_uri = google_storage_bucket.embeddings.url

    config {
      dimensions                  = 768  # text-embedding-gecko dimensions
      approximate_neighbors_count = 150
      distance_measure_type       = "DOT_PRODUCT_DISTANCE"

      algorithm_config {
        tree_ah_config {
          leaf_node_embedding_count    = 1000
          leaf_nodes_to_search_percent = 7
        }
      }

      shard_size = "SHARD_SIZE_MEDIUM"
    }
  }

  index_update_method = "STREAM_UPDATE"
}

# Index endpoint for queries
resource "google_vertex_ai_index_endpoint" "embeddings_endpoint" {
  display_name = "${var.app_name}-embeddings-endpoint"
  region       = var.region
  project      = var.project_id

  description = "Vector search endpoint"

  # Private VPC
  network = "projects/${data.google_project.project.number}/global/networks/${google_compute_network.vertex_vpc.name}"

  public_endpoint_enabled = false
}

# Deploy index to endpoint
resource "google_vertex_ai_index_endpoint_deployed_index" "deployed" {
  index_endpoint = google_vertex_ai_index_endpoint.embeddings_endpoint.id
  index          = google_vertex_ai_index.embeddings_index.id

  deployed_index_id = "deployed_embeddings_index"
  display_name      = "Deployed Embeddings Index"

  dedicated_resources {
    machine_spec {
      machine_type = "n1-standard-16"
    }

    min_replica_count = 2
    max_replica_count = 10

    autoscaling_metric_specs {
      metric_name = "aiplatform.googleapis.com/prediction/online/cpu/utilization"
      target      = 70
    }
  }

  enable_access_logging = true
}

# Storage bucket for embeddings
resource "google_storage_bucket" "embeddings" {
  name     = "${var.project_id}-${var.app_name}-embeddings"
  location = var.region

  uniform_bucket_level_access = true

  versioning {
    enabled = true
  }

  lifecycle_rule {
    condition {
      num_newer_versions = 3
    }
    action {
      type = "Delete"
    }
  }
}

3. Custom Model Deployment

# custom_model.tf

# Upload model to Cloud Storage
resource "google_storage_bucket" "model_artifacts" {
  name     = "${var.project_id}-ml-models"
  location = var.region

  uniform_bucket_level_access = true
  versioning {
    enabled = true
  }
}

resource "google_storage_bucket_object" "model_artifact" {
  name   = "models/${var.model_name}/model.pkl"
  bucket = google_storage_bucket.model_artifacts.name
  source = var.model_path
}

# Register model
resource "google_vertex_ai_model" "custom_model" {
  display_name = var.model_name
  region       = var.region
  project      = var.project_id

  description = "Custom ML model"

  version_aliases = ["production"]

  # Model artifact location
  artifact_uri = "gs://${google_storage_bucket.model_artifacts.name}/models/${var.model_name}/"

  # Container spec for serving
  container_spec {
    image_uri = "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest"

    env {
      name  = "MODEL_NAME"
      value = var.model_name
    }

    ports {
      container_port = 8080
    }

    predict_route = "/predict"
    health_route  = "/health"
  }

  # Encryption
  encryption_spec {
    kms_key_name = google_kms_crypto_key.model_key.id
  }
}

# Create serving endpoint
resource "google_vertex_ai_endpoint" "model_endpoint" {
  display_name = "${var.model_name}-endpoint"
  location     = var.region
  project      = var.project_id

  network = google_compute_network.vertex_vpc.id
}

# Deploy model to endpoint
resource "google_vertex_ai_deployed_model" "deployed" {
  endpoint = google_vertex_ai_endpoint.model_endpoint.id
  model    = google_vertex_ai_model.custom_model.id

  display_name = "${var.model_name}-v1"

  dedicated_resources {
    machine_spec {
      machine_type = "n1-standard-4"

      accelerator_type  = "NVIDIA_TESLA_T4"
      accelerator_count = 1
    }

    min_replica_count = 1
    max_replica_count = 5

    autoscaling_metric_specs {
      metric_name = "aiplatform.googleapis.com/prediction/online/cpu/utilization"
      target      = 60
    }
  }

  traffic_split = {
    "0" = 100
  }
}

4. Vertex AI Pipelines

# pipelines.tf

# Pipeline storage bucket
resource "google_storage_bucket" "pipeline_root" {
  name     = "${var.project_id}-pipeline-root"
  location = var.region

  uniform_bucket_level_access = true

  lifecycle_rule {
    condition {
      age = 30
    }
    action {
      type = "Delete"
    }
  }
}

# Artifact Registry for pipeline containers
resource "google_artifact_registry_repository" "pipeline_containers" {
  repository_id = "vertex-pipelines"
  location      = var.region
  format        = "DOCKER"

  description = "Container images for Vertex AI Pipelines"
}

# Service account for pipelines
resource "google_service_account" "pipeline_sa" {
  account_id   = "vertex-pipeline-runner"
  display_name = "Vertex AI Pipeline Runner"
}

resource "google_project_iam_member" "pipeline_permissions" {
  for_each = toset([
    "roles/aiplatform.user",
    "roles/storage.admin",
    "roles/bigquery.dataEditor",
    "roles/logging.logWriter"
  ])

  project = var.project_id
  role    = each.value
  member  = "serviceAccount:${google_service_account.pipeline_sa.email}"
}

# Example: Training pipeline trigger
resource "google_cloudfunctions2_function" "pipeline_trigger" {
  name     = "trigger-training-pipeline"
  location = var.region

  build_config {
    runtime     = "python312"
    entry_point = "trigger_pipeline"

    source {
      storage_source {
        bucket = google_storage_bucket.pipeline_root.name
        object = "functions/trigger.zip"
      }
    }
  }

  service_config {
    available_memory   = "256M"
    timeout_seconds    = 60
    service_account_email = google_service_account.pipeline_sa.email

    environment_variables = {
      PROJECT_ID    = var.project_id
      PIPELINE_ROOT = "gs://${google_storage_bucket.pipeline_root.name}"
      REGION        = var.region
    }
  }
}

5. Feature Store

# feature_store.tf

resource "google_vertex_ai_featurestore" "main" {
  name   = "${var.app_name}-featurestore"
  region = var.region

  online_serving_config {
    fixed_node_count = 1
  }

  encryption_spec {
    kms_key_name = google_kms_crypto_key.feature_key.id
  }

  force_destroy = false
}

# Entity type (e.g., users, items)
resource "google_vertex_ai_featurestore_entitytype" "users" {
  name         = "users"
  featurestore = google_vertex_ai_featurestore.main.id

  monitoring_config {
    snapshot_analysis {
      disabled = false
    }
  }
}

# Features
resource "google_vertex_ai_featurestore_entitytype_feature" "user_age" {
  name       = "age"
  entitytype = google_vertex_ai_featurestore_entitytype.users.id

  value_type = "INT64"

  description = "User age in years"
}

resource "google_vertex_ai_featurestore_entitytype_feature" "user_ltv" {
  name       = "lifetime_value"
  entitytype = google_vertex_ai_featurestore_entitytype.users.id

  value_type = "DOUBLE"

  description = "User lifetime value"
}

6. Batch Prediction

# batch_prediction.tf

# Batch prediction job
resource "google_vertex_ai_batch_prediction_job" "batch_inference" {
  display_name = "${var.model_name}-batch-prediction"
  location     = var.region

  model = google_vertex_ai_model.custom_model.id

  input_config {
    instances_format = "jsonl"

    gcs_source {
      uris = ["gs://${google_storage_bucket.model_artifacts.name}/batch-input/*.jsonl"]
    }
  }

  output_config {
    predictions_format = "jsonl"

    gcs_destination {
      output_uri_prefix = "gs://${google_storage_bucket.model_artifacts.name}/batch-output/"
    }
  }

  dedicated_resources {
    machine_spec {
      machine_type      = "n1-standard-4"
      accelerator_type  = "NVIDIA_TESLA_T4"
      accelerator_count = 1
    }

    starting_replica_count = 1
    max_replica_count      = 10
  }

  service_account = google_service_account.vertex_sa.email
}

7. Monitoring & Observability

# monitoring.tf

# Dashboard for model endpoints
resource "google_monitoring_dashboard" "vertex_dashboard" {
  dashboard_json = jsonencode({
    displayName = "${var.app_name} Vertex AI Dashboard"

    mosaicLayout = {
      columns = 12

      tiles = [
        # Prediction requests
        {
          width  = 6
          height = 4
          widget = {
            title = "Prediction Requests"
            xyChart = {
              dataSets = [{
                timeSeriesQuery = {
                  timeSeriesFilter = {
                    filter = "metric.type=\"aiplatform.googleapis.com/prediction/online/prediction_count\" resource.type=\"aiplatform.googleapis.com/Endpoint\""
                  }
                }
              }]
            }
          }
        },

        # Prediction latency
        {
          xPos   = 6
          width  = 6
          height = 4
          widget = {
            title = "Prediction Latency (p95)"
            xyChart = {
              dataSets = [{
                timeSeriesQuery = {
                  timeSeriesFilter = {
                    filter = "metric.type=\"aiplatform.googleapis.com/prediction/online/response_latencies\" resource.type=\"aiplatform.googleapis.com/Endpoint\""

                    aggregation = {
                      alignmentPeriod     = "60s"
                      perSeriesAligner    = "ALIGN_DELTA"
                      crossSeriesReducer  = "REDUCE_PERCENTILE_95"
                    }
                  }
                }
              }]
            }
          }
        },

        # Error rate
        {
          yPos   = 4
          width  = 6
          height = 4
          widget = {
            title = "Error Rate"
            xyChart = {
              dataSets = [{
                timeSeriesQuery = {
                  timeSeriesFilter = {
                    filter = "metric.type=\"aiplatform.googleapis.com/prediction/online/error_count\" resource.type=\"aiplatform.googleapis.com/Endpoint\""
                  }
                }
              }]
            }
          }
        },

        # Replica utilization
        {
          xPos   = 6
          yPos   = 4
          width  = 6
          height = 4
          widget = {
            title = "Replica Utilization"
            xyChart = {
              dataSets = [{
                timeSeriesQuery = {
                  timeSeriesFilter = {
                    filter = "metric.type=\"aiplatform.googleapis.com/prediction/online/replicas\" resource.type=\"aiplatform.googleapis.com/Endpoint\""
                  }
                }
              }]
            }
          }
        }
      ]
    }
  })
}

# Alert: High latency
resource "google_monitoring_alert_policy" "high_latency" {
  display_name = "${var.app_name} - High Prediction Latency"
  combiner     = "OR"

  conditions {
    display_name = "P95 latency > 5s"

    condition_threshold {
      filter          = "metric.type=\"aiplatform.googleapis.com/prediction/online/response_latencies\" resource.type=\"aiplatform.googleapis.com/Endpoint\""
      duration        = "300s"
      comparison      = "COMPARISON_GT"
      threshold_value = 5000

      aggregations {
        alignment_period     = "60s"
        per_series_aligner   = "ALIGN_DELTA"
        cross_series_reducer = "REDUCE_PERCENTILE_95"
      }
    }
  }

  notification_channels = [google_monitoring_notification_channel.email.id]
}

# Alert: High error rate
resource "google_monitoring_alert_policy" "high_errors" {
  display_name = "${var.app_name} - High Error Rate"
  combiner     = "OR"

  conditions {
    display_name = "Error rate > 5%"

    condition_threshold {
      filter          = "metric.type=\"aiplatform.googleapis.com/prediction/online/error_count\" resource.type=\"aiplatform.googleapis.com/Endpoint\""
      duration        = "300s"
      comparison      = "COMPARISON_GT"
      threshold_value = 0.05

      aggregations {
        alignment_period   = "60s"
        per_series_aligner = "ALIGN_RATE"
      }
    }
  }

  notification_channels = [google_monitoring_notification_channel.email.id]
}

Variables

# variables.tf

variable "project_id" {
  description = "Google Cloud project ID"
  type        = string
}

variable "region" {
  description = "Region for Vertex AI resources"
  type        = string
  default     = "us-central1"
}

variable "app_name" {
  description = "Application name prefix"
  type        = string
}

# Endpoint configuration
variable "min_replicas" {
  description = "Minimum number of replicas"
  type        = number
  default     = 1
}

variable "max_replicas" {
  description = "Maximum number of replicas"
  type        = number
  default     = 10
}

# Model configuration
variable "model_name" {
  description = "Custom model name"
  type        = string
}

variable "model_path" {
  description = "Local path to model artifact"
  type        = string
}

# Vector search
variable "embedding_dimensions" {
  description = "Dimensions for embeddings (768 for gecko, 1536 for OpenAI)"
  type        = number
  default     = 768
}

# Alerting
variable "alert_email" {
  description = "Email for monitoring alerts"
  type        = string
}

Deployment Workflow

1. Initialize Terraform

terraform init

2. Plan Infrastructure

terraform plan \
  -var="project_id=my-project" \
  -var="app_name=my-app" \
  -var="model_name=custom-model" \
  -var="model_path=./model.pkl" \
  -var="[email protected]"

3. Apply Configuration

terraform apply \
  -var="project_id=my-project" \
  -var="app_name=my-app" \
  -var="model_name=custom-model" \
  -var="model_path=./model.pkl" \
  -var="[email protected]"

4. Verify Deployment

# Check endpoints
gcloud ai endpoints list --region=us-central1

# Check deployed models
gcloud ai models list --region=us-central1

# Check vector search indexes
gcloud ai indexes list --region=us-central1

# Test prediction
curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT/locations/us-central1/endpoints/ENDPOINT_ID:predict \
  -d '{"instances": [{"feature1": 1.0, "feature2": "value"}]}'

Integration with Other Plugins

jeremy-adk-terraform

jeremy-adk-terraform: Agent Engine (ADK agents)
jeremy-vertex-terraform: Model serving & ML infrastructure (this plugin)

jeremy-vertex-engine

Terraform provisions endpoints → Engine inspector validates (for Agent Engine only)

jeremy-vertex-validator

Terraform provisions infrastructure → Validator checks production readiness

Use Cases

Gemini API Deployment

"Create Terraform for Gemini 2.0 Flash endpoint"
"Deploy Gemini Pro with auto-scaling"

Vector Search for RAG

"Set up vector search infrastructure for RAG application"
"Deploy embeddings index with 768 dimensions"

Custom Model Serving

"Deploy custom scikit-learn model to Vertex AI"
"Create endpoint for TensorFlow model with GPU"

Batch Predictions

"Set up batch prediction job for large dataset"
"Deploy batch inference with T4 GPUs"

Feature Store

"Create Feature Store for user features"
"Deploy feature serving for real-time predictions"

Best Practices

✅ Private Endpoints: Use VPC for production endpoints ✅ Auto-scaling: Configure based on traffic patterns ✅ Monitoring: Deploy dashboards and alerts ✅ Encryption: Use CMEK for sensitive models ✅ Version Control: Tag model versions ✅ Cost Optimization: Use preemptible VMs for batch jobs ✅ Traffic Splitting: Blue/green deployments ✅ Model Registry: Organize models in Vertex AI Model Registry

Requirements

Terraform >= 1.5.0
Google Cloud Provider >= 5.0
Google Cloud Project with billing enabled
Appropriate IAM permissions
Model artifacts prepared
gcloud CLI

License

MIT

Support

Issues: https://github.com/jeremylongshore/claude-code-plugins/issues
Discussions: https://github.com/jeremylongshore/claude-code-plugins/discussions

Version

1.0.1 (2025) - Comprehensive Vertex AI infrastructure (Model Garden, Vector Search, Pipelines)