11 min read Vincent

From Rules to Intelligence: Automating Network Infrastructure Capacity Delivery at Scale

How we evolved from policy-based automation to cost-optimized, self-decisive systems for delivering network infrastructure capacity across all machine racks - ensuring ML workloads never wait for connectivity.

Network Infrastructure Automation Capacity Planning

At Google scale, every hour of delay in network infrastructure delivery costs real money. When thousands of machine racks are deployed across global data centers, and ML training runs are waiting for connectivity, manual capacity planning becomes the bottleneck that breaks everything.

I led a cross-functional program that transformed how we deliver network infrastructure capacity for all machine racks—with ML racks as the most critical and demanding use case. What began as rule-based automation evolved into self-decisive, cost-optimized systems that ensure network blocks are delivered precisely when needed, eliminating the dreaded scenario of ML machine racks sitting idle without connectivity.

This journey from reactive policies to intelligent resource allocation mirrors the challenges we’re solving today in Vertex AI GenAI resource management, where the stakes are even higher and the complexity even greater.

The Scale Challenge: Network Infrastructure for Global Data Centers

When you’re operating hundreds of data centers globally, network infrastructure capacity delivery becomes a complex optimization problem with multiple competing objectives:

Cost: Network equipment is expensive, and over-provisioning wastes millions of dollars Timing: Under-provisioning creates bottlenecks that idle expensive compute hardware Reliability: Network failures cascade to thousands of downstream services Efficiency: Poor placement decisions create long-term operational overhead

Why ML Racks Changed Everything

While we automated capacity delivery for all machine racks, ML racks became our most demanding and highest-stakes use case:

Criticality: ML training runs represent massive investments—a single large model training can cost millions of dollars. Any network delay that stops training is immediately visible to executives.

Scale: ML deployments aren’t incremental. When you’re launching a new model, you need thousands of machines with full connectivity all at once. No graceful degradation, no partial deployment.

Complexity: ML interconnect topologies (fat trees, torus networks, custom fabrics) require careful planning to avoid hotspots and ensure predictable performance.

Evolution: Every new generation of ML accelerators brings new networking requirements that break existing assumptions.

The reality: ML workloads exposed every weakness in our manual processes and forced us to build truly intelligent automation.

Building the Network Switch Port Reservation System

The first automation we tackled was network switch port reservations—seemingly simple, but critically important for eliminating manual planning errors.

The Manual Process Was Brittle

Before automation, provisioning new ML hardware involved:

  1. Physical planning: Engineers manually mapping machines to rack locations
  2. Network planning: Calculating which switch ports each machine needed
  3. Reservation tracking: Spreadsheets (yes, spreadsheets) tracking port assignments
  4. Configuration generation: Manual creation of network configs

This process had a 30% error rate. Wrong port assignments meant machines that couldn’t communicate, failed training jobs, and expensive re-work.

The Automated Solution

We built a reservation system that became the source of truth for all ML machine connectivity:

# Simplified version of our reservation logic
class NetworkPortReservation:
    def __init__(self, datacenter_topology):
        self.topology = datacenter_topology
        self.reservations = ReservationStore()

    def reserve_ports_for_deployment(self, deployment_spec):
        """Reserve network ports for a new ML deployment"""
        required_bandwidth = deployment_spec.bandwidth_requirements
        machine_count = deployment_spec.machine_count
        locality_constraints = deployment_spec.locality_requirements

        # Find optimal switch placement
        switch_allocation = self.optimize_switch_placement(
            machine_count,
            required_bandwidth,
            locality_constraints
        )

        # Reserve specific ports atomically
        reservation_id = self.reservations.create_reservation(
            switches=switch_allocation.switches,
            ports=switch_allocation.ports,
            duration=deployment_spec.duration
        )

        return reservation_id

Key Design Decisions

Atomic Reservations: Port reservations are all-or-nothing. Either you get all the ports you need for a deployment, or you get none. This prevents partial allocations that leave deployments in limbo.

Topology Awareness: The system understands datacenter network topology—which switches can handle high-bandwidth ML traffic, which ports are already reserved, and how to minimize network hops for training workloads.

Temporal Planning: Reservations include time dimensions. You can reserve ports for future deployments, allowing capacity planning weeks ahead of actual hardware arrival.

Cross-Functional Program Management

Technical solutions are only half the battle. Successful capacity delivery automation requires coordinating across multiple teams with different priorities and timelines.

The Teams Involved

Hardware Engineering: Defines physical requirements, power needs, cooling constraints Network Engineering: Manages switch capacity, topology planning, configuration management Data Center Operations: Handles physical installation, power/cooling provisioning ML Infrastructure: Defines connectivity requirements, traffic patterns, SLA needs Procurement: Manages hardware ordering, delivery scheduling, vendor coordination

Coordination Challenges

Each team operates on different timelines:

  • Procurement plans hardware orders 6+ months ahead
  • DC Operations schedules installations weeks in advance
  • Network Engineering needs days for configuration changes
  • ML teams often need capacity “yesterday” for urgent training runs

The automation had to accommodate all these constraints while remaining flexible enough for last-minute changes.

Program Management Approach

We implemented a staged delivery model:

  1. Quarterly Planning: High-level capacity forecasting aligned with ML roadmaps
  2. Monthly Reviews: Detailed scheduling of specific deployments
  3. Weekly Checkpoints: Status updates, risk mitigation, timeline adjustments
  4. Daily Automation: Automated port reservations, configuration generation, validation

This multi-timescale approach let us maintain strategic visibility while enabling tactical flexibility.

Handling NPI (New Product Introduction) Adaptability

The biggest test of any automation system is how it handles completely new hardware that wasn’t anticipated in the original design.

The NPI Challenge

Every new generation of ML hardware introduces surprises:

  • Different form factors requiring new rack layouts
  • New networking protocols not supported by existing switches
  • Unique power/cooling requirements affecting placement constraints
  • Novel interconnect topologies optimized for specific workload patterns

Traditional automation breaks when assumptions change. We needed a system that could adapt.

Building Adaptive Automation

Our approach was to parameterize everything and make the system data-driven:

# Hardware specification (simplified)
tpu_v5_spec:
  form_factor: "blade"
  power_consumption: 400W
  cooling_requirements: "liquid"
  network_interfaces:
    - type: "ethernet"
      ports: 2
      bandwidth: "100Gbps"
    - type: "infiniband"
      ports: 8
      bandwidth: "200Gbps"
  placement_constraints:
    - "same_rack": true
    - "network_distance": "single_hop"
  deployment_patterns:
    - name: "training_pod"
      size: 64
      topology: "torus"

Instead of hard-coding hardware assumptions, we created a hardware description language that could capture the requirements of any new ML accelerator.

Validation and Testing

For each new hardware type, the automation system runs through:

  1. Simulation: Test port allocation algorithms against the new hardware spec
  2. Validation: Verify network configurations meet bandwidth/latency requirements
  3. Staged Rollout: Deploy small batches first to validate end-to-end connectivity
  4. Full Deployment: Scale to production volumes only after validation passes

This approach reduced NPI deployment errors by 90% compared to manual processes.

Evolution: From Rules to Intelligence

The most fascinating part of this journey was watching our automation evolve from reactive rules to proactive intelligence. This evolution happened in three distinct phases:

Phase 1: Rule-Based Automation

Our initial system was essentially sophisticated if-then logic:

# Simplified rule-based logic (Phase 1)
def allocate_network_capacity(deployment_request):
    if deployment_request.rack_type == "ML":
        priority = "HIGH"
        bandwidth_multiplier = 2.0
        redundancy_requirement = "DUAL_PATH"
    elif deployment_request.rack_type == "COMPUTE":
        priority = "MEDIUM"
        bandwidth_multiplier = 1.0
        redundancy_requirement = "SINGLE_PATH"

    # Apply static allocation rules
    return apply_allocation_rules(
        priority, bandwidth_multiplier, redundancy_requirement
    )

This worked, but it was brittle. Rules optimized for TPU v4 deployments failed spectacularly when we started deploying GPU clusters with different interconnect patterns.

Phase 2: Cost-Aware Optimization

We realized that optimal decisions required understanding costs, not just following rules. This led to our breakthrough insight: treat network capacity delivery as a cost optimization problem.

# Cost-optimization approach (Phase 2)
class NetworkCapacityOptimizer:
    def __init__(self, cost_model):
        self.cost_model = cost_model

    def optimize_allocation(self, deployment_requests, constraints):
        """Find minimum-cost allocation satisfying all constraints"""

        # Calculate costs for different allocation strategies
        strategies = self.generate_allocation_strategies(deployment_requests)

        best_strategy = None
        min_cost = float('inf')

        for strategy in strategies:
            cost = self.calculate_total_cost(strategy)
            if cost < min_cost and self.satisfies_constraints(strategy, constraints):
                min_cost = cost
                best_strategy = strategy

        return best_strategy

    def calculate_total_cost(self, strategy):
        """Calculate total cost including equipment, power, opportunity cost"""
        equipment_cost = self.cost_model.equipment_cost(strategy.switches_needed)
        power_cost = self.cost_model.power_cost(strategy.power_consumption)
        opportunity_cost = self.cost_model.delay_cost(strategy.deployment_time)

        return equipment_cost + power_cost + opportunity_cost

Phase 3: Self-Decisive Intelligence

The final evolution was making the system truly autonomous. Instead of humans setting optimization parameters, the system learned from historical deployments and adjusted its decision-making based on outcomes.

Key capabilities that emerged:

Dynamic Cost Modeling: The system learned that a $50K switch deployed early could prevent $2M in idle compute costs, and adjusted its decision-making accordingly.

Predictive Allocation: Instead of waiting for deployment requests, the system started pre-positioning capacity based on forecasted demand patterns.

Adaptive Constraints: The system learned when to bend rules (e.g., accepting single-path redundancy for urgent ML deployments) and when rules were non-negotiable.

Feedback Loops: Post-deployment metrics fed back into the optimization model, improving future decisions.

The Parallel: Vertex AI GenAI Resource Management

This evolution from rules to intelligence directly applies to our current challenges with GenAI resource allocation on Vertex AI:

Similar Problem: We need to allocate GPU/TPU capacity across thousands of inference workloads with different SLA requirements, cost constraints, and performance characteristics.

Similar Evolution: We’re seeing the same progression—starting with policy-based allocation, moving toward cost-optimization, and ultimately building systems that make autonomous resource decisions.

Higher Stakes: GenAI workloads are even more dynamic and cost-sensitive than training workloads. The system needs to make allocation decisions in milliseconds, not hours.

The lesson: Intelligent infrastructure systems aren’t built—they’re evolved through iterations of automation, measurement, and optimization.

Impact and Results

After two years of development and deployment, the capacity delivery automation delivered measurable improvements:

Operational Metrics

Zero Downtime: Eliminated unplanned downtime for new ML hardware launches—previously we had 2-3 incidents per quarter where connectivity issues delayed major training runs.

Planning Error Reduction: Manual port assignment errors dropped from 30% to <1%. The few remaining errors were due to physical hardware failures, not planning mistakes.

Deployment Speed: Time from hardware arrival to ML workload readiness decreased from 5-7 days to 8-12 hours—a 10x improvement.

Strategic Impact

Cost Optimization: Faster deployment meant ML teams could start training runs sooner, reducing the effective cost per training token.

Reliability: Automated configuration generation eliminated configuration drift and human errors that previously caused mysterious network issues weeks after deployment.

Scalability: The system handled 3x growth in ML hardware deployments without requiring proportional growth in operational staff.

Lessons Learned

1. Automate the Critical Path, Not Everything

We initially tried to automate every aspect of capacity delivery. This was a mistake. Some processes—like handling physical hardware failures—still benefit from human judgment and flexibility.

Focus automation efforts on the critical path activities that are both high-risk and high-frequency. For us, that was network port reservations and configuration generation.

2. Design for Failure, Not Just Success

The most important part of our automation wasn’t the happy path—it was graceful degradation when things went wrong.

  • What happens if a switch fails during deployment?
  • How do you handle partial network connectivity?
  • Can the system recover from corrupted reservation state?

Building robust error handling and rollback capabilities was more valuable than optimizing the success case.

3. Observability Is Non-Negotiable

You can’t manage what you can’t measure. We invested heavily in:

  • Real-time dashboards showing deployment progress and bottlenecks
  • Automated alerting when reservations failed or configurations were rejected
  • Post-deployment validation to catch issues before ML workloads started
  • Historical metrics to identify trends and capacity planning needs

4. Cross-Team Alignment Takes Time

The technical automation was the easy part. Getting five different teams to agree on processes, timelines, and priorities took months of meetings, documentation, and compromise.

Invest in clear interfaces and contracts between teams. Define who is responsible for what, when handoffs occur, and how to escalate when things go wrong.

Looking Forward

The capacity delivery automation we built was just the beginning. The next frontier is predictive capacity planning—using ML to forecast demand patterns and pre-position capacity before it’s needed.

Imagine a system that:

  • Predicts training run requirements based on model architecture and dataset size
  • Automatically triggers hardware orders when utilization patterns suggest future shortages
  • Optimizes placement across multiple datacenters to minimize network costs
  • Adapts to new hardware through learned patterns rather than manual configuration

We’re not there yet, but the foundation is in place. Reliable, automated capacity delivery is the prerequisite for intelligent capacity management.


Building infrastructure automation at scale requires balancing technical excellence with organizational reality. The most elegant technical solution is worthless if teams won’t use it, and the most politically acceptable process is worthless if it doesn’t solve the underlying technical problems.

Share this post:
Back to Blog
Vincent's profile photo

Vincent

Tech Leader & Architect specializing in LLM Infrastructure, ML Platforms, and Distributed Systems. Passionate about building scalable systems that power the next generation of AI applications.

Related Posts

Sep 30, 2025

The Unseen Blueprint: Key Principles for Successful Automation

A practical guide to building resilient, secure, and scalable automation systems using service-oriented and data-oriented principles.

Read more →