Your GPU is probably doing nothing right now.
Check your system monitor. If you're running AI agents — whether it's for dev ops, content generation, monitoring, or automation — there's a good chance your workflow looks like this: CPU at 100%, GPU at 0%, and your cloud API bill climbing. Every classification, every summary, every embedding going over the wire.
I spent the last month fixing this. The result: 10 autonomous AI agents running 24/7 on an NVIDIA Jetson Orin Nano — an 8GB, 15-watt edge device that fits under a desk. About 60% of all inference runs locally on the GPU at zero marginal cost. The other 40% goes to cloud APIs for complex reasoning.
Total monthly infrastructure cost beyond my API subscription: $1.50 in electricity.
This post covers the full architecture — the tiered model strategy, the GPU services, the agent design patterns, and the hard-won lessons.
The Architecture: Three Tiers of Intelligence
The core insight is simple: not every AI task needs a frontier model.
When an agent checks if a service is healthy, it doesn't need GPT-5 or Claude Opus. When it classifies a log entry as "normal" or "error," it doesn't need 200 billion parameters. When it generates an embedding for search, it definitely doesn't need a $0.01/request API call.
So we split everything into three tiers:
- Tier 1 — 4B Parameters, Local GPU, $0 Cost. Classification, summarization, embeddings, health checks, log parsing. About 60% of all tasks.
- Tier 2 — 7-8B Parameters, Local GPU, $0 Cost. More complex local reasoning. Currently in development.
- Tier 3 — Frontier Models (Sonnet/Opus), Cloud API. Code review, security analysis, architecture decisions, complex writing. About 40% of tasks.
~60% of all agent tasks are Tier 1. Health checks, log classification, document indexing, status formatting — these are pattern-matching problems that a 4B model handles in milliseconds.
The GPU Setup: Two Services That Changed Everything
Service 1: Vector Search
Instead of grepping through thousands of files with keywords, we index everything into semantic vectors using a local embedding model. Search by meaning, not string matching.
How it works:
- Scan all markdown and text files in the workspace
- Chunk them into ~500-word segments with 50-word overlap
- Generate embeddings using
nomic-embed-texton the local GPU - Store vectors in a faiss index for fast similarity search
- Auto-reindex every 5 minutes — incremental, only re-embeds changed files
The embedding generation hits the local GPU via Ollama's API. Each file gets chunked with overlap to prevent splitting ideas at boundaries. An MD5 hash tracks which files have changed, so we only regenerate embeddings when content actually changes. A full reindex of 1,600+ chunks takes about 15 minutes. An incremental check takes less than 1 second.
Result: Semantic search across 1,600+ document chunks in under 200ms. Zero API cost.
Service 2: Batch Inference API
A lightweight HTTP wrapper around the local model with three endpoints: /classify for categorization, /summarize for condensing text, and /infer for general-purpose completions.
The classification endpoint is the workhorse. It takes a text input and a list of categories, prompts the model at temperature 0.0 with a 10-token limit, and returns a single-word answer. Fast and deterministic.
Performance on a Jetson Orin Nano (8GB):
- Classification speed: ~24 tokens/second
- Classification latency: 7-9 seconds per request
- Warm inference startup: under 200ms when model is already loaded
- Summarization: ~24 tok/s, varies with output length
The Critical Trick: Pin Your Models Warm
The single biggest performance win is keeping your models loaded in GPU memory permanently.
By default, Ollama unloads models after 5 minutes of inactivity. Every cold start takes 7-8 seconds to load the model from disk back into VRAM. If you're running classification every few minutes, you're constantly paying that penalty.
The fix: set keep_alive to -1 when making your first request. This tells Ollama to never unload the model. It stays in GPU VRAM permanently until you restart the service.
Memory budget on an 8GB shared-memory device:
- OS and applications: ~1.8 GB
- Inference model (Gemma3 4B): ~3.3 GB
- Embedding model (nomic-embed-text): ~0.3 GB
- Available headroom: ~2.6 GB
On shared-memory architectures like Jetson, GPU VRAM and system RAM are the same pool — budget carefully. On discrete GPUs (RTX cards), VRAM is separate and you have significantly more headroom.
The Agent Architecture: 10 Headless Workers
Here's where it gets interesting. The agents aren't chatbots — they're headless cron workers that wake up on a schedule, check their domain, and go back to sleep.
The Org Structure
At the top is the human (CTO). Below that is the Conductor — an always-on orchestrator running on a frontier model. The Conductor doesn't write code; it conducts the team. It scopes work, spawns sub-agents for implementation, reviews output, and ships it.
Below the Conductor, agents split into two groups.
Cloud API agents (Tier 3 — Thinkers and Reviewers):
- Project Manager — every 30 min. Tracks all projects, checks for open PRs, build failures, blockers. Generates action plans, not status reports.
- Security Scanner — hourly. Runs dependency audits across all repos, scans for hardcoded secrets, checks auth patterns, flags architecture drift.
- QA Reviewer — every 30 min. Watches for open pull requests. Reviews for bugs, security issues, code quality. Can approve and merge clean PRs automatically.
- Brand Auditor — weekly. Audits all public-facing sites for visual consistency: color palettes, favicons, OpenGraph metadata, mobile responsiveness.
Local GPU agents (Tier 1 — Monitors):
- Document Indexer — hourly. Indexes all workspace files into semantic vectors. Maintains the search index.
- Health Monitor — hourly. Checks CPU temp, RAM, swap, disk, GPU thermals, service health. Triages by severity.
- Log Auditor — hourly. Reads application logs, system journal, SSH attempts. Classifies every entry. Flags anomalies.
- Ops Agent — daily. Manages secondary infrastructure, disk cleanup, connectivity checks.
Plus one planned Research Agent for threat intelligence.
The Communication Pattern
This is the most important design decision we made.
When an agent runs and everything is OK, it stays completely silent. No message, no status update, nothing. Only when it finds a real, actionable issue does it post to its assigned alert channel. The Conductor sees the alert, coordinates a fix — often by spawning a sub-agent to implement it — and the QA agent reviews the resulting pull request.
Silence is the feature. If your alert channels are quiet, everything is working. Agents only speak when they have something worth saying.
We learned this the hard way. In our first week, agents posted status updates every 30 minutes — even when nothing had changed. "All systems healthy!" twelve times a day across eight channels. It was unbearable.
Now the rule is absolute: if nothing needs human attention, say nothing. The absence of alerts IS the status report.
Workspace Isolation
Each agent gets its own isolated workspace directory containing an identity file (we call it SOUL.md), a memory directory for state and logs, and a model configuration file.
The identity file is the key. It defines who the agent is, what it does, and how it behaves — including its personality. When an agent wakes up for a cron run, it reads its identity first. This prevents identity bleed — your security scanner doesn't accidentally start acting like your project manager.
The Conductor maintains the master workspace. Agent workspaces are satellites with their own state.
Model Selection: What We Learned
We tested every 4B model available for Tier 1 tasks. Here's what we found:
- Gemma3 4B — The winner. Clean single-word classifications, excellent instruction following, 24 tok/s. When you ask it to classify something, it gives you one word, not a paragraph explaining its reasoning.
- Qwen3 4B — Solid alternative. Good instruction following at ~20 tok/s. A reliable second choice.
- Llama 3.2 3B — Decent but too small for reliable classification at ~25 tok/s. Sometimes gives inconsistent results.
- Nemotron 4B — Surprising disappointment. ~19 tok/s and poor instruction following. Has a tendency to echo the prompt back instead of generating a response. Fine for embeddings (
nomic-embed-textis excellent), but not reliable for inference tasks.
Key tip: If you're using a reasoning model — anything that produces chain-of-thought thinking tags — strip the thinking tokens from the output before returning results to your agents. A simple regex replacement cleans them out.
The Numbers
Stress Test Results
We fired all agents simultaneously plus concurrent GPU requests to see what would break:
- Peak GPU temperature: 71°C (26°C below the throttle threshold)
- Peak RAM usage: 4.5 GB out of 8 GB (recovered quickly after spike)
- Swap impact: Negligible — less than 5 MB of movement
- Service survival rate: 100% — everything recovered after the load spike
- Bottleneck: CPU contention, not thermal or memory
Monthly Cost
- Tier 1 agents (4 agents, ~24 runs per day): $0 — runs entirely on local GPU
- Tier 3 agents (4 agents via cloud API): Included in API subscription
- Conductor (always-on orchestrator): Included in API subscription
- Hardware power draw (15W continuous): ~$1.50/month
- Total infrastructure beyond subscription: $1.50
Task Distribution
About 60% of tasks run locally at zero cost. The breakdown:
- Health monitoring: ~20% of tasks (Tier 1, local, $0)
- Log classification: ~15% (Tier 1, local, $0)
- Document indexing: ~15% (Tier 1, local, $0)
- Status formatting: ~10% (Tier 1, local, $0)
- Code review: ~15% (Tier 3, cloud)
- Security scanning: ~10% (Tier 3, cloud)
- Architecture decisions: ~10% (Tier 3, cloud)
- Brand auditing: ~5% (Tier 3, cloud)
10 Lessons From Running This for a Month
- Start with one agent, not ten. Build the conductor first. Add agents one at a time. Validate each before adding the next. We started with two and scaled to ten over three weeks.
- Personality constrains behavior. Agents with defined personalities produce more consistent output than agents with just instructions. A "nervous project manager" consistently tracks blockers. A "stoic security scanner" consistently finds vulnerabilities. The personality becomes a behavioral guardrail.
- Silence must be explicitly enforced. AI models want to show their work. You have to tell them — in the identity file, in the cron prompt, AND in the rules section — "if nothing is wrong, say absolutely nothing." Say it three times in three places. They'll still try to sneak in a summary.
- 4B models cannot do complex reasoning. We tested coding tasks on a 4B model. It failed completely. Small models are Tier 1 only: classify, format, embed, simple Q&A. Don't ask them to think.
- Pin models warm in VRAM. Cold starts waste 7-8 seconds per inference. Keeping models permanently loaded eliminates this entirely. The memory cost is worth it.
- Incremental indexing is non-negotiable. Full reindexing 1,600 documents takes 15+ minutes. Hash-based change detection drops incremental checks to under 1 second. Only re-embed what changed.
- The conductor doesn't code. Separation of concerns: the orchestrator scopes work and reviews output. Implementation agents write code. This keeps the architecture clean and prevents the orchestrator from getting stuck in implementation details.
- All code goes through PR review. Every change an agent makes goes through a pull request. Another agent reviews it. No direct pushes to main. This creates an audit trail and catches bugs that coding agents introduce.
- Stress test before you trust it. Fire all agents simultaneously and watch thermals, RAM, and swap. Know your limits before they surprise you at 3 AM.
- Write everything down. Every decision, every architecture change, every breakthrough gets logged. The workspace is the team's shared memory. Agents without persistent memory make the same mistakes every session.
Getting Started
If you want to replicate this:
- Hardware: Any machine with an NVIDIA GPU — a Jetson ($200-500), an old gaming PC, or a cloud GPU instance. 8GB VRAM minimum for a 4B model plus embeddings.
- Software: Ollama for local inference, PM2 for process management, faiss for vector search. All free and open-source.
- Models: Pull
gemma3:4bandnomic-embed-textthrough Ollama. Under 4GB total. - Start small: One orchestrator agent plus one monitoring agent. Add more as you validate the pattern.
- The GPU services (vector search and batch inference) take about an hour to set up. The agent architecture takes a few days to tune — mostly figuring out the right silence rules.
What's Next
We're working on Tier 2 inference with 7-8B models for more capable local reasoning, cross-machine agent execution across networked devices, task routing middleware that automatically sends requests to the cheapest capable tier, and continuous learning loops where agents improve classification accuracy over time.
The edge AI future isn't about replacing cloud models. It's about using the right model for each task — and being honest about which tasks are zero-dollar problems.
Running 10 agents on 15 watts. The future is local.