Infrastructure Diaries

Building Call Centers
from Zero.

The journey of a solo operator building and running production call centers across 3 countries — the 3AM debugging sessions, the outages, the real fixes, and every system built along the way.

5
Prod Servers
130+
VMs Managed
3
Countries
See the Journey Work with Me
The Starting Point
Every system solved a real problem.
No toy projects. No lab setups. Every tool, every script, every architecture decision documented here was born from a production incident, a scaling bottleneck, or an operational pain that had to be fixed.
01
Production-Forged
Every config, every script, every SQL query comes from systems handling real calls for real businesses right now.
02
Incident-Driven
Every architecture was born from a real problem — dropped calls at 3AM, dead firewalls, exhausted disks, silent trunk failures.
03
Battle-Tested
Running across 4 data centers, 7 SIP providers, and 3 countries. If it breaks in production, we know about it first.
From first server to full fleet.
Each milestone represents a real system built to solve a real operational challenge. This is the timeline of how a production call center infrastructure grew from nothing.
Phase 1 — Foundation
VoIP Monitoring Stack
SIP trunks silently dropped. Packet loss crept up overnight. Agents got stuck in zombie conferences. We only found problems when customers complained — hours too late.
Built a centralized monitoring stack on Docker — Grafana, Prometheus, Loki, Homer, Smokeping — that watches every server, every trunk, every call in real time. 14 alert rules catch failures within 5 minutes, before anyone notices.
Docker Grafana Prometheus Loki Homer Smokeping
Foundation
Custom Prometheus Exporter
Standard exporters don't understand ViciDial — agent states, campaign stats, trunk health are locked inside Asterisk AMI and MySQL. No visibility into the actual call center metrics that matter.
Python exporter that bridges Asterisk AMI events and ViciDial MySQL queries into Prometheus metrics. Agent call counts, pause durations, trunk registration status, queue depths — all as time-series data with Grafana dashboards.
Python Prometheus Asterisk AMI Grafana
Foundation
Server Hardening & Optimization
Inherited servers with default configs, no fail2ban, open MySQL ports, and Apache serving phpMyAdmin to the internet. One server ran 839 days with zero backup cron and 134GB of unmanaged recordings.
Battle-tested hardening playbook: SSH lockdown, fail2ban with custom jails, MariaDB tuning, Apache security headers, automated backups to Hetzner storage boxes, recording cleanup with retention policies, and 240-day disk management.
fail2ban MariaDB Apache iptables Backup
Scaling Operations
Phase 2 — Scale
Multi-Country Team Isolation
Three countries (Romania, Spain, Albania) needed isolated teams on one ViciDial server. Agents were seeing each other's campaigns, calls were routing to wrong teams, SIP credentials overlapped.
Complete isolation architecture: dedicated campaigns, inbound groups, user groups, 270 SIP peers, ring group fallback extensions per country. 15 issues found and fixed during setup — from missing SIP peers to wrong agent assignments.
ViciDial Asterisk SIP MySQL Dialplan
Scale
MariaDB Multi-Source Replication
Running reports on production databases was slowing down live call handling. Complex queries on 3 separate servers made cross-server reporting impossible without SSH-hopping and manual aggregation.
Fan-in replication from 3 production servers to a single reporting replica with database renaming. All reporting queries hit the replica with read-only access, zero impact on live operations. Cron monitoring watches IO/SQL thread health.
MariaDB Replication SQL Cron
Scale
Cloud VM Fleet Management
Managing 130+ Windows VMs across 4 data centers through Kamatera's dashboard was unworkable. No bulk operations, no cost visibility, no way to assign VMs to agents without manual tracking in spreadsheets.
Custom web panel with role-based access: admin fleet view with cost analysis and bulk power operations, agent panel showing only their assigned servers. SQLite backend, Kamatera API integration, auto-refreshing VM details cache every 5 minutes.
PHP SQLite REST API JavaScript
Intelligent Automation
Phase 3 — AI Integration
AI Voice Agents for Asterisk
Missed calls outside business hours meant lost leads. Human agents can't answer 24/7, and traditional IVR trees frustrate callers into hanging up. We needed something that could actually hold a conversation.
Real-time AI phone agent with sub-250ms latency through Asterisk AudioSocket. Deepgram for speech-to-text, Groq for LLM reasoning, Cartesia for text-to-speech. Also integrated ElevenLabs cloud agent with SIP trunk routing and per-call context injection.
Python Deepgram Groq Cartesia ElevenLabs AudioSocket
AI Integration
AI Audio Analysis Service
Call quality complaints were impossible to diagnose at scale. Listening to recordings manually doesn't work with thousands of daily calls. We needed automated quality scoring to catch problems early.
FastAPI service with neural MOS scoring (NISQA), voice activity detection (Silero VAD), and AI-generated quality reports via Claude. Scores every call automatically, flags anomalies, provides root cause analysis for quality drops.
FastAPI Python NISQA Silero VAD Claude AI
AI Integration
QA Pipeline — Transcription + AI Scoring
QA supervisors could manually review maybe 5% of calls. The other 95% went unmonitored. Bad calls slipped through, agent coaching was guesswork, and compliance issues were found weeks too late.
Automated pipeline: Faster-Whisper transcribes every inbound call, then AI (Gemini/Groq) scores across 8 weighted categories — greeting, empathy, resolution, compliance, etc. Supervisors review flagged calls, not random samples.
Faster-Whisper Gemini Groq FastAPI SQLite
Operational Excellence
Phase 4 — Operations
Advanced Inbound Call Flow
Inbound calls needed 5-stage filtering: DID lookup, hours check, prefix blocking, agent routing with rank-based distribution, and ring group fallback. One wrong config line once dropped 11+ calls overnight.
Production-hardened inbound pipeline with repeat caller detection (route returning callers to their previous agent), recording security, codec optimization, and ring group fallbacks that ensure calls never hit silence — even when no agents are logged in.
Asterisk Dialplan Perl AGI MySQL PHP
Operations
Softphone Mass Deployment
Deploying softphones to 100+ agent machines one by one was a multi-day task. Each needed SIP config, audio tuning, firewall rules, and a watchdog to restart on crash. Manual setup introduced config drift across the fleet.
PowerShell automation that installs, configures, and deploys softphones to entire agent fleets via WinRM. Includes SIP config templating, audio device auto-detection, watchdog service, and remote batch deployment from a central host.
PowerShell tSIP WinRM SIP
Operations
WhatsApp Monitor & Email Reports
Critical business updates were scattered across WhatsApp groups. Important messages got buried in noise. No searchable history, no way to know if a team lead missed a key announcement.
FastAPI service that monitors WhatsApp groups, stores all messages in PostgreSQL with full-text search, generates AI daily summaries via Claude, and sends scheduled email digests. Web dashboard for browsing and searching group history.
FastAPI PostgreSQL Claude AI Email
Platform Engineering
Phase 5 — Platform
ViciDial Cluster Architecture
Single-server ViciDial doesn't scale past a certain agent count. Web interface, telephony processing, and database compete for the same CPU. Upgrades require downtime on a live system.
Multi-server cluster with separated roles: dedicated web server, telephony server, database server. Real production configs for server-to-server communication, shared recording storage, and rolling upgrade procedures with rollback plans.
ViciDial Asterisk MariaDB Clustering
Platform
Database Performance & Tuning
ViciDial's MyISAM tables lock during writes, freezing the entire agent interface during peak hours. Queries that should take milliseconds were taking 30+ seconds. Archiving was manual and inconsistent.
Complete MariaDB tuning: custom indexes that cut query times by 90%, automated archiving of historical data, MyISAM-to-InnoDB migration strategy for high-contention tables, and buffer pool sizing based on actual workload analysis.
MariaDB Indexing MyISAM InnoDB Query Optimization
Platform
Containerized Asterisk & WebRTC
Asterisk deployments are fragile — library version conflicts, manual dependency management, no reproducible builds. WebRTC setup for browser-based calling involves SSL, STUN/TURN, PJSIP WebSocket — each a separate minefield.
Production Docker Compose stack: multi-stage Asterisk build with PJSIP, ARI, WebRTC support. MariaDB, Redis, Nginx reverse proxy included. Plus a complete WebRTC ViciPhone setup with certificate management and TURN server configuration.
Docker Asterisk PJSIP WebRTC STUN/TURN
5
Production Servers
130+
VMs Managed
4
Data Centers
7
SIP Providers
Get in Touch
Let's build something.
Whether you need VoIP infrastructure consulting, a custom monitoring stack, AI voice agent integration, or help scaling your call center — let's talk.