@vicistack/how-we-built-ai-voice-agents
v1.0.0
Published
What 500 Cold Call Recordings Taught Us About Building AI Dialers — ViciStack call center engineering guide
Maintainers
Readme
What 500 Cold Call Recordings Taught Us About Building AI Dialers
Here is the math that keeps call center operators up at night. You are running 50 agents on an outbound roofing campaign. Fully loaded cost per agent — wages, seats, dialer licenses, management overhead — lands somewhere between $15 and $25 an hour depending on whether you are running domestic or offshore. Call it $20 on average. That is $1,000 an hour in labor just to keep the floor running. Eight-hour shift, five days a week, and you are burning north of $160,000 a month before a single appointment hits the board. Now look at what those 50 agents actually produce. VICIdial is pushing 200+ dials per hour per agent through predictive mode. Your contact rate — the percentage of dials that reach a human — sits between 3% and 7% on a typical residential list. That means out of 10,000 dials an hour, you are getting maybe 500 conversations. Out of those 500 conversations, your top closer books appointments at three or four times the rate of your worst agent. But your top closer is one person. They call in sick. They burn out. They quit and take the job across the street for fifty cents more an hour. This is the fundamental problem with human-powered outbound at scale: your best performance is not repeatable, and your average performance is expensive. We have watched this play out across hundreds of VICIdial deployments. The top 20% of agents generate 60-70% of the results. The bottom 30% barely cover their own cost. And the operational overhead of recruiting, training, and retaining enough warm bodies to maintain volume is a second full-time job that nobody signed up for. So we decided to build something better. But we did not start where most people start. Most AI voice products are built on theoretical scripts — someone sits in a conference room, writes what they think a good cold call sounds like, feeds it into a language model, and hopes for the best. That approach produces agents that sound like they are reading a script because they are. Real cold calls are messy. Prospects interrupt. They object in ways no script anticipates. The timing of a pause matters as much as the words. We started with data. Specifically, we pulled 500+ real cold call recordings from a live roofing appointment-setting campaign running through VICIdial. Five human agents, all working the same lists with the same dialer settings. We transcribed every call. We tagged every disposition. We mapped which agents converted and which ones fumbled, then we went line by line through the conversations that actually booked appointments to figure out why they worked. The insight was simple: do not invent a new playbook — clone the one that is already winning. Record everything. Transcribe everything. Analyze everything. Identify the exact phrases, objection handlers, pacing patterns, and conversation structures that your best human agents use on the calls that convert. Then build an AI agent that replicates those patterns at scale, across every dial, every hour, without calling in sick or asking for a raise. That is what we built. And it started with those 500 recordings. --- ## Building a Transcription Engine That Actually Works on Phone Audio Here is the thing nobody tells you about Whisper: it was trained on podcasts, audiobooks, and YouTube videos. Clean audio. Controlled environments. Maybe a little background music. Phone audio is none of that. You are dealing with narrow-band codec compression (8kHz G.711, if you are lucky), background noise from job sites, crosstalk where both parties talk over each other, and the particular joy of a prospect on a cell phone standing next to a busy road. Feed that raw into Whisper and you get hallucinated bible verses, repeated phrases, and confident-sounding garbage. We needed transcription that actually worked on VICIdial call recordings. No cloud APIs, no per-minute costs that scale into absurdity at volume. So we built the whole pipeline locally with faster-whisper. ### Choosing the Right Model We tested every model worth testing. base scored maybe a 5 out of 10 on accuracy — fine as a supplement but completely unusable standalone. medium was better but still stumbled on domain vocabulary. distil-large-v3 offered a decent speed-accuracy tradeoff. But for production transcription where the output actually matters, large-v3 was the only model that consistently got it right. ### The Configuration That Made It Work Getting the model right was only half the battle. The real gains came from tuning the transcription parameters: python from faster_whisper import WhisperModel model = WhisperModel("large-v3", device="cpu", compute_type="int8") segments, info = model.transcribe( processed_audio_path, beam_size=10, initial_prompt=( "This is a phone call about roofing services. " "Terms: siding, fascia, eaves, gutters, downspouts, soffit, " "flashing, ridge cap, drip edge, ice dam, shingle." ), word_timestamps=True, vad_filter=True, # Silero VAD vad_parameters=dict( min_silence_duration_ms=500, ), temperature=[0.0, 0.2, 0.4, 0.6, 0.8, 1.0], condition_on_previous_text=False, ) Every parameter here exists for a reason. beam_size=10 gives the decoder more room to explore candidate transcriptions instead of greedily committing to the first path. The initial_prompt is seeded with domain-specific vocabulary pulled directly from the client's calling script — without it, Whisper hears "soffit" and writes "saw fit" every single time. vad_filter engages Silero Voice Activity Detection to skip silence, which prevents the hallucination problem where Whisper invents text to fill quiet gaps. The temperature fallback starts at 0 for deterministic output but steps up automatically when confidence is low. And condition_on_previous_text=False stops one bad segment from poisoning everything that follows. ### Preprocessing: Fix the Audio Before Whisper Ever Sees It Before any of that runs, every recording passes through an ffmpeg preprocessing pipeline. We normalize volume levels, apply noise reduction, run a high-pass filter to cut low-frequency phone hum, and — critically — upsample the audio. Most VICIdial recordings are narrow-band 8kHz. Whisper expects 16kHz. Upsampling with proper interpolation gives the model something closer to what it was trained on. This single step measurably improved word error rates. Recordings process in parallel across CPU cores. A 5-minute call transcribes in roughly 30-40 seconds on modern hardware, and batch jobs distribute cleanly across available threads. ### Post-Processing: Where the Transcript Becomes Useful Raw transcripts, even good ones, still need work. Every transcript gets post-processed by Claude, which fixes domain-specific mishearings that Whisper consistently gets wrong ("eve" back to "eave," "gutter" pluralization, brand names). It also corrects proper nouns — addresses, street names, homeowner names — by cross-referencing the matched lead data from the dialer. If the lead record says the prospect lives at 4217 Birchwood Lane, and Whisper wrote "4217 Birch Wood Lane," the post-processor fixes it. Speaker labeling happens through a combination of script pattern matching and word-level timestamps. The agent's lines follow predictable script patterns — greetings, qualifying questions, rebuttals. By aligning those patterns against the word timestamps, we reliably tag each segment as Agent or Prospect without needing a separate diarization model. The key insight is simple but easy to miss: phone audio transcription is a pipeline problem, not a model problem. No single model, no matter how large, will give you clean transcripts from dirty audio. You preprocess aggressively, configure carefully, and post-process intelligently. Then it actually works. --- ## Context Injection: Matching Every Recording to Its Lead A raw transcript is just words on a page. Without context, you cannot tell whether the agent butchered the prospect's address or Whisper did. You cannot tell if this is a first-contact cold call or the eighth redial on a lead that has been dodging callbacks for three weeks. You cannot tell if the agent went off-script or if the script itself is the problem. We started by reading everything the agents were supposed to be working from. The cold calling script — the exact words they were told to say on every dial. The objection handling playbook — prescribed responses to "I'm not interested," "I already have coverage," "How did you get my number," and every other common pushback. The leads CSV with phone numbers, names, addresses, and cities. The campaign context briefing with company details and service offerings. All of it went into the pipeline before a single recording was processed. Every recording filename in VICIdial contains the phone number. That phone number is a direct key into the vicidial_list table. So step one was automated matching: for each recording, pull the full lead record. Name, street address, city, state, insurance status, every field VICIdial stores. Then pull the call metadata — talk time, wait time, hold time, who hung up (agent or prospect), disposition, timestamps. Then pull the agent ID and cross-reference it against the VICIdial user table to get the actual agent name. Then pull campaign and list metadata for additional context about what this agent was supposed to be selling and to whom. This context fed directly into transcription. Whisper accepts an initial_prompt parameter that biases its vocabulary toward expected words and phrases. We loaded script terminology, objection playbook language, company names, and product terms into that prompt. The model stopped hallucinating "solar panel" when the agent said "supplemental plan." The matched lead data fed into transcript cleanup. When Whisper output "123 Oak," and the lead record showed "123 Oakwood Drive, Tallahassee," the correction was automatic. Prospect names, street addresses, city names — all verified against the actual lead data and fixed in the final transcript. But the most valuable context was call history. Some leads in this dataset were called eight or more times before they booked an appointment. A first-contact cold call and a fifth redial to the same prospect are completely different conversations with completely different dynamics. The agent's tone is different. The prospect's patience is different. What counts as "good" objection handling is different. We identified every call as either first-contact or follow-up and tagged it accordingly, because analyzing a redial with first-contact expectations produces garbage insights. The key insight is simple: a transcript without context is just words. When you know the lead's name, address, insurance status, and complete call history, you can fix transcription errors AND understand what is actually happening in the conversation. VICIdial stores all of this data. Most shops never connect it back to their recordings. That connection is where the analysis starts. --- ## Win/Loss Analysis: What Makes a Call Convert Every call in the dataset got a tag. WIN meant an appointment was booked (disposition code APPTBK). LOSS got a sub-reason: not interested, already has a contractor, bad timing, wrong number, voicemail, gatekeeper block, do-not-call request, hostile, or disconnected. No ambiguity. No "maybe callback." Every call lands in a bucket so we can count what actually happens on the phones. Then we went deeper. We extracted the exact moment a prospect commits. Not the general vibe of the call — the precise words and the timestamp. In winning calls, commitment sounds like this: "Yeah, that'd be fine" at 1:47. "Sure, go ahead and put me down" at 2:12. "What time works?" at 1:33. These aren't enthusiastic yeses. They're quiet surrenders. The prospect stops resisting and agrees to a slot. We captured every one of them. In losing calls, we extracted the exact moment of disengagement — the sentence where the prospect checks out and the call is effectively over, even if it keeps going for another thirty seconds. Then we mapped what the agent said immediately before each turning point. The sentence right before the prospect books. The sentence right before the prospect shuts down. This is where the data gets sharp. Here is a real pattern from the dataset. Prospect says: "I'm not really interested." Agent A responds: "I totally understand — most folks say the same thing until they see what the estimate looks like. It's free either way. Would morning or afternoon work better for you?" Prospect books. Agent B, same objection, responds: "Oh, okay. Well, if you change your mind..." Call over. The difference between a booked appointment and a burned lead is often a single sentence. The agent who recovers from "I'm not interested" with the right words books
About
Built by ViciStack — enterprise VoIP and call center infrastructure.
- VICIdial Hosting & Optimization
- Call Center Performance Guides
- Full Article: What 500 Cold Call Recordings Taught Us About Building AI Dialers
License
MIT
