No UI Survives First Contact with Users
How we rebuilt our 'Add Property' dialog three times in one session based on real user feedback. A case study in iterative design and the importance of staying flexible.
How we built a conversational voice agent that turns natural dialogue into structured property searches using OpenAI's Realtime API, WebRTC, and real-time data extraction.
Kristian Elset Bø
Engineer
"I'm looking for a two-bedroom apartment in Brooklyn, somewhere around $3,000 a month. My partner and I are moving from San Francisco, and we really need a place with good light and outdoor space."
That's how humans describe what they want. Not by filling out 15 form fields, but through natural conversation. We built a voice agent that listens, understands, and transforms this dialogue into a structured property search—all in real-time.
This is the story of how we implemented it using OpenAI's Realtime API, WebRTC, and some clever extraction patterns.
Traditional property search interfaces make you think like a database. Select from dropdowns. Enter min/max values. Check boxes. It's efficient for computers but exhausting for humans.
Voice flips this. You talk about your life situation, and the system figures out the query. "We're expecting our first kid in June" implies you need more bedrooms. "I work from home a lot" suggests you value a home office or quiet neighborhood.
Our voice agent, Homi, has one job: have a friendly conversation and extract everything needed to create a property search collection.
The system has two parallel tracks running simultaneously:
Track 1: Voice Conversation User speaks → OpenAI Realtime transcribes → AI responds with voice + text
Track 2: Data Extraction Transcript updates → GPT-4o extracts structured data → UI shows live preview
Here's the flow:
┌─────────────┐ Ephemeral Token ┌──────────────────────┐
│ Browser │ ◄──────────────────────► │ /api/voice/session │
│ │ │ (Auth + OpenAI) │
└─────┬───────┘ └──────────────────────┘
│
│ WebRTC (Audio + Data Channel)
│
▼
┌─────────────────────────────────┐
│ OpenAI Realtime API │
│ - Whisper transcription │
│ - GPT-4o conversation │
│ - Voice synthesis │
└─────────────────────────────────┘
│
│ Transcript Events
▼
┌─────────────────────────────────┐
│ Extraction Loop (Debounced) │
│ /api/voice/extract │
│ - Structured outputs │
│ - Incremental merge │
└─────────────────────────────────┘
OpenAI's Realtime API supports both WebSocket and WebRTC connections. We chose WebRTC for one reason: latency.
WebRTC establishes a peer-to-peer-style connection (though it still goes through OpenAI's servers) with built-in optimizations for real-time audio. The difference is noticeable—responses feel instantaneous rather than delayed.
The trade-off is complexity. WebRTC requires:
But for voice, that sub-100ms latency is worth it.
Before any voice connection happens, we need an ephemeral token. This keeps our OpenAI API key secure on the server while letting the browser connect directly.
// api/voice/session/route.ts
export async function POST(request: Request) {
// Verify user session first
const session = await auth.api.getSession({
headers: request.headers,
});
if (!session?.user.id) {
return NextResponse.json(
{ error: "No session found", needsAuth: true },
{ status: 401 },
);
}
// Get ephemeral token from OpenAI
const response = await fetch("https://api.openai.com/v1/realtime/sessions", {
method: "POST",
headers: {
Authorization: `Bearer ${env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
voice: "verse", // Dynamic, engaging male voice
instructions: HOMI_PERSONA.systemPrompt,
input_audio_transcription: {
model: "gpt-4o-transcribe",
},
turn_detection: {
type: "semantic_vad",
eagerness: "medium",
create_response: true,
},
}),
});
const data = await response.json();
return NextResponse.json({
sessionToken: data.client_secret.value,
expiresAt: data.client_secret.expires_at,
});
}
The key configuration here is turn_detection. We use semantic_vad (Voice Activity Detection) instead of the simpler server_vad. Semantic VAD understands conversation flow—it won't interrupt when you pause to think, but it knows when you've actually finished your thought.
Establishing the connection involves the classic WebRTC handshake, but with OpenAI as the "remote peer":
const startSession = async () => {
// 1. Get ephemeral token
const { sessionToken } = await fetch("/api/voice/session", {
method: "POST",
}).then((r) => r.json());
// 2. Get microphone access
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
sampleRate: 24000,
},
});
// 3. Create peer connection
const pc = new RTCPeerConnection();
// 4. Set up audio output
const audioEl = document.createElement("audio");
audioEl.autoplay = true;
pc.ontrack = (e) => {
audioEl.srcObject = e.streams[0];
};
// 5. Add microphone track
pc.addTrack(stream.getAudioTracks()[0], stream);
// 6. Create data channel for events
const dc = pc.createDataChannel("oai-events");
dc.onmessage = (e) => {
const event = JSON.parse(e.data);
handleRealtimeEvent(event);
};
// 7. SDP exchange
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const sdpResponse = await fetch(
"https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17",
{
method: "POST",
headers: {
Authorization: `Bearer ${sessionToken}`,
"Content-Type": "application/sdp",
},
body: offer.sdp,
},
);
const answerSdp = await sdpResponse.text();
await pc.setRemoteDescription({ type: "answer", sdp: answerSdp });
};
Once connected, the data channel receives a stream of events: session.created, input_audio_buffer.speech_started, response.audio_transcript.delta, etc.
Here's where it gets interesting. While the conversation flows naturally, we're continuously extracting structured data in the background.
Every 3 seconds after a message, we send the transcript to our extraction endpoint:
const extractFromTranscript = async (msgs: TranscriptMessage[]) => {
// Debounce: don't extract if we just did
const now = Date.now();
if (now - lastExtractionRef.current < 5000) return;
// Need at least 2 messages for meaningful extraction
if (msgs.length < 2) return;
const response = await fetch("/api/voice/extract", {
method: "POST",
body: JSON.stringify({
messages: msgs,
previousData: extractedData, // For incremental merge
}),
});
const result = await response.json();
// Merge new data with existing
setExtractedData((prev) => ({ ...prev, ...result.data }));
};
The extraction uses OpenAI's structured outputs with a carefully crafted Zod schema:
export const extractionSchema = z.object({
intention: z
.enum(["buy", "rent", "rent_short"])
.nullable()
.describe(
"User's goal: 'buy' (purchasing, investment), 'rent' (lease), " +
"'rent_short' (vacation). Default to 'buy' if ambiguous.",
),
collectionGroupType: z
.enum(["solo", "couple", "family", "friends"])
.nullable()
.describe(
"Who will live there. Infer from pronouns: " +
"'we' with partner = couple, mentions kids = family",
),
majorCityId: z
.enum(MAJOR_CITY_IDS)
.nullable()
.describe(
"Match neighborhoods to parent cities: " +
"Hudson Yards → new-york-us, Grünerløkka → oslo-no",
),
budget: z
.number()
.nullable()
.describe("Budget as number. Monthly for rent, total for buy."),
preferredBedroomCount: z
.number()
.int()
.nullable()
.describe("Number of bedrooms. 'studio' = 0"),
niceToHaveFeatures: z
.array(z.enum(AMENITIES))
.nullable()
.describe(
"Extract ANY amenities mentioned positively, even in passing. " +
"'I'd love a view' → view, 'parking would be nice' → parking",
),
suggestedFollowUp: z
.string()
.nullable()
.describe("What key info is still missing? Suggest ONE natural topic."),
});
The .describe() calls are crucial—they guide the model to make reasonable inferences rather than requiring explicit statements. "We're moving together" correctly infers couple even though the user never said that word.
Here's a subtle bug that took hours to figure out: user transcriptions complete after the AI has already started responding.
The timeline looks like this:
If you just append messages in order, you get:
Assistant: "Great choice! Brooklyn has..."
User: "I'm looking for a place in Brooklyn" // Wrong order!
The fix: when inserting user messages, look for incomplete assistant messages and insert before them:
case "conversation.item.input_audio_transcription.completed":
if (event.transcript) {
setMessages((prev) => {
const userMessage = {
id: event.item_id,
role: "user",
content: event.transcript,
isFinal: true,
};
// Find first incomplete assistant message
const firstIncompleteIdx = prev.findIndex(
(m) => m.role === "assistant" && !m.isFinal
);
if (firstIncompleteIdx >= 0) {
// Insert BEFORE the incomplete response
return [
...prev.slice(0, firstIncompleteIdx),
userMessage,
...prev.slice(firstIncompleteIdx),
];
}
return [...prev, userMessage];
});
}
break;
The extraction doesn't just feed the UI—it also guides the conversation. When we extract data, we send a system message back to the AI:
if (dc.readyState === "open" && result.suggestedFollowUp) {
const currentCompleteness = calculateCompleteness(newData);
if (currentCompleteness < 100) {
const contextUpdate = {
type: "conversation.item.create",
item: {
type: "message",
role: "system",
content: [
{
type: "input_text",
text:
`[Search is ${currentCompleteness}% complete. ` +
`To finish: ${result.suggestedFollowUp}. ` +
`Ask naturally, don't list what's missing.]`,
},
],
},
};
dc.send(JSON.stringify(contextUpdate));
}
}
This keeps Homi on track without making the conversation feel like an interrogation.
We calculate a weighted completeness score to show progress:
export function calculateCompleteness(data: ExtractedCollectionData): number {
const fields = [
// Core fields - the essentials (85 points)
{ key: "intention", weight: 20 },
{ key: "majorCityId", weight: 20, alt: "city" },
{ key: "collectionGroupType", weight: 15 },
{ key: "budget", weight: 20 },
{ key: "preferredBedroomCount", weight: 10 },
// Nice to haves (15 points)
{ key: "preferredHousingTypes", weight: 5 },
{ key: "niceToHaveFeatures", weight: 10 },
];
let score = 0;
for (const field of fields) {
const value = data[field.key];
if (value != null && (!Array.isArray(value) || value.length > 0)) {
score += field.weight;
}
}
return Math.min(100, score);
}
When completeness hits 90%+, Homi automatically wraps up and suggests creating the collection.
Voice interfaces need visual feedback. You can't see sound, so users need to know:
We created a GlowingOrb component that reacts to both state and audio levels:
const STATE_CONFIGS: Record<VoiceAgentState, StateConfig> = {
idle: {
gradient: "radial-gradient(circle, rgba(59,130,246,0.4) 0%, ...)",
glowColor: "rgba(59, 130, 246, 0.4)",
animation: "animate-pulse-slow",
label: "Ready to chat",
},
listening: {
gradient: "radial-gradient(circle, rgba(6,182,212,0.6) 0%, ...)",
glowColor: "rgba(6, 182, 212, 0.5)",
animation: "", // Dynamic based on audio level
label: "Listening...",
},
speaking: {
gradient: "radial-gradient(circle, rgba(139,92,246,0.6) 0%, ...)",
glowColor: "rgba(139, 92, 246, 0.5)",
animation: "animate-pulse-speak",
label: "Homi is speaking",
},
// ... more states
};
The orb scales dynamically based on microphone input levels, creating a visual "breathing" effect that mirrors the user's speech.
One subtle detail: we track a recentlySpeaking state with a 800ms delay after the transcript completes. Why? Audio playback lags behind text. Without this, the orb would switch to "listening" while Homi's voice was still playing.
Voice sessions have more states than a typical boolean isConnected:
type VoiceAgentState =
| "idle" // Not connected, ready to start
| "connecting" // Establishing connection
| "connected" // Active but waiting
| "listening" // User is speaking
| "speaking" // AI is speaking
| "processing" // Processing user input
| "error"; // Connection error
The state machine handles transitions carefully:
input_audio_buffer.speech_started → listeninginput_audio_buffer.speech_stopped → processingresponse.audio_transcript.delta → speakingresponse.done → connectedWe also map connected to listening in the UI—when you're in an active call waiting, you're effectively listening.
The system prompt shapes the entire experience:
const HOMI_SYSTEM_PROMPT = `You are Homi, a friendly real estate assistant...
## Your Vibe
- Relaxed and conversational, like chatting with a friend
- Keep responses SHORT - this is a voice conversation, not a lecture
## Conversation Flow
Start by understanding their STORY - the "why" behind their search:
1. What's sparking this search? (new job, growing family?)
2. What kind of place are they picturing?
3. Who will be living there?
4. Where do they want to be?
5. What's their budget range?
## How to Talk
- One question at a time
- React naturally - show genuine interest in their story
- Don't repeat back everything they said
- Keep it to 1-2 sentences per response
- Sound like a real person, not a customer service bot`;
The key insight: voice AI needs to be more concise than text AI. Long responses feel like lectures. Short, natural responses feel like conversation.
The voice connection was straightforward (thanks OpenAI). The real complexity is in extracting structured data from unstructured conversation. Invest heavily in your extraction prompts and schemas.
Real-time systems generate a lot of events. Debounce extraction, debounce state transitions, debounce UI updates. Users don't notice 3-second delays, but they notice janky interfaces.
Voice is invisible. Without strong visual feedback, users don't know if the system is working. The glowing orb wasn't a nice-to-have—it was essential.
Synthetic test cases don't capture how humans actually talk. "I guess maybe around three thousand or so?" is very different from "Budget: $3000".
Not everyone wants voice. Some prefer typing. Some are in public. Some have accessibility needs. Voice should be an option, not a requirement.
This is just the beginning. Future improvements:
The vision: property search that feels like talking to a knowledgeable friend, not filling out forms.
Want to try it yourself? Launch the voice agent and have a conversation with Homi. Tell us about your dream place—we'll turn it into a search.

Engineering at Homi, building the future of real estate technology.
Continue reading with these related articles
How we rebuilt our 'Add Property' dialog three times in one session based on real user feedback. A case study in iterative design and the importance of staying flexible.
How we built a campaign-ready sync system for Loops that computes dynamic user segments on-demand without polluting our database schema or scattering one-off updates throughout our codebase.
How we implemented user deletion at Homi with PII anonymization, analytics provider updates, and proper dependency handling—all while maintaining data integrity.