Building a Voice-First Property Search with OpenAI Realtime

"I'm looking for a two-bedroom apartment in Brooklyn, somewhere around $3,000 a month. My partner and I are moving from San Francisco, and we really need a place with good light and outdoor space."

That's how humans describe what they want. Not by filling out 15 form fields, but through natural conversation. We built a voice agent that listens, understands, and transforms this dialogue into a structured property search—all in real-time.

This is the story of how we implemented it using OpenAI's Realtime API, WebRTC, and some clever extraction patterns.

The Vision: Conversation as Interface

Traditional property search interfaces make you think like a database. Select from dropdowns. Enter min/max values. Check boxes. It's efficient for computers but exhausting for humans.

Voice flips this. You talk about your life situation, and the system figures out the query. "We're expecting our first kid in June" implies you need more bedrooms. "I work from home a lot" suggests you value a home office or quiet neighborhood.

Our voice agent, Homi, has one job: have a friendly conversation and extract everything needed to create a property search collection.

Architecture Overview

The system has two parallel tracks running simultaneously:

Track 1: Voice Conversation User speaks → OpenAI Realtime transcribes → AI responds with voice + text

Track 2: Data Extraction Transcript updates → GPT-4o extracts structured data → UI shows live preview

Here's the flow:

┌─────────────┐     Ephemeral Token      ┌──────────────────────┐
│   Browser   │ ◄──────────────────────► │  /api/voice/session  │
│             │                          │  (Auth + OpenAI)     │
└─────┬───────┘                          └──────────────────────┘
      │
      │ WebRTC (Audio + Data Channel)
      │
      ▼
┌─────────────────────────────────┐
│   OpenAI Realtime API           │
│   - Whisper transcription       │
│   - GPT-4o conversation         │
│   - Voice synthesis             │
└─────────────────────────────────┘
      │
      │ Transcript Events
      ▼
┌─────────────────────────────────┐
│   Extraction Loop (Debounced)   │
│   /api/voice/extract            │
│   - Structured outputs          │
│   - Incremental merge           │
└─────────────────────────────────┘

Why WebRTC Over WebSocket?

OpenAI's Realtime API supports both WebSocket and WebRTC connections. We chose WebRTC for one reason: latency.

WebRTC establishes a peer-to-peer-style connection (though it still goes through OpenAI's servers) with built-in optimizations for real-time audio. The difference is noticeable—responses feel instantaneous rather than delayed.

The trade-off is complexity. WebRTC requires:

SDP offer/answer negotiation
ICE candidate handling (handled by OpenAI in this case)
Separate data channel for events

But for voice, that sub-100ms latency is worth it.

Session Management: Security First

Before any voice connection happens, we need an ephemeral token. This keeps our OpenAI API key secure on the server while letting the browser connect directly.

// api/voice/session/route.ts
export async function POST(request: Request) {
  // Verify user session first
  const session = await auth.api.getSession({
    headers: request.headers,
  });

  if (!session?.user.id) {
    return NextResponse.json(
      { error: "No session found", needsAuth: true },
      { status: 401 },
    );
  }

  // Get ephemeral token from OpenAI
  const response = await fetch("https://api.openai.com/v1/realtime/sessions", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${env.OPENAI_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      voice: "verse", // Dynamic, engaging male voice
      instructions: HOMI_PERSONA.systemPrompt,
      input_audio_transcription: {
        model: "gpt-4o-transcribe",
      },
      turn_detection: {
        type: "semantic_vad",
        eagerness: "medium",
        create_response: true,
      },
    }),
  });

  const data = await response.json();

  return NextResponse.json({
    sessionToken: data.client_secret.value,
    expiresAt: data.client_secret.expires_at,
  });
}

The key configuration here is turn_detection. We use semantic_vad (Voice Activity Detection) instead of the simpler server_vad. Semantic VAD understands conversation flow—it won't interrupt when you pause to think, but it knows when you've actually finished your thought.

The WebRTC Dance

Establishing the connection involves the classic WebRTC handshake, but with OpenAI as the "remote peer":

const startSession = async () => {
  // 1. Get ephemeral token
  const { sessionToken } = await fetch("/api/voice/session", {
    method: "POST",
  }).then((r) => r.json());

  // 2. Get microphone access
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: {
      echoCancellation: true,
      noiseSuppression: true,
      sampleRate: 24000,
    },
  });

  // 3. Create peer connection
  const pc = new RTCPeerConnection();

  // 4. Set up audio output
  const audioEl = document.createElement("audio");
  audioEl.autoplay = true;
  pc.ontrack = (e) => {
    audioEl.srcObject = e.streams[0];
  };

  // 5. Add microphone track
  pc.addTrack(stream.getAudioTracks()[0], stream);

  // 6. Create data channel for events
  const dc = pc.createDataChannel("oai-events");
  dc.onmessage = (e) => {
    const event = JSON.parse(e.data);
    handleRealtimeEvent(event);
  };

  // 7. SDP exchange
  const offer = await pc.createOffer();
  await pc.setLocalDescription(offer);

  const sdpResponse = await fetch(
    "https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17",
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${sessionToken}`,
        "Content-Type": "application/sdp",
      },
      body: offer.sdp,
    },
  );

  const answerSdp = await sdpResponse.text();
  await pc.setRemoteDescription({ type: "answer", sdp: answerSdp });
};

Once connected, the data channel receives a stream of events: session.created, input_audio_buffer.speech_started, response.audio_transcript.delta, etc.

Real-Time Data Extraction

Here's where it gets interesting. While the conversation flows naturally, we're continuously extracting structured data in the background.

Every 3 seconds after a message, we send the transcript to our extraction endpoint:

const extractFromTranscript = async (msgs: TranscriptMessage[]) => {
  // Debounce: don't extract if we just did
  const now = Date.now();
  if (now - lastExtractionRef.current < 5000) return;

  // Need at least 2 messages for meaningful extraction
  if (msgs.length < 2) return;

  const response = await fetch("/api/voice/extract", {
    method: "POST",
    body: JSON.stringify({
      messages: msgs,
      previousData: extractedData, // For incremental merge
    }),
  });

  const result = await response.json();

  // Merge new data with existing
  setExtractedData((prev) => ({ ...prev, ...result.data }));
};

The extraction uses OpenAI's structured outputs with a carefully crafted Zod schema:

export const extractionSchema = z.object({
  intention: z
    .enum(["buy", "rent", "rent_short"])
    .nullable()
    .describe(
      "User's goal: 'buy' (purchasing, investment), 'rent' (lease), " +
        "'rent_short' (vacation). Default to 'buy' if ambiguous.",
    ),

  collectionGroupType: z
    .enum(["solo", "couple", "family", "friends"])
    .nullable()
    .describe(
      "Who will live there. Infer from pronouns: " +
        "'we' with partner = couple, mentions kids = family",
    ),

  majorCityId: z
    .enum(MAJOR_CITY_IDS)
    .nullable()
    .describe(
      "Match neighborhoods to parent cities: " +
        "Hudson Yards → new-york-us, Grünerløkka → oslo-no",
    ),

  budget: z
    .number()
    .nullable()
    .describe("Budget as number. Monthly for rent, total for buy."),

  preferredBedroomCount: z
    .number()
    .int()
    .nullable()
    .describe("Number of bedrooms. 'studio' = 0"),

  niceToHaveFeatures: z
    .array(z.enum(AMENITIES))
    .nullable()
    .describe(
      "Extract ANY amenities mentioned positively, even in passing. " +
        "'I'd love a view' → view, 'parking would be nice' → parking",
    ),

  suggestedFollowUp: z
    .string()
    .nullable()
    .describe("What key info is still missing? Suggest ONE natural topic."),
});

The .describe() calls are crucial—they guide the model to make reasonable inferences rather than requiring explicit statements. "We're moving together" correctly infers couple even though the user never said that word.

The Tricky Bits

Transcript Ordering

Here's a subtle bug that took hours to figure out: user transcriptions complete after the AI has already started responding.

The timeline looks like this:

User finishes speaking
AI starts generating response (streaming audio + transcript)
User's transcription completes
AI's response continues

If you just append messages in order, you get:

Assistant: "Great choice! Brooklyn has..."
User: "I'm looking for a place in Brooklyn"  // Wrong order!

The fix: when inserting user messages, look for incomplete assistant messages and insert before them:

case "conversation.item.input_audio_transcription.completed":
  if (event.transcript) {
    setMessages((prev) => {
      const userMessage = {
        id: event.item_id,
        role: "user",
        content: event.transcript,
        isFinal: true,
      };

      // Find first incomplete assistant message
      const firstIncompleteIdx = prev.findIndex(
        (m) => m.role === "assistant" && !m.isFinal
      );

      if (firstIncompleteIdx >= 0) {
        // Insert BEFORE the incomplete response
        return [
          ...prev.slice(0, firstIncompleteIdx),
          userMessage,
          ...prev.slice(firstIncompleteIdx),
        ];
      }

      return [...prev, userMessage];
    });
  }
  break;

Context Injection

The extraction doesn't just feed the UI—it also guides the conversation. When we extract data, we send a system message back to the AI:

if (dc.readyState === "open" && result.suggestedFollowUp) {
  const currentCompleteness = calculateCompleteness(newData);

  if (currentCompleteness < 100) {
    const contextUpdate = {
      type: "conversation.item.create",
      item: {
        type: "message",
        role: "system",
        content: [
          {
            type: "input_text",
            text:
              `[Search is ${currentCompleteness}% complete. ` +
              `To finish: ${result.suggestedFollowUp}. ` +
              `Ask naturally, don't list what's missing.]`,
          },
        ],
      },
    };
    dc.send(JSON.stringify(contextUpdate));
  }
}

This keeps Homi on track without making the conversation feel like an interrogation.

Completeness Scoring

We calculate a weighted completeness score to show progress:

export function calculateCompleteness(data: ExtractedCollectionData): number {
  const fields = [
    // Core fields - the essentials (85 points)
    { key: "intention", weight: 20 },
    { key: "majorCityId", weight: 20, alt: "city" },
    { key: "collectionGroupType", weight: 15 },
    { key: "budget", weight: 20 },
    { key: "preferredBedroomCount", weight: 10 },
    // Nice to haves (15 points)
    { key: "preferredHousingTypes", weight: 5 },
    { key: "niceToHaveFeatures", weight: 10 },
  ];

  let score = 0;
  for (const field of fields) {
    const value = data[field.key];
    if (value != null && (!Array.isArray(value) || value.length > 0)) {
      score += field.weight;
    }
  }

  return Math.min(100, score);
}

When completeness hits 90%+, Homi automatically wraps up and suggests creating the collection.

Visual Feedback: The Glowing Orb

Voice interfaces need visual feedback. You can't see sound, so users need to know:

Is the system listening?
Is the AI thinking?
Is the AI speaking?

We created a GlowingOrb component that reacts to both state and audio levels:

const STATE_CONFIGS: Record<VoiceAgentState, StateConfig> = {
  idle: {
    gradient: "radial-gradient(circle, rgba(59,130,246,0.4) 0%, ...)",
    glowColor: "rgba(59, 130, 246, 0.4)",
    animation: "animate-pulse-slow",
    label: "Ready to chat",
  },
  listening: {
    gradient: "radial-gradient(circle, rgba(6,182,212,0.6) 0%, ...)",
    glowColor: "rgba(6, 182, 212, 0.5)",
    animation: "", // Dynamic based on audio level
    label: "Listening...",
  },
  speaking: {
    gradient: "radial-gradient(circle, rgba(139,92,246,0.6) 0%, ...)",
    glowColor: "rgba(139, 92, 246, 0.5)",
    animation: "animate-pulse-speak",
    label: "Homi is speaking",
  },
  // ... more states
};

The orb scales dynamically based on microphone input levels, creating a visual "breathing" effect that mirrors the user's speech.

One subtle detail: we track a recentlySpeaking state with a 800ms delay after the transcript completes. Why? Audio playback lags behind text. Without this, the orb would switch to "listening" while Homi's voice was still playing.

State Management: Seven States of Voice

Voice sessions have more states than a typical boolean isConnected:

type VoiceAgentState =
  | "idle" // Not connected, ready to start
  | "connecting" // Establishing connection
  | "connected" // Active but waiting
  | "listening" // User is speaking
  | "speaking" // AI is speaking
  | "processing" // Processing user input
  | "error"; // Connection error

The state machine handles transitions carefully:

input_audio_buffer.speech_started → listening
input_audio_buffer.speech_stopped → processing
response.audio_transcript.delta → speaking
response.done → connected

We also map connected to listening in the UI—when you're in an active call waiting, you're effectively listening.

The Homi Personality

The system prompt shapes the entire experience:

const HOMI_SYSTEM_PROMPT = `You are Homi, a friendly real estate assistant...

## Your Vibe
- Relaxed and conversational, like chatting with a friend
- Keep responses SHORT - this is a voice conversation, not a lecture

## Conversation Flow
Start by understanding their STORY - the "why" behind their search:
1. What's sparking this search? (new job, growing family?)
2. What kind of place are they picturing?
3. Who will be living there?
4. Where do they want to be?
5. What's their budget range?

## How to Talk
- One question at a time
- React naturally - show genuine interest in their story
- Don't repeat back everything they said
- Keep it to 1-2 sentences per response
- Sound like a real person, not a customer service bot`;

The key insight: voice AI needs to be more concise than text AI. Long responses feel like lectures. Short, natural responses feel like conversation.

Lessons Learned

1. Extraction is the Hard Part

The voice connection was straightforward (thanks OpenAI). The real complexity is in extracting structured data from unstructured conversation. Invest heavily in your extraction prompts and schemas.

2. Debounce Everything

Real-time systems generate a lot of events. Debounce extraction, debounce state transitions, debounce UI updates. Users don't notice 3-second delays, but they notice janky interfaces.

3. Visual Feedback is Non-Negotiable

Voice is invisible. Without strong visual feedback, users don't know if the system is working. The glowing orb wasn't a nice-to-have—it was essential.

4. Test with Real Conversations

Synthetic test cases don't capture how humans actually talk. "I guess maybe around three thousand or so?" is very different from "Budget: $3000".

5. Graceful Degradation

Not everyone wants voice. Some prefer typing. Some are in public. Some have accessibility needs. Voice should be an option, not a requirement.

What's Next

This is just the beginning. Future improvements:

Multi-language support (Whisper supports 50+ languages)
Voice-activated property browsing ("Show me the next one")
Collaborative voice sessions (search together with your partner)
Proactive suggestions based on market data

The vision: property search that feels like talking to a knowledgeable friend, not filling out forms.

Want to try it yourself? Launch the voice agent and have a conversation with Homi. Tell us about your dream place—we'll turn it into a search.

Building a Voice-First Property Search with OpenAI Realtime

The Vision: Conversation as Interface

Architecture Overview

Why WebRTC Over WebSocket?

Session Management: Security First

The WebRTC Dance

Real-Time Data Extraction

The Tricky Bits

Transcript Ordering

Context Injection

Completeness Scoring

Visual Feedback: The Glowing Orb

State Management: Seven States of Voice

The Homi Personality

Lessons Learned

1. Extraction is the Hard Part

2. Debounce Everything

3. Visual Feedback is Non-Negotiable

4. Test with Real Conversations

5. Graceful Degradation

What's Next

About the Author

Kristian Elset Bø

Related Posts

Building GDPR-Compliant User Deletion: A Full-Stack Approach

Email Audience Segmentation Without Schema Pollution

No UI Survives First Contact with Users

Building a Voice-First Property Search with OpenAI Realtime

The Vision: Conversation as Interface

Architecture Overview

Why WebRTC Over WebSocket?

Session Management: Security First

The WebRTC Dance

Real-Time Data Extraction

The Tricky Bits

Transcript Ordering

Context Injection

Completeness Scoring

Visual Feedback: The Glowing Orb

State Management: Seven States of Voice

The Homi Personality

Lessons Learned

1. Extraction is the Hard Part

2. Debounce Everything

3. Visual Feedback is Non-Negotiable

4. Test with Real Conversations

5. Graceful Degradation

What's Next

About the Author

Kristian Elset Bø

Related Posts

Building GDPR-Compliant User Deletion: A Full-Stack Approach

Email Audience Segmentation Without Schema Pollution

No UI Survives First Contact with Users

Want our product updates? Sign up for our newsletter.