The Technology

Conversational AI
meets the street.

citypal combines state-of-the-art language models, real-time spatial intelligence, and audio streaming technology to create a tour guide that actually understands you.

Scroll
The Problem

Why this is hard.

Real-Time Performance

Conversational AI needs to respond in under 2 seconds. At walking speed, context changes constantly. Traditional chatbots can't keep up.

📍

Spatial Awareness

"What's that building?" only makes sense if the AI knows exactly where you are, what's visible, and what you're likely pointing at.

🎯

Contextual Memory

The AI must remember your interests, route, previous questions, and time constraints—all while maintaining natural conversation flow.

Architecture

How citypal works.

Five interconnected layers working in real-time to deliver contextual, conversational walking tours.

📱 CLIENTUser Voice Input🧠 INTELLIGENCEContext • Personalization • Orchestration🗺️ KNOWLEDGEGraphStories • History • POIsSemantic Search📍 SPATIALLocationGPS • CompassWhat You're Looking At🎧 AUDIO SYNTHESISNatural Voice • Streaming • Real-time👤Voice →Query →Context →Stories →Location →Audio →
📱

Client Layer

Audio-first React Native app optimized for background operation and low battery drain.

🧠

Intelligence Layer

Proprietary context engine that manages conversation, personalization, and fact-checking.

🗺️

Knowledge Layer

Vector database with city knowledge graphs for semantic search in milliseconds.

📍

Spatial Layer

GPS and sensor fusion to understand location, direction, and points of interest.

🎧

Audio Layer

Streaming voice synthesis with natural prosody and ultra-low latency.

Our Secret Sauce

The Intelligence Layer.

Foundation models like GPT-4 are powerful, but they don't understand where you are, what you're seeing, or what you care about. Our intelligence layer bridges that gap.

🎯

Context Management

We maintain a rich, dynamic context window that includes:

  • User state: Location, heading, speed, time of day
  • Visible POIs: What's in view, based on GPS + compass
  • Conversation history: Recent topics, unanswered questions
  • Learned preferences: Architecture? Food? History?
🔍

Semantic Retrieval

When you ask a question, we don't just search keywords—we understand intent:

  • Vector search: Find semantically similar stories/facts
  • Spatial filtering: Prioritize nearby, visible locations
  • Personalization: Surface content matching your interests
  • RAG architecture: Inject relevant facts into LLM context
🛡️

Hallucination Prevention

LLMs can "hallucinate" false facts. Our system prevents this:

  • Fact-checking layer: Verify claims against knowledge graph
  • Source attribution: Track where each fact comes from
  • Confidence scoring: AI admits when it doesn't know
  • Human review: Local experts curate and verify content
🎭

Personality & Voice

citypal isn't just accurate—it's engaging and human:

  • Tone adaptation: Playful, scholarly, or practical based on you
  • Story selection: Choose anecdotes over dry facts
  • Natural pacing: Pauses, emphasis, conversational flow
  • Local voices: City-specific personalities and accents
Audio Infrastructure

Low-latency audio streaming.

Traditional text-to-speech waits for the entire response before speaking. That's too slow for natural conversation.

We use streaming synthesis: citypal starts speaking within 500ms of your question, even while the AI is still generating the rest of the response.

The Result:

  • <2s from question to first word spoken
  • Interruption support: Stop AI mid-sentence to ask follow-up
  • Adaptive bitrate: Works on 3G, optimized for 5G
  • Background mode: Phone locked, audio continues

Audio Pipeline

1
User speaks
0ms
2
Speech-to-text (Whisper)
~300ms
3
Intelligence layer processes
~100ms
4
LLM generates (streaming)
~400ms
5
TTS starts speaking
~200ms
~1 second total
Feels instantaneous
Built Right

Privacy-first. Performance-obsessed.

🔒

Privacy by Design

  • Location Never Leaves Device
    GPS coordinates processed locally. Server only sees city-level data.
  • Ephemeral Conversations
    Audio not stored. Transcripts auto-delete after session.
  • No Third-Party Tracking
    No analytics SDKs, no ad networks, no data brokers.
  • Anonymous by Default
    No login required for basic tours. Data tied to device, not identity.

Optimized Performance

  • Edge Computing
    Content cached at CDN edge nodes. <50ms latency worldwide.
  • Offline Mode
    Download cities ahead of time. Core features work without network.
  • Battery Optimization
    4+ hours active use. Background GPS managed intelligently.
  • Adaptive Quality
    Degrades gracefully on slow connections. Never stops working.
Differentiation

Why we're ahead.

Foundation models are commoditized. Our competitive advantage is the intelligence layer and years of R&D on spatial AI.

🎯

Spatial-First Architecture

Built from the ground up for location-aware AI. Not a chatbot with GPS bolted on—every system designed for spatial context.

📚

Curated Knowledge

Local experts, historians, and storytellers contribute content. Not scraped Wikipedia—verified, compelling narratives.

🔬

R&D in Conversational UX

Years invested in how people actually talk while walking. Interruption handling, pause detection, natural turn-taking.

Interested in the tech?

We're hiring engineers, researchers, and technical leaders. Or if you're an investor, let's talk about our technical moat.