The Technology

Conversational AI
meets the street.

citypal combines state-of-the-art language models, real-time spatial intelligence, and audio streaming technology to create a tour guide that actually understands you.

Explore Architecture Our Intelligence Layer

Scroll

The Problem

Why this is hard.

⚡

Real-Time Performance

Conversational AI needs to respond in under 2 seconds. At walking speed, context changes constantly. Traditional chatbots can't keep up.

📍

Spatial Awareness

"What's that building?" only makes sense if the AI knows exactly where you are, what's visible, and what you're likely pointing at.

🎯

Contextual Memory

The AI must remember your interests, route, previous questions, and time constraints—all while maintaining natural conversation flow.

Architecture

How citypal works.

Five interconnected layers working in real-time to deliver contextual, conversational walking tours.

📱

Client Layer

Audio-first React Native app optimized for background operation and low battery drain.

🧠

Intelligence Layer

Proprietary context engine that manages conversation, personalization, and fact-checking.

🗺️

Knowledge Layer

Vector database with city knowledge graphs for semantic search in milliseconds.

📍

Spatial Layer

GPS and sensor fusion to understand location, direction, and points of interest.

🎧

Audio Layer

Streaming voice synthesis with natural prosody and ultra-low latency.

Our Secret Sauce

The Intelligence Layer.

Foundation models like GPT-4 are powerful, but they don't understand where you are, what you're seeing, or what you care about. Our intelligence layer bridges that gap.

🎯

Context Management

We maintain a rich, dynamic context window that includes:

•User state: Location, heading, speed, time of day
•Visible POIs: What's in view, based on GPS + compass
•Conversation history: Recent topics, unanswered questions
•Learned preferences: Architecture? Food? History?

🔍

Semantic Retrieval

When you ask a question, we don't just search keywords—we understand intent:

•Vector search: Find semantically similar stories/facts
•Spatial filtering: Prioritize nearby, visible locations
•Personalization: Surface content matching your interests
•RAG architecture: Inject relevant facts into LLM context

🛡️

Hallucination Prevention

LLMs can "hallucinate" false facts. Our system prevents this:

•Fact-checking layer: Verify claims against knowledge graph
•Source attribution: Track where each fact comes from
•Confidence scoring: AI admits when it doesn't know
•Human review: Local experts curate and verify content

🎭

Personality & Voice

citypal isn't just accurate—it's engaging and human:

•Tone adaptation: Playful, scholarly, or practical based on you
•Story selection: Choose anecdotes over dry facts
•Natural pacing: Pauses, emphasis, conversational flow
•Local voices: City-specific personalities and accents

Audio Infrastructure

Low-latency audio streaming.

Traditional text-to-speech waits for the entire response before speaking. That's too slow for natural conversation.

We use streaming synthesis: citypal starts speaking within 500ms of your question, even while the AI is still generating the rest of the response.

The Result:

→<2s from question to first word spoken
→Interruption support: Stop AI mid-sentence to ask follow-up
→Adaptive bitrate: Works on 3G, optimized for 5G
→Background mode: Phone locked, audio continues

Audio Pipeline

User speaks

0ms

Speech-to-text (Whisper)

~300ms

Intelligence layer processes

~100ms

LLM generates (streaming)

~400ms

TTS starts speaking

~200ms

~1 second total

Feels instantaneous

Built Right

Privacy-first. Performance-obsessed.

🔒

Privacy by Design

✓
Location Never Leaves Device
GPS coordinates processed locally. Server only sees city-level data.
✓
Ephemeral Conversations
Audio not stored. Transcripts auto-delete after session.
✓
No Third-Party Tracking
No analytics SDKs, no ad networks, no data brokers.
✓
Anonymous by Default
No login required for basic tours. Data tied to device, not identity.

⚡

Optimized Performance

→
Edge Computing
Content cached at CDN edge nodes. <50ms latency worldwide.
→
Offline Mode
Download cities ahead of time. Core features work without network.
→
Battery Optimization
4+ hours active use. Background GPS managed intelligently.
→
Adaptive Quality
Degrades gracefully on slow connections. Never stops working.

Differentiation

Why we're ahead.

Foundation models are commoditized. Our competitive advantage is the intelligence layer and years of R&D on spatial AI.

🎯

Spatial-First Architecture

Built from the ground up for location-aware AI. Not a chatbot with GPS bolted on—every system designed for spatial context.

📚

Curated Knowledge

Local experts, historians, and storytellers contribute content. Not scraped Wikipedia—verified, compelling narratives.

🔬

R&D in Conversational UX

Years invested in how people actually talk while walking. Interruption handling, pause detection, natural turn-taking.

Interested in the tech?

We're hiring engineers, researchers, and technical leaders. Or if you're an investor, let's talk about our technical moat.

Join the Team →

View Business Plan

Conversational AI meets the street.