TL;DR
Problem
KAYAK's standalone AI chatbot had strong capabilities but near-zero retention and almost no organic traffic.
Solution
Pivoted to an embedded conversational AI experience, de-risked with a low-cost A/B test before scaling site-wide.
Team
Design: Tai H, Jeongmin K
PM: John L
Front end: JokΕ«bas S, Goda G
Back end: Elica F
Impact
+14% booking revenue
+123% engagement
At a Glance
Onboarding to AI Chat
Standalone page β embedded drawer, zero context switching
Multi-intent output
Single rich text response β simultaneous flight + hotel widgets with follow-ups
Responsive chat
Congested mobile modal β streamlined single-line input with curated prompts
π» The AI Ghost Town
Why does no one come back?
In 2024, KAYAK launched a standalone AI chatbot: KAYAK.ai. As a separate product from the core booking platform, it could answer complex travel questions, match deals to niche preferences, and check live flight status, then hand users off to KAYAK.com to complete their booking.
However, even with heavy investment in marketing and AI infrastructure, KAYAK.ai gained very little organic traffic, and most critically - almost no user came back.
KAYAK.ai became a ghost town within a month of launch.
Signals of opportunity
We dug into the referral data between KAYAK.ai and the core platform. A small group of KAYAK.ai users did make it to core KAYAK, and their behavior stood out: 5x higher click-through rates than users from other referral channels.
The users who experienced AI-assisted search showed dramatically stronger intent. But the standalone model couldn't get enough of them there.
KAYAK.ai's technology wasn't the problem. Asking users to go somewhere new for it was.
π§ͺ Testing the waters: NLP Search
From finding to hypothesis
If the problem was asking users to go somewhere new β not the AI itself β then the fix wasn't improving KAYAK.ai. It was removing the destination barrier entirely, and reframing the goal around user value: how might an AI feature help travelers find the right travel deals more accurately and faster?
That reframed the question from βHow do we make KAYAK.ai better?β into two testable hypotheses: that AI features embedded in KAYAK.com would drive adoption on par with standalone KAYAK.ai, and that user engagement with them would contribute positively to core revenue.
AI Chat vs. AI search mode: Long term goal and quick test
We had a larger vision. But a persistent chat experience β multi-turn conversation, new UI components, new interaction patterns β carried significant dev cost and migration risk. If that experiment failed, the sunk cost would be high and the learnings hard to isolate.
We needed the cheapest possible test of the core thesis before making the expensive architectural commitment.
The obvious move was to let users type natural language directly into KAYAK's search form. But the existing form routes to structured queries β origin, destination, dates, passengers. Accepting freeform input would require a new backend pathway to constantly parse whether a user is typing a natural language prompt or a structured airport query. That's a much larger scope than adding a parallel mode.
A separate AI mode gave us two things: risk isolation and signal clarity. If the experiment broke, only the experiment broke β not the primary search funnel. And a distinct mode produced clean XP data, making it possible to attribute any conversion lift specifically to AI-assisted search.
The test
NLP Search β natural language input embedded directly in the core KAYAK search flow. Same site. Same results page. One new input type. The AI meets users where they already are, inside the flow they already trust.
Through this fast and low-cost MVP experiment, we were able to validate whether a larger vision was worth building.
One input field. Same search flow. Cheapest way to know if the thesis holds.
π¦Ύ Prompt Engineering: Classify vs. Infer
Before shipping anything, we had to solve the hardest design problem β not what the AI looks like, but how it works in our product.
We explored two approaches to generate AI response to user prompts - a parametric classifier vs. an intent inference engine.
Parametric classifier
Parametric classifier maps every prompt by date and destination completeness into 9 fixed scenarios, it was precise but brittle. It assumed users think in search parameters, couldn't handle prompts outside the search paradigm, and would scale exponentially with every new dimension.
Intent inference engine
We decided to build an intent inference model with much more resilience: infer what the user is trying to do into rough bucket (search, explore, ask a question, plan across verticals), then resolve ambiguity within each category through follow up questions and clarification.
Simulate user flow in a custom GPT
To communicate the AI system behavior with ENG and PM, I started in Figma, mapping the intent inference engine as a jobs-to-be-done flow: how the system parses a user's input into different query types, and where a vertical search branches off to a second API call that generates the search URL. I then translated that static flow into a custom GPT so the behavior could actually be run.
Compared to a static figma file, stakeholders could type real prompts and see simulated responses immediately, thereby providing us a much more efficient way to iterate on the system prompt before committing any dev efforts.
This custom GPT simulation aligned ENG and PM as a shared behavioral contract before any design mocks were created, and became the playbook for design iteration.
π’ Green Light to the Final Lap
We launched NLP Search as a tracer bullet experiment in weeks. Here's what the data came back with.
- +14% increase in real booking revenue (excluding ads)
- +2% lift in total conversion
- Neutral impact on core metrics β no downside risk materialized
The revenue signal confirmed the thesis: embedding AI in a familiar flow converts. But the data also revealed a structural ceiling. Users who engaged with NLP Search submitted one query and stopped. There was no way to refine, follow up, or explore across verticals. The architecture was single-turn by design, and usage patterns reflected exactly that.
π Scaling to Chat Drawer
NLP Search was a scoped experiment. What came next was a platform shift.
With the revenue signal and the single-turn ceiling identified, we built what we originally envisioned: an omnipresent AI chat experience across KAYAK.com. The Chat v1 drawer launched site-wide, accessible from the front door, results pages, and detail pages. For the first time, users could have a conversation with KAYAK, not just search it.
Left vs. Right?
In emerging AI chat interfaces, panel placement signals the AI's role to users.
Left panels (Lovable, ChatGPT Canvas) position AI as a creative partner. Users rely on multiple rounds of prompting and co-creation to fine tune the artifact generated by LLM.
Right panels, on the other hand, position AI as a contextual assistant. It allows users to stay focused on the main content while keeping the assistant easily discoverable and non-intrusive. KAYAK Chat is the latter: users focus on search results, chat assists on demand.
Integrating with existing architecture
KAYAK already had a right-panel drawer for Trips. Reusing it meant zero new interaction models, and co-locating Chat with Trips creates a natural handoff: plan in chat, save to Trips in one motion.
π©Ή Three Problems in One Reply
Chat v1 was live across the site. Then a single flight reply showed everything still wrong with it.
Three problems surfaced in one response:
- Unstable layout. The summary is the last thing the model generates, yet it rendered at the top, so every answer reflowed and the hierarchy collapsed the moment results arrived.
- Lost in the panel. The full results list was crammed into the narrow chat column. Fullstory replays showed users scrolling the results away and losing the page they came from.
- Orphaned saving. The save control lived inside chat, cut off from the core booking flow where users actually manage trips.
The third was the easy call: saving functionality was removed from chat entirely and let the existing Trips drawer own it. The other two were real design problems to tackle: a response structure that stops reflowing, and a navigation model that keeps the page in view.
π§± Defining a Coherent Response
First problem: when the AI can say anything, how does a response stay coherent?
Multi-turn chat made the output non-deterministic. Where NLP Search returned one structured result set, an open conversation could return anything: a list, a comparison, a clarifying question, a multi-vertical plan. With the summary landing last and jumping to the top, every response reflowed. So instead of designing one layout, I defined a fixed response anatomy, a constant order and lifecycle that any answer could pour into: a transient status and acknowledgement hold the wait, then a persistent block settles in fixed order, the result widget first, the summary pinned beneath it, then a follow-up. The summary still generates last, but it renders in its reserved slot, so the response stops jumping.
The lifecycle also controls cost. The status and acknowledgement are static, generic strings with no per-turn model call, while the widget and summary are the dynamic, generated parts. Holding the wait with a cheap static layer keeps the API cost flat no matter how many turns a user takes.
One slot stayed deliberately loose. The result widgets are variable and optional: a question returns none, a single search returns one, a multi-vertical prompt returns several. The anatomy held that flex, but what each widget should actually be, and how a user moves across several result pages without losing their place, was the next problem.
Reflections
Chat v1 Results
After launching Chat v1 as an embedded drawer on KAYAK.com, user engagement jumped +123% compared to NLP Search. Messages per user increased by 80%, signaling that users found enough value to keep the conversation going.
Key takeaways
- Ask before answering. One clarifying step costs seconds and saves users from dead-end sessions.
- Classify intent, not query. A deal-hunter and a trip-planner typing the same prompt need entirely different responses.
- Replace recaps with actions. 7/9 users called text summaries βnoiseβ β proactive follow-ups outperformed every time.
- Prototype with real data. Static mocks can't test intent-driven systems. Design for unknown outputs, not known inputs.
The most valuable work wasn't the UI β it was the behavioral specs, classifier logic, and research that made the UI possible. When those pieces clicked, the pixels followed.