Building MLS-grounded agentic search | Rockhood Engineering

Why Agentic Search for Real Estate?

When we started building Rockhood’s AI assistant, we faced a fundamental choice: build a retrieval-augmented generation (RAG) system or design something more deliberate. RAG works well for static knowledge bases, but real estate queries require something different. A question like “What are the top 3 comparable sales for a 4-bedroom home in Queen Anne under $1M?” isn’t answered by semantic search over documents—it requires structured queries, multiple data sources, ranking logic, and Fair Housing compliance checks.

We chose an agentic approach: a small planner orchestrates targeted searches over licensed MLS data, assembles evidence, and only then generates an answer with citations. This architecture prevents hallucination at the source and makes every claim traceable to MLS records. The trade-off? Higher latency and implementation complexity. But for a regulated industry where accuracy matters more than fluency, it’s the right call.

Design Trade-offs: What We Optimized For

Real estate AI sits at the intersection of three constraints: regulatory compliance (Fair Housing Act, MLS policies), data fragmentation (every MLS has different schemas and access rules), and user expectations (fast, natural answers). Our design goals reflect these realities.

Accuracy over fluency. We never output facts we can’t back with sources. This means sometimes saying “I don’t have enough data to answer that” instead of generating plausible-sounding nonsense. Users trust the system more when it admits uncertainty.

First-class citations. Every claim must be traceable to specific MLS records. This isn’t just good UX—it’s a compliance requirement. Agents need to verify sources before sharing analysis with clients, and audit trails matter when regulators ask questions.

Bounded latency. We target 2–10 seconds for typical queries. That’s slow compared to ChatGPT, but fast enough for professional workflows. The trade-off: we parallelize MLS queries aggressively and use intelligent caching, which introduces staleness (15-minute TTL for market aggregates). Real-time accuracy isn’t always necessary—yesterday’s median DOM is usually fine.

Compliance by construction. Fair Housing Act violations can destroy a brokerage. We build guardrails into the architecture—filtering protected class language before generation and on output—rather than patching compliance as an afterthought.

Operable in production. This isn’t a demo. We need clear logs, metrics, and failure modes. Every response includes query traces, evidence diffs across retries, and audit trails. When something breaks at 2am, we need to debug it.

The Real Estate Context: Why This Is Hard

Most AI systems deal with static knowledge or web-scale data. Real estate is different. MLS data is licensed, not scraped—we have contractual obligations and rate limits for every integration. Every MLS has different schemas—NWMLS uses different field names than CRMLS, which uses different filters than SFAR. Fair Housing compliance is non-negotiable—we can’t just fine-tune on MLS data and hope for the best.

These constraints shaped our architecture. Instead of training on MLS data (which would violate licensing), we treat MLSs as external APIs. Instead of semantic search over listings, we use structured queries with explicit constraints. Instead of relying on post-hoc content filters, we build compliance into the agent loop.

System Architecture: How It Works

The system runs a short agent loop that separates planning, retrieval, and generation into distinct phases.

Phase 1: Planning and Constraint Extraction

When a user asks “Top 3 comps for a 4-bed home in Queen Anne under $1M,” the planner must extract structured constraints: location=Queen Anne, beds=4, price<1,000,000, type=sold, horizon=180 days. This is harder than it sounds. “Queen Anne” could mean the neighborhood or the hill. “Under $1M” could mean list price or close price. “Top 3” implies ranking, but ranking by what—proximity, recency, similarity?

We use a small classification model to identify query intent (search, compare, analyze, explain) and extract constraints through a combination of entity recognition and slot filling. The planner refuses to guess—if it can’t extract key constraints with confidence, it asks clarifying questions instead of hallucinating.

The output: a structured plan with tool calls and timeouts. For the comparable sales query, the planner might issue three parallel searches: sold_listings(location=Queen Anne, beds=4, days=180), market_stats(location=Queen Anne, metric=median_dom), and neighborhood_boundaries(name=Queen Anne) to validate geography.

Phase 2: Parallel MLS Queries

The planner executes searches in parallel with per-tool timeouts (typically 3 seconds). This is where MLS fragmentation hurts—we need separate connectors for NWMLS, CRMLS, SFAR, and SDMLS, each with different authentication, rate limits, and field mappings. If one MLS times out, we proceed with partial results rather than blocking the entire response.

Each connector returns structured records: listing IDs, addresses, close prices, close dates, beds, baths, square footage. We also fetch supporting data like neighborhood boundaries and market aggregates. All of this happens in parallel, orchestrated with deadline propagation—if the overall query has 10 seconds, each sub-query gets 3 seconds max.

Phase 3: Ranking and Aggregation

Raw MLS results are messy. We need to normalize schemas, deduplicate (same listing from multiple sources), and rank by relevance. For comparable sales, “relevance” means similarity: how close is this property to the target in beds, baths, square footage, location, and recency?

We compute a weighted similarity score:

Beds/baths exact match: High weight (you can’t comp a 4-bed with a 2-bed)
Square footage: Normalized distance with decay (±20% is fine, ±50% is not)
Location: Distance in meters with sharp cutoff (same neighborhood matters)
Recency: Exponential decay (sales from last month > sales from 6 months ago)

This produces a ranked list. We take the top K (typically 3-10 depending on query type) and pass them to the evidence builder.

Phase 4: Evidence Assembly

The evidence builder compresses ranked results into a compact, citation-rich context that fits within the model’s token budget. This is a balancing act—we need enough detail for the model to generate accurate answers, but not so much that we blow the context window.

For each listing, we preserve: listing_id, address, close_price, close_date, beds, baths, sqft, days_on_market. We also attach aggregates: median_dom, trend_dom_90d, count_sold. Every fact gets a citation handle—a machine-readable reference back to the source record.

The output looks like this:

{
  "evidence": [
    {
      "id": "NWMLS-123456",
      "type": "sold",
      "address": "123 Queen Anne Ave N",
      "close_price": 985000,
      "close_date": "2025-07-18",
      "beds": 4,
      "baths": 2.5,
      "sqft": 2100,
      "dom": 8
    },
    ...
  ],
  "aggregates": {
    "median_dom": 12,
    "trend_dom_90d": -0.15,
    "count_sold": 47
  }
}

This structured format makes it easy for the model to reference specific listings and for us to trace every claim back to source data.

Phase 5: Grounded Generation

Now we finally call the LLM. The prompt includes the user’s original query, the evidence context, and strict instructions: “Only use facts from the provided evidence. Reference listings by ID. If evidence is insufficient, say so.”

The model generates a concise answer:

Based on recent sales in Queen Anne, the top 3 comparable properties for a 4-bedroom home under $1M are:

123 Queen Anne Ave N (NWMLS-123456) — Sold for $985,000 on July 18, 2025. 4 beds, 2.5 baths, 2,100 sqft. Closed in 8 days.

456 Ward St (NWMLS-123987) — Sold for $975,000 on July 5, 2025. 4 beds, 2 baths, 1,950 sqft. Closed in 14 days.

789 Bigelow Ave N (NWMLS-124321) — Sold for $950,000 on June 29, 2025. 4 beds, 2.5 baths, 2,050 sqft. Closed in 11 days.

The median days on market for this area is 12 days, down 15% from 90 days ago.

Every number is grounded in the evidence. The listing IDs make it clickable and verifiable.

Phase 6: Compliance and Safety Checks

Before returning the answer, we run it through compliance guardrails. These filters scan for Fair Housing Act protected class language (race, religion, familial status, etc.), MLS policy violations (exposing seller-side data), and unsafe content. If a violation is detected, we either rewrite the text or escalate for human review.

This happens after generation, not during training. Why? Because compliance requirements change—new policies, new protected classes, new MLS rules. We can’t retrain the model every time. By separating compliance into a post-processing step, we maintain flexibility.

Phase 7: Return with Citations

Finally, we attach machine-readable citations to the response. The frontend renders these as clickable links back to MLS records. Agents can verify every claim. Compliance officers can audit the trail. Users trust the system because they can see the sources.

Total latency for this example: 2–4 seconds. Planner extracts constraints in 200ms. Parallel MLS queries take 1.5–2 seconds. Ranking and evidence assembly: 300ms. LLM generation: 800ms. Guardrails: 200ms.

Handling Uncertainty: When Data Isn’t Enough

One of the hardest problems in agentic systems is admitting “I don’t know.” Generic LLMs will happily hallucinate an answer. Our system refuses.

If the planner can’t find enough comparable sales (fewer than 3 matches in 180 days), it returns a structured “insufficient evidence” response:

I found only 2 comparable 4-bedroom sales in Queen Anne in the last 6 months, which isn’t enough for a reliable analysis. Would you like me to:

Expand the search to nearby neighborhoods (Fremont, Wallingford)?

Increase the time window to 12 months?

Relax the bedroom constraint to include 3-bedroom homes?

This prevents hallucination at the source and guides users toward actionable next steps. It also reveals gaps in MLS coverage—if we consistently lack data for a market, we know we need better integrations.

Citations: Why Structured Metadata Matters

We don’t just say “source: listing.” We preserve listing_id, type (sold/active), close_date, and all relevant attributes. Why?

Compliance audits require traceability. When a Fair Housing complaint comes in, we need to show exactly which MLS records contributed to which claims. Unstructured citations (“Based on recent sales…”) don’t cut it.

Agents need to verify before sharing. Before sending a CMA to a client, an agent clicks through to verify the comps. If the citation points to a listing ID, they can pull it up in their MLS. If it’s just “recent sales,” it’s useless.

Users can validate accuracy. Transparency builds trust. When users see clickable listing IDs, they understand the system isn’t making things up—it’s grounded in real data.

This is the core insight: grounding isn’t just about retrieval; it’s about traceability and auditability. In regulated industries, that matters.

System Architecture Diagram

The diagram shows the full data flow:

MLS Connectors (left): Multiple licensed APIs with rate limits and schema normalization
Ranking & Aggregation (center): Similarity scoring, deduplication, top-K selection
Evidence Builder (right): Compacts structured facts with citation handles
LLM with Grounded Context (top right): Generates answers that reference evidence by ID
Compliance Guardrails (bottom): FHA filters, MLS policy checks, audit logging

What’s Next: Future Directions

This architecture gets us production-ready accuracy and compliance. But there’s room for improvement.

Learned ranking models. Right now, similarity scoring uses hand-tuned weights. We could learn market-specific ranking functions from agent feedback—what comps do agents actually use?

Richer toolset for the planner. We could expose geospatial joins (properties within 0.5 miles of a school), risk layers (flood zones, crime stats), and economic indicators. This would enable more complex queries.

Automated testing for hallucination resistance. We need counterfactual tests: if we remove a key piece of evidence, does the model still claim the fact? If so, it’s hallucinating.

Closing Thoughts

Grounding an LLM with an agentic search loop over MLS data gives us answers that are verifiable, compliant, and fast enough for professional workflows. The core idea is simple: plan, search, assemble evidence, then generate—with citations and guardrails by default.

This architecture reflects a deliberate trade-off: higher latency and implementation complexity in exchange for accuracy and auditability. For a regulated industry like real estate, where a single Fair Housing violation can cost millions, that’s the right choice.

If you’re building AI for regulated domains—healthcare, finance, legal—the lessons here apply: separate retrieval from generation, make citations first-class, and build compliance into the architecture from day one.

In a follow-up post, we’ll dive into the operational challenges: running this system in production, scaling across markets, and the compliance and observability infrastructure that keeps it reliable.