Under the hood of Rockbase | Rockhood Engineering

Why Real Estate Data Infrastructure Matters

Most AI companies start with a model and then figure out the data. In real estate, that approach doesn’t work. The data problem is the hard problem, and it has to be solved first.

Here’s why: there are over 100 MLSs (Multiple Listing Services) nationwide, each with different schemas, update frequencies, access rules, and licensing restrictions. NWMLS uses different field names than CRMLS. ARMLS structures addresses differently than SFAR. Some MLSs update every 15 minutes; others update once a day. Some expose APIs; others require FTP downloads of CSV files.

You can’t just scrape Zillow or Redfin. That violates their terms of service, and more importantly, you’d be working with stale, incomplete data that’s already been filtered through their business logic. If you want to build serious real estate AI—comps engines, market analytics, agentic search—you need licensed, direct MLS access. And that means solving the infrastructure problem.

This post walks through Rockbase: the data infrastructure platform we built to transform chaotic MLS feeds into clean, normalized, AI-ready data. It’s not sexy. It’s not a breakthrough in machine learning. But it’s the foundation that makes everything else possible.

Design Principles: What We Built For

Before diving into architecture, here’s what we optimized for:

Accuracy over speed. Real estate is a high-stakes domain. A wrong price or a missed listing can cost an agent a deal. We’d rather take an extra second to validate and normalize data correctly than serve fast but incorrect results.

Compliance by design. MLS licensing agreements are contracts. Violating them means losing access, which kills the product. We enforce MLS policies at the infrastructure level—per-MLS allowlists, data retention rules, audit trails—so violations are structurally impossible.

Horizontal scalability for new MLSs. Adding NWMLS shouldn’t require rewriting code that handles CRMLS. Each MLS integration is a plugin with a standardized interface. We normalize at ingestion, not at query time.

Tool-first API design. We don’t expose raw database tables. Instead, we expose tools: Property Search, Market Stats, Comp Engine. This abstraction layer lets us optimize queries, enforce business logic, and provide a stable API even as the underlying schema evolves.

The Data Flow: From Chaos to Clean Data

The diagram above shows the full pipeline. Let’s walk through each stage.

Stage 1: Change Detection

MLSs don’t push updates—we have to poll them. Most MLSs update their feeds every 15-30 minutes, but some lag by hours. Our change detection layer runs on a schedule, fetching new data and comparing it against what we already have.

Why change detection first? Because we don’t want to reprocess the entire feed every time. If NWMLS has 50,000 active listings and only 200 changed in the last hour, we only process those 200. This keeps latency low and reduces API load.

The challenge: Detecting changes reliably. Some MLSs provide a “last modified” timestamp. Others don’t. When there’s no timestamp, we compute a hash of the record and compare it to the stored hash. If they differ, it’s a change.

Failure modes: What if an MLS is down? We log the failure, back off exponentially (15 min → 30 min → 1 hour), and proceed with other MLSs. One MLS outage shouldn’t block the entire pipeline.

Stage 2: Schema Validation

Once we’ve detected changes, we validate them against the MLS schema. This catches:

Missing required fields. If a listing has no address or price, we reject it and log the error.
Data type mismatches. Some MLSs send prices as strings (“$500,000”). We parse and validate them.
Enumeration violations. A listing status should be “Active,” “Pending,” or “Sold”—not “ACT” or “closed.”

Here’s a real example. NWMLS represents status as integers (1 = Active, 2 = Pending, 3 = Sold). CRMLS uses strings (“A,” “P,” “S”). ARMLS uses full words (“Active,” “Pending,” “Sold”). Schema validation catches these differences and flags records that don’t conform.

Why validate before normalization? Because bad data upstream causes compounding errors downstream. If we normalize a malformed record, it pollutes the database, breaks search indexes, and surfaces in user queries. Better to reject it early and alert the MLS provider.

Stage 3: Normalization

This is the hard part. Normalization transforms heterogeneous MLS records into a canonical schema that the rest of the system can rely on.

Address normalization. “123 Queen Anne Ave N, Seattle, WA” in NWMLS might appear as “123 QA Ave North Seattle 98109” in another MLS. We geocode addresses, standardize street suffixes (Ave, Avenue, Av.), and deduplicate listings that appear in multiple MLSs.

Field mapping. NWMLS calls it ListPrice; CRMLS calls it OriginalListPrice. We map both to list_price in our canonical schema. Same with beds, baths, square footage, photos, listing dates—every field gets mapped to a standard name.

Enumeration harmonization. We define a fixed set of enums for status, property type, etc., and map MLS-specific values to them:

{
  "status": {
    "NWMLS": {"1": "active", "2": "pending", "3": "sold"},
    "CRMLS": {"A": "active", "P": "pending", "S": "sold"},
    "ARMLS": {"Active": "active", "Pending": "pending", "Closed": "sold"}
  }
}

The output: A normalized record that looks the same whether it came from NWMLS, CRMLS, or any other MLS. This is what we store in the database.

Stage 4: Storage and Indexing

Normalized data flows into two systems: a database (our ground truth) and a search index (for fast queries).

Why two storage layers? Because they serve different purposes. The database is optimized for transactional consistency—when a listing changes, we update it atomically. The search index is optimized for queries—geospatial search, filter by beds/baths/price, full-text search on listing descriptions.

We use PostgreSQL with PostGIS for the database (geospatial queries are first-class) and Elasticsearch for the search index. When a listing is updated in the database, we re-index it asynchronously. This means there’s a small lag (typically <5 seconds) between database updates and search visibility, but it keeps query latency low.

Partitioning strategy. We partition by MLS and status. Active listings from NWMLS go in one partition; sold listings from CRMLS go in another. This speeds up queries (“show me active listings in Seattle”) and simplifies retention policies (sold listings older than 7 years get archived per NAR guidelines).

The API Layer: Tools Over Tables

Most data platforms expose SQL or REST endpoints that return raw table rows. We don’t. Instead, we expose tools—higher-level abstractions that encapsulate business logic and query optimization.

Why Tools?

Stability. The underlying schema changes as we add new MLSs and fields. Tools provide a stable interface. Property Search returns the same JSON structure whether we’re querying one MLS or ten.

Optimization. Tools let us optimize queries in ways users can’t. Market Stats pre-computes aggregates (median price, average DOM) and caches them with a 15-minute TTL. From the user’s perspective, it’s instant. Behind the scenes, we’re hitting a cache, not running an expensive aggregation.

Compliance. Tools enforce MLS-specific restrictions. Some MLSs prohibit exposing certain fields (seller motivation, days-to-close incentives). Tools apply per-MLS allowlists so violations are impossible.

Example Tools

Property Search: Takes filters (location, beds, baths, price range, status) and returns ranked listings. It handles geocoding (if you pass “Queen Anne,” it resolves to coordinates), applies MLS-specific restrictions, and deduplicates across MLSs.

Market Stats: Returns aggregates for a location: median price, median DOM, inventory count, month-over-month trends. These are pre-computed and cached because they’re queried frequently.

Comp Engine: Given a property, finds the most similar recently sold listings (the “comps”). It uses a weighted similarity function—beds/baths match, square footage within ±20%, same neighborhood, sold in the last 6 months—and returns ranked results.

More tools. We have tools for listing history (price changes, status changes), school data, flood zones, and property tax assessments. Each tool is a separate service with its own caching, rate limiting, and optimization logic.

The API Contract

Tools are called via a REST API:

POST /tools/property-search
{
  "location": "Queen Anne, Seattle",
  "beds": 4,
  "price_max": 1000000,
  "status": "sold",
  "days": 180
}

Response:

{
  "results": [
    {
      "id": "NWMLS-123456",
      "address": "123 Queen Anne Ave N",
      "price": 985000,
      "beds": 4,
      "baths": 2.5,
      "sqft": 2100,
      "close_date": "2025-07-18",
      "dom": 8
    },
    ...
  ],
  "count": 47,
  "cache_hit": false
}

The API returns structured data with citations (listing IDs) so downstream systems (like the LLM in our agentic search loop) can trace every claim back to source records.

The Residual Connection: When to Skip the LLM

In the diagram, notice the dashed line from the Database directly to the User. This is a residual connection—a shortcut that bypasses the full pipeline (search index → API → tools → LLM).

Why? Because not every query needs an LLM. If a user asks “Show me the listing for 123 Queen Anne Ave N,” we don’t need to:

Search the index
Call the Property Search tool
Pass results to an LLM
Generate a natural language response

We can just query the database directly, return the listing, and render it. This is faster, cheaper, and more reliable.

When to use the residual path:

Direct lookups. User provides a listing ID or address.
Simple filters. User wants raw data with no analysis (“All 4-bed homes under $1M”).
Bulk exports. User downloads a CSV of listings for offline analysis.

When to use the full pipeline:

Complex queries. “What are the best comps for this property?” requires ranking logic.
Natural language. “Is the Queen Anne market heating up?” needs synthesis and trend analysis.
Explanations. “Why is this property priced below market?” requires reasoning.

The residual connection is a pragmatic concession: sometimes the simplest path is the best.

Operational Challenges: Running This at Scale

Building the pipeline was hard. Operating it is harder.

Challenge 1: Monitoring Data Freshness

We monitor each MLS feed independently. If NWMLS hasn’t sent new data in over an hour (typical is 15 minutes), we alert. If CRMLS sends a batch with 90% schema validation errors (typical is <1%), we alert.

The dashboard: We run a real-time dashboard showing feed health: last update time, record count, error rate, API latency. When something breaks, we see it in minutes, not hours.

Challenge 2: Handling Schema Changes

MLSs change their schemas without warning. NWMLS might rename a field, add a new enum value, or change a data type. When this happens:

Schema validation starts rejecting records.
Alerts fire.
We update the schema mapping and redeploy.
We backfill recent records that were rejected.

This happens 2-3 times per month across all MLSs. It’s unavoidable. The key is detecting it fast and fixing it fast.

Challenge 3: Compliance and Audit Trails

NAR (National Association of Realtors) guidelines require 7-year retention of transaction records for audit purposes. We log every query, every tool call, and every generated response. Logs are encrypted at rest, redacted for PII, and archived to cold storage after 90 days.

When a Fair Housing complaint comes in, we can reconstruct exactly what data was accessed, when, and by whom. This isn’t paranoia—it’s operational necessity in a regulated industry.

Challenge 4: Cost Modeling

MLS licensing fees are per-user or per-query, depending on the MLS. NWMLS charges a flat monthly fee. CRMLS charges per API call. We track usage per MLS and per tool to understand our cost structure.

Storage costs: We store ~50,000 active listings per MLS, ~200,000 sold listings (6-month rolling window), and ~1M historical records. At current scale, PostgreSQL + Elasticsearch costs ~$2,000/month.

Compute costs: ETL pipeline processing costs ~$500/month. API and tool serving costs ~$1,000/month.

Total: Rockbase costs ~~$3,500/month in infrastructure, plus MLS licensing fees (~~$10,000/month for 4 MLSs). As we scale to more MLSs, licensing dominates.

What This Enables

Rockbase is infrastructure. It doesn’t generate revenue directly. But it makes everything else possible.

Agentic search. Our MLS-grounded agentic search system relies on Rockbase tools to fetch comps, compute market stats, and retrieve listings. Without clean, normalized data, the agent loop would surface garbage.

Market analytics. We run trend detection algorithms over normalized time-series data. “Is inventory rising or falling?” “Are prices accelerating?” These questions require consistent data across MLSs and time.

Comparable sales engines. Real estate agents spend hours manually pulling comps. Our Comp Engine tool does it in seconds because it queries a unified, normalized database.

Future capabilities. With Rockbase as the foundation, we can build:

Predictive pricing models (train on normalized sold listings)
Cross-market insights (“Seattle inventory is rising, but Tacoma is flat”)
Automated CMAs (Comparative Market Analyses)
Fair Housing compliance dashboards (audit MLS data for policy violations)

None of this works without Rockbase.

Lessons Learned: What We Got Right and Wrong

Start with schema validation, not machine learning. We initially thought the hard part would be building a smart comp-ranking algorithm. Wrong. The hard part was getting clean data. We spent the first three months just building schema validators and normalization logic. Once the data was clean, everything else got easier.

Every MLS integration teaches you something new. We thought after integrating NWMLS and CRMLS, the third and fourth would be easy. Wrong. ARMLS uses a completely different address format. SFAR embeds metadata in listing descriptions. Each integration revealed new edge cases. Now we assume every new MLS will surprise us.

Compliance is a feature, not a tax. We initially treated MLS policy enforcement as a “nice to have.” Then we realized: if we violate a policy and lose MLS access, the product dies. Compliance became a first-class feature with tests, documentation, and eng review. It’s tedious, but it’s existential.

Build for observability from day one. We launched without detailed query tracing. When latency spiked, we had no idea why—slow MLS? Slow database? Slow cache? Adding tracing later was painful. If we could redo it, we’d instrument everything from the start.

Caching is a double-edged sword. Cache hit rates are great (78% for market stats), but cache invalidation is hard. When a listing changes, we invalidate related cache entries. When an MLS sends a bulk update, we invalidate aggressively. Too aggressive, and we lose cache benefits. Too conservative, and users see stale data. We’re still tuning this.

What’s Next: Scaling Rockbase

Right now, we integrate 4 MLSs covering Washington and California. Our goal: nationwide coverage (100+ MLSs).

Expanding MLS coverage. Each new MLS requires contract negotiation, schema mapping, and compliance review. We’re targeting 10 MLSs by Q2 2026, 25 by EOY 2026.

Real-time updates. Most MLSs update every 15-30 minutes. We poll on that cadence. But some MLSs support webhooks for instant updates. We’re building a webhook ingestion layer to reduce latency for time-sensitive queries (new listings, price reductions).

Richer data types. Right now, we ingest listings, photos, and basic attributes. Next: floor plans, 3D tours, showing availability, and transaction documents. This requires new storage (blob storage for media) and new tools (image search, document parsing).

Cross-market analytics. With normalized data across multiple markets, we can answer questions like: “Which markets have the fastest sales velocity?” “Where is inventory growing fastest?” These insights aren’t possible with raw MLS feeds—they require Rockbase’s normalization layer.

Closing: We Did the Heavy Lifting

Real estate AI is unforgiving. Integrating 100+ MLS feeds, surviving schema surprises at 2 a.m., enforcing contracts that read like legal exams, and keeping latency inside single-digit seconds is the hard work most teams never see.

Rockbase is the product of more than a decade building data-intensive AI and ML systems at Amazon and Meta scale. The same engineers who shipped globally distributed catalogs, realtime ranking pipelines, and privacy-tough logging systems built this stack—and they brought those playbooks with them.

We built Rockbase to carry that load. The ingestion jobs, validators, normalization rules, audit trails, and cost controls are already wired together so you can focus on the surface area your customers touch. We have done the heavy lifting—wiring compliance into the pipeline, sanding off the MLS inconsistencies, and hardening the ETL—so you can build user experiences that are unique, trustworthy, and fast.

If you’re trying to ship real estate AI, you don’t have to re-fight every battle we’ve already won. Plug into Rockbase, spend your time crafting workflows, and let the infrastructure keep the ground truth straight.

This post is part of a series on building real estate AI. Read about MLS-grounded agentic search and running agentic search in production.