Back to Case Studies
Real Estate / PropTech

Unifying 2.3 Million Customer Records from 12 Data Sources: A Real Estate CDP Implementation

March 30, 2026
11 min read

The same buyer was appearing as five different leads across twelve disconnected systems. Sales agents were calling the same person from three different teams simultaneously. Before you can market to a customer, you need to know they are the same person.

CDPIdentity ResolutionDatabricksPythonAzure FunctionsData EngineeringReal EstatePySparkDelta Lake

The Problem: A Fragmented Customer Universe

A large real estate developer was running a sophisticated marketing operation — Google and Facebook ads, listings on three major property portals, WhatsApp Business campaigns, walk-in registrations, referral networks, call center inbound, and email campaigns.

The problem: every channel created its own customer record, in its own system, in its own format.

The same buyer — call him Rahul — might appear as:

  • 99acres: "Rahul Sharma", phone 9876543210, interested in 2BHK
  • Website form: "R. Sharma", phone +91-9876543210, email rahul.sharma@gmail.com
  • Facebook Lead Ad: "RAHUL SHARMA", phone 987-654-3210
  • Call center log: "Rahul S.", phone 9876543210, site visit scheduled
  • WhatsApp: same phone digits, different field format

Rahul was five different leads. He was being called by three different agents simultaneously. One agent had told him pricing. The second called to offer a site visit he had already scheduled. The third called to introduce the project he had already visited.

Multiply this across 2.3 million records over five years, and you have a fundamentally broken customer data foundation — one that made every downstream system (CRM, marketing, analytics) unreliable.

This was a previous company engagement. Client details are anonymised.

The 12 Data Sources

SourceIntegration methodVolume
99acresREST API (webhook + daily export)~800K records
MagicBricksSFTP daily export~420K records
Housing.comREST API~310K records
Website formsREST API (real-time webhook)~180K records
Facebook Lead AdsMeta Marketing API~210K records
Google AdsGoogle Ads API~95K records
WhatsApp BusinessMeta Cloud API~105K records
Call centre (Freshdesk)REST API~140K records
Walk-in registrationsCSV upload + API~55K records
Referral partner networkBatch file (daily SFTP)~80K records
Email campaigns (Mailchimp)REST API~75K records
Historical SQL databaseOne-time ETL + ongoing CDC~830K records

What We Built

A Customer Data Platform (CDP) that:

  1. Ingests continuously from all 12 sources
  2. Standardizes and cleans records at ingestion
  3. Resolves duplicate identities using deterministic and probabilistic matching
  4. Maintains a single golden record per customer with full journey history
  5. Propagates unified profiles to downstream systems (CRM, marketing platform, analytics layer)

Identity Resolution: The Engineering Core

This was the hardest problem — and the one where most CDP implementations fail.

Step 1: Standardization

Every record is standardized before matching:

  • Phone numbers: all formats normalized to E.164 (+91XXXXXXXXXX). 9876543210, +919876543210, 091-9876-543210, and 987-654-3210 all resolve to +919876543210
  • Email addresses: lowercased, whitespace stripped, common domain typos corrected (gmial.com → gmail.com) using edit-distance threshold < 2
  • Names: lowercased, title stripping (Mr., Mrs., Dr., Er.), initial expansion where inferrable

Step 2: Deterministic Matching

Two records are definitively the same person if they share:

  • Exact normalized phone number, OR
  • Exact email address (post-standardization)

Deterministic matches receive match_confidence: 1.0 and are merged immediately, no review required.

Step 3: Probabilistic Matching at Scale

For records without an exact phone or email match, we compute a match probability score using a weighted feature set:

FeatureWeight
Name similarity (Jaro-Winkler distance)0.30
Location / project interest overlap0.25
Partial phone match (last 7 digits)0.25
Source + timing proximity (same source, ±48 hours)0.20

This computation runs as a PySpark job on Databricks — probabilistic matching at 2.3M records requires distributed join operations that a single-machine solution cannot handle within acceptable SLA windows.

Decision thresholds:

  • Score ≥ 0.82 → auto-merge, match_confidence recorded on the golden record
  • Score 0.65–0.82 → flagged for human review (processed weekly by data team; typically < 2,000 records/week)
  • Score < 0.65 → treated as distinct individuals

Step 4: Golden Record Construction

When source records are merged into one identity, the golden record is a materialized view — not a destructive merge. Every source record and its contributing field values are preserved in full in a contact_sources junction table.

For each field on the golden record, the displayed value is the most recent high-confidence value, with source_priority weights applied where recency ties.

Step 5: Change Propagation

New records from any source trigger identity resolution within 5 minutes via Azure Service Bus. If the new record matches an existing golden record, the profile is updated. Downstream systems receive a webhook notification for every golden record change event.

Technical Architecture

Results

After implementation:

MetricValue
Total records processed2.3 million
Duplicate identities resolved340,000
Inbound leads identified as existing customers23%
Records flagged for human review (weekly)< 2,000
Cross-source opt-out propagation accuracy99.8%
Latency from new record to golden record update< 5 minutes (p95)

The business impact beyond the numbers:

  • Sales agents pulling up a "new" enquiry now see the full history: previous site visits, past agent interactions, earlier pricing conversations. They stop cold-calling people who have already been through the funnel.
  • Marketing suppression became reliable for the first time. Previously, a customer who opted out from one portal's lead form was still being called because their opt-out only applied to that source's record. Post-CDP, a single opt-out propagates across all systems within minutes.
  • The analytics team went from "we can't trust the lead count" to having a single, clean source of truth for all funnel metrics.

What This Demonstrates

Identity resolution is the unsexy foundational work that makes every downstream data and AI initiative actually work. Segmentation, personalization, lead scoring, lifetime value — none of these produce reliable outputs when built on a fractured identity layer.

This engagement demonstrates the full CDP stack: multi-source ingestion engineering, domain-specific data standardization (India phone formats, Indian name patterns), probabilistic matching at scale on Databricks, golden record design with full lineage preservation, and real-time downstream propagation.

The same implementation pattern applies to any multi-channel business with fragmented customer data: retail, financial services, telecom, healthcare.

Not Sure Where to Start?

Book a free 30-minute strategy session with a senior data architect — no pitch, no obligation.

Schedule Your Free Strategy Session

Not Sure Where to Start? Start Here.

We offer a free 30-minute strategy session with a senior data or AI architect — not a sales rep. Bring your current challenge, your stack, or just a vague sense that your data situation needs to improve. We'll give you an honest assessment of where to begin.

No pitch. No obligation. Just a useful conversation.

Typically responds within 1 business day · Available for India, US, UK & Canada