Unifying 2.3 Million Customer Records from 12 Data Sources: A Real Estate CDP Implementation

The same buyer was appearing as five different leads across twelve disconnected systems. Sales agents were calling the same person from three different teams simultaneously. Before you can market to a customer, you need to know they are the same person.

The Problem: A Fragmented Customer Universe

A large real estate developer was running a sophisticated marketing operation — Google and Facebook ads, listings on three major property portals, WhatsApp Business campaigns, walk-in registrations, referral networks, call center inbound, and email campaigns.

The problem: every channel created its own customer record, in its own system, in its own format.

The same buyer — call him Rahul — might appear as:

99acres: "Rahul Sharma", phone 9876543210, interested in 2BHK
Website form: "R. Sharma", phone +91-9876543210, email rahul.sharma@gmail.com
Facebook Lead Ad: "RAHUL SHARMA", phone 987-654-3210
Call center log: "Rahul S.", phone 9876543210, site visit scheduled
WhatsApp: same phone digits, different field format

Rahul was five different leads. He was being called by three different agents simultaneously. One agent had told him pricing. The second called to offer a site visit he had already scheduled. The third called to introduce the project he had already visited.

Multiply this across 2.3 million records over five years, and you have a fundamentally broken customer data foundation — one that made every downstream system (CRM, marketing, analytics) unreliable.

This was a previous company engagement. Client details are anonymised.

The 12 Data Sources

Source	Integration method	Volume
99acres	REST API (webhook + daily export)	~800K records
MagicBricks	SFTP daily export	~420K records
Housing.com	REST API	~310K records
Website forms	REST API (real-time webhook)	~180K records
Facebook Lead Ads	Meta Marketing API	~210K records
Google Ads	Google Ads API	~95K records
WhatsApp Business	Meta Cloud API	~105K records
Call centre (Freshdesk)	REST API	~140K records
Walk-in registrations	CSV upload + API	~55K records
Referral partner network	Batch file (daily SFTP)	~80K records
Email campaigns (Mailchimp)	REST API	~75K records
Historical SQL database	One-time ETL + ongoing CDC	~830K records

What We Built

A Customer Data Platform (CDP) that:

Ingests continuously from all 12 sources
Standardizes and cleans records at ingestion
Resolves duplicate identities using deterministic and probabilistic matching
Maintains a single golden record per customer with full journey history
Propagates unified profiles to downstream systems (CRM, marketing platform, analytics layer)

Identity Resolution: The Engineering Core

This was the hardest problem — and the one where most CDP implementations fail.

Step 1: Standardization

Every record is standardized before matching:

Phone numbers: all formats normalized to E.164 (+91XXXXXXXXXX). 9876543210, +919876543210, 091-9876-543210, and 987-654-3210 all resolve to +919876543210
Email addresses: lowercased, whitespace stripped, common domain typos corrected (gmial.com → gmail.com) using edit-distance threshold < 2
Names: lowercased, title stripping (Mr., Mrs., Dr., Er.), initial expansion where inferrable

Step 2: Deterministic Matching

Two records are definitively the same person if they share:

Exact normalized phone number, OR
Exact email address (post-standardization)

Deterministic matches receive match_confidence: 1.0 and are merged immediately, no review required.

Step 3: Probabilistic Matching at Scale

For records without an exact phone or email match, we compute a match probability score using a weighted feature set:

Feature	Weight
Name similarity (Jaro-Winkler distance)	0.30
Location / project interest overlap	0.25
Partial phone match (last 7 digits)	0.25
Source + timing proximity (same source, ±48 hours)	0.20

This computation runs as a PySpark job on Databricks — probabilistic matching at 2.3M records requires distributed join operations that a single-machine solution cannot handle within acceptable SLA windows.

Decision thresholds:

Score ≥ 0.82 → auto-merge, match_confidence recorded on the golden record
Score 0.65–0.82 → flagged for human review (processed weekly by data team; typically < 2,000 records/week)
Score < 0.65 → treated as distinct individuals

Step 4: Golden Record Construction

When source records are merged into one identity, the golden record is a materialized view — not a destructive merge. Every source record and its contributing field values are preserved in full in a contact_sources junction table.

For each field on the golden record, the displayed value is the most recent high-confidence value, with source_priority weights applied where recency ties.

Step 5: Change Propagation

New records from any source trigger identity resolution within 5 minutes via Azure Service Bus. If the new record matches an existing golden record, the profile is updated. Downstream systems receive a webhook notification for every golden record change event.

Technical Architecture

Results

After implementation:

Metric	Value
Total records processed	2.3 million
Duplicate identities resolved	340,000
Inbound leads identified as existing customers	23%
Records flagged for human review (weekly)	< 2,000
Cross-source opt-out propagation accuracy	99.8%
Latency from new record to golden record update	< 5 minutes (p95)

The business impact beyond the numbers:

Sales agents pulling up a "new" enquiry now see the full history: previous site visits, past agent interactions, earlier pricing conversations. They stop cold-calling people who have already been through the funnel.
Marketing suppression became reliable for the first time. Previously, a customer who opted out from one portal's lead form was still being called because their opt-out only applied to that source's record. Post-CDP, a single opt-out propagates across all systems within minutes.
The analytics team went from "we can't trust the lead count" to having a single, clean source of truth for all funnel metrics.

What This Demonstrates

Identity resolution is the unsexy foundational work that makes every downstream data and AI initiative actually work. Segmentation, personalization, lead scoring, lifetime value — none of these produce reliable outputs when built on a fractured identity layer.

This engagement demonstrates the full CDP stack: multi-source ingestion engineering, domain-specific data standardization (India phone formats, Indian name patterns), probabilistic matching at scale on Databricks, golden record design with full lineage preservation, and real-time downstream propagation.

The same implementation pattern applies to any multi-channel business with fragmented customer data: retail, financial services, telecom, healthcare.

The Problem: A Fragmented Customer Universe

The 12 Data Sources

What We Built

Identity Resolution: The Engineering Core

Step 1: Standardization

Step 2: Deterministic Matching

Step 3: Probabilistic Matching at Scale

Step 4: Golden Record Construction

Step 5: Change Propagation

Technical Architecture

Results

What This Demonstrates

Not Sure Where to Start?

More Case Studies

Architecting the Single View of the Customer: Building a Composable CDP That Actually Scales

Building a Real-Time Aquaculture Intelligence Platform: The AquaStackX Story

Pond Score: Building a Machine Learning Risk Engine for Aquaculture Farms

Not Sure Where to Start? Start Here.