Campaign Fulfillment at Scale: Migrating from SQL Server to Databricks — 4 Hours to 14 Minutes
A campaign fulfillment engine that worked fine at 100,000 contacts collapsed under the weight of 2 million. SQL Server was not the problem — the architecture was. Here is how we redesigned it for Databricks and cut fulfillment time by 94%.
The Breaking Point
The client operated one of the largest automotive marketing platforms in the US — serving 4,000+ dealerships with targeted campaign automation, including coupon-based offers delivered to prospect audiences segmented by vehicle interest, geography, purchase intent, and financing history.
The scale trajectory:
- 2019: A campaign to 200,000 contacts ran in 18 minutes
- 2021: A campaign to 600,000 contacts was taking 1.5 hours
- 2022: A campaign to 800,000 contacts was taking 2.5 hours with occasional timeouts
- Early 2023: Campaigns targeting 2 million+ contacts were failing to complete within the business day
The contractual SLA to dealerships was "campaign live within 2 hours of approval." That commitment was being broken, weekly.
The system was collapsing under its own success. The question was whether to patch or redesign.
Previous company engagement. Client details are anonymised.
What the Campaign Fulfillment Engine Did
The fulfillment pipeline had four stages:
- Audience resolution — evaluate the campaign's targeting rules against the full contact database to produce a recipient list
- Coupon assignment — for coupon-based campaigns, assign a unique, one-time-use coupon code to each recipient (strict no-duplicate constraint)
- Personalization — render each recipient's message with their name, vehicle interest, assigned coupon, dealership details, and offer copy
- Delivery queuing — push personalized payloads to the delivery service (email, SMS, direct mail)
All four stages were implemented as stored procedures in SQL Server. Written when campaigns had 50,000 contacts. Patched and extended for three years. By 2023, they were a labyrinth of temporary tables, cursor loops, and cross-database joins that no single engineer could fully trace.
Root Cause Analysis
Before touching the migration, we spent two weeks profiling the existing system under production-representative load.
Finding 1: Coupon assignment was a cursor loop
Assigning unique coupons to 2M recipients was done row-by-row:
-- Simplified pseudocode of the original stored procedure
WHILE @remaining_recipients > 0
BEGIN
SELECT TOP 1 @coupon_id = coupon_id
FROM coupons
WHERE used = 0
AND pool_id = @pool_id;
UPDATE coupons SET used = 1, assigned_to = @recipient_id
WHERE coupon_id = @coupon_id;
SET @remaining_recipients = @remaining_recipients - 1;
END
At 2M recipients, this is 2 million individual SQL Server round trips on a single-threaded cursor. This alone accounted for ~68% of total fulfillment time.
Finding 2: Audience resolution was doing full table scans
The contact database held 8.5M records. The audience filter queries were missing composite indexes on the most common filter dimensions, resulting in multiple full-table scans per campaign execution.
Finding 3: Lock contention under concurrent load
Multiple campaigns running simultaneously (standard during business hours) were acquiring table-level locks on shared reference tables (coupon pools, contact attributes). Concurrent campaigns queued behind each other, causing cascading delays and occasional deadlocks.
Finding 4: Zero parallelism
The entire four-stage pipeline was single-threaded. SQL Server was sitting at 12% CPU utilization during campaign runs. The bottleneck was sequential stored procedure logic — not compute resource saturation.
The Redesign: Databricks + Delta Lake
We redesigned the pipeline as distributed Spark jobs on Databricks, with Delta Lake as the data layer.
Key Design Decision: Coupon Assignment Without Duplicates at Scale
This was the most complex problem. We had to guarantee that:
- Every recipient gets exactly one coupon
- No coupon is assigned to more than one recipient
- This holds under concurrent campaign execution (multiple campaigns running simultaneously)
- At 5M+ recipient scale
The solution: partition-aware batch join with Delta Lake optimistic concurrency.
The available coupons table is partitioned by coupon pool ID. Each recipient is assigned to a partition bucket using a deterministic hash. Within each partition, coupon assignment is a single shuffle-free join — no row-by-row iteration. The assignment write is a conditional Delta Lake MERGE:
# Simplified — actual implementation handles retry and partition logic
(
delta_coupons
.alias("coupons")
.merge(
assignments_df.alias("assignments"),
"coupons.coupon_id = assignments.coupon_id AND coupons.assigned_to IS NULL"
)
.whenMatchedUpdate(set={"assigned_to": "assignments.recipient_id", "assigned_at": "current_timestamp()"})
.execute()
)
If a concurrent job claims a coupon in the same partition before this write commits, Delta Lake's optimistic concurrency detects the conflict and the affected assignments retry against the next available partition. Tested and verified: zero duplicate assignments at 5M recipient load across 50 concurrent test campaigns.
Audience Resolution with Predicate Pushdown
The contacts Delta table is Z-ordered on the four most common filter dimensions: geography_state, vehicle_category, purchase_intent_score, and last_contact_date. Audience resolution queries push filter predicates down to file-level skipping — a typical campaign filter touches 3–8% of data files instead of scanning all 8.5M records.
Migration Approach: Shadow Mode, Not Big Bang
We ran parallel pipelines for 6 weeks:
- SQL Server continued processing all production campaigns
- Databricks processed the same campaigns in shadow mode
- Output comparison: recipient lists, coupon assignments, and personalized payloads were diff'd against the SQL Server output after each campaign
After zero discrepancies across 200+ shadow-mode campaigns, we cut over. SQL Server pipeline kept on cold standby for 30 days post-cutover as rollback insurance.
Results
| Metric | SQL Server (before) | Databricks (after) | Change |
|---|---|---|---|
| 2M audience campaign time | 4.2 hours | 14 minutes | −94% |
| Max practical audience size | ~800K | 5M+ | 6.25x capacity |
| Coupon duplication errors | Occasional under load | Zero | Eliminated |
| Concurrent campaign capacity | 2 (lock contention) | 8+ | 4x |
| Infrastructure cost per campaign run | Baseline | −74% | 3.8x cheaper per run |
| SLA compliance (2-hour delivery) | 67% | 99.2% | +32 percentage points |
The cost reduction is structural: the Databricks job cluster auto-scales, runs the campaign, and terminates. There is no always-on legacy database instance sized for peak load sitting idle during off-hours.
What This Demonstrates
This is a canonical data engineering migration: a system built for one scale, hitting the wall at 10x that scale, requiring architectural redesign rather than incremental patching.
The skills demonstrated here are transferable to any data platform migration:
- Profile before migrating — the cursor loop and missing index findings came from measurement, not assumption
- Partition-aware distributed job design — the coupon assignment pattern applies to any distributed write with uniqueness constraints
- Shadow mode validation — running parallel pipelines to validate correctness before cutover eliminates migration risk
- Phased cutover with rollback — the 30-day cold standby gave the business confidence to cut over
The industry is automotive. The pattern applies everywhere.
Not Sure Where to Start?
Book a free 30-minute strategy session with a senior data architect — no pitch, no obligation.
Schedule Your Free Strategy Session