Skip to main content

Incremental Synchronization System

Overview

The CDP implements an automated incremental synchronization system that keeps data up-to-date with minimal resource usage and maximum efficiency.

Architecture

Components

  1. Smart Sync Worker (Railway Service)

    • Repository: https://github.com/NomadaDigital01/nerdistan-worker
    • Service: Runs independently on Railway
    • Schedule: Automated execution every hour
  2. Incremental Sync Manager

    • Location: scripts/incremental_sync.py
    • Purpose: Captures new and updated orders
    • Batch size: 500 orders per execution
  3. Database Tables

    • cdp.order_events: Main order storage
    • cdp.tenants: Tenant configuration and status
    • cdp.sync_checkpoints: Progress tracking

Synchronization Schedule

Daily Incremental Sync

  • Frequency: Every 6 hours
  • Times: 00:00, 06:00, 12:00, 18:00 (Argentina timezone)
  • Scope: Last 24 hours of data
  • Duration: ~5-10 minutes per tenant

Historical Sync

  • Trigger: Automatically when gaps detected
  • Batch size: 7 days at a time
  • Priority: Lower than daily sync
  • Checkpoints: Resumable after interruptions

Full Sync

  • Frequency: Weekly (Sundays at 2 AM)
  • Scope: Complete 30-day refresh
  • Purpose: Data integrity verification

Configuration

Tenant Settings

-- Each tenant has these sync configurations
sync_enabled: boolean -- Enable/disable sync
sync_frequency_hours: integer -- Hours between syncs (typically 6)
last_sync_at: timestamp -- Last successful sync
go_live_date: date -- Start date for historical data
initial_sync_completed: boolean -- Full sync status

Current Active Tenants

TenantSync FrequencyLast Sync StatusDaily Orders
PetBaar6 hours✅ Active~40-50
Seven Sport6 hours✅ Active~55-60
Chelsea6 hours✅ Active~60-65
Mundo Juguete6 hours✅ Active~55-60
Kangoo Pet Food6 hours✅ Active~40-45
Celada SA6 hours✅ Active~60-65
Ferreira6 hours✅ Active~60-65
Digital Farma6 hours✅ Active~40-45
Zapatos Net6 hours✅ Active~15-20
Bercovich SA6 hours✅ Active~10-15
Essential6 hours✅ Active~60-65

Sync Process Flow

graph TD
A[Smart Worker Starts] --> B{Check Time}
B -->|Business Hours| C[Daily Sync]
B -->|Off Hours| D[Historical Sync]
B -->|Sunday 2AM| E[Weekly Full Sync]

C --> F[Get Recent Orders]
D --> G[Process Historical Gaps]
E --> H[Complete Refresh]

F --> I[Update CDP Tables]
G --> I
H --> I

I --> J[Update Checkpoints]
J --> K[Update last_sync_at]
K --> L[Complete]

Data Volume Statistics

Daily Incremental Load

  • Average: ~1,000 orders/day across all tenants
  • Peak: ~1,500 orders/day (weekdays)
  • Low: ~700 orders/day (weekends)

Current Database Size

  • Total Orders: 155,777
  • Unique Customers: 30,272
  • Products: 49,831
  • Date Range: January 2023 - Present

Monitoring

Key Metrics

  1. Sync Status Check
SELECT
tenant_name,
last_sync_at,
CASE
WHEN last_sync_at > NOW() - INTERVAL '6 hours' THEN '✅ Current'
WHEN last_sync_at > NOW() - INTERVAL '12 hours' THEN '⚠️ Behind'
ELSE '❌ Stale'
END as status
FROM cdp.tenants
WHERE is_active = true;
  1. Daily Order Growth
SELECT
DATE(order_date) as date,
COUNT(*) as new_orders
FROM cdp.order_events
WHERE order_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE(order_date)
ORDER BY date DESC;
  1. Sync Gaps Detection
SELECT
tenant_name,
MAX(order_date) as last_order,
CURRENT_DATE - DATE(MAX(order_date)) as days_behind
FROM cdp.tenants t
JOIN cdp.order_events o ON t.tenant_id = o.tenant_id
GROUP BY tenant_name
HAVING CURRENT_DATE - DATE(MAX(order_date)) > 1;

Troubleshooting

Common Issues

Orders Not Updating

  1. Check last_sync_at in tenants table
  2. Verify sync_enabled = true
  3. Check Railway worker logs: railway logs --service smart-sync-worker

Duplicate Orders

  • System checks for existing order_id before insert
  • Multi-policy orders handled with sales_channel field
  • Only 3 duplicates detected in 155K orders (0.002%)

Missing Historical Data

  1. Verify go_live_date is set correctly
  2. Check initial_sync_completed status
  3. Review checkpoint table for gaps

Manual Sync Trigger

# Force full sync for specific tenant
SYNC_MODE=full python scripts/smart_sync_hybrid_inverse.py --tenant-id 20

# Run incremental sync manually
python scripts/incremental_sync.py

# Check sync progress
python check_sync_progress.py

Performance Optimization

Best Practices

  1. Batch Processing: 50 orders per page for quick responses
  2. Checkpoint System: Resume from last successful point
  3. Off-peak Historical: Heavy processing during night hours
  4. Connection Pooling: Reuse database connections

Resource Usage

  • CPU: ~10-15% during sync
  • Memory: ~200-300MB per worker
  • Network: ~50-100 requests/minute to VTEX API
  • Database: ~500-1000 inserts/minute

Integration with CDP

After synchronization completes:

  1. Real-time Events trigger customer profile updates
  2. RFM Segmentation recalculates automatically
  3. CLV Predictions update for affected customers
  4. Journey Stages progress based on new activity

Future Enhancements

  • Real-time webhook integration
  • Parallel tenant processing
  • Automatic retry mechanism
  • Data quality validation
  • Sync performance dashboard

Last updated: September 20, 2025