Integration Layer - Data Normalization & Consolidation

Integration Layer: Core Idea & Novel Contributions

What is the Integration Layer?

The Integration Layer takes data from the Source Layer in all its messy diversity and transforms it into a standardized, consolidated format that downstream systems can work with.

Core Responsibilities

1. Schema Standardization

Challenge: Each data source uses different field names, units, and structures

Salesforce calls it “Employees”: 500
HR system calls it “Headcount”: 5.0E+2 (scientific notation)
CSV calls it “Staff_Count”: “approx 500”

Solution: Map all sources to canonical ESGETC schema

interface StandardizedDataPoint {
  // Identification
  organizationId: string;
  dimension: 'ECONOMIC' | 'SOCIAL' | 'ENVIRONMENTAL' | 'GOVERNANCE' | 'TECHNOLOGICAL' | 'CONNECTEDNESS';
  indicator: string;  // Canonical indicator name

  // Value
  value: number;
  valueType: 'ACTUAL' | 'TARGET' | 'TREND' | 'INDEX';
  unit: string;  // Normalized (employees, tons, percentage, etc)

  // Context
  timeperiod: {
    start: Date;
    end: Date;
    frequency: 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'ANNUAL';
  };

  // Quality & Lineage
  sourceSystem: string;
  collectionMethod: 'SURVEY' | 'API' | 'IoT' | 'FILE_UPLOAD' | 'MANUAL';
  qualityScore: number;  // 0-100%
  provenance: {
    originalValue: string;
    transformationApplied: string;
    transformedBy: string;
    transformedAt: Date;
  };

  // Relationships
  geoLocation?: {
    country: string;
    region?: string;
    city?: string;
    coordinates?: [number, number];
  };

  metadata: Record<string, any>;
}

2. Unit Conversion

Converts all measurements to standard units:

Category	Conversions Supported
Energy	kWh, MWh, BTU, Joules, natural gas units
Mass	kg, tonnes, lbs, oz
Volume	m³, gallons, liters, acre-feet
Distance	km, miles, meters
Time	hours, days, months, years
Money	USD, EUR, GBP, JPY, etc (with FX rates)
Concentration	ppm, ppb, mg/L, ug/m³

3. Data Deduplication

Problem: Same organization appears in multiple sources

“Apple Inc” from Company database
“Apple” from survey response
“AAPL” from stock data
“Apple Inc.” from Wikidata

Solution: Pairwise matching algorithm

Fuzzy string matching (handles typos, abbreviations)
Geographic proximity (entities in same location likely same org)
Relationship inference (if both reference same parent company)
Manual verification (flagged for human review)

Result: Single unified org record with all data merged.

4. Time Series Alignment

Consolidates data with different collection frequencies:

Source A (quarterly):        Q1 --- Q2 --- Q3 --- Q4
Source B (monthly):          M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
Source C (IoT continuous):   ████████████████████████████

Unified model: Store all data at finest granularity (daily)
Last-value-carry-forward for sparse sources
Interpolation available for time-series models

5. Geographic Harmonization

Aligns geographic references:

“New York” → State code plus city
“USA” → Country code
Coordinates (39.7817, -89.6501) → Administrative hierarchy
“Springfield, IL” → Standardized to: USA/IL/Springfield

Creates unified geographic schema enabling:

Regional aggregation
Distance calculations
“50-mile radius” queries
Heat mapping

Novel Contributions

1. Federated Data Model

Unlike traditional data warehouses that force normalization upfront, we maintain multiple simultaneous schemas:

Canonical Schema: Standardized ESGETC model
Source Schema: Original data kept as-is
Transformed Schema: Intermediate transformations

Benefits:

No data loss through normalization
Audit trail preserved (how we got from original to canonical)
Flexibility to maintain source-specific context
Reversible transformations for data recovery

2. Intelligent Default Assumptions

When data is incomplete, the system makes intelligent assumptions:

Pattern Recognition

If organization reported $5M revenue for 3 years, trends upward → estimate next month
If competitor has 20% employee turnover, similar company likely has similar rate
If region has $X per-capita income, adjust individual org benchmarks

Contextual Inference

If org is in renewable energy sector, environmental dimension likely more important
If NGO operates in rural Kenya, social dimension likely most material
If manufacturer in California, governance dimension affected by state regulations

Conservative Default

When in doubt, flag for human review
Use weighted averages (trust good data more than estimated)
Confidence intervals provide uncertainty bounds

3. Adaptive Consolidation Levels

Customize integration depth per organization type:

Quick Integration (Small NGO, 20 hours effort)

Basic CSV upload
Manual data entry for key fields
No external system connections

Standard Integration (Mid-size company, 40 hours)

One ERP system connected
Quarterly data refresh
Semi-automated

Deep Integration (Large enterprise, 100+ hours)

Multiple systems (ERP, CRM, HR, IoT)
Real-time or daily sync
Advanced transformations

4. Supply Chain Traceability

Goes beyond simple data transformation to track lineage:

Which original field came from which source system
What transformation was applied
When the transformation was applied
Who approved the transformation
What is the confidence level in the result

Enables:

“Why is my economic score 65?” → See exact calculation
Data quality investigation (trace bad data to source)
Audit compliance (full provenance trail)
Reproducibility (exactly replicate transformation)

Technical Architecture

ETL Pipeline

Extract → Transform → Load → Validate
   ↓           ↓          ↓        ↓
Source    Normalize  UnifiedDB  QualityCheck
Systems   + Dedupe             ↓
          + Harmonize      Analytics DB
          + Enrich ←── Reference Data

Reference Data Management

Central repository of standardized values:

├─ Organizations (Wikidata, World Bank)
├─ Geographic Hierarchies (country/region/city)
├─ Economic Sectors (NAICS, ISIC codes)
├─ Sustainability Indicators (GRI, SASB, SDG targets)
├─ Unit Conversions (energy, mass, distance)
├─ Foreign Exchange Rates (updated daily)
└─ Benchmarks (industry averages by size/sector/region)

Data Quality Pipeline

Input → Validation → Scoring → Decision
         ├─ Type
         ├─ Range
         ├─ Format
         ├─ Uniqueness
         └─ Referential

Score ≥ 80% → Auto-accept to Analytics DB
Score 50-80% → Flag for review
Score < 50% → Reject, request data re-submission

Integration Patterns

Pattern 1: Real-Time Streaming

For high-frequency data (IoT, financial):

Data Source → Kafka → Stream Processor → Cache → Dashboard
(1000x/sec)  (buffer)  (validate/enrich)  (seconds latency)

Pattern 2: Batch Processing

For periodic updates (financial reports, HR systems):

Data Source → Scheduled Job → Transform → Database
(daily/weekly) (cloud function)  (SQL)

Pattern 3: API Polling

For external APIs without webhooks:

Schedule → API Call → Rate Limit → Transform → Store
(hourly)  (Salesforce)  (queue)    (mapping)

Pattern 4: File Uploads

For user-submitted data:

Upload → Virus Scan → Parse → Validate → Queue → Process
(CSV)   (ClamAV)     (format) (schema)   (async) (transform)

Supported Transformations

Automatic

Unit conversion (BTU → kWh)
Date/time harmonization (parse formats)
case normalization (ABC Corp → ABC Corp)
Whitespace trimming

Semantic

Sector classification (code → NAICS category)
Geographic mapping (“California” → USA/CA)
Currency conversion (JPY → USD)
Percentage/basis point conversion

Statistical

Outlier detection and handling
Missing value imputation
Scaling/normalization for comparisons
Index calculation

Custom

Organization-specific business logic
Domain-specific calculations
Custom mapping tables
Plugin architecture for advanced transformations

Best Practices

1. Start Simple

Begin with basic CSV, add integrations gradually as needs grow.

2. Document Assumptions

Every transformation should be explicit and documented.

3. Validate Early

Check data quality at each step, not just at the end.

4. Preserve Lineage

Never lose track of how data was transformed.

5. Test Transformations

Before applying to production data, test on samples.

6. Version Control

Keep transformation code in Git, enable rollback if needed.