Skip to content

Integration Layer - Data Normalization & Consolidation

Integration Layer: Core Idea & Novel Contributions

What is the Integration Layer?

The Integration Layer takes data from the Source Layer in all its messy diversity and transforms it into a standardized, consolidated format that downstream systems can work with.

Core Responsibilities

1. Schema Standardization

Challenge: Each data source uses different field names, units, and structures

  • Salesforce calls it “Employees”: 500
  • HR system calls it “Headcount”: 5.0E+2 (scientific notation)
  • CSV calls it “Staff_Count”: “approx 500”

Solution: Map all sources to canonical ESGETC schema

interface StandardizedDataPoint {
// Identification
organizationId: string;
dimension: 'ECONOMIC' | 'SOCIAL' | 'ENVIRONMENTAL' | 'GOVERNANCE' | 'TECHNOLOGICAL' | 'CONNECTEDNESS';
indicator: string; // Canonical indicator name
// Value
value: number;
valueType: 'ACTUAL' | 'TARGET' | 'TREND' | 'INDEX';
unit: string; // Normalized (employees, tons, percentage, etc)
// Context
timeperiod: {
start: Date;
end: Date;
frequency: 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'ANNUAL';
};
// Quality & Lineage
sourceSystem: string;
collectionMethod: 'SURVEY' | 'API' | 'IoT' | 'FILE_UPLOAD' | 'MANUAL';
qualityScore: number; // 0-100%
provenance: {
originalValue: string;
transformationApplied: string;
transformedBy: string;
transformedAt: Date;
};
// Relationships
geoLocation?: {
country: string;
region?: string;
city?: string;
coordinates?: [number, number];
};
metadata: Record<string, any>;
}

2. Unit Conversion

Converts all measurements to standard units:

CategoryConversions Supported
EnergykWh, MWh, BTU, Joules, natural gas units
Masskg, tonnes, lbs, oz
Volumem³, gallons, liters, acre-feet
Distancekm, miles, meters
Timehours, days, months, years
MoneyUSD, EUR, GBP, JPY, etc (with FX rates)
Concentrationppm, ppb, mg/L, ug/m³

3. Data Deduplication

Problem: Same organization appears in multiple sources

  • “Apple Inc” from Company database
  • “Apple” from survey response
  • “AAPL” from stock data
  • “Apple Inc.” from Wikidata

Solution: Pairwise matching algorithm

  • Fuzzy string matching (handles typos, abbreviations)
  • Geographic proximity (entities in same location likely same org)
  • Relationship inference (if both reference same parent company)
  • Manual verification (flagged for human review)

Result: Single unified org record with all data merged.

4. Time Series Alignment

Consolidates data with different collection frequencies:

Source A (quarterly): Q1 --- Q2 --- Q3 --- Q4
Source B (monthly): M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
Source C (IoT continuous): ████████████████████████████
Unified model: Store all data at finest granularity (daily)
Last-value-carry-forward for sparse sources
Interpolation available for time-series models

5. Geographic Harmonization

Aligns geographic references:

  • “New York” → State code plus city
  • “USA” → Country code
  • Coordinates (39.7817, -89.6501) → Administrative hierarchy
  • “Springfield, IL” → Standardized to: USA/IL/Springfield

Creates unified geographic schema enabling:

  • Regional aggregation
  • Distance calculations
  • “50-mile radius” queries
  • Heat mapping

Novel Contributions

1. Federated Data Model

Unlike traditional data warehouses that force normalization upfront, we maintain multiple simultaneous schemas:

  • Canonical Schema: Standardized ESGETC model
  • Source Schema: Original data kept as-is
  • Transformed Schema: Intermediate transformations

Benefits:

  • No data loss through normalization
  • Audit trail preserved (how we got from original to canonical)
  • Flexibility to maintain source-specific context
  • Reversible transformations for data recovery

2. Intelligent Default Assumptions

When data is incomplete, the system makes intelligent assumptions:

Pattern Recognition

  • If organization reported $5M revenue for 3 years, trends upward → estimate next month
  • If competitor has 20% employee turnover, similar company likely has similar rate
  • If region has $X per-capita income, adjust individual org benchmarks

Contextual Inference

  • If org is in renewable energy sector, environmental dimension likely more important
  • If NGO operates in rural Kenya, social dimension likely most material
  • If manufacturer in California, governance dimension affected by state regulations

Conservative Default

  • When in doubt, flag for human review
  • Use weighted averages (trust good data more than estimated)
  • Confidence intervals provide uncertainty bounds

3. Adaptive Consolidation Levels

Customize integration depth per organization type:

Quick Integration (Small NGO, 20 hours effort)

  • Basic CSV upload
  • Manual data entry for key fields
  • No external system connections

Standard Integration (Mid-size company, 40 hours)

  • One ERP system connected
  • Quarterly data refresh
  • Semi-automated

Deep Integration (Large enterprise, 100+ hours)

  • Multiple systems (ERP, CRM, HR, IoT)
  • Real-time or daily sync
  • Advanced transformations

4. Supply Chain Traceability

Goes beyond simple data transformation to track lineage:

  • Which original field came from which source system
  • What transformation was applied
  • When the transformation was applied
  • Who approved the transformation
  • What is the confidence level in the result

Enables:

  • “Why is my economic score 65?” → See exact calculation
  • Data quality investigation (trace bad data to source)
  • Audit compliance (full provenance trail)
  • Reproducibility (exactly replicate transformation)

Technical Architecture

ETL Pipeline

Extract → Transform → Load → Validate
↓ ↓ ↓ ↓
Source Normalize UnifiedDB QualityCheck
Systems + Dedupe ↓
+ Harmonize Analytics DB
+ Enrich ←── Reference Data

Reference Data Management

Central repository of standardized values:

├─ Organizations (Wikidata, World Bank)
├─ Geographic Hierarchies (country/region/city)
├─ Economic Sectors (NAICS, ISIC codes)
├─ Sustainability Indicators (GRI, SASB, SDG targets)
├─ Unit Conversions (energy, mass, distance)
├─ Foreign Exchange Rates (updated daily)
└─ Benchmarks (industry averages by size/sector/region)

Data Quality Pipeline

Input → Validation → Scoring → Decision
├─ Type
├─ Range
├─ Format
├─ Uniqueness
└─ Referential
Score ≥ 80% → Auto-accept to Analytics DB
Score 50-80% → Flag for review
Score < 50% → Reject, request data re-submission

Integration Patterns

Pattern 1: Real-Time Streaming

For high-frequency data (IoT, financial):

Data Source → Kafka → Stream Processor → Cache → Dashboard
(1000x/sec) (buffer) (validate/enrich) (seconds latency)

Pattern 2: Batch Processing

For periodic updates (financial reports, HR systems):

Data Source → Scheduled Job → Transform → Database
(daily/weekly) (cloud function) (SQL)

Pattern 3: API Polling

For external APIs without webhooks:

Schedule → API Call → Rate Limit → Transform → Store
(hourly) (Salesforce) (queue) (mapping)

Pattern 4: File Uploads

For user-submitted data:

Upload → Virus Scan → Parse → Validate → Queue → Process
(CSV) (ClamAV) (format) (schema) (async) (transform)

Supported Transformations

Automatic

  • Unit conversion (BTU → kWh)
  • Date/time harmonization (parse formats)
  • case normalization (ABC Corp → ABC Corp)
  • Whitespace trimming

Semantic

  • Sector classification (code → NAICS category)
  • Geographic mapping (“California” → USA/CA)
  • Currency conversion (JPY → USD)
  • Percentage/basis point conversion

Statistical

  • Outlier detection and handling
  • Missing value imputation
  • Scaling/normalization for comparisons
  • Index calculation

Custom

  • Organization-specific business logic
  • Domain-specific calculations
  • Custom mapping tables
  • Plugin architecture for advanced transformations

Best Practices

1. Start Simple

Begin with basic CSV, add integrations gradually as needs grow.

2. Document Assumptions

Every transformation should be explicit and documented.

3. Validate Early

Check data quality at each step, not just at the end.

4. Preserve Lineage

Never lose track of how data was transformed.

5. Test Transformations

Before applying to production data, test on samples.

6. Version Control

Keep transformation code in Git, enable rollback if needed.


Next Steps