Skip to content

Source Layer - Data Collection & Integration

Source Layer: Core Idea & Novel Contributions

What is the Source Layer?

The Source Layer is responsible for collecting data from all possible sources. It’s the entry point data uses before integration and processing.

Core Responsibilities

1. Data Collection Methods

Direct Input

  • Web forms and surveys (assessments, stakeholder input)
  • CSV/Excel file uploads (batch data import)
  • API submissions from external systems
  • Manual data entry (for specialized fields)

Automated Integration

  • REST API connections from ERP/CRM/HR systems
  • IoT sensors (production facilities, environmental monitors)
  • Webhooks from third-party services (Salesforce, HubSpot)
  • Database direct connections (read-only access)

Third-Party Databases

  • World Bank Indicators API
  • UNDP SDG databases
  • ESG rating agencies
  • OpenStreetMap geographic data
  • Wikidata entity information
  • Government statistical bureaus

2. Production-Grade IoT Data Pipeline

The platform includes a production-ready IoT infrastructure for real-time operational data:

Current Capabilities

  • Supports 1,000+ simultaneous device connections
  • Processes 1M+ events weekly
  • <5 second latency from sensor to dashboard
  • 99.9% uptime SLA
  • Hardware compatibility: Modbus, MQTT, HTTP/HTTPS

Data Types

  • Temperature, humidity, air quality sensors
  • Equipment operational metrics (runtime, cycles, errors)
  • Power consumption and energy metrics
  • Production volume and quality metrics
  • Safety and compliance indicators

Reliability & Monitoring

  • Automatic retry with exponential backoff
  • Device heartbeat monitoring
  • Anomaly detection for sensor malfunctions
  • Data validation and sanitization
  • Compression for bandwidth optimization

3. Data Format Normalization

All incoming data is converted to standard formats:

interface RawDataPoint {
sourceId: string; // Which system it came from
timestamp: Date; // When it was collected
organizationId: string; // Which org it belongs to
dimension: string; // ESGETC dimension
value: number; // The actual data
unit: string; // Unit of measurement
metadata: Record<string, any>; // Context
}

4. Data Validation & Quality

Automatic Checks

  • Range validation (values within expected bounds)
  • Type validation (string vs number vs date)
  • Required field validation
  • Format validation (email, URL, date formats)
  • Referential integrity (foreign keys exist)

Quality Scoring Each data point gets a quality score (0-100%):

  • Completeness: 0-25 points (all required fields)
  • Accuracy: 0-25 points (matches known benchmarks)
  • Timeliness: 0-25 points (how recent is the data)
  • Consistency: 0-25 points (aligns with related data)

Data below 50% quality triggers a review workflow before inclusion in analysis.


Novel Contributions

1. Federated Data Collection

Unlike competitors who force data into a single model, we support multiple data schemas simultaneously:

  • Organization A uses quarterly reports
  • Organization B has monthly IoT streams
  • Organization C uploads raw spreadsheets
  • All can be assessed on same ESGETC framework

2. NLP-Based Entity Discovery & Classification

We don’t just accept what organizations tell us. The platform:

  • Crawls public data (Wikidata, web, databases) to discover organizations
  • Extracts mentions of sustainability work through NLP
  • Classifies entities automatically (business, NGO, university, etc.)
  • Deduplicates organizations that appear multiple times
  • Verifies claims against World Bank and UN partnerships

Learn more about entity discovery →

3. Adaptive Intelligence Weighting System

The data collection itself adapts:

  • Learns from past assessments which questions are most predictive
  • Allocates survey effort to high-impact areas
  • Skips or elaborates based on initial answers
  • Personalizes collection to organizational context

Instead of “ask 300 questions,” the system asks 30-80 optimized questions.

4. Baseline Analysis & AI Recommendations

When data is first collected, the system:

  • Detects baseline values for each dimension
  • Identifies trends (improving/declining over time)
  • Surfaces anomalies (this org very different from peers)
  • Recommends next steps (where to focus improvement efforts)

All before the organization answers a single planning question.


Technical Architecture

Data Ingestion Pipeline

┌─── API Endpoint ─────┐
│ ↓
│ ├─ POST /api/v1/data/ingest
│ ├─ Validation layer
│ ├─ Authentication check
│ └─ Rate limiting
├─── Async Job Queue ┐
│ ↓
│ ├─ Parse format
│ ├─ Transform to standard schema
│ ├─ Run quality checks
│ └─ Store raw data
├─── Real-time Stream ┐
│ ↓
│ ├─ Event deduplication
│ ├─ Immediate validation
│ ├─ Cache warm-up
│ └─ Trigger alerts if needed
└─── File Upload ─────┐
├─ Virus scan
├─ Parse (CSV/Excel/JSON)
├─ Validate structure
└─ Queue for batch processing

Data Sources Currently Integrated

Source TypeExamplesFrequency
APIsSalesforce, HubSpot, Stripe, ShopifyReal-time
DatabasesPostgreSQL, MySQL, SQL ServerOn-demand
FilesCSV, Excel, JSON, PDFManual upload
FormsWeb surveys, mobile appOn submission
IoTSensors, equipment, devicesReal-time
Public DataWorld Bank, UNDP, OpenStreetMapWeekly sync
WebhooksSlack, Zapier, customReal-time

Security at the Source

Data Protection

  • TLS/SSL Encryption: All data in transit encrypted
  • Field-Level Encryption: Sensitive fields encrypted at rest
  • API Key Rotation: Automatic key management
  • Audit Logging: Every data point logged with timestamp and user

Access Control

  • Source Authorization: Each integration requires explicit approval
  • Scope Limiting: Integrations get minimum necessary permissions
  • IP Whitelisting: Optional for on-premise connections
  • Rate Limiting: Prevent abuse and DoS attacks

Data Validation

  • Malware Scanning: File uploads checked for malware
  • SQL Injection Prevention: Parameterized queries
  • XSS Protection: Input sanitization
  • CORS Protection: Only trusted domains allowed

Supported Data Sources

Enterprise Systems

  • SAP - Sales, procurement, financials
  • Oracle - General enterprise data
  • NetSuite - Multi-subsidiary data
  • Salesforce - Customer and sales data
  • HubSpot - Marketing and sales data

Operational Systems

  • MES (Manufacturing) - Production metrics
  • SCADA - Equipment monitoring
  • BMS (Building) - Energy and operations
  • WMS - Warehouse management

Financial Systems

  • QuickBooks - Accounting data
  • FreshBooks - Invoice and expense data
  • Xero - Multi-currency accounting

Geographic/Environmental

  • OpenWeather - Climate data
  • AirVisual - Air quality
  • OpenStreetMap - Geographic boundaries
  • Sentinel-2 - Satellite imagery

Best Practices

1. Data Quality First

Start with high-quality data. Garbage in = garbage out.

2. Start with APIs

Automated data is more reliable than manual entry.

3. Plan Schema Early

Understand which fields you need before collecting vast amounts.

4. Verify Baseline

Check initial data looks reasonable before diving into analysis.

5. Integrate Gradually

Start with one system, add others as comfort grows.


Next Steps