Source Layer - Data Collection & Integration
Source Layer: Core Idea & Novel Contributions
What is the Source Layer?
The Source Layer is responsible for collecting data from all possible sources. It’s the entry point data uses before integration and processing.
Core Responsibilities
1. Data Collection Methods
Direct Input
- Web forms and surveys (assessments, stakeholder input)
- CSV/Excel file uploads (batch data import)
- API submissions from external systems
- Manual data entry (for specialized fields)
Automated Integration
- REST API connections from ERP/CRM/HR systems
- IoT sensors (production facilities, environmental monitors)
- Webhooks from third-party services (Salesforce, HubSpot)
- Database direct connections (read-only access)
Third-Party Databases
- World Bank Indicators API
- UNDP SDG databases
- ESG rating agencies
- OpenStreetMap geographic data
- Wikidata entity information
- Government statistical bureaus
2. Production-Grade IoT Data Pipeline
The platform includes a production-ready IoT infrastructure for real-time operational data:
Current Capabilities
- Supports 1,000+ simultaneous device connections
- Processes 1M+ events weekly
- <5 second latency from sensor to dashboard
- 99.9% uptime SLA
- Hardware compatibility: Modbus, MQTT, HTTP/HTTPS
Data Types
- Temperature, humidity, air quality sensors
- Equipment operational metrics (runtime, cycles, errors)
- Power consumption and energy metrics
- Production volume and quality metrics
- Safety and compliance indicators
Reliability & Monitoring
- Automatic retry with exponential backoff
- Device heartbeat monitoring
- Anomaly detection for sensor malfunctions
- Data validation and sanitization
- Compression for bandwidth optimization
3. Data Format Normalization
All incoming data is converted to standard formats:
interface RawDataPoint { sourceId: string; // Which system it came from timestamp: Date; // When it was collected organizationId: string; // Which org it belongs to dimension: string; // ESGETC dimension value: number; // The actual data unit: string; // Unit of measurement metadata: Record<string, any>; // Context}4. Data Validation & Quality
Automatic Checks
- Range validation (values within expected bounds)
- Type validation (string vs number vs date)
- Required field validation
- Format validation (email, URL, date formats)
- Referential integrity (foreign keys exist)
Quality Scoring Each data point gets a quality score (0-100%):
- Completeness: 0-25 points (all required fields)
- Accuracy: 0-25 points (matches known benchmarks)
- Timeliness: 0-25 points (how recent is the data)
- Consistency: 0-25 points (aligns with related data)
Data below 50% quality triggers a review workflow before inclusion in analysis.
Novel Contributions
1. Federated Data Collection
Unlike competitors who force data into a single model, we support multiple data schemas simultaneously:
- Organization A uses quarterly reports
- Organization B has monthly IoT streams
- Organization C uploads raw spreadsheets
- All can be assessed on same ESGETC framework
2. NLP-Based Entity Discovery & Classification
We don’t just accept what organizations tell us. The platform:
- Crawls public data (Wikidata, web, databases) to discover organizations
- Extracts mentions of sustainability work through NLP
- Classifies entities automatically (business, NGO, university, etc.)
- Deduplicates organizations that appear multiple times
- Verifies claims against World Bank and UN partnerships
Learn more about entity discovery →
3. Adaptive Intelligence Weighting System
The data collection itself adapts:
- Learns from past assessments which questions are most predictive
- Allocates survey effort to high-impact areas
- Skips or elaborates based on initial answers
- Personalizes collection to organizational context
Instead of “ask 300 questions,” the system asks 30-80 optimized questions.
4. Baseline Analysis & AI Recommendations
When data is first collected, the system:
- Detects baseline values for each dimension
- Identifies trends (improving/declining over time)
- Surfaces anomalies (this org very different from peers)
- Recommends next steps (where to focus improvement efforts)
All before the organization answers a single planning question.
Technical Architecture
Data Ingestion Pipeline
┌─── API Endpoint ─────┐│ ↓│ ├─ POST /api/v1/data/ingest│ ├─ Validation layer│ ├─ Authentication check│ └─ Rate limiting│├─── Async Job Queue ┐│ ↓│ ├─ Parse format│ ├─ Transform to standard schema│ ├─ Run quality checks│ └─ Store raw data│├─── Real-time Stream ┐│ ↓│ ├─ Event deduplication│ ├─ Immediate validation│ ├─ Cache warm-up│ └─ Trigger alerts if needed│└─── File Upload ─────┐ ↓ ├─ Virus scan ├─ Parse (CSV/Excel/JSON) ├─ Validate structure └─ Queue for batch processingData Sources Currently Integrated
| Source Type | Examples | Frequency |
|---|---|---|
| APIs | Salesforce, HubSpot, Stripe, Shopify | Real-time |
| Databases | PostgreSQL, MySQL, SQL Server | On-demand |
| Files | CSV, Excel, JSON, PDF | Manual upload |
| Forms | Web surveys, mobile app | On submission |
| IoT | Sensors, equipment, devices | Real-time |
| Public Data | World Bank, UNDP, OpenStreetMap | Weekly sync |
| Webhooks | Slack, Zapier, custom | Real-time |
Security at the Source
Data Protection
- TLS/SSL Encryption: All data in transit encrypted
- Field-Level Encryption: Sensitive fields encrypted at rest
- API Key Rotation: Automatic key management
- Audit Logging: Every data point logged with timestamp and user
Access Control
- Source Authorization: Each integration requires explicit approval
- Scope Limiting: Integrations get minimum necessary permissions
- IP Whitelisting: Optional for on-premise connections
- Rate Limiting: Prevent abuse and DoS attacks
Data Validation
- Malware Scanning: File uploads checked for malware
- SQL Injection Prevention: Parameterized queries
- XSS Protection: Input sanitization
- CORS Protection: Only trusted domains allowed
Supported Data Sources
Enterprise Systems
- SAP - Sales, procurement, financials
- Oracle - General enterprise data
- NetSuite - Multi-subsidiary data
- Salesforce - Customer and sales data
- HubSpot - Marketing and sales data
Operational Systems
- MES (Manufacturing) - Production metrics
- SCADA - Equipment monitoring
- BMS (Building) - Energy and operations
- WMS - Warehouse management
Financial Systems
- QuickBooks - Accounting data
- FreshBooks - Invoice and expense data
- Xero - Multi-currency accounting
Geographic/Environmental
- OpenWeather - Climate data
- AirVisual - Air quality
- OpenStreetMap - Geographic boundaries
- Sentinel-2 - Satellite imagery
Best Practices
1. Data Quality First
Start with high-quality data. Garbage in = garbage out.
2. Start with APIs
Automated data is more reliable than manual entry.
3. Plan Schema Early
Understand which fields you need before collecting vast amounts.
4. Verify Baseline
Check initial data looks reasonable before diving into analysis.
5. Integrate Gradually
Start with one system, add others as comfort grows.