Storage Layer - Persistent Data & Audit Trail
Storage Layer: Core Idea & Novel Contributions
What is the Storage Layer?
The Storage Layer persists all data, maintains complete audit trails, enables compliance, and supports efficient querying for analytics and reporting.
Core Responsibilities
1. Primary Data Storage
PostgreSQL Database
- Relational database storing all operational data
- ACID compliance (Atomicity, Consistency, Isolation, Durability)
- Full backup and recovery capabilities
- Read replicas for query distribution
- Optimized indices for common queries
Data Includes
- Organization profiles and user accounts
- Assessment responses and scoring
- Action plans and progress tracking
- Stakeholder data and engagement records
- Financial and operational metrics
- IoT readings and processed analytics
2. Audit Trail & Compliance Logging
Immutable Log Every action recorded permanently:
Timestamp: 2026-03-15 14:32:04 UTCUser: john.smith@company.comAction: Create assessment "Q1 2026 Baseline Assessment"Organization: Acme CorpIP Address: 203.45.67.89Status: SUCCESSDuration: 245msDetails: Assessment type=ESGETC, Framework=Standard31, Locale=en_USRetention: 7 years (compliant with financial/regulatory requirements)
Access: Only compliance officers can view full audit logs
Immutability: Cannot be modified, only appended to (tamper-proof)
3. Data Encryption at Rest
Field-Level Encryption Sensitive data encrypted even in storage:
Standard fields (unencrypted):- Organization name: "Acme Corp"- Assessment date: "2026-03-15"
Sensitive fields (encrypted):- Revenue: [encrypted]- Employee email: [encrypted]- Benefits data: [encrypted]- Financial metrics: [encrypted]Encryption Key Management
- Keys stored in separate HSM (Hardware Security Module)
- Keys rotated annually
- No keys in source code or configuration
- Access to keys logged and audited
4. Data Backup & Recovery
Backup Strategy
Production Database ↓ (Continuous) ↓ Real-time failover replica (hot standby) (Can take over in <1 second) ↓ (Hourly) ↓ Daily snapshot (30-day retention) ↓ (Weekly) ↓ Long-term archive (7-year retention)Recovery Objectives (RTO/RPO)
- RTO (Recovery Time Objective): 99.9% uptime = ~45 minutes downtime/year
- RPO (Recovery Point Objective): Real-time replication = <1 second data loss
Testing: Disaster recovery drils quarterly
Novel Contributions
1. Federated Data Model
Platform maintains both:
- Canonical model: Standardized ESGETC schema for analysis
- Source model: Original data kept as-is for audit trail
- Lineage: Complete path from original → transformation → canonical
Benefits:
- No data loss through normalization
- Can recreate any calculation (reproducibility)
- Audit trail shows exactly how metric was derived
- Regulatory compliance (showing source data)
2. Temporal Data Storage
Stores not just current state, but complete history:
2026-03-15: Environmental score = 682026-02-15: Environmental score = 65 (historical)2026-01-15: Environmental score = 62 (historical)
Can answer:- "What's the current score?" (68)- "How has this changed?" (trend: +6 over 2 months)- "What was it on specific date?" (exact historical values)Enables:
- Trend analysis
- Before/after assessment
- Learning from past decisions
3. Row-Level Security (RLS)
Fine-grained access control:
User: CEO of Acme CorpCan see: Only Acme Corp's data
User: Analyst at consulting firmCan see: Only organizations that consented to share with consultants
User: Researcher at universityCan see: Anonymized aggregated data only
User: AdminCan see: Everything (with full audit logging)Implemented via:
- Database role permissions
- Application-level checks
- Encryption key restrictions for sensitive fields
4. Data Deduplication & Versioning
Handles: Same organization appears multiple times
"Apple Inc" from Company database"Apple" from survey response"AAPL" from stock data"Apple Inc." from Wikidata
System resolves to:"Apple Inc" (canonical)All historical versions kept for auditCan revert to old version if neededBenefits:
- True single source of truth
- No duplicate data inflating counts
- Clear audit trail of merges
Technical Architecture
Database Schema
Core Tables:
-- Organizationsorganizations (org_id, name, sector, country, ...)
-- Assessmentsassessments (assessment_id, org_id, date, dimensions, ...)assessment_responses (response_id, question_id, answer, ...)
-- Action Plansaction_plans (plan_id, org_id, status, ...)actions (action_id, plan_id, description, owner, ...)
-- Users & Permissionsusers (user_id, email, org_id, ...)user_roles (user_id, role, permissions, ...)
-- Audit Trailaudit_log (log_id, user_id, action, resource, timestamp, ...)
-- Time Series Datametric_timeseries (metric_id, org_id, date, value, ...)Query Optimization
Indices
- Org + Date (for fast org-specific queries)
- Metric + Date (for time series queries)
- User + Organization (for access control)
Materialized Views
- Pre-computed summaries (dimension scores, trends)
- Refreshed nightly
- Dramatically faster for reporting
Partitioning
- Monthly partitions for audit logs
- Quarterly partitions for assessment data
- Allows fast pruning of old data
Performance
Query Performance
- 90% of queries complete in <100ms
- Dashboard loads in <1 second
- Bulk exports (1M rows) complete in <30 seconds
Scalability
- Current: 10M+ records
- Handles 1000+ concurrent users
- Can scale to 100M+ records with partitioning
Data Governance
Data Dictionary
All fields documented:
- Field name and type
- Business definition
- Unit of measurement (if numeric)
- Range of valid values
- Data owner
Example:
Field: annual_revenueType: decimalUnit: USD (thousands)Definition: Total annual revenue in previous fiscal yearValid Range: 0 to 999,999Data Owner: CFO/Finance teamUpdate Frequency: AnnualData Quality Controls
Validation Rules
- Revenue: Must be positive- Headcount: Must be integer >= 1- Email: Must match email format- Country: Must be valid ISO code- Date: Cannot be in futureConstraints
- Referential integrity (no orphaned records)
- Uniqueness (no duplicate organizations)
- Not null (all required fields filled)
Correction Process
- Data error detected (human or automated)
- Flagged for review
- Correction made with audit trail
- Downstream analytics recalculated
- Stakeholders notified if score changed significantly
Compliance & Regulations
SOC2 Compliance
- Encryption in transit (TLS 1.3)
- Encryption at rest (AES-256)
- Access controls (role-based)
- Audit logging (immutable)
- Incident response (24-hour response)
GDPR Compliance
- Personal data stored separately
- Encryption of personal fields
- Data subject rights (access, deletion, portability)
- Data protection by design
- Privacy impact assessments
Backup & Disaster Recovery
- Backups in geographically distributed locations
- Regular recovery drills
- <1 hour RTO for critical failures
- Testing and validation quarterly
Best Practices
1. Data Quality First
Investment in data quality pays dividends forever.
2. Document Everything
Data dictionary, lineage, transformations all documented.
3. Test Thoroughly
Backup/recovery tested regularly before disaster strikes.
4. Monitor Continuously
Disk space, query performance, backup status all monitored.
5. Plan for Growth
Design storage to handle 10x current volume.