Skip to content

Storage Layer - Persistent Data & Audit Trail

Storage Layer: Core Idea & Novel Contributions

What is the Storage Layer?

The Storage Layer persists all data, maintains complete audit trails, enables compliance, and supports efficient querying for analytics and reporting.

Core Responsibilities

1. Primary Data Storage

PostgreSQL Database

  • Relational database storing all operational data
  • ACID compliance (Atomicity, Consistency, Isolation, Durability)
  • Full backup and recovery capabilities
  • Read replicas for query distribution
  • Optimized indices for common queries

Data Includes

  • Organization profiles and user accounts
  • Assessment responses and scoring
  • Action plans and progress tracking
  • Stakeholder data and engagement records
  • Financial and operational metrics
  • IoT readings and processed analytics

2. Audit Trail & Compliance Logging

Immutable Log Every action recorded permanently:

Timestamp: 2026-03-15 14:32:04 UTC
User: john.smith@company.com
Action: Create assessment "Q1 2026 Baseline Assessment"
Organization: Acme Corp
IP Address: 203.45.67.89
Status: SUCCESS
Duration: 245ms
Details: Assessment type=ESGETC, Framework=Standard31, Locale=en_US

Retention: 7 years (compliant with financial/regulatory requirements)

Access: Only compliance officers can view full audit logs

Immutability: Cannot be modified, only appended to (tamper-proof)

3. Data Encryption at Rest

Field-Level Encryption Sensitive data encrypted even in storage:

Standard fields (unencrypted):
- Organization name: "Acme Corp"
- Assessment date: "2026-03-15"
Sensitive fields (encrypted):
- Revenue: [encrypted]
- Employee email: [encrypted]
- Benefits data: [encrypted]
- Financial metrics: [encrypted]

Encryption Key Management

  • Keys stored in separate HSM (Hardware Security Module)
  • Keys rotated annually
  • No keys in source code or configuration
  • Access to keys logged and audited

4. Data Backup & Recovery

Backup Strategy

Production Database
(Continuous)
Real-time failover replica (hot standby)
(Can take over in <1 second)
(Hourly)
Daily snapshot (30-day retention)
(Weekly)
Long-term archive (7-year retention)

Recovery Objectives (RTO/RPO)

  • RTO (Recovery Time Objective): 99.9% uptime = ~45 minutes downtime/year
  • RPO (Recovery Point Objective): Real-time replication = <1 second data loss

Testing: Disaster recovery drils quarterly


Novel Contributions

1. Federated Data Model

Platform maintains both:

  • Canonical model: Standardized ESGETC schema for analysis
  • Source model: Original data kept as-is for audit trail
  • Lineage: Complete path from original → transformation → canonical

Benefits:

  • No data loss through normalization
  • Can recreate any calculation (reproducibility)
  • Audit trail shows exactly how metric was derived
  • Regulatory compliance (showing source data)

2. Temporal Data Storage

Stores not just current state, but complete history:

2026-03-15: Environmental score = 68
2026-02-15: Environmental score = 65 (historical)
2026-01-15: Environmental score = 62 (historical)
Can answer:
- "What's the current score?" (68)
- "How has this changed?" (trend: +6 over 2 months)
- "What was it on specific date?" (exact historical values)

Enables:

  • Trend analysis
  • Before/after assessment
  • Learning from past decisions

3. Row-Level Security (RLS)

Fine-grained access control:

User: CEO of Acme Corp
Can see: Only Acme Corp's data
User: Analyst at consulting firm
Can see: Only organizations that consented to share with consultants
User: Researcher at university
Can see: Anonymized aggregated data only
User: Admin
Can see: Everything (with full audit logging)

Implemented via:

  • Database role permissions
  • Application-level checks
  • Encryption key restrictions for sensitive fields

4. Data Deduplication & Versioning

Handles: Same organization appears multiple times

"Apple Inc" from Company database
"Apple" from survey response
"AAPL" from stock data
"Apple Inc." from Wikidata
System resolves to:
"Apple Inc" (canonical)
All historical versions kept for audit
Can revert to old version if needed

Benefits:

  • True single source of truth
  • No duplicate data inflating counts
  • Clear audit trail of merges

Technical Architecture

Database Schema

Core Tables:

-- Organizations
organizations (org_id, name, sector, country, ...)
-- Assessments
assessments (assessment_id, org_id, date, dimensions, ...)
assessment_responses (response_id, question_id, answer, ...)
-- Action Plans
action_plans (plan_id, org_id, status, ...)
actions (action_id, plan_id, description, owner, ...)
-- Users & Permissions
users (user_id, email, org_id, ...)
user_roles (user_id, role, permissions, ...)
-- Audit Trail
audit_log (log_id, user_id, action, resource, timestamp, ...)
-- Time Series Data
metric_timeseries (metric_id, org_id, date, value, ...)

Query Optimization

Indices

  • Org + Date (for fast org-specific queries)
  • Metric + Date (for time series queries)
  • User + Organization (for access control)

Materialized Views

  • Pre-computed summaries (dimension scores, trends)
  • Refreshed nightly
  • Dramatically faster for reporting

Partitioning

  • Monthly partitions for audit logs
  • Quarterly partitions for assessment data
  • Allows fast pruning of old data

Performance

Query Performance

  • 90% of queries complete in <100ms
  • Dashboard loads in <1 second
  • Bulk exports (1M rows) complete in <30 seconds

Scalability

  • Current: 10M+ records
  • Handles 1000+ concurrent users
  • Can scale to 100M+ records with partitioning

Data Governance

Data Dictionary

All fields documented:

  • Field name and type
  • Business definition
  • Unit of measurement (if numeric)
  • Range of valid values
  • Data owner

Example:

Field: annual_revenue
Type: decimal
Unit: USD (thousands)
Definition: Total annual revenue in previous fiscal year
Valid Range: 0 to 999,999
Data Owner: CFO/Finance team
Update Frequency: Annual

Data Quality Controls

Validation Rules

- Revenue: Must be positive
- Headcount: Must be integer >= 1
- Email: Must match email format
- Country: Must be valid ISO code
- Date: Cannot be in future

Constraints

  • Referential integrity (no orphaned records)
  • Uniqueness (no duplicate organizations)
  • Not null (all required fields filled)

Correction Process

  1. Data error detected (human or automated)
  2. Flagged for review
  3. Correction made with audit trail
  4. Downstream analytics recalculated
  5. Stakeholders notified if score changed significantly

Compliance & Regulations

SOC2 Compliance

  • Encryption in transit (TLS 1.3)
  • Encryption at rest (AES-256)
  • Access controls (role-based)
  • Audit logging (immutable)
  • Incident response (24-hour response)

GDPR Compliance

  • Personal data stored separately
  • Encryption of personal fields
  • Data subject rights (access, deletion, portability)
  • Data protection by design
  • Privacy impact assessments

Backup & Disaster Recovery

  • Backups in geographically distributed locations
  • Regular recovery drills
  • <1 hour RTO for critical failures
  • Testing and validation quarterly

Best Practices

1. Data Quality First

Investment in data quality pays dividends forever.

2. Document Everything

Data dictionary, lineage, transformations all documented.

3. Test Thoroughly

Backup/recovery tested regularly before disaster strikes.

4. Monitor Continuously

Disk space, query performance, backup status all monitored.

5. Plan for Growth

Design storage to handle 10x current volume.


Next Steps