Data Integrator
"Plug in the next sensor. Land the data. Curate the schema."
Persona
Insinööri Juho Halme
Juho is the one nobody talks to when things work — and the one everybody calls when a coastal MAC sensor disappears off the dashboard at 02:00. He owns the bronze → silver Lakehouse layers, the Data Factory pipelines that pull new sensor feeds, and the OneLake shortcuts that expose partner data into the same workspace.
A new coastal MAC sensor in Inkoo? That's a half-day onboarding: provision an Eventstream, validate CSV headers against the canonical schema, register the sensor in the catalog, route the stream through the dedup/parse processor, and land into the silver Delta table. He measures success in schema-drift incidents per quarter (target: zero).
⚠ synthetic personaDaily workflow
- 08:00Reviews overnight pipeline run summary: bronze-layer record counts, silver-layer rejection counts, any schema-validation failures.
- 09:00Onboarding ticket: new partner AIS feed from EU Naval Tracking. Sets up an Eventstream source, defines header mapping into
ais_bronze. - 11:00Runs the schema validation notebook against last 24 h of bronze data — three columns drifted from the partner side. Files a ticket back to the partner.
- 13:30OneLake shortcut maintenance: the Border Guard's drone-radar feed moved to a new ADLS container; reissues shortcut.
- 15:00Pair-programs with the Intelligence Analyst on a new gold-layer table:
incident_evidence_bundlejoining AIS + MAC + radar around any incident timestamp. - 16:30Updates the catalog: registers the new coastal MAC sensor MAC-INK-COAST-01 in
sensors.geojson, propagates to the live dashboards. - 17:00Quarterly metric check: schema-drift incidents this quarter — 1 (partner side). Pipeline failure rate: 0.04%. Within SLO.
Key data products
| Data product | Source scenario(s) | Fabric tool | Refresh cadence |
|---|---|---|---|
| Sensor onboarding pipeline Eventstream → bronze Delta → silver Delta with parse/dedup, per sensor family | all (infra) | Fabric Data Factory + Eventstream | continuous |
| Schema validation report CSV header drift checks, type-cast failures, null-rate anomalies per source | all sources | Notebook + Lakehouse validation table | daily |
| OneLake shortcut registry Index of partner-hosted data exposed read-only into this workspace via shortcuts | partner feeds | OneLake shortcuts + Lakehouse catalog | on change |
| Eventstream routing config Topic → derived stream → destination wiring per sensor | S1–S6 realtime | Eventstream | on change |
| Sensor & infra catalog Canonical sensors.geojson + infrastructure.geojson as source of truth | all | Git + Lakehouse external table | on change |
| Pipeline SLO dashboard Run success %, latency p95, schema-drift count per quarter | all pipelines | Power BI | hourly |
Linked scenarios
infrastructure.geojson polygons being current — Juho is the one who updates them when EnergiNet publishes a new alignment.
Fabric tools used
Example Data Agent prompts
- List every sensor that has produced fewer than 10% of its expected records in the last 24 hours.
- Which silver-layer tables have had schema-validation failures this week, and which upstream source caused each?
- Show me the lineage from
ais_bronzeto every downstream Activator rule.