Phase 2 of 4: Data & Ontology
Define what your data means
before you build on it.
Ontology design, classification, governed pipelines, and access control. In that order. Raw data is not an asset until someone defines what it is, who owns it, how reliable it is, and who can see it.
The data architecture
Six layers. Each one necessary.
Data moves from raw sources through classification and curation to a governed semantic model with controlled access. Every layer has a job. None are optional.
Source Systems
Where your data lives today.
Every piece of business data originates somewhere: relational databases, REST APIs, event streams, SaaS platforms, flat files, IoT feeds, legacy systems. Before we can govern anything, we need a complete inventory of what exists, who owns it, and what it actually means. Most organizations discover they have two or three authoritative sources of truth for the same entity, and none of them agree.
- Catalog all data sources: databases, APIs, files, streams, SaaS platforms
- Assess current data quality and completeness at each source
- Identify authoritative sources vs. derived or stale copies
- Map ownership: who controls each source, who depends on it
Ingestion Layer
Data moves. Reliably, with full fidelity.
The ingestion layer moves data from source systems into the governed environment at the right frequency, with the right fidelity, without breaking production systems. Poorly designed ingestion is the root cause of most downstream data quality failures. Not bad data, bad movement. A record that gets dropped, duplicated, or silently truncated in ingestion corrupts every downstream consumer.
- Define ingestion frequency per source: streaming, micro-batch, or scheduled
- Build connectors for each source type: standard and custom
- Implement dead-letter queues and retry logic for failed records
- Record raw-as-received data before any transformation (full fidelity landing)
Classification Layer
Data is typed, tagged, scored, and tracked.
This is the layer most organizations skip, and the absence of it is exactly why their AI fails. Classification inspects incoming data and answers the questions that governance requires: What type is it? How sensitive is it? What business object does it represent? How trustworthy is it? How did it get here? Without classification, you do not have governed data. You have a pile of bytes that looks like data.
- Schema inference: column types, formats, nullable patterns, cardinality
- PII detection: names, emails, SSNs, addresses, financial data, health records
- Sensitivity classification: Public, Internal, Confidential, Restricted
- Business object mapping: tag which ontology entity each column or record represents
Curated Data Zones
Raw becomes reliable. Reliable becomes business-ready.
Curated zones apply progressive refinement: data moves from raw to validated to cleaned to enriched to business-ready. Each zone has explicit quality contracts. Nothing advances without passing them. This is where most of the transformation engineering lives, and where most DIY data pipelines quietly fail because the contracts were never defined.
- Bronze: raw as received, immutable, full fidelity, retained for audit and replay
- Silver: validated, deduplicated, type-cast, nulls handled, schema normalized
- Gold: joined, enriched, aggregated, business rules applied, ontology-aligned
- Quality gates at each zone boundary: data does not advance without passing defined tests
Ontology & Semantic Layer
Data becomes knowledge. Objects have relationships.
The semantic layer maps curated data to your business ontology: the formal definition of what your business objects are, how they relate, and what they mean. This is what gives AI real reasoning capability instead of pattern matching over raw column names. Palantir built a billion-dollar company around exactly this concept.
- Formal ontology definition: Customer, Order, Asset, Employee and all their properties
- Relationship mapping: Customer has Orders, Order contains Products, Asset belongs to Location
- Semantic API: query business objects by name, not by table join and column alias
- Knowledge graph: relationships traversable by AI agents and analytical query engines
Service & Access Layer
Governed access. Role-aware. Fully audited.
The service layer is where governed data becomes accessible to applications, analysts, and AI. MCP servers live here. Every read, every action is mediated by access controls built on the ontology and sensitivity classifications from lower layers. AI does not reach raw data; it reaches a governed interface that knows exactly what it is allowed to see and do, and logs everything it touches.
- MCP servers: governed AI access to read data and take action in your systems
- REST APIs: application access to business objects, not raw tables or SQL
- Row-level security: data filtered by user role, team, or sensitivity classification tier
- Column-level masking: PII redacted or tokenized based on classification
The hard layer
Classification is where most data projects fail.
Most teams treat ingestion as the hard part. It is not. Getting data into a system is easy. Knowing what that data is (what it means, how sensitive it is, what business entity it represents, and whether it can be trusted) is what separates governed data from a pile of bytes.
We build the classification layer as a discrete, testable service. Not a convention, not a naming scheme, not documentation nobody reads. It runs on every record that lands, it emits structured metadata to the catalog, and it blocks bad data from advancing to curated zones.
When AI needs to reason over your data, this is the work that makes it possible. The model does not hallucinate a customer record when the classification layer has already told it what a customer is, what fields are authoritative, and what sensitivity tier applies.
The access bridge
MCP servers sit on top of the service layer.
That is not an accident.
MCP (Model Context Protocol) servers are the last-mile access layer for AI agents. They can only be built correctly when the ontology defines their schema, the classification layer tells them what is sensitive, and the service layer tells them who can see what. The data architecture is what makes governed AI access possible.
Find out where your data actually stands.
The AI Readiness Assessment scores your data architecture across all six layers. Know what is governed, what is missing, and what to fix first.