Open Inference Accelerator (OpenAIU) — Project Charter

Executive Summary

The AI inference market is approaching $50 billion by 2026 and is projected to reach $350 billion by 2032. Today's inference infrastructure was designed for training — large, power-hungry compute cards repurposed for a workload whose bottleneck is memory bandwidth and latency, not raw FLOP throughput. A standard inference server chassis holds eight double-wide cards drawing 350–700W each, running at 3–5 kilowatts total. The hardware carries closed firmware that cannot be independently audited, and the economics are dictated by that hardware's supply chain.

OpenAIU is designed from first principles as an inference accelerator for the markets where auditability and supply chain sovereignty matter most: banking, healthcare, defense, telecom, and sovereign AI programs. Single-slot. 75W. A standard server chassis holds sixteen OpenAIU cards — twice the density at roughly one-quarter the power per inference node — running open source silicon with auditable firmware from boot to inference kernel.

The SoC uses two tiers of POWER ISA cores: a dedicated POWER Control Cluster of four Microwatt-class in-order cores handling on-card firmware, host bridge management, NOC orchestration, and power/thermal control — and a 32-core AI compute array connected to those control cores via the Auxiliary Execution Unit (AXU) interface. High-bandwidth memory access is delivered through an open OMI buffer. The entire SoC — control plane and compute plane — runs a unified POWER ISA with a single toolchain and a single auditable firmware stack.

The project will use agentic EDA AI tools — IBM Bob AI orchestration and the IBM EDA Suite via the Silicon Factory — to compress what would traditionally be a 5-year design cycle into a 36-month delivery.

Value Proposition

Density: 2× cards per chassis

Single-slot form factor at 75W fits 16 cards where a GPU-based inference server holds 8 double-wide cards. Same chassis, double the inference nodes, each independently managed.

Power: ~¼ per inference node

At 75W vs 350–700W per GPU card, OpenAIU targets dramatically lower power per inference unit. Designed for sustained inference workloads — not peak GPU training bursts.

Fully Auditable Stack

AI core RTL, control firmware, and memory interface are Apache 2.0 open source. Every numerical operation is specifiable and verifiable — critical for regulated AI deployment under EU AI Act and OCC model risk guidance.

No Licensing Fees

Open RTL, open firmware, open toolchain. Any OPF member can manufacture, deploy, or extend the design. No per-card royalties, no driver licensing, no ecosystem lock-in.

Performance targets are architectural goals based on design parameters. Validated benchmarks will be published following FPGA bring-up at milestone M6.

SoC Architecture

OpenAIU SoC — Block Diagram (POWER ISA Throughout)

AI Compute Array — 8×4 Grid
32 Open AI Cores
2D systolic arrays (matrix multiply) + 1D vector units · INT4/INT8/FP16/BF16 · Bidirectional ring NOC interconnect · AXU-dispatched from POWER control cluster

POWER Control Cluster

4× Microwatt Cores (POWER ISA)

In-order · On-card firmware · PCIe host bridge · NOC task scheduler · Power & thermal management · AXU dispatch to AI array · Secure boot (OpenFSP chain)

On-Chip Memory

64 MB SRAM + L3

Distributed SRAM tiles across AI array · Shared L3 for POWER control cluster + AI cores · Weight caching

Memory Interface — Project Gap

Open OMI Buffer

OpenCAPI OMI subset · >200 GB/s · LPDDR5X memory stacks · Open buffer chip

Host I/O

PCIe Gen5 ×16 / OpenCAPI

Negotiated at boot · PCIe for broad server compatibility · OpenCAPI for POWER hosts · Card-to-card chaining · Managed by POWER control cluster

The AI Core Array

Thirty-two AI accelerator cores arranged in an 8×4 grid, connected by a bidirectional ring bus. Each core contains:

A 2D systolic array for matrix multiplication — the dominant operation in transformer inference (attention, feed-forward layers). Systolic arrays pipeline data through a grid of multiply-accumulate units with no memory bandwidth required between steps.
A 1D vector unit for activation functions, softmax, layer normalization, and element-wise operations that do not fit the systolic pattern.
Local SRAM tile for weight caching — keeping frequently accessed model weights on-chip to avoid off-chip bandwidth pressure.
Low-precision arithmetic support: INT4, INT8, FP16, and BF16. Quantized INT4/INT8 inference delivers 4–8× throughput improvement over FP32 with acceptable accuracy loss for most production models.

4× Microwatt POWER cores (control plane) 32 AI cores (compute array)

8-column grid — row 0: POWER control cluster · rows 1–4: AI compute array

The AXU Interface: Tight-Coupling to POWER ISA

What is the AXU (Auxiliary Execution Unit) Interface?

The POWER control cluster exposes an Auxiliary Execution Unit (AXU) interface — a standardized connection point defined in the POWER ISA that allows custom functional units to sit directly adjacent to the POWER execution pipeline. AXU operations are dispatched as POWER ISA instructions and share register state with the control core's integer and floating-point register files. This means the AI cores in OpenAIU are not a separate "device" communicating over PCIe or even an AXI bus — they are execution units from the CPU's perspective, with register-speed data transfer and zero DMA overhead for small tensors. The POWER ISA Matrix Multiply Assist (MMA) instructions are specifically designed to dispatch to AXU-class functional units of exactly this type.

The POWER Control Cluster — One ISA, End to End

Modern AI accelerator SoCs use a two-tier architecture: small management cores handle host bridge enumeration, DDR initialization, and NOC fabric orchestration, while the compute array handles the AI workload. It is the correct structural pattern — separating control plane concerns from compute plane concerns allows each tier to be optimized independently.

OpenAIU uses Microwatt-class POWER ISA cores for the control cluster, completing a unified POWER ISA environment across the entire SoC. This delivers three compounding advantages:

Single ISA throughout: The entire SoC — control plane and compute plane — is POWER ISA. The firmware on the control cluster is compiled with the same POWER toolchain as the software stack on the host server. One toolchain, one firmware model, one audit chain across the full card.
Silicon-proven control core: Microwatt has already been taped out as a real ASIC using the OpenROAD open-source EDA flow, and runs Linux on FPGA today. The control cluster is not a new design — it is a known-good, open RTL block with production tapeout heritage that OPF members can build on with confidence.
Complete auditability: The AXU interface, the Microwatt control cores, the AI compute array RTL, and the on-card firmware are all POWER ISA and all Apache 2.0 open source. Every line of code executing on the card — from boot loader to inference kernel — is independently auditable. This is the requirement for regulated market deployments in banking, healthcare, and defense.

The four Microwatt control cores are small, in-order, and clocked conservatively — their job is orchestration, not compute. They consume a fraction of die area and power budget. The AI compute array does the heavy lifting; the POWER control cluster makes the card a first-class, autonomously bootable, fully managed accelerator device.

Host Interface: Dual Standard PCIe and OpenCAPI Coherent Fabric

OpenAIU supports two host interfaces — making it viable across the broadest possible deployment base:

PCIe Gen5 ×16: Standard compatibility with any server platform. Plug into any OCP NIC 3.0 or standard PCIe slot. Works in existing data center infrastructure today with no host-side changes.
OpenCAPI (Osmosis): Direct cache-coherent attachment to POWER hosts. The Osmosis coherent fabric delivers significantly higher bandwidth and lower latency than conventional accelerator buses — with coherent memory access between host and accelerator, eliminating data copy overhead for model weight transfers and inference requests. On POWER-native hosts, OpenCAPI is the primary interface and delivers the full inference performance advantage.

The two interfaces are not exclusive — the card negotiates at boot time based on what the host presents. The same hardware runs in any existing data center infrastructure over PCIe, and achieves its full performance potential when paired with a POWER-native host over OpenCAPI, with the on-card POWER control cluster managing interface initialization in both cases.

Card-to-Card Chaining and Inference Scale-Out

OpenCAPI's topology supports direct card-to-card communication without routing through the host CPU. Multiple OpenAIU cards can be chained into an inference fabric — sharing the KV cache for long-context inference, pipelining model layers across cards, or distributing batch workloads across the full card pool.

Chassis Density Comparison

Typical GPU Inference Server

8 cards · 350–700W each

GPU ×2

2,800–5,600W total · Double-slot · Proprietary

OpenAIU Inference Server

16 cards · 75W each

AIU

1,200W total · Single-slot · OpenCAPI chained · Open source

The Open OMI Challenge

The Most Significant Technical Gap in the Project

The Open Memory Interface (OMI) is a low-latency, high-bandwidth serial memory interface — a subset of the OpenCAPI specification — that delivers HBM-class bandwidth with a fraction of the die area of traditional parallel DDR interfaces. IBM uses OMI across its POWER processor family and Spyre. The problem: every existing OMI buffer chip (the component that bridges the OMI serial link to LPDDR5 or HBM memory) is proprietary. Microchip Technology's explorer buffer chip is the only commercial option, and it is closed. OpenAIU requires an open OMI buffer chip as a sub-project deliverable. The path forward uses the Universal Memory Interface (UMI) from Zero ASIC as a transaction-layer reference, combined with an open-source SerDes implementation, to build a minimal OMI-compatible buffer in an accessible process node. This is explicitly the highest-risk deliverable in the project scope and is flagged for early funding prioritization.

AI-Accelerated Design: Compressing the Timeline

A traditional silicon design project of this complexity — a multi-core AI SoC at 5nm — takes 5–7 years from architecture to tape-out. The OpenAIU project explicitly targets 36 months by using agentic AI-powered EDA tooling throughout the design flow. This is not hypothetical: the tools exist and are in commercial use as of 2025-2026.

Front-End / RTL

IBM Bob AI Orchestration

IBM Bob AI · Agentic EDA orchestration

Agentic AI that automates RTL coding, test plan generation, regression orchestration, and bug fixing for the Microwatt control cluster updates and AI core RTL. Compresses front-end design from months to days for well-specified blocks.

Verification

IBM EINSTEIN Formal Verification

IBM EINSTEIN · Formal verification suite

Autonomously reaches coverage targets faster by identifying redundant tests and routing simulation resources toward uncovered functional states. Critical for verifying AI core correctness and AXU dispatch correctness across all precision formats.

Physical Design / PPA

IBM SixthSense Physical Layout

IBM SixthSense · AI-assisted layout optimization

Optimizes floorplanning, clock tree synthesis, and power gating across all 32 AI cores simultaneously — targeting the 75W envelope with improved power efficiency over manual closure. Engineers can drive multiple block closures in parallel via Silicon Factory.

RTL-to-GDSII

IBM EDA Suite via Silicon Factory

IBM EDA Suite · Design Space Optimization

Autonomous search across the implementation design space (synthesis strategies, placement constraints, routing options) to find minimum-power configurations at 5nm. Particularly valuable for the AI core array where repeated instances allow aggressive optimization sharing.

Analog / SerDes / OMI

IBM BooleDozer Logic Synthesis

IBM BooleDozer · Logic and analog synthesis

For the analog-intensive open OMI SerDes and memory interface circuits. AI-assisted process migration takes an existing open-source SerDes design and retargets it to the 5nm node, dramatically reducing analog design iteration time — the slowest part of any mixed-signal SoC.

Chiplet Architecture

Multi-Die Integration

UCIe · 2.5D integration

The AI core array and the Microwatt control cluster may be implemented as separate chiplets (AI cores at 5nm, control cluster at 7nm) connected via UCIe die-to-die interface. This reduces per-die yield risk and allows node-optimized manufacturing for each functional block. AI-driven EDA tools from the Silicon Factory optimize the die partition boundary.

Regulatory Value: Auditable AI Inference

Regulated industries deploying AI for credit decisions, fraud detection, clinical decision support, or benefits eligibility face a specific problem with current AI accelerator hardware: the inference computation is performed on opaque proprietary silicon with no auditable firmware and no independent verification of numerical precision. Regulators in the EU (AI Act), US financial sector (OCC guidance on model risk), and healthcare (FDA guidance on AI-based devices) are increasingly focused on infrastructure provenance for AI systems.

OpenAIU is designed with auditability as a first-class architectural requirement:

Open AI core RTL: The matrix multiply units, vector units, and precision conversion logic are open source. An auditor can verify that INT8 quantization is implemented to specification — with no rounding modes hidden in proprietary firmware.
Deterministic execution: OpenAIU will specify and document the exact numerical behavior of all supported precision formats — a property that proprietary GPU inference engines do not guarantee across driver versions.
Firmware chain of trust: Boot from OpenFSP (Project 03) with measured boot extending from the service processor through the PCIe card firmware. The entire software stack from power-on to inference kernel is cryptographically attested.
Model isolation: LPAR-equivalent partitioning of AI core resources — enforced in hardware — ensures model inference from different tenants cannot interfere. Relevant for shared regulated cloud environments.

Integration with the OpenPOWER Stack

OpenAIU is not a standalone device. It is designed as a component in the full OpenPOWER stack:

Microwatt (Community Project): The 4× Microwatt POWER ISA control cluster on the OpenAIU SoC draws directly from the open source Microwatt project — already silicon-proven via OpenROAD tapeout. OpenAIU is the first deployment of Microwatt inside a production AI accelerator SoC.
Project 02 (OpenHMC): OpenHMC manages OpenAIU card inventory, partitions inference capacity across LPARs, monitors power draw and thermals, and handles firmware updates — exactly as it manages compute LPARs. No separate management plane required.
Project 03 (OpenFSP): OpenFSP handles the PCIe card's out-of-band management, secure boot attestation, and hardware error reporting through the same Redfish API surface used for host servers.
Project 04 (POWER OCP Reference Platform): OpenAIU slots into the OCP NIC 3.0 and PCIe card slots on the reference platform chassis. When deployed with OpenCAPI-capable POWER hosts, multiple OpenAIU cards chain directly over OpenCAPI — enabling low-latency scale-out inference across a full rack without routing through the host CPU.

Target Specifications

Parameter	Specification	Notes
AI Cores	32 (8×4 grid)	Open RTL, 2D systolic + 1D vector per core
Precision	INT4 / INT8 / FP16 / BF16	Hardware acceleration for all; INT4 for LLM inference
On-chip SRAM	≥64 MB	Distributed tiles + shared L3
Memory Interface	OMI (open buffer)	OpenCAPI OMI subset; LPDDR5X memory stacks
Memory Bandwidth	>200 GB/s	Peak to AI core array
Control Processor	4× Microwatt POWER ISA	AXU-coupled to AI array; MMA instruction dispatch; full POWER ISA
Host Interface	PCIe Gen5 ×16 / OpenCAPI	Negotiated at boot; OpenCAPI primary on POWER hosts for full coherent performance; PCIe for broad compatibility
Power Envelope	75W TDP	PCIe card form factor, passive or active cooling
Process Node	5nm (TSMC N5)	7nm fallback; chiplet option for mixed nodes
Architecture	Monolithic or 2.5D chiplet	UCIe die-to-die interface for chiplet variant
Firmware	Open source (Apache 2.0)	Boot via OpenFSP; management via OpenHMC

Deliverables

AI Core Architecture Spec

Published microarchitecture specification for a single AI core: systolic array dimensions, vector unit design, SRAM tile, and AXU dispatch protocol. Month 4.

AI Core RTL (32 cores)

Synthesizable open source RTL for the full 32-core array including ring bus interconnect. Verification suite with INT4/INT8/FP16/BF16 test vectors. Month 14.

Open OMI Buffer RTL

Open source OMI buffer chip design — the most novel deliverable. SerDes, memory controller, and OMI protocol stack. FPGA-verified. Month 18.

Full SoC RTL Integration

Microwatt control cluster + AI array + OMI + PCIe Gen5 + OpenCAPI integrated into a complete SoC RTL. Functional simulation with reference AI models (BERT, LLaMA quantized). Month 22.

FPGA Prototype

Partial implementation on UltraScale+ — Microwatt control cluster + reduced AI core array (8 cores) — running quantized inference against reference benchmarks. Month 20.

ASIC Tape-Out Package

GDSII at 5nm (or 7nm). PPA closure verified with IBM SixthSense/EINSTEIN via Silicon Factory. DRC clean. Third-party security audit of AI core arithmetic. Month 36.

Milestones

Month 1TSC formed with representation from semiconductor, AI, and regulated industry backgrounds. EDA tool access established through IBM EDA Suite via Silicon Factory membership.
Month 3AXU interface specification finalized. AI core microarchitecture v1.0 published for community review. OMI buffer chip feasibility study complete.
Month 6Single AI core RTL complete. AXU dispatch working in simulation with Microwatt control cluster. MMA instruction encoding verified against POWER ISA specification.
Month 108-core cluster RTL with ring bus interconnect. INT8 matrix multiply verified against reference implementation. IBM SixthSense PPA optimization begun on core cluster via Silicon Factory.
Month 14Full 32-core array RTL complete. IBM EDA Suite physical optimization pass targeting 75W at 5nm. OMI buffer SerDes RTL alpha.
Month 18Open OMI buffer chip RTL complete. FPGA partial prototype: 8 AI cores + Microwatt control cluster running BERT INT8 inference. INT4 precision format verified.
Month 22Full SoC RTL integration. LLaMA-class quantized model inference running in simulation. PCIe Gen5 and OpenCAPI interfaces integrated; card-to-card chaining validated in simulation.
Month 26Timing closure at 5nm with IBM SixthSense/EINSTEIN assistance via Silicon Factory. Third-party arithmetic audit of AI core precision formats. OpenHMC and OpenFSP integration validated.
Month 30Pre-tape-out DRC and LVS clean. SBOM published. Security audit complete. Founding member sign-off on tape-out configuration.
Month 36GDSII tape-out package submitted to foundry. First silicon expected 12–18 months post-submission. FPGA prototype units available to all Founding Members.

The Strategic Case

Every major regulated industry is deploying AI inference on-premises — for model risk compliance, data residency requirements, or latency-sensitive applications. They are doing it on proprietary GPU hardware they cannot audit, connected via proprietary software stacks they cannot verify, managed by cloud providers whose infrastructure they cannot inspect.

OpenAIU is the only project in the world combining: an open processor ISA with IBM patent coverage, open AI accelerator cores derived from published research architecture, auditable firmware from chip boot to inference kernel, and a hardware form factor compatible with multi-vendor manufacturing. For a bank that must explain to its regulator exactly what hardware and software computed a credit decision — OpenAIU is infrastructure that makes that explanation possible. No proprietary GPU can.

Open Inference AcceleratorProject Charter