An open source AI inference SoC — 32 AI accelerator cores paired with a dedicated cluster of small POWER ISA control cores, dual PCIe Gen5 and OpenCAPI host interfaces, and card-to-card chaining. Single-slot, 75W. Sixteen cards per chassis where competing systems fit eight. The entire SoC runs a unified POWER ISA — one toolchain, one firmware stack, fully auditable from silicon to software.
The AI inference market is approaching $50 billion by 2026 and is projected to reach $350 billion by 2032. Today's inference infrastructure was designed for training — large, power-hungry compute cards repurposed for a workload whose bottleneck is memory bandwidth and latency, not raw FLOP throughput. A standard inference server chassis holds eight double-wide cards drawing 350–700W each, running at 3–5 kilowatts total. The hardware carries closed firmware that cannot be independently audited, and the economics are dictated by that hardware's supply chain.
OpenAIU is designed from first principles as an inference accelerator for the markets where auditability and supply chain sovereignty matter most: banking, healthcare, defense, telecom, and sovereign AI programs. Single-slot. 75W. A standard server chassis holds sixteen OpenAIU cards — twice the density at roughly one-quarter the power per inference node — running open source silicon with auditable firmware from boot to inference kernel.
The SoC uses two tiers of POWER ISA cores: a dedicated POWER Control Cluster of four Microwatt-class in-order cores handling on-card firmware, host bridge management, NOC orchestration, and power/thermal control — and a 32-core AI compute array connected to those control cores via the Auxiliary Execution Unit (AXU) interface. High-bandwidth memory access is delivered through an open OMI buffer. The entire SoC — control plane and compute plane — runs a unified POWER ISA with a single toolchain and a single auditable firmware stack.
The project will use agentic EDA AI tools — IBM Bob AI orchestration and the IBM EDA Suite via the Silicon Factory — to compress what would traditionally be a 5-year design cycle into a 36-month delivery.
Single-slot form factor at 75W fits 16 cards where a GPU-based inference server holds 8 double-wide cards. Same chassis, double the inference nodes, each independently managed.
At 75W vs 350–700W per GPU card, OpenAIU targets dramatically lower power per inference unit. Designed for sustained inference workloads — not peak GPU training bursts.
AI core RTL, control firmware, and memory interface are Apache 2.0 open source. Every numerical operation is specifiable and verifiable — critical for regulated AI deployment under EU AI Act and OCC model risk guidance.
Open RTL, open firmware, open toolchain. Any OPF member can manufacture, deploy, or extend the design. No per-card royalties, no driver licensing, no ecosystem lock-in.
Performance targets are architectural goals based on design parameters. Validated benchmarks will be published following FPGA bring-up at milestone M6.
Thirty-two AI accelerator cores arranged in an 8×4 grid, connected by a bidirectional ring bus. Each core contains:
The POWER control cluster exposes an Auxiliary Execution Unit (AXU) interface — a standardized connection point defined in the POWER ISA that allows custom functional units to sit directly adjacent to the POWER execution pipeline. AXU operations are dispatched as POWER ISA instructions and share register state with the control core's integer and floating-point register files. This means the AI cores in OpenAIU are not a separate "device" communicating over PCIe or even an AXI bus — they are execution units from the CPU's perspective, with register-speed data transfer and zero DMA overhead for small tensors. The POWER ISA Matrix Multiply Assist (MMA) instructions are specifically designed to dispatch to AXU-class functional units of exactly this type.
Modern AI accelerator SoCs use a two-tier architecture: small management cores handle host bridge enumeration, DDR initialization, and NOC fabric orchestration, while the compute array handles the AI workload. It is the correct structural pattern — separating control plane concerns from compute plane concerns allows each tier to be optimized independently.
OpenAIU uses Microwatt-class POWER ISA cores for the control cluster, completing a unified POWER ISA environment across the entire SoC. This delivers three compounding advantages:
The four Microwatt control cores are small, in-order, and clocked conservatively — their job is orchestration, not compute. They consume a fraction of die area and power budget. The AI compute array does the heavy lifting; the POWER control cluster makes the card a first-class, autonomously bootable, fully managed accelerator device.
OpenAIU supports two host interfaces — making it viable across the broadest possible deployment base:
The two interfaces are not exclusive — the card negotiates at boot time based on what the host presents. The same hardware runs in any existing data center infrastructure over PCIe, and achieves its full performance potential when paired with a POWER-native host over OpenCAPI, with the on-card POWER control cluster managing interface initialization in both cases.
OpenCAPI's topology supports direct card-to-card communication without routing through the host CPU. Multiple OpenAIU cards can be chained into an inference fabric — sharing the KV cache for long-context inference, pipelining model layers across cards, or distributing batch workloads across the full card pool.
The Open Memory Interface (OMI) is a low-latency, high-bandwidth serial memory interface — a subset of the OpenCAPI specification — that delivers HBM-class bandwidth with a fraction of the die area of traditional parallel DDR interfaces. IBM uses OMI across its POWER processor family and Spyre. The problem: every existing OMI buffer chip (the component that bridges the OMI serial link to LPDDR5 or HBM memory) is proprietary. Microchip Technology's explorer buffer chip is the only commercial option, and it is closed. OpenAIU requires an open OMI buffer chip as a sub-project deliverable. The path forward uses the Universal Memory Interface (UMI) from Zero ASIC as a transaction-layer reference, combined with an open-source SerDes implementation, to build a minimal OMI-compatible buffer in an accessible process node. This is explicitly the highest-risk deliverable in the project scope and is flagged for early funding prioritization.
A traditional silicon design project of this complexity — a multi-core AI SoC at 5nm — takes 5–7 years from architecture to tape-out. The OpenAIU project explicitly targets 36 months by using agentic AI-powered EDA tooling throughout the design flow. This is not hypothetical: the tools exist and are in commercial use as of 2025-2026.
Agentic AI that automates RTL coding, test plan generation, regression orchestration, and bug fixing for the Microwatt control cluster updates and AI core RTL. Compresses front-end design from months to days for well-specified blocks.
Autonomously reaches coverage targets faster by identifying redundant tests and routing simulation resources toward uncovered functional states. Critical for verifying AI core correctness and AXU dispatch correctness across all precision formats.
Optimizes floorplanning, clock tree synthesis, and power gating across all 32 AI cores simultaneously — targeting the 75W envelope with improved power efficiency over manual closure. Engineers can drive multiple block closures in parallel via Silicon Factory.
Autonomous search across the implementation design space (synthesis strategies, placement constraints, routing options) to find minimum-power configurations at 5nm. Particularly valuable for the AI core array where repeated instances allow aggressive optimization sharing.
For the analog-intensive open OMI SerDes and memory interface circuits. AI-assisted process migration takes an existing open-source SerDes design and retargets it to the 5nm node, dramatically reducing analog design iteration time — the slowest part of any mixed-signal SoC.
The AI core array and the Microwatt control cluster may be implemented as separate chiplets (AI cores at 5nm, control cluster at 7nm) connected via UCIe die-to-die interface. This reduces per-die yield risk and allows node-optimized manufacturing for each functional block. AI-driven EDA tools from the Silicon Factory optimize the die partition boundary.
Regulated industries deploying AI for credit decisions, fraud detection, clinical decision support, or benefits eligibility face a specific problem with current AI accelerator hardware: the inference computation is performed on opaque proprietary silicon with no auditable firmware and no independent verification of numerical precision. Regulators in the EU (AI Act), US financial sector (OCC guidance on model risk), and healthcare (FDA guidance on AI-based devices) are increasingly focused on infrastructure provenance for AI systems.
OpenAIU is designed with auditability as a first-class architectural requirement:
OpenAIU is not a standalone device. It is designed as a component in the full OpenPOWER stack:
| Parameter | Specification | Notes |
|---|---|---|
| AI Cores | 32 (8×4 grid) | Open RTL, 2D systolic + 1D vector per core |
| Precision | INT4 / INT8 / FP16 / BF16 | Hardware acceleration for all; INT4 for LLM inference |
| On-chip SRAM | ≥64 MB | Distributed tiles + shared L3 |
| Memory Interface | OMI (open buffer) | OpenCAPI OMI subset; LPDDR5X memory stacks |
| Memory Bandwidth | >200 GB/s | Peak to AI core array |
| Control Processor | 4× Microwatt POWER ISA | AXU-coupled to AI array; MMA instruction dispatch; full POWER ISA |
| Host Interface | PCIe Gen5 ×16 / OpenCAPI | Negotiated at boot; OpenCAPI primary on POWER hosts for full coherent performance; PCIe for broad compatibility |
| Power Envelope | 75W TDP | PCIe card form factor, passive or active cooling |
| Process Node | 5nm (TSMC N5) | 7nm fallback; chiplet option for mixed nodes |
| Architecture | Monolithic or 2.5D chiplet | UCIe die-to-die interface for chiplet variant |
| Firmware | Open source (Apache 2.0) | Boot via OpenFSP; management via OpenHMC |
Published microarchitecture specification for a single AI core: systolic array dimensions, vector unit design, SRAM tile, and AXU dispatch protocol. Month 4.
Synthesizable open source RTL for the full 32-core array including ring bus interconnect. Verification suite with INT4/INT8/FP16/BF16 test vectors. Month 14.
Open source OMI buffer chip design — the most novel deliverable. SerDes, memory controller, and OMI protocol stack. FPGA-verified. Month 18.
Microwatt control cluster + AI array + OMI + PCIe Gen5 + OpenCAPI integrated into a complete SoC RTL. Functional simulation with reference AI models (BERT, LLaMA quantized). Month 22.
Partial implementation on UltraScale+ — Microwatt control cluster + reduced AI core array (8 cores) — running quantized inference against reference benchmarks. Month 20.
GDSII at 5nm (or 7nm). PPA closure verified with IBM SixthSense/EINSTEIN via Silicon Factory. DRC clean. Third-party security audit of AI core arithmetic. Month 36.
Every major regulated industry is deploying AI inference on-premises — for model risk compliance, data residency requirements, or latency-sensitive applications. They are doing it on proprietary GPU hardware they cannot audit, connected via proprietary software stacks they cannot verify, managed by cloud providers whose infrastructure they cannot inspect.
OpenAIU is the only project in the world combining: an open processor ISA with IBM patent coverage, open AI accelerator cores derived from published research architecture, auditable firmware from chip boot to inference kernel, and a hardware form factor compatible with multi-vendor manufacturing. For a bank that must explain to its regulator exactly what hardware and software computed a credit decision — OpenAIU is infrastructure that makes that explanation possible. No proprietary GPU can.