SK hynix + NVIDIA: SSDs and HBF for AI Inference - What Product Teams Should Prepare For
SK hynix is moving past HBM wins and pushing into NAND-based storage for AI. The company is co-developing a next-gen SSD with NVIDIA under "Storage Next" (NVIDIA) and "AI-N P" (SK hynix), targeting roughly 10x the performance of current SSDs. A prototype is planned for late next year, with a projection of up to 100 million IOPS by 2027.
In parallel, SK hynix and SanDisk are building a high-bandwidth flash (HBF) standard-stacked NAND structured for AI-scale bandwidth. An alpha is expected late January next year, with customer prototypes in 2027. Shinyoung Securities estimates HBF could open around $1B in 2027 and reach $12B by 2030.
Why storage is now the bottleneck
GPUs solved training throughput by pairing with HBM, which widened the data pipe and kept cores busy. But inference is a different game: latency and capacity dominate.
For reference, GPT-4 inference is said to need about 3.6 TB, while a single HBM3E GPU offers roughly 192 GB. That often forces 6-7 GPUs per request, pushing service cost and complexity. Personalization lifts memory needs even further. Since HBM is volatile, it doesn't hold long-term user context. NAND does.
What SK hynix is building
- AI-N P (with NVIDIA): New SSD architecture and controllers optimized for large-scale AI inference I/O. The goal: reduce storage-compute stalls and improve perf-per-watt. PoC underway; prototype targeted for late next year; projected up to 100M IOPS in 2027.
- AI-N B / HBF (with SanDisk): Stacked NAND akin to HBM's wide data path, but for non-volatile flash. Alpha expected late January; customer prototypes in 2027. Packaging leans on SK hynix's VFO (vertical fan-out) to connect along the die edge-avoiding TSV drilling and the yield hits that are tough on complex 3D NAND.
- AI-N D: A middle-tier storage layer that targets TB-PB scale with SSD-like speed and HDD-like economics for inference-era workloads.
Key implications for product development
- Architect for memory tiers: HBM for hot tensors; SSD/HBF for large weights, caches, and user context. Design clear data residency rules and move less data across tiers.
- Target latency, not just throughput: Prototype with realistic token budgets and batch sizes. Track tail latencies at the pipeline level (GPU + network + storage).
- Modernize the I/O path: Adopt async I/O, batched reads, and direct paths like NVIDIA GPUDirect Storage to cut CPU mediation.
- Plan for new standards: HBF may ship through GPU vendors or storage OEMs. Expect co-validation cycles and firmware dependencies. Avoid lock-in with abstraction layers where possible.
- Budget for capacity at inference: Personalization requires persistent context. Model your per-user and per-session memory footprints up front.
- Balance endurance and cost: Profile read/write ratios. Minimize write amplification with smarter sharding, prefetching, and compaction policies.
- Thermals and density: Higher IOPS means heat. Validate airflow, power envelopes, and slot count early-especially with dense NVMe or future HBF modules.
- Networking matters: Wide storage bandwidth is pointless if east-west traffic or PCIe lanes throttle end-to-end performance. Align PCIe Gen, NICs, and switch fabrics with your I/O targets.
- Observability: Instrument queue depth, IOPS per query, bytes per token, and GPU stall reasons. Tie these to unit economics.
- Roadmap alignment: Prototypes hit late next year; broader availability points to 2027. Line up vendor access, PoCs, and budget cycles now.
How this changes your build strategy
Training-optimized stacks won't carry you through inference at scale. You'll need non-volatile tiers that keep user context close and feeds GPUs fast enough to choke less on I/O.
The likely end state: HBM + HBF + SSD, each with a clear role, plus software that actually takes advantage of the layout.
Immediate actions for your team
- Prototype NVMe-heavy inference nodes with direct storage paths; measure query cost vs. latency.
- Define a memory map per model (hot, warm, cold) and enforce it in code, not slides.
- Start vendor dialogues on AI-N P and HBF PoCs; document interface and firmware assumptions.
- Size your 2026-2027 infra budgets around capacity + latency, not just FLOPs.
- Upskill the team on AI infra and storage-aware inference patterns. See curated options by job role here.
Useful references
Bottom line: inference is pushing past HBM's comfort zone. NAND-via faster SSDs and HBF-looks set to become a first-class citizen in AI system design. If you own product outcomes, build for it now.
Your membership also unlocks: