First Ethernet-Based AI Memory Fabric System Boosts LLM Efficiency
Enfabrica has introduced its Elastic Memory Fabric System (EMFASYS), an Ethernet-based AI memory fabric designed to cut AI inference costs per user per token by up to 50%. This hardware and software solution is built on Enfabrica’s proprietary SuperNIC silicon network interconnect technology, aiming to optimize AI memory management and improve efficiency in large language model (LLM) inference.
Addressing AI-Inference Bottlenecks
LLM inference demands massive data movement between memory and processing units. Modern workloads require 10 to 100 times more compute time per query compared to earlier models. Without efficient memory access, CPUs, GPUs, and TPUs can spend a lot of time idle, waiting for data.
The EMFASYS rack-mount system tackles this challenge by creating a high-performance virtual memory system that is scalable and fast. It allows multiple processors across different racks to share memory seamlessly, reducing delays and improving throughput.
How EMFASYS Works: A Hub-and-Spoke Model
At the core of EMFASYS is Enfabrica’s 3.2 Terabits/second Accelerated Compute Fabric SuperNIC (ACF-S). It connects up to 18 channels with 144 CXL memory lanes to 800/400 Gigabit Ethernet ports, all linked through high-speed RDMA Ethernet cabling.
This setup enables faster access for complex multi-GPU AI inference tasks, cutting costs by up to 50% per token per user and supporting efficient infrastructure scaling. The system supports shared memory targets up to 18 Terabytes of CXL DDR5 DRAM per node, offloading GPU and HBM memory.
Power consumption adds about one kilowatt to a rack system already drawing tens of kilowatts, while memory bandwidth is aggregated to enable transaction striping across multiple channels and Ethernet ports. The cost of fabric-attached memory stays below $20 per gigabyte.
Enfabrica’s CEO, Rochan Sankar, compares EMFASYS to an airline hub-and-spoke system: large data payloads can be offloaded and distributed efficiently to different processors, similar to how passengers are transferred from jumbo jets to regional flights without congestion or delays.
Flexible, Architecture-Agnostic Memory Interaction
Current LLM architectures, including GPUs and TPUs, share similar memory limitations due to integrated high-bandwidth memory (HBM). EMFASYS breaks these barriers by enabling high-performance memory interaction regardless of the underlying hardware.
According to Sankar, the system delivers a peak of 3.2 Terabits/second per accelerator, managing congestion and allowing for a composable system. This means a mix of Nvidia GPUs, AMD GPUs, memory, or storage can be connected with consistent data flow.
EMFASYS Now Available for Sampling
Enfabrica is currently sampling EMFASYS with several customers, providing complete systems and evaluation platforms. Proof-of-concept setups are also hosted at the company’s data center, demonstrating modularity, scalability, and customer expandability.
This solution presents a practical approach to improving AI inference efficiency and reducing infrastructure costs, especially as demand for large-scale AI models continues to grow.
Your membership also unlocks: