Why AI Inference Is the New Bottleneck for Financial Services Firms

Financial firms face equal challenges in AI training and inference, needing diverse, efficient deployments from edge devices to data centers. Storage solutions now crucially reduce costs by maintaining context and minimizing computation.

Categorized in: AI News Finance
Published on: Aug 01, 2025
Why AI Inference Is the New Bottleneck for Financial Services Firms

For Financial Services Firms, AI Inference Is As Challenging As Training

In the early days of machine learning, training AI models was a costly and complex task, while inference—the process of applying these models to new data—was relatively straightforward. But for financial services firms today, especially those working with generative AI, inference has become just as challenging as training.

Financial institutions must deploy AI models that not only meet specific purposes but also operate efficiently on diverse devices. Some models run on smartphones or edge devices in bank branches to ensure low latency, while others demand heavy computing resources in data centers, introducing delays applications must manage. This diversity means inference workloads run across various compute engines and require storage solutions that are no longer an afterthought.

Storage systems now play a crucial role in maintaining context to reduce unnecessary computation, effectively lowering inference costs. Lessons from financial services can benefit other industries, though these firms tend to keep their AI strategies confidential, partly due to the sensitive nature of their data and operations.

AI Inference Workloads in Financial Services

AI inference enhances a broad range of workloads within financial services. These can be grouped into three main categories:

  • Quantitative Finance: AI supports investment risk assessment, insurance actuarial evaluations, and algorithm back-testing to ensure models perform well on unseen data.
  • Underwriting and Analytics: AI processes alternative data streams like news sentiment, satellite imagery, and video feeds to gauge market sentiment and operational risks, including fraud detection and document automation.
  • Customer Experience: Conversational AI, chatbots, personalization engines, and advisory systems improve client interactions.

Additionally, many financial firms invest in AI-powered code assistants to modernize legacy banking systems, many still reliant on COBOL and mainframes. Techniques like Retrieval Augmented Generation (RAG) combine internal structured and unstructured data with pretrained models to enhance inference tasks, both internally and in customer-facing applications.

Cautious Adoption of AI in Banking

JP Morgan, a leader in financial AI, has used traditional machine learning for risk and fraud detection for years but only launched its first generative AI tool, IndexGPT, in mid-2024. This tool leverages OpenAI’s GPT-4 to generate keywords for thematic investment indices, automating a task investors once performed manually. Though the indices are currently static, future versions may enable investors to create personalized, dynamic indices—provided inference costs decrease.

Bank of America’s Erica and Wells Fargo’s Fargo apps showcase different AI approaches in customer service. Erica, launched in 2018, relies on natural language processing and machine learning without large language models (LLMs). It manages over 20 million active users and handles billions of interactions, assisting with tasks like bill payment and service subscription management.

In contrast, Wells Fargo’s Fargo app, introduced in 2022, uses generative AI techniques extensively. It performs speech-to-text locally with a small LLM, scrubs personally identifiable information, then queries a range of large models hosted on cloud platforms like Google Cloud and Microsoft Azure. Fargo’s user interactions surged from 21 million in 2023 to over 245 million in 2024, highlighting the growing demand for inference.

The Growth of AI Inference Nodes

Before generative AI, AI models typically fit into a single GPU for inference even if training required many GPUs. Now, large foundation models trained on tens of thousands of GPUs require inference systems with multiple GPUs—sometimes four, eight, or even sixteen—to hold and process model weights.

Chain of thought reasoning models, which break down problems across several smaller models, demand about 100 times more compute but provide more accurate results. This needs specialized hardware like the GB200 NVL72 system with 72 GPUs delivering over a petaflop of inference compute.

Upcoming systems, such as Nvidia’s VR200 NVL144, will push performance further, offering exaflop-scale computing power in a single rack. These powerful systems are suited for complex inference workflows far beyond the capabilities of typical GPU cards or CPU-based accelerators.

While many financial firms will continue using smaller multi-GPU setups, larger, rack-scale systems are becoming necessary to handle increasingly sophisticated inference workloads. Alternatives from AMD and Intel may also emerge, but adoption depends on how confidently institutions can rely on these advanced AI outputs—something highly regulated banks approach carefully.

Storage’s Crucial Role in AI Inference

Storage is no longer a secondary concern in AI inference. Efficient storage solutions enable caching of key-value vectors and context windows, reducing the need to recompute data with each token generated by AI models. This approach lessens the burden on GPU memory and cuts inference costs.

Persistent memory can store query contexts, allowing systems to avoid redundant computations when similar tasks repeat. For example, Vast Data’s platform uses networked persistent memory to extend context length beyond GPU memory limits, improving efficiency in multi-tenant environments.

Proper data orchestration is also essential. Hammerspace, for instance, manages global metadata and treats local flash storage in GPU servers as a fast tier, moving data strategically to optimize inference throughput.

Financial services firms must integrate these storage strategies with their AI infrastructure to keep inference both efficient and cost-effective.

For professionals in finance looking to deepen their understanding of AI and its practical applications, exploring comprehensive AI training resources can provide valuable insights. Platforms like Complete AI Training offer courses tailored to various skill levels and job roles.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)