LLM inference that moves the needle in finance
Large language models are changing how trading desks and research teams extract signals from unstructured data. With the right setup, they read filings, summarize risk exposures, and surface catalysts before they hit your P&L. To make hardware and software decisions easier, STAC introduced STAC-AI LANG6-a benchmark that isolates LLM inference performance inside a RAG pipeline so you can compare systems with apples-to-apples metrics.
Below, you'll see what performed best across batch and interactive workloads, what it means for real financial use cases, and how to run the same style of tests on your own data.
What STAC-AI LANG6 actually measures
The benchmark focuses on inference for Llama 3.1 8B Instruct and Llama 3.1 70B Instruct using two EDGAR-based tasks built from 10-K filings:
- EDGAR4: Medium-length prompts summarizing a company's relationship to financial and physical concepts (commodities, currencies, rates, real estate).
- EDGAR5: Long-context Q&A spanning a complete 10-K filing.
Both mirror real workflows like summarizing annual reports, exposure analysis, and drafting investment notes across thousands of public companies. Source documents come from SEC filings like 10-Ks (EDGAR).
Two modes are tested. Batch (offline) measures raw throughput (RPS/WPS) when all requests arrive at once. Interactive (online) models live traffic with a tunable arrival rate λ and reports reaction time (RT, like time-to-first-token) and words-per-second per user (WPS/user). Note: interactive mode does not include Llama 3.1 70B with EDGAR5. Quality and length are checked against a control set. Unlike many benchmarks, STAC requires chat templating and tokenization at inference time-extra CPU work that real deployments may prefer for prompt security.
Test platforms and software stack
Three configurations were compared:
- HPE ProLiant Compute DL384 Gen12 with NVIDIA GH200 Grace Hopper Superchip (on-prem, single-server efficiency).
- Nebius Cloud VM based on a single node of an NVIDIA GB200 NVL72 system: two Grace CPUs and four Blackwell GPUs fully connected via NVLink and NVSwitch (cloud).
- SuperMicro AS-5126GS-TNRT with two NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs (on-prem, compact Blackwell option). Each GPU has 96 GB memory.
Models were post-training quantized using NVIDIA TensorRT Model Optimizer-FP8 on Hopper, NVFP4 on Blackwell-and executed with NVIDIA TensorRT LLM (PyTorch runtime) for high-throughput, low-latency inference.
Batch-mode results (throughput)
Across every model/dataset, Blackwell posted the highest throughput. Note: GB200 NVL72 numbers below were not audited by STAC.
- Llama 3.1 8B, EDGAR4 - 2x GH200: 8,237 WPS / 51.5 RPS; 4x GB200 NVL72: 37,480 WPS / 224 RPS; 2x RTX PRO 6000: 5,500 WPS / 32.9 RPS.
- Llama 3.1 8B, EDGAR5 - 2x GH200: 304 WPS / 0.784 RPS; 4x GB200 NVL72: 1,112 WPS / 2.85 RPS; 2x RTX PRO 6000: 138 WPS / 0.345 RPS.
- Llama 3.1 70B, EDGAR4 - 2x GH200: 1,071 WPS / 6.77 RPS; 4x GB200 NVL72: 5,618 WPS / 35.9 RPS; 2x RTX PRO 6000: 831 WPS / 5.26 RPS.
- Llama 3.1 70B, EDGAR5 - 2x GH200: 41.4 WPS / 0.119 RPS; 4x GB200 NVL72: 150 WPS / 0.477 RPS; 2x RTX PRO 6000: 13 WPS / 0.04 RPS.
Single-GPU comparisons showed up to a 3.2x jump moving from a GH200 GPU to a GB200 NVL72 GPU.
Interactive-mode takeaways (user experience)
In live settings, desks care about two things: how fast the first token shows up (RT) and how fast the rest streams (WPS/user). STAC reports the 95th percentile for both and also uses interword latency (IWL = 1 / WPS/user) for clarity.
GB200 NVL72 sustained better interactivity at higher throughput across scenarios. Even when normalizing to each system's own maximum throughput, GB200 NVL72 typically held lower RT and IWL than GH200-meaning smoother user experience at the same relative load.
What this means for finance teams
If you're summarizing 10-Ks, screening sector risk, or generating trade ideas, the message is straightforward: Blackwell is the pick for peak throughput and tighter latency targets; Hopper remains a cost-efficient, reliable baseline that still scores well at high loads.
- Use quantization (FP8 on Hopper, NVFP4 on Blackwell) to fit larger contexts and drive speed without giving up accuracy.
- Keep chat templating and tokenization server-side for prompt security, but budget CPU for it-it matters under load.
- Set arrival rates to match real traffic and track the 95th percentile RT and IWL for SLAs your users will feel.
- Long-context runs (EDGAR5) are memory and bandwidth hungry; plan capacity accordingly.
- Consider single-server Blackwell (RTX PRO 6000) for on-prem pilots and the GB200 NVL72 node in cloud for scaled desks.
For broader context and vendor-neutral methodology, see STAC Research.
Benchmark TensorRT LLM on your own data
You can replicate this style of testing with your models and sequence lengths. Prereqs: an NVIDIA GPU sized for your model/quantization, a TensorRT LLM Docker image, and a Hugging Face token with access to Llama 3.1 8B/70B Instruct.
Quick-start commands
Step 1: Launch the container
docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus=all \
-u $(id -u):$(id -g) \
-e USER=$(id -un) \
-e HOME=/tmp \
-e TRITON_CACHE_DIR=/tmp/.triton \
-e TORCHINDUCTOR_CACHE_DIR=/tmp/.inductor_cache \
-e HF_HOME=/workspace/model_cache \
-e HF_TOKEN=<your_huggingface_token> \
--volume "$(pwd)":/workspace \
--workdir /workspace \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2
Step 2: Clone the Model Optimizer repo
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git -b 0.37.0
Step 3: Quantize the model (NVFP4 shown)
bash TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh \
--model meta-llama/Llama-3.1-8B-Instruct \
--quant nvfp4
Step 4: Generate synthetic traffic
python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
--stdout \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
token-norm-dist \
--input-mean 2048 \
--output-mean 128 \
--input-stdev 0 \
--output-stdev 0 \
--num-requests 30000 \
> dataset_2048_128.json
Step 5: Run the benchmark (offline style)
cat > llm_options.yml << 'EOF'
cuda_graph_config:
enable_padding: True
EOF
trtllm-bench \
--model meta-llama/Llama-3.1-8B-Instruct \
--model_path /workspace/TensorRT-Model-Optimizer/examples/llm_ptq/saved_models_Llama-3_1-8B-Instruct_nvfp4 \
throughput \
--dataset dataset_2048_128.json \
--backend pytorch \
--extra_llm_api_options llm_options.yml
This returns request throughput, tokens/sec per GPU, and more-enough to map capacity to budget and user SLAs. If you have real traffic stats, vary input/output length distributions and arrival rates to mirror reality.
Bottom line
NVIDIA GB200 NVL72 pushed STAC-AI LANG6 to new highs, delivering up to 3.2x gains over prior architectures along with better interactivity at higher loads. Hopper remains a strong, proven option for both batch and interactive inference. Choose based on your mix of long-context analysis, latency targets, and run-rate cost.
Want more workflows and playbooks? Explore AI for Finance for practical applications across investment research, trading, and risk.
Your membership also unlocks: