AI on Demandpowered by NVIDIA

NVIDIA Nemotron 3 Super: Hybrid Mamba-Transformer architecture with a mixture-of-experts model, 120 billion parameters and a context window of up to 1 million tokens. 12 billion parameters active per query — powered by stepping stone on Swiss infrastructure.

NVIDIA Nemotron 3 Super is an open-weight language model featuring a hybrid Mamba-Transformer architecture and Mixture-of-Experts (MoE). Of its 120 billion parameters, only around 12 billion are active per query — enabling Frontier-level performance whilst making efficient use of resources.

The model processes contexts of up to 1 million tokens — enough for entire document collections, codebases or conversations lasting several hours in a single query. It supports 7 languages (including German), offers configurable reasoning and is specifically trained for agent-based workflows, tool integration and RAG scenarios. stepping stone runs Nemotron 3 Super entirely on Swiss infrastructure — your data remains in Switzerland.

Companies and development teams that require a high-performance language model for complex tasks — without relying on US cloud services. Particularly suitable for organisations dealing with long documents, multilingual requirements or agent-based workflows.

Typical use cases: analysis and summarisation of large document collections; agent-based workflows with tool integration and autonomous planning; RAG scenarios with extensive context; code generation and code reviews; and the automation of IT workflows and recurring tasks.

Open Weights (NVIDIA Nemotron Open Model Licence). Swiss data centres. No vendor lock-in.

Context window of up to 1 million tokens — one of the largest on the market. Efficient despite 120 billion parameters, thanks to MoE architecture. Configurable reasoning: can be enabled or disabled depending on the task. NVIDIA, as the developer, stands for reliability and continuous development. Personalised advice and operation provided by stepping stone in Bern.

Scope of services

AI model on demand

Access to NVIDIA Nemotron 3 Super for reasoning, agent-based workflows and complex text tasks. A context window of up to 1 million tokens for processing entire collections of documents in a single query.

GPU performance on demand

Scalable computing power on Swiss infrastructure. Particularly efficient thanks to MoE architecture: 120 billion parameters, with only 12 billion active per query.

Managed service

Deployment, monitoring, maintenance and support on Swiss infrastructure, with personalised advice. stepping stone takes care of the day-to-day running so that you can focus on the benefits.

Areas of application

Document analysis

With 1 million tokens, Nemotron 3 Super can process entire document collections, codebases or conversation histories lasting several hours in a single query.

Companies use it to analyse extensive reports, legal documents and technical documentation. Configurable reasoning can be switched on or off depending on the task — ensuring precise results without unnecessary computational overhead.

Reasoning & Analysis

Nemotron 3 Super is designed for complex, multi-stage reasoning — with autonomous task planning and tool integration.

RAG scenarios involving broad context, IT workflow automation and code generation benefit from MoE efficiency: cutting-edge performance with efficient use of resources, compatible with standard API clients, and fully hosted on Swiss infrastructure.

Benchmark

The benchmarks were measured using the vllm bench tool against the production API gateway. The standard input sizes were 1,024 tokens for input and 256 tokens for output, which corresponds to 2–3 book pages or 500–750 words.

If necessary, higher input sizes can be set.

 

Call

# Set your personal key:
STONEY_KEY=sk-...

# Make key visible for vllm bench:
export OPENAI_API_KEY=$STONEY_KEY

# Start the benchmark
vllm bench serve \
 --backend openai-chat \
 --model "NVIDIA/NVIDIA-Nemotron-3-Super-120B-A12B" \
 --base-url llm.stoney-cloud.com \
 --endpoint /v1/chat/completions \
 --dataset-name random \
 --random-input-len 1024 \
 --random-output-len 256 \
 --num-prompts 50 \
 --max-concurrency 1 \
--tokenizer "Qwen/Qwen2.5-7B-Instruct" \
 --percentile-metrics ttft

 

Result

============ Serving Benchmark Result ============
Successful requests:                     44        
Failed requests:                         6         
Maximum request concurrency:             1         
Benchmark duration (s):                  148.34    
Total input tokens:                      69315     
Total generated tokens:                  11264     
Request throughput (req/s):              0.30      
Output token throughput (tok/s):         75.94     
Peak output token throughput (tok/s):    257.00    
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          543.22    
---------------Time to First Token----------------
Mean TTFT (ms):                          2964.56   
Median TTFT (ms):                        2963.51   
P99 TTFT (ms):                           3033.40

 

Legend

  • Successful requests: Successful prompt requests.
  • Failed requests: Unsuccessful prompts.
  • Maximum request concurrency: How many requests the model processes simultaneously.
  • Benchmark duration (s): The duration of the benchmark run in seconds.
  • Total input tokens: The total number of input tokens.
  • Total generated tokens: The total number of tokens generated by the model.
  • Request throughput (req/s): The number of requests processed per second.
  • Output token throughput (tok/s): The average number of tokens generated per second.
  • Peak output token throughput (tok/s): The maximum measured number of output tokens per second.
  • Peak concurrent requests: The maximum measured number of requests processed simultaneously.
  • Total token throughput (tok/s): The average of all tokens processed during the measurement.
  • Mean Time to First Token (TTFT) (ms): The average time elapsed between input and the first visible output.
  • Median TTFT (ms): The expected time between input and the first visible output. Also known as TTFT p50.
  • p99 TTFT (ms): The time elapsed in the “worst case” scenario until the first token is generated.
  • Tokenizer: The tokenizer is used to send queries to the evaluated model during a benchmark. These are typically small, publicly available models, such as Qwen/Qwen2.5-7B-Instruct.

Price

ModelContext lengthInput/MTokOutput/MTok
NVIDIA-Nemotron-3-Super-120B-A12B131k2.00005.0000
All prices are in CHF/MTok, excluding VAT.