AI on Demandpowered by MiniMaxAI

MiniMax-M2.5: a mixture-of-experts model with 230 billion parameters — of which only 10 billion are active per query. Trained on over 200,000 real-world development environments and more than 10 programming languages, and operated by stepping stone on Swiss infrastructure.

MiniMax-M2.5 is an open-weight language model based on a Mixture-of-Experts (MoE) architecture. Of its 230 billion parameters, only around 10 billion are active per query — this makes the model fast and cost-effective without compromising on performance.

The model has been specifically trained for coding and agent-based tasks: across over 200,000 real-world development environments and more than 10 programming languages. It plans like a software architect, breaks down complex tasks into manageable steps and works independently with tools. stepping stone runs MiniMax-M2.5 on Swiss infrastructure — no data leaves the country.

Development teams and companies wishing to use AI to support programming, automation or the creation of agent-based workflows — without relying on US providers. Particularly suitable for organisations that require cutting-edge performance whilst keeping an eye on costs and data sovereignty.

Typical use cases: AI-powered software development and code reviews; agent-based workflows with tool integration and web search; automated document generation (Word, Excel, PowerPoint); complex, multi-stage tasks with autonomous planning.

Open Weights (Modified MIT). Swiss data centres. No vendor lock-in.

Thanks to its MoE architecture, MiniMax-M2.5 delivers cutting-edge performance at a fraction of the usual cost — 10 to 20 times cheaper than comparable proprietary models. The model supports over 10 programming languages and works independently with tools, search engines and files. Personalised advice and operation provided by stepping stone in Bern.

Scope of services

AI model on demand

Access to MiniMax-M2.5 for coding, agent-based workflows and complex text tasks. Frontier-level performance on a par with the best proprietary models — at a fraction of the cost.

GPU performance on demand

Scalable computing power for Swiss infrastructure. Particularly efficient thanks to MoE architecture: 230 billion parameters, with only 10 billion active per query.

Managed service

Deployment, monitoring, maintenance and support on Swiss infrastructure, with personalised advice. stepping stone takes care of the day-to-day running so that you can focus on the benefits.

Areas of application

Software development

MiniMax-M2.5 has been specifically trained for professional software development — in more than 10 programming languages and real-world production environments.

It generates, analyses and reviews code at a cutting-edge level: 10 to 20 times cheaper than comparable proprietary models. Teams use it as a coding assistant, for automated code reviews and to speed up development cycles.

Agent-based workflows

The model plans like a software architect: it breaks down complex tasks, works independently with tools and delivers structured results.

MiniMax-M2.5 supports function calling, web searches and file operations — ideal for multi-step automation. It automatically creates Word, Excel and PowerPoint documents and is suitable for agent-based set-ups that do not rely on US cloud services.

Benchmark

The benchmarks were measured using the vllm bench tool against the production API gateway. The standard input sizes were 1,024 tokens for input and 256 tokens for output, which corresponds to 2–3 book pages or 500–750 words.

If necessary, higher input sizes can be set.

Call

# Set your personal key:
STONEY_KEY=sk-...

# Make key visible for vllm bench:
export OPENAI_API_KEY=$STONEY_KEY

# Start the benchmark
vllm bench serve \
 --backend openai-chat \
 --model "MiniMaxAI/MiniMax-M2.5" \
 --base-url llm.stoney-cloud.com \
 --endpoint /v1/chat/completions \
 --dataset-name random \
 --random-input-len 1024 \
 --random-output-len 256 \
 --num-prompts 50 \
 --max-concurrency 1 \
--tokenizer "Qwen/Qwen2.5-7B-Instruct" \
 --percentile-metrics ttft

Result

============ Serving Benchmark Result ============
Successful requests:                     48        
Failed requests:                         2         
Maximum request concurrency:             1         
Benchmark duration (s):                  187.20    
Total input tokens:                      73622     
Total generated tokens:                  12288     
Request throughput (req/s):              0.26      
Output token throughput (tok/s):         65.64     
Peak output token throughput (tok/s):    257.00    
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          458.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          3740.68   
Median TTFT (ms):                        3756.77   
P99 TTFT (ms):                           3871.80   
==================================================

Legend

  • Successful requests: Successful prompt requests
  • Failed requests: Unsuccessful prompts
  • Maximum request concurrency: How many requests the model processes simultaneously.
  • Benchmark duration (s): The duration of the benchmark run in seconds.
  • Total input tokens: The total number of input tokens.
  • Total generated tokens: The total number of tokens generated by the model.
  • Request throughput (req/s): The number of requests processed per second.
  • Output token throughput (tok/s): The average number of tokens generated per second.
  • Peak output token throughput (tok/s): The maximum measured number of output tokens per second.
  • Peak concurrent requests: The maximum measured number of requests processed simultaneously.
  • Total token throughput (tok/s): The average of all tokens processed during the measurement.
  • Mean Time to First Token (TTFT) (ms): The average time elapsed between input and the first visible output.
  • Median TTFT (ms): The expected time between input and the first visible output. Also known as TTFT p50.
  • p99 TTFT (ms): The time elapsed in the “worst case” scenario until the first token is generated.
  • Tokenizer: The tokenizer is used to send queries to the evaluated model during a benchmark. These are typically small, publicly available models, such as Qwen/Qwen2.5-7B-Instruct.

Price

ModelContext lengthInput/MTokOutput/MTok
MiniMax-M2.5196k1.94009.7000
All prices are in CHF/MTok, excluding VAT.