AI on Demandpowered by Qwen

Qwen — Alibaba’s open-source language model family, ranging from 0.6 to 235 billion parameters. stepping stone runs them entirely on Swiss GPU infrastructure: as an API, on a pay-as-you-go basis, with no dependence on the US.

With stepping stone, businesses gain access to AI capabilities, models, GPU resources, storage, interfaces and consultancy, flexibly tailored to their needs. In other words, it is not a rigid product package, but a scalable AI environment that is operated securely in Swiss data centres and tailored to the business’s requirements.

The Qwen models run entirely on Swiss infrastructure. No data leaves the country. Access is via an OpenAI-compatible API, which can be integrated directly into existing applications and workflows.

Swiss companies wishing to deploy AI in a production environment without having to set up their own GPU infrastructure or transfer data to US providers. Particularly suitable for regulated sectors, public authorities and SMEs looking to make the transition from AI experimentation to production use.

Typical use cases: chatbots and assistant systems, automation and analysis, coding assistants, document processing, agent-based workflows.

Swiss data centres. Open-source models. Your data, your rules. You retain full control over your AI strategy, without being dependent on OpenAI, Google or Amazon.

Personalised advice from stepping stone, from model selection through to integration. And a pricing model based on usage. You only pay for what you use.

Areas of application

Assistance

Qwen is suitable for building intelligent assistance systems — from simple chatbots to complex dialogue control.

Teams use it for automated customer service, internal knowledge search and multilingual analysis workflows. Thanks to the range of models available, you can choose the right one for any purpose — from the compact 0.6B for edge applications to the 235B flagship model for demanding tasks.

Development

Qwen has proven its worth in development practice as a coding assistant and the foundation for agent-based workflows.

The model generates, analyses and reviews code in over 20 programming languages. For agent-based setups, it supports function calling, tool integration and multi-step task planning — on Swiss infrastructure, and can be integrated via an OpenAI-compatible API.

Benchmark

The benchmarks were measured using the vllm bench tool against the production API gateway. The standard input sizes were 1,024 tokens for input and 256 tokens for output, which corresponds to 2–3 book pages or 500–750 words.

If necessary, higher input sizes can be set.

Call

# Set your personal key:
STONEY_KEY=sk-...

# Make key visible for vllm bench:
export OPENAI_API_KEY=$STONEY_KEY

# Start the benchmark
vllm bench serve \
 --backend openai-chat \
 --model "Qwen/Qwen3-Coder-Next" \
 --base-url llm.stoney-cloud.com \
 --endpoint /v1/chat/completions \
 --dataset-name random \
 --random-input-len 1024 \
 --random-output-len 256 \
 --num-prompts 50 \
 --max-concurrency 1 \
--tokenizer "Qwen/Qwen2.5-7B-Instruct" \
 --percentile-metrics ttft

Result

============ Serving Benchmark Result ============
Successful requests:                     49
Failed requests:                         1
Maximum request concurrency:             1
Benchmark duration (s):                  162.16
Total input tokens:                      50568
Total generated tokens:                  12544
Request throughput (req/s):              0.30
Output token throughput (tok/s):         77.36
Peak output token throughput (tok/s):    257.00
Peak concurrent requests:                2.00
Total token throughput (tok/s):          389.20
---------------Time to First Token----------------
Mean TTFT (ms):                          3239.52
Median TTFT (ms):                        3239.48
P99 TTFT (ms):                           3365.96
==================================================

Legend

  • Successful requests: Successful prompt requests
  • Failed requests: Unsuccessful prompts
  • Maximum request concurrency: How many requests the model processes simultaneously.
  • Benchmark duration (s): The duration of the benchmark run in seconds.
  • Total input tokens: The total number of input tokens.
  • Total generated tokens: The total number of tokens generated by the model.
  • Request throughput (req/s): The number of requests processed per second.
  • Output token throughput (tok/s): The average number of tokens generated per second.
  • Peak output token throughput (tok/s): The maximum measured number of output tokens per second.
  • Peak concurrent requests: The maximum measured number of requests processed simultaneously.
  • Total token throughput (tok/s): The average of all tokens processed during the measurement.
  • Mean Time to First Token (TTFT) (ms): The average time elapsed between input and the first visible output.
  • Median TTFT (ms): The expected time between input and the first visible output. Also known as TTFT p50.
  • p99 TTFT (ms): The time elapsed in the “worst case” scenario until the first token is generated.
  • Tokenizer: The tokenizer is used to send queries to the evaluated model during a benchmark. These are typically small, publicly available models, such as Qwen/Qwen2.5-7B-Instruct.

Price

ModelContext lengthInput/MTokOutput/MTok
Qwen3.5-35B-A3B-FP8131k0.17001.0000
Qwen3-Coder-Next262k0.34001.7000
All prices are in CHF/MTok, excluding VAT.