Industry-standard LLM benchmarks in DataRobot

Each LLM deployment has a ceiling, latency curve, and unit cost. Most teams work blindly, discovering their deployment limits only when over-provisioning exhausts their GPU budget or when peak traffic leads to catastrophic failure.

There are three important numbers: the maximum sustained concurrency before the GPU saturates, the end-to-end response time at that concurrency, and the cost per million tokens under sustained load. These metrics emerge from how the model interacts with your devices, uptime, tokenization, and traffic mix.

DataRobot 11.8 changes that with LLM Profiling Jobs: the original integral of Nvidia ipervan industry-standard generative AI measurement tool. A certified one POST Measures any DataRobot LLM deployment serving an OpenAI-compatible web server, scans the scope of concurrency and use cases you define, and returns test inputs to Class reservations (Available in DataRobot 11.9).

Why is LLM capacity difficult to predict?

LLM inference does not scale linearly. The computing and memory requirements for each request depend dynamically on the prompt length, response length, sampling parameters, and KV cache usage. A deployment serving 50 short talk cycles per second can stop at 5 long context RAG requests per second on the same device. There are four distinct behaviors that make fixed or speculative power estimates unreliable:

  • Latency is nonlinear in synchronization. The first-token latency and inter-token latency remain roughly constant over a wide range of concurrency, then rise sharply once the GPU memory bandwidth or compute saturates. TTFT rises when the prefill calculation is saturated; The latency between tokens increases when the decoding memory bandwidth becomes saturated. Which one comes first depends on the workload mix and GPU configuration of the deployment (single card or cluster). The saturation knee is the important operating point, and cannot be deduced from a single low-load measurement.
  • Trade-off between throughput and latency. You can squeeze more total tokens per second of deployment by running them with higher concurrency, at the cost of slower response per user. The right trade-off depends on your SLO, not a general recommendation.
  • The use of case mix is ​​important. Two posts running the same model on the same device can have very different capacities if one provides a short Q&A and the other provides a long context summary. The mixture must be present in the test, otherwise the test is incorrect.
  • Caching and routing change the answer. Prefix caching (common in proxy coding with periodic compression) and KV-aware routing can dramatically raise the efficient throughput. Profiles work against cold sawing with random inputs representing the floor, not the ceiling.

LLM profile functions make these curves visible.

How LLM Standards Help

  • Defend capacity and quota decisions using measured data. When Finance questions the 4-H100 footprint, or when cross-functional teams negotiate shared capacity, you can justify the architecture with trial profiling data. The saturation knee, SLO target, and expected traffic make GPU sizing an evidence-based component. The same numbers feed directly into class bookings.
  • Calculate the cost per consumer. The total token throughput combined with the GPU instance cost provides a cost figure per million tokens that supports a chargeback or offer. I attribute spending to consumers based on their reservations, not by guesswork.
  • Compare models and devices equally. Keep the workload profile constant and change one dimension at a time: the same model on different GPU configurations (B200 node vs. B300 node, or 4×H100 vs. 8×H100), or different models on the same configuration (Qwen3.6 35B-A3B MoE vs. Qwen3.6 27B dense). Since the AIPerf benchmarks match NVIDIA’s published NIM benchmarks, the numbers are also directly comparable to generic benchmarks for the same model and hardware combinations. The right inputs to make purchasing decisions and determine capacity sizing before ordering devices.
  • Prove that the change is safe before shipping it. Before upgrading the model, increasing vLLM, switching a driver, or migrating a GPU, rerun the same profile and compare it to the previous baseline. Regressions appear in metrics, not in incident reports.

What LLM Standards Mean

The four main AIPerf metrics map directly back to user experience and to GPU economics:

  • Time to first token (TTFT, ms). Measures how long the user waits between sending the prompt and seeing the first character; This metric is dominated by prefill calculation.
  • Inter-token latency (ITL, ms). The average time between successive output symbols once generation has begun. Adjusts the perceived “typing speed” of responsiveness.
  • Request throughput (requests/sec). Complete request and response cycles per second with the tested synchronization. Capacity value (RPM) basis in quota reservations.
  • Total token throughput (tokens/sec). Total tokens (input plus output) processed per second across all concurrent requests. Cost economics basis per code.

For each measure, AIPerf reports averages and percentages (p.50, p.90, p.99). When GPU saturation is detected during the scanning process, estimatedCapacity Report the repeat immediately before it. When saturation is not detected (common case, since the profile is not co-located with the deployment), estimatedCapacity Reports the last iteration tested. Scroll wide enough so that the curve bends clearly, or treat the result as a lower bound.

Submit a job

The profile request takes four parameters: a deploymentId (the ID of the DataRobot LLM deployment you want to profile), a list of concurrency levels to scan, a request count metric (the number of requests issued by each concurrent worker), and one or more use cases. Each use case specifies the input sequence length (ISL), the output sequence length (OSL), the standard deviations of both, and the weight (prob). The sum of the weights in all use cases must be 100.

export DATAROBOT_ENDPOINT="https://app.datarobot.com"
export DR_API_KEY=""
export HUGGINGFACE_DR_CRED_ID=""
export DEPLOYMENT_ID=""
export CONCURRENCIES="(1,10,50,100)"
export REQUEST_COUNT_SCALAR=2
export MODEL_TOKENIZER="openai/gpt-oss-20b"
export USE_CASES='({"isl":200,"islStddev":15,"osl":1000,"oslStddev":15,"prob":100})'
 
curl -X POST -H "Authorization: Bearer ${DR_API_KEY}" 
     -H "Content-Type: application/json" 
     "${DATAROBOT_ENDPOINT}/api/v2/llmProfilingJobs/" 
     -d @- <

A 202 Accepted The response returns the function ID, implementation ID, and state ID:

{
  "id": "69e09f9e25fdfdfab0d27925",
  "jobExecutionId": "69e09f9f25fdfdfab0d27926",
  "statusId": "5633f028-3f68-4f83-bddc-560d266d6bd2"
}

Monitor and retrieve LMM measurement results

Poll the status API with what is returned statusId. When the task is finished, the API returns. 303 See Other and Location The header indicates the endpoint of the results:

curl -s -L -i 
  -H "Authorization: Bearer ${DR_API_KEY}" 
  "${DATAROBOT_ENDPOINT}/api/v2/status/${STATUS_ID}/"

Fetch complete results with profiling task id:

curl -H "Authorization: Bearer ${DR_API_KEY}" 
     "${DATAROBOT_ENDPOINT}/api/v2/llmProfilingJobs/${LLM_PROFILING_JOB_ID}/profilingResults/"

Example payload (truncated):

{
  "estimatedCapacity": {
    "metrics": (
      { "name": "request_throughput",     "units": "requests/sec", "measurements": ({ "name": "avg", "value": 8.84    }) },
      { "name": "inter_token_latency",    "units": "ms",           "measurements": ({ "name": "avg", "value": 23.79   }) },
      { "name": "time_to_first_token",    "units": "ms",           "measurements": ({ "name": "avg", "value": 833.06  }) },
      { "name": "total_token_throughput", "units": "tokens/sec",   "measurements": ({ "name": "avg", "value": 4524.80 }) }
    )
  },
  "results": ( "...per-iteration benchmark data..." )
}

estimatedCapacity It is the sustainable operating point. results It contains one entry for each concurrency level tested, with the full set of metrics.

Curve reading

Rated capacity numbers tell you what the sustainable ceiling is. The iteration results show you how the sawing behaves as the load climbs toward that ceiling. The table below is an illustrative example.

Concurrent requests TTFT (millisecond) Total throughput (tokens/sec) Note
1 ~150 ~600 Low load, near ground latency
10 ~250 ~ 2500 Productivity scales approximately linearly
50 ~800 ~ 4500 estimatedCapacity Came back from this iteration
100 ~ 1500 ~ 4600 Saturated: TTFT nearly doubles, throughput plateaus

When AIPerf detects GPU saturation during a scan, it identifies the iteration before it (concurrency 50 here) and returns those metrics as estimatedCapacity. When saturation is not detected, estimatedCapacity It is simply the last frequency tested, which is why the scan should extend beyond the knee. Anything beyond this point trades off user-perceived latency for marginal gains in throughput. If the product specification requires a TTFT of less than 1 second, the curve will show that the deployment supports up to approximately 50 concurrent requests with a margin: saving the GPU so that peak concurrent request remains at or below that level.

From the result of the characterization to the formation of quota reservations

The bridge from running profiles to configuring quota reservations is straightforward:

Preparing classes Where does it come from? Example (from the sample above)
Capacity (rpm) estimatedCapacity.request_throughput × 60 8.84 requests/s × 60 ≈ 530 rpm
Usage threshold Choose 70-80% of amplitude so that execution occurs before the saturation knee 80% → Execution at approximately 424 rpm
% reserved for each consumer Minimize size for each consumer’s priority needs during competition 30% Production Agent A, 20% Agent B, 30% Agent C, 20% Unreserved Group
Filling rate Capacity/60 (requests per second) 530 / 60 ≈ 8.83 requests/s

For a primer on how capacity, utilization threshold, and reserved percentage interact under load, see Determine the price for class reservations.

Example of job costing

Take the sample result: a total of 4,524 constant codes per second (input plus output). This means approximately 16.3 million tokens per hour from a single deployment.

If the primary GPU instance costs $X per hour, the cost per million tokens is $X / 16.3. For example, at $4 per hour, that’s about $0.25 per million tokens. At $12 an hour, that’s about $0.74. To calculate the cost per million Output Tokens – the standard for public API comparisons – divide the total cost by the share of the workload’s output. For example, considering an ISL of 200 and an OSL of 1000, the output represents about 83% of the total tokens. At $4 per hour, this translates to about $0.30 per million output tokens.

Each benchmark run gives you a new, accurate cost figure for each token for the exact model, hardware, and quantization suite you’re running. After upgrading vLLM or switching hardware, reboot the same profile and make sure you optimize the economics of your unit rather than trusting the vendor’s claim. This is the basis for the transparency of the cost of each token and each chargeback agent.

Choose your input

A useful profile starts with two questions: What range of concurrency do you expect in production, and what does your traffic actually look like?

  • Synchronizers to sweep. start wide ((1, 10, 50, 100)) to locate the saturation knee, then narrow it (eg (40, 50, 60, 70)) to get an SLO-level read on that point.
  • Request a numerical count. Set it high enough so that each repetition runs long enough to eliminate the noise. The numerical value 2 is a reasonable starting point. Increase it if the contrast seems high.
  • Use cases. Match your real traffic mix. If you offer 70% short RAG (ISL 200, OSL 300) and 30% RAG long context (ISL 4000, OSL 800), identify two use cases with prob: 70 and prob: 30. Mixed traffic testing reveals background latency behavior (such as p99 spikes) that is obscured by the average single use case.
  • Token. Expressly appointed. The standard is based on exact token counts, so the corresponding token code is part of the correct measurement.

Operational notes

  • Profiling generates artificial load. Run jobs against a non-production LLM deployment or during a maintenance window.
  • Since the traffic is synthetic, the results of the prefill cache will not appear in the token metrics.
  • The configuration treats the publishing process as a black box. Whether the deployment runs on one or more GPUs, and whatever tensor combination, pipeline, data, or parallelism the expert uses, the profile measures the externally observable outcome.
  • Jobs can be canceled using a DELETE to the profile function identifier. Canceling is a best effort and may not stop a run that is nearly complete.
  • Before sending, store your Hugging Face token Manage DataRobot credentials As credentials “API Token (API Key)”. Used by AIPerf to fetch form code, stored credentials prevent rate limit errors.

Get access

LLM profiling functionality is in private preview in DataRobot 11.8. To enable your tenant, contact your DataRobot account team. They will run Enable dynamic quota capacity profiles Feature tag (the internal name for LLM profile jobs) and configure the profile job image in your group.

He learns more

Leave a Reply