Speeding Up Stable Diffusion Turbo with ONNX Runtime and Olive

Advertisement

Jun 12, 2025 By Alison Perry

AI image generation has come a long way, and Stable Diffusion is leading much of that progress. Models like SD Turbo and SDXL Turbo are designed to produce high-quality outputs quickly. But as good as these models are, inference speed often becomes a bottleneck—especially if you're working with limited hardware or trying to scale your applications. That’s where ONNX Runtime and Olive come in. These tools don’t just improve performance—they optimize your pipeline in a way that makes deployment feel far more efficient.

Let's break it down so it's easier to grasp what's happening under the hood—and, more importantly, how you can put these tools to work with your own setup.

Why SD Turbo and SDXL Turbo Need a Boost

Stable Diffusion Turbo models are already fast by design, but they can still hit performance walls on devices with tight memory or compute constraints. Whether you're building a mobile app or trying to generate multiple outputs in real-time, shaving milliseconds off matters. These delays usually come from layers that haven’t been optimized or from suboptimal use of system resources.

The models themselves are large, and their operations can be complex. Memory access, repeated computations, and redundant layers all contribute to longer processing times. So, even if the model runs, it may not be running at its best. That's where inference optimization comes into play.

Accelerating SD Turbo and SDXL Turbo Inference with ONNX Runtime and Olive

Step 1: Exporting the Model to ONNX

Before you can do anything with ONNX Runtime or Olive, you need to get your model into the right format. ONNX is designed to be a universal format for machine learning models, making them portable and hardware-friendly.

For SD Turbo or SDXL Turbo, this means exporting your PyTorch model into an ONNX graph. It sounds more complicated than it is. With the right script, this step usually takes minutes. The trick is to make sure the exported model preserves all dynamic shapes and handles inputs correctly. If you're dealing with batch processing or resolution changes, the exported model must be flexible enough to support that.

python

CopyEdit

torch.onnx.export(

model,

dummy_input,

"sdxl_turbo.onnx",

export_params=True,

opset_version=17,

do_constant_folding=True,

input_names=['input'],

output_names=['output'],

dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}

)

Once you’ve done this, you now have a version of your model that can run outside PyTorch—without needing the whole framework just to get an output.

Step 2: Optimizing with Olive

Olive is Microsoft’s optimization tool designed to improve model performance across multiple backends. It works on ONNX models and provides techniques like quantization, operator fusion, and graph pruning—all without requiring you to rewrite code.

What makes Olive useful here is that it runs a set of optimization passes tailored to your target device. You can set it up for CPU, GPU, or even specialized NPUs. Olive checks your hardware and then runs the best combination of optimizations automatically.

For example, if you’re targeting an NVIDIA GPU, Olive might apply TensorRT-specific changes. On the other hand, CPU inference might reduce precision (from FP32 to INT8 or BF16) to speed things up without a major loss in output quality.

A basic Olive configuration might look like this:

python

CopyEdit

from olive.workflows import run as olive_run

config = {

"model_path": "sdxl_turbo.onnx",

"device": "cuda",

"optimize": ["quantization", "fusion"],

"output_path": "sdxl_turbo_optimized.onnx"

}

olive_run(config)

The result is a smaller, faster ONNX model that’s more efficient during inference. What you end up with isn’t just a faster model—it’s a model that's been reshaped to better match the resources it's running on.

Step 3: Running Inference with ONNX Runtime

Now that you have an optimized ONNX model, you’ll want to run inference using ONNX Runtime. This runtime environment is built specifically to make ONNX models run as fast as possible, whether you're on a laptop, a server, or even a mobile chip.

ONNX Runtime supports execution providers that match your hardware. You can choose from CPU (default), CUDA for NVIDIA GPUs, DirectML for Windows machines, or even ROCm for AMD GPUs. All of this happens with just a few lines of setup code.

python

CopyEdit

import onnxruntime as ort

import numpy as np

session = ort.InferenceSession("sdxl_turbo_optimized.onnx", providers=['CUDAExecutionProvider'])

inputs = {session.get_inputs()[0].name: input_tensor.astype(np.float32)}

outputs = session.run(None, inputs)

What’s different now is how fast everything runs. You’re using an optimized model, a runtime that’s tuned to your hardware, and a format designed for maximum portability.

This step often results in noticeable improvements: lower memory usage, reduced latency, and better throughput—all without changing how you call the model. It’s still just an input and an output, but what happens between those two is dramatically more efficient.

Step 4: Benchmarking and Tuning

Optimization doesn’t end with deployment. You’ll want to test your setup across different input sizes and see how inference behaves under load. ONNX Runtime gives you tools to benchmark your model and monitor performance across sessions.

If you're running SDXL Turbo in a high-load environment (like serving image generations to multiple users), consider tuning session parameters. Things like session threading, memory arena settings, and prefetching can help you squeeze even more performance out of the system.

Example settings for threaded inference:

python

CopyEdit

session_options = ort.SessionOptions()

session_options.intra_op_num_threads = 4

session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL

session = ort.InferenceSession("sdxl_turbo_optimized.onnx", sess_options=session_options, providers=['CUDAExecutionProvider'])

These tweaks let you adapt the model to your environment without retraining or modifying the core architecture. Over time, this can make a big difference—especially when inference costs start stacking up.

Conclusion

If you’re running SD Turbo or SDXL Turbo models and want to get the most out of your hardware, moving to ONNX Runtime and using Olive is one of the easiest ways to do that. You’re not changing how the models work—you’re simply letting them run the way they were meant to. From exporting to ONNX, optimizing with Olive, and deploying with ONNX Runtime, every step is focused on making inference faster and more reliable.

Once you go through this process, your models become easier to scale, simpler to deploy, and faster where it matters.

Advertisement

You May Like

Top

8 Best Claude AI Prompts for Business Coaches and Consultants

Learn the top 8 Claude AI prompts designed to help business coaches and consultants boost productivity and client results.

Jun 10, 2025
Read
Top

The Truth About ChatGPT Jailbreaks: Should You Use Them

Curious about ChatGPT jailbreaks? Learn how prompt injection works, why users attempt these hacks, and the risks involved in bypassing AI restrictions

May 27, 2025
Read
Top

Discover 10 Best Image Generation Prompts for Business Cards

Find the 10 best image-generation prompts to help you design stunning, professional, and creative business cards with ease.

Jun 10, 2025
Read
Top

Nvidia’s New AI Platform: A Boost for Cloud GPU Providers

Discover how Nvidia's latest AI platform enhances cloud GPU performance with energy-efficient computing.

Jun 03, 2025
Read
Top

Speeding Up Stable Diffusion Turbo with ONNX Runtime and Olive

Speed up Stable Diffusion Turbo and SDXL Turbo inference using ONNX Runtime and Olive. Learn how to export, optimize, and deploy models for faster, more efficient image generation

Jun 12, 2025
Read
Top

Why Fetch’s Shift to Hugging Face on AWS Made Machine Learning Work Better

What happens when ML teams stop juggling tools? Fetch moved to Hugging Face on AWS and cut development time by 30%, boosting consistency and collaboration across projects

Jun 11, 2025
Read
Top

Using ZenML to Predict Electric Vehicle Efficiency at Scale

Learn how ZenML helps streamline EV efficiency prediction—from raw sensor data to production-ready models. Build clean, scalable pipelines that adapt to real-world driving conditions

May 28, 2025
Read
Top

The Hybrid Model Built for Speed: Bamba and the Mamba2 Framework

How the Bamba: Inference-Efficient Hybrid Mamba2 Model improves AI performance by reducing resource demands while maintaining high accuracy and speed using the Mamba2 framework

May 13, 2025
Read
Top

Best AI Voice Generator Tools to Try in 2025

Discover the top 10 AI voice generator tools for 2025, including ElevenLabs, PlayHT, Murf.ai, and more. Compare features for video, podcasts, education, and app development

May 29, 2025
Read
Top

How StarCoder2 and The Stack v2 Are Redefining Open AI for Code

What makes StarCoder2 and The Stack v2 different from other models? They're built with transparency, balanced performance, and practical use in mind—without hiding how they work

Jun 11, 2025
Read
Top

Can ChatGPT Improve Customer Service Efficiency and Satisfaction?

Learn how to use ChatGPT for customer service to improve efficiency, handle FAQs, and deliver 24/7 support at scale

Jun 05, 2025
Read
Top

Discover Underused Hugging Face Features That Boost Productivity And Ease

Tired of reinventing model workflows from scratch? Hugging Face offers tools beyond Transformers to save time and reduce boilerplate

Jun 10, 2025
Read