Why do requests sometimes take longer on Salad compared to other cloud providers?
The inference process on Salad nodes may include downloading/uploading, pre-processing/post-processing, and GPU inference. This means that it might take longer to initiate than other cloud providers. If you want to optimize your container deployments, take some of these steps and considerations.
- If the process takes much longer than expected, you can add code to determine how long each stage in the inference process takes and identify which part consumes most of the time.
- If I/O takes a long time, such as downloading videos and uploading generated images, try tools that support parallel downloading and uploading, which can significantly reduce transmission time.
- If pre-processing and post-processing take longer, such as processing data format conversion or merging results, you can scale the container group up by using more vCPUs and memory.
- If the GPU inference time is significantly longer than expected, please check your code:
- Ensure your code is explicitly using the GPU.
- Verify the correct GPU type is selected to run the model inference.
- Check if VRAM might run out due to certain requests during inference. For LLMs, VRAM usage could increase quickly with larger batch sizes and longer context lengths.
- Avoid multiprocessing or multithreading-based concurrent inference over a single GPU, as it might limit optimal GPU cache utilization and impact performance.
- If inference takes a long time (tens of minutes or longer), consider re-architecting your application from a push mode using the container gateway to a pull mode using a job queue. With a job queue, a container will only get and process a new job when the existing one is done, significantly simplifying your application.