Why does my container instance fail to run?
Troubleshooting issues with your container deployment can be complex if you're new to running applications on a distributed network. Here are some common issues and interventions that should help you get up and running in no time.
Set Yourself Up for Success
- Running or testing on the Salad Network should be executed on multiple replicas. Running on one replica leaves you open to network variability of that single node and can make diagnosing issues difficult.
- Understand the needs of your application and make sure to choose the correct hardware and sufficient memory to run your container instance.
Container Stuck Pending
Pending is a container group state. If the container group is stuck pending there may be a long queue at the moment. The best remedy for this state is to wait, as the most likely cause is the queue for containers waiting to deploy on the Salad network.
Container Stuck Deploying while Instances are Allocating
If your container group is deploying, but all instances are allocating for five minutes or more, you may have over constrained your deployment. When possible select more GPUs, lower RAM, or more locations.
Instance Stuck at 99% Downloading
Our network runs on variable residential internet connections. Download and create times are estimates. Most instances should start quicker than the estimate but a few may take longer and get “stuck” at 99%. Having more replicas lowers the chances of getting “stuck” on a lower speed connection.
- If a container starts but fails or exits immediately it should be recreated or reallocated. Try to check for container failures and make sure the container will start and run locally without an -it flag.
- Reducing the container image size is a good way to get faster startup speeds overall. How best to do this will depend on the contents of the image itself.
Repeated StartFailures
If a container fails to start, we will retry the container. If there is an issue with the container configuration, it will repeatedly fail to start. Check the recent errors tab in the Portal for StartFailures or other failure Exit codes.
- Common start failures include an incorrect command, adding a GPU to a container that is not setup for a GPU, or including redundant NVIDIA libraries.
- Common early Exit failures are incorrect authentication to an external source, missing files, or needing OpenCL. Unfortunately we depend on WSL, which does not support OpenCL at this time.
- It is a good practice when possible to check that a container will run locally, e.g. through docker before trying on Salad.
Container Exit Failures
If the container exits, the exit code will be shown in the portal. In most cases, this exit code depends on the container itself and we cannot provide definitive guidance.
- Refer to the Error Reference Guide for known exit failure codes.
Slow Network Connections
Salad is built on residential computers and does not have the consistency or network speeds of a data center. Getting reliable measures of the download and upload speed a customer will experience depends on the size of the data the customer is transferring, to where, and where the node is located. Speed test results differ between services depending on where in the world they are.
- Users concerned about reliably fast network connections should make sure to run as many replicas of their container group as possible to maximize overall performance of the instance.
Node Performance
While we continuously evaluate the performance of our nodes, some nodes show low performance for some workloads. There are a variety of possible reasons for this from hardware differences, driver versions, to activity on the node.
- If performance is critical, we recommend implementing basic performance checks at startup, and even potentially as a liveness probe.
- Performance can deteriorate over time due to memory issues within some containers. Recreating the instance can help, and if that fails reallocating the workload to another instance is recommended.