Serverless gpus

Scale your AI on our GPUs

Scale to zero & pay as you scale.
Billed by the second.
Use any language of your choice.
Python, NodeJS, Rust, Golang, and more ...

How we scale.

Bring any REST API
Package your API in a container. We automatically pre-cache your container image across our fleet.
8scale integrations
Create Serverless App
When you create an app, we give you to make GPU scaling seamless.
We can scale your app from 0 to 100 GPUs within seconds without you having to deal with infra management.
Send traffic to {id}.8scale.app
As you send traffic to your app, we automatically scale replicas across our fleet of GPUs.
Pre-Caching
We take your container image and distribute across our fleet for instant scale.
200ms Cold-Starts
We prioritize starting replicas faster by preserving GPU state.
Obervability
Get access to metrics, logs, and more; with full visibility on infra we provision.
Nvidia 5090
32 GiB vRAM
$0.45 / hr
$0.00012 / sec
4x CPU • 32 GiB RAM
$0.55 / hr
$0.00015 / sec
6x CPU • 48 GiB RAM
$0.65 / hr
$0.00018 / sec
8x CPU • 64 GiB RAM
Storage
$0.08 / GiB / mo
100 GiB = $8 / mo
* Storage is charged even if you scale to zero due to pre-caching of images across our fleet.
* e.g. 5 max replicas with scale to zero and 20 GiB per replica is charged 5*20 = 100 GiB = $8 / mo
Bandwidth
FREE

What controls scaling?

Built for Scale

Scaling with Precision

We understand sending 1 extra request to a single replica can cause gen AI to respond slower or worst case, cause GPUs to OOM. We use Redis to track every single request while scaling up in real-time.
Min / Max Replicas
Configure apps to control minimum and maximum bound of scale. Scale to zero by setting min replicas to 0. Scale ups do incur higher cold-starts and we do our best to freeze GPU state to reduce cold-starts down to 200ms.
Requests Per Replicas
Precise control of how many requests a single replica can handle. We use Redis to track every single request while scaling up in real-time. Any time your total request volume is higher than your config, we scale up more replicas within seconds to meet your demand.
Scale Down Delay
Define scale down delay in seconds to give you extra time to handle more requests and avoid premature scaling down replicas.
Teams
Invite members and create teams for different projects or environments.
Billing
Real-time billing updates within minutes with aggregated views.
Access Control
Assign roles to members and manage their access levels.
API Keys
Create API keys to programmatically access 8Scale APIs.

Built for Teams

Built for Security

Encryption
We encrypt all traffic between nodes to make sure data is secure.
Secrets
All private registry credentials are encrypted to keep them secure.
P2P Nodes
8Scale operates on peer to peer nodes provided by our community. We vet all of our nodes to keep data secure and provide reliable performance.

All traffic between nodes and 8Scale APIs uses encryption. We keep secrets in-memory rather than on disk to provide extra security.