Serverless gpus

Scale your AI on our GPUs

Scale to zero & pay as you scale.

Billed by the second.

Get Started

Use any language of your choice.

Python, NodeJS, Rust, Golang, and more ...

How we scale.

Bring any REST API

Package your API in a container. We automatically pre-cache your container image across our fleet.

Create Serverless App

When you create an app, we give you to make GPU scaling seamless.

We can scale your app from 0 to 100 GPUs within seconds without you having to deal with infra management.

Send traffic to {id}.8scale.app

As you send traffic to your app, we automatically scale replicas across our fleet of GPUs.

Pre-Caching

We take your container image and distribute across our fleet for instant scale.

200ms Cold-Starts

We prioritize starting replicas faster by preserving GPU state.

Obervability

Get access to metrics, logs, and more; with full visibility on infra we provision.

GPU Pricing

Get Started

Nvidia 5090

32 GiB vRAM

$0.45 / hr
$0.00012 / sec

4x CPU • 32 GiB RAM

$0.55 / hr
$0.00015 / sec

6x CPU • 48 GiB RAM

$0.65 / hr
$0.00018 / sec

8x CPU • 64 GiB RAM

Storage

$0.08 / GiB / mo

•

100 GiB = $8 / mo

* Storage is charged even if you scale to zero due to pre-caching of images across our fleet.
* e.g. 5 max replicas with scale to zero and 20 GiB per replica is charged 5*20 = 100 GiB = $8 / mo

Bandwidth

FREE

What controls scaling?

Built for Scale

Get Started

Scaling with Precision

We understand sending 1 extra request to a single replica can cause gen AI to respond slower or worst case, cause GPUs to OOM. We use Redis to track every single request while scaling up in real-time.

Min / Max Replicas

Configure apps to control minimum and maximum bound of scale. Scale to zero by setting min replicas to 0. Scale ups do incur higher cold-starts and we do our best to freeze GPU state to reduce cold-starts down to 200ms.

Requests Per Replicas

Precise control of how many requests a single replica can handle. We use Redis to track every single request while scaling up in real-time. Any time your total request volume is higher than your config, we scale up more replicas within seconds to meet your demand.

Scale Down Delay

Define scale down delay in seconds to give you extra time to handle more requests and avoid premature scaling down replicas.

Teams

Invite members and create teams for different projects or environments.

Billing

Real-time billing updates within minutes with aggregated views.

Access Control

Assign roles to members and manage their access levels.

API Keys

Create API keys to programmatically access 8Scale APIs.

Built for Teams

Get Started

Built for Security

Get Started

Encryption

We encrypt all traffic between nodes to make sure data is secure.

Secrets

All private registry credentials are encrypted to keep them secure.

P2P Nodes

8Scale operates on peer to peer nodes provided by our community. We vet all of our nodes to keep data secure and provide reliable performance.

All traffic between nodes and 8Scale APIs uses encryption. We keep secrets in-memory rather than on disk to provide extra security.

Scale your AI on our GPUs

How we scale.

GPU Pricing

What controls scaling?

Built for Scale

Scaling with Precision

Built for Teams

Built for Security

GPU Hosts

20% Referral