Get started

Endpoint

The New Endpoint option allows users to create a Serverless service using custom images and configurations.

Configuration Parameters

Endpoint Name
- Custom name for your endpoint
- Supports UTF-8 character set
- Maximum 128 characters
- Must be unique across your account
Worker Configuration
- Active Workers: Initial and minimum number of workers (default: 1)
- Max Workers: Maximum number of workers for auto-scaling
- GPUs / Worker: Number of GPUs per worker (range: 1-8)
Container Settings
- Container Image: Docker image to use for the service
- Container Start Command: Command to execute when starting the container
  - Optional: Uses image's entrypoint if not specified
- Shell: Specifies the shell environment for the command
  - Default: /bin/sh
  - Adjustable based on image requirements
Network Configuration
- Data Center: Target cluster for running the serverless service
  - Consider network environment and GPU availability
- HTTP Port: Port for external HTTP service
  - Single port only
  - Requests to Endpoint URL are forwarded to this port
  - Container must listen on this port
Environment
- Environment Variables: Configure multiple environment variables for the container
Advanced Settings
- Network Volume: Option to mount persistent network storage
- Network Volume Mount Path: Specify the mount path for persistent storage

Quick Deploys

Quick Deploys lets you deploy custom Endpoints of popular models with minimal configuration.

How to do I get started with Quick Deploys?

How to interact with AtlasCloud Serverless?

After creating a serverless endpoint, the platform generates a domain URL that allows you to access the service:

https://${SERVERLESS_ID}.${REGION}.atlascloud.ai/

The URL components:

SERVERLESS_ID: Your unique endpoint identifier
REGION: The deployment region (e.g., us-east, eu-west)

Auto Scaling

Serverless deployments start with one worker by default. The platform automatically manages scaling based on concurrent requests and resource utilization.

Key Auto-scaling Features:

Active Workers: Minimum number of workers that will always be running, regardless of load
Max Workers: Maximum number of workers that can be created during high load periods
GPUs / Worker: Number of GPUs allocated to each worker instance, affecting processing capacity

The auto-scaling system follows these rules:

Scale Up:
- Triggers when concurrent requests per worker exceed 100
- New workers are added within 30-60 seconds
- Scales in increments based on request load
- Maximum scale-up rate: 200% of current capacity per 60 seconds
Scale Down:
- Begins when concurrent requests drop below threshold
- Requires 60 seconds of low utilization before scaling down
- Scales down one worker at a time
- Maintains minimum Active Workers count
- Maximum scale-down rate: 100% of current capacity per 60 seconds
Scaling Limits:
- Minimum: Active Workers count
- Maximum: Max Workers setting
- Scale to zero: Only if Active Workers is set to 0
Cold Start:
- New workers take 30-60 seconds to become available
- Consider this delay when planning for traffic spikes

Use Cases

AI Inference
- Large Language Models (LLMs)
- Stable Diffusion
- Computer Vision
- Speech Recognition
API Services
- RESTful APIs
- WebSocket Support
- Custom Endpoints

Endpoint​

Configuration Parameters​

Quick Deploys​

How to do I get started with Quick Deploys?​

How to interact with AtlasCloud Serverless?​

Auto Scaling​

Use Cases​