Get started
Endpoint
The New Endpoint option allows users to create a Serverless service using custom images and configurations.
Configuration Parameters
-
Endpoint Name
- Custom name for your endpoint
- Supports UTF-8 character set
- Maximum 128 characters
- Must be unique across your account
-
Worker Configuration
- Active Workers: Initial and minimum number of workers (default: 1)
- Max Workers: Maximum number of workers for auto-scaling
- GPUs / Worker: Number of GPUs per worker (range: 1-8)
-
Container Settings
- Container Image: Docker image to use for the service
- Container Start Command: Command to execute when starting the container
- Optional: Uses image's entrypoint if not specified
- Shell: Specifies the shell environment for the command
- Default:
/bin/sh
- Adjustable based on image requirements
- Default:
-
Network Configuration
- Data Center: Target cluster for running the serverless service
- Consider network environment and GPU availability
- HTTP Port: Port for external HTTP service
- Single port only
- Requests to Endpoint URL are forwarded to this port
- Container must listen on this port
- Data Center: Target cluster for running the serverless service
-
Environment
- Environment Variables: Configure multiple environment variables for the container
-
Advanced Settings
- Network Volume: Option to mount persistent network storage
- Network Volume Mount Path: Specify the mount path for persistent storage
Quick Deploys
Quick Deploys lets you deploy custom Endpoints of popular models with minimal configuration.
How to do I get started with Quick Deploys?
How to interact with AtlasCloud Serverless?
After creating a serverless endpoint, the platform generates a domain URL that allows you to access the service:
https://${SERVERLESS_ID}.${REGION}.atlascloud.ai/
The URL components:
SERVERLESS_ID
: Your unique endpoint identifierREGION
: The deployment region (e.g., us-east, eu-west)
Auto Scaling
Serverless deployments start with one worker by default. The platform automatically manages scaling based on concurrent requests and resource utilization.
Key Auto-scaling Features:
- Active Workers: Minimum number of workers that will always be running, regardless of load
- Max Workers: Maximum number of workers that can be created during high load periods
- GPUs / Worker: Number of GPUs allocated to each worker instance, affecting processing capacity
The auto-scaling system follows these rules:
-
Scale Up:
- Triggers when concurrent requests per worker exceed 100
- New workers are added within 30-60 seconds
- Scales in increments based on request load
- Maximum scale-up rate: 200% of current capacity per 60 seconds
-
Scale Down:
- Begins when concurrent requests drop below threshold
- Requires 60 seconds of low utilization before scaling down
- Scales down one worker at a time
- Maintains minimum Active Workers count
- Maximum scale-down rate: 100% of current capacity per 60 seconds
-
Scaling Limits:
- Minimum: Active Workers count
- Maximum: Max Workers setting
- Scale to zero: Only if Active Workers is set to 0
-
Cold Start:
- New workers take 30-60 seconds to become available
- Consider this delay when planning for traffic spikes
Use Cases
- AI Inference
- Large Language Models (LLMs)
- Stable Diffusion
- Computer Vision
- Speech Recognition
- API Services
- RESTful APIs
- WebSocket Support
- Custom Endpoints