Skip to main content

Get started

Endpoint

The New Endpoint option allows users to create a Serverless service using custom images and configurations.

Configuration Parameters

  • Endpoint Name

    • Custom name for your endpoint
    • Supports UTF-8 character set
    • Maximum 128 characters
    • Must be unique across your account
  • Worker Configuration

    • Active Workers: Initial and minimum number of workers (default: 1)
    • Max Workers: Maximum number of workers for auto-scaling
    • GPUs / Worker: Number of GPUs per worker (range: 1-8)
  • Container Settings

    • Container Image: Docker image to use for the service
    • Container Start Command: Command to execute when starting the container
      • Optional: Uses image's entrypoint if not specified
    • Shell: Specifies the shell environment for the command
      • Default: /bin/sh
      • Adjustable based on image requirements
  • Network Configuration

    • Data Center: Target cluster for running the serverless service
      • Consider network environment and GPU availability
    • HTTP Port: Port for external HTTP service
      • Single port only
      • Requests to Endpoint URL are forwarded to this port
      • Container must listen on this port
  • Environment

    • Environment Variables: Configure multiple environment variables for the container
  • Advanced Settings

    • Network Volume: Option to mount persistent network storage
    • Network Volume Mount Path: Specify the mount path for persistent storage

Quick Deploys

Quick Deploys lets you deploy custom Endpoints of popular models with minimal configuration.

How to do I get started with Quick Deploys?

How to interact with AtlasCloud Serverless?

After creating a serverless endpoint, the platform generates a domain URL that allows you to access the service:

https://${SERVERLESS_ID}.${REGION}.atlascloud.ai/

The URL components:

  • SERVERLESS_ID: Your unique endpoint identifier
  • REGION: The deployment region (e.g., us-east, eu-west)

Auto Scaling

Serverless deployments start with one worker by default. The platform automatically manages scaling based on concurrent requests and resource utilization.

Key Auto-scaling Features:

  • Active Workers: Minimum number of workers that will always be running, regardless of load
  • Max Workers: Maximum number of workers that can be created during high load periods
  • GPUs / Worker: Number of GPUs allocated to each worker instance, affecting processing capacity

The auto-scaling system follows these rules:

  • Scale Up:

    • Triggers when concurrent requests per worker exceed 100
    • New workers are added within 30-60 seconds
    • Scales in increments based on request load
    • Maximum scale-up rate: 200% of current capacity per 60 seconds
  • Scale Down:

    • Begins when concurrent requests drop below threshold
    • Requires 60 seconds of low utilization before scaling down
    • Scales down one worker at a time
    • Maintains minimum Active Workers count
    • Maximum scale-down rate: 100% of current capacity per 60 seconds
  • Scaling Limits:

    • Minimum: Active Workers count
    • Maximum: Max Workers setting
    • Scale to zero: Only if Active Workers is set to 0
  • Cold Start:

    • New workers take 30-60 seconds to become available
    • Consider this delay when planning for traffic spikes

Use Cases

  1. AI Inference
    • Large Language Models (LLMs)
    • Stable Diffusion
    • Computer Vision
    • Speech Recognition
  2. API Services
    • RESTful APIs
    • WebSocket Support
    • Custom Endpoints