Skip to main content

vLLM

Create Storage

First, create a persistent storage volume to store model files:

  1. Navigate to the Storage page
  2. Click "New Network Volume" button
  3. Fill in the storage details:
    • Volume Name: Give your storage a descriptive name
    • GB: Choose appropriate size based on your model requirements
    • Data Center: Choose the same region where you'll deploy your serverless

create storage 01 create storage 02

Get HuggingFace Token

To download models from HuggingFace, you'll need an access token:

  1. Visit HuggingFace website and sign in to your account
  2. Go to your profile settings
  3. Navigate to "Access Tokens" section
  4. Click "Create new token" button
  5. Configure your token:
    • Name: Give your token a descriptive name
    • Role: Select "read" for model downloading
  6. Click "Create token" button
  7. Copy and save the generated token securely - you'll need it later

apply hf-token 01 apply hf-token 02 apply hf-token 03 apply hf-token 04 apply hf-token 05

Configuration Guide

Choose Model

The platform provides a built-in vLLM framework version 0.6.2 environment. Here's what you need to configure:

  • HuggingFace Model: Enter the target model name (e.g., meta-llama/Llama-2-7b-chat-hf)
  • HuggingFace Token: Optional authentication token
    • Required for certain models and datasets
    • Automatically set as HUGGING_FACE_HUB_TOKEN environment variable in the container
    • Paste the token you generated earlier

vLLM Parameters

These are optional advanced settings for the vLLM framework. Modify with caution:

  • Tensor Parallel Degree: For multi-GPU inference
  • Max Total Tokens: Limit the total response length
  • Quantization: Model compression options
  • Trust Remote Code: Enable for models requiring custom code

Note: Please ensure you understand these parameters before modifying them from their default values.

Endpoint Parameters

Configure your deployment environment:

  • Endpoint Name: Auto-generated but customizable
  • GPU Configuration:
    • Select GPU type (A100, H100, L4, etc.)
    • Specify number of GPUs per worker
  • Data Center: Choose deployment region
  • Storage:
    • Strongly recommended: Mount Network Volume to /root/.cache/huggingface
    • This enables model persistence across restarts
    • Speeds up subsequent deployments by caching model files

Tip: Persistent storage significantly improves startup time for subsequent deployments by avoiding repeated model downloads.

quick deploy 02 quick deploy 01 quick deploy 02

After deployment, your vLLM endpoint will be ready to serve inference requests. The system will automatically handle model downloading and initialization.