Cara Menerapkan Llama 4 ke AWS, Azure & Hugging Face

Panduan ini menyediakan instruksi langkah demi langkah untuk menerapkan model Llama 4 Meta (Scout dan Maverick) pada tiga platform utama: AWS, Azure, dan Hugging Face. Model-model ini menawarkan kemampuan canggih termasuk pemrosesan multimodal, jendela konteks besar, dan kinerja terkini.

💡

Tip Pengembang: Sebelum menyelami penerapan, pertimbangkan untuk meningkatkan toolkit pengujian API Anda! Apidog menawarkan alternatif yang lebih intuitif dan kaya fitur untuk Postman dengan dukungan yang lebih baik untuk titik akhir model AI, pengujian kolaboratif, dan dokumentasi API otomatis. Alur kerja penerapan LLM Anda akan berterima kasih karena telah melakukan peralihan.

button

Prasyarat & Persyaratan Perangkat Keras untuk Penerapan Llama 4

Akses ke model Llama 4 melalui perjanjian lisensi Meta
Akun Hugging Face dengan token akses BACA
Akun AWS, Azure, atau Hugging Face Pro sesuai kebutuhan untuk target penerapan Anda
Pemahaman dasar tentang kontainerisasi dan layanan cloud

AWS (melalui TensorFuse)

Scout: 8x GPU H100 untuk konteks token 1 juta
Maverick: 8x GPU H100 untuk konteks token 430K
Alternatif: 8x GPU A100 (jendela konteks yang dikurangi)

Azure

(Ini selaras dengan panduan Azure ML umum untuk model bahasa besar, tetapi tidak ada dokumentasi khusus Llama 4 yang ditemukan untuk mengonfirmasi persyaratan yang tepat.)

Direkomendasikan: ND A100 v4-series (8 GPU NVIDIA A100)
Minimum: Standard_ND40rs_v2 atau lebih tinggi

Hugging Face

Direkomendasikan: Perangkat keras A10G-Large Space
Alternatif: A100-Large (opsi perangkat keras premium)
Perangkat keras tingkat gratis tidak mencukupi untuk model lengkap

1. Menerapkan Llama 4 ke AWS menggunakan TensorFuse

1.1 Siapkan AWS dan TensorFuse

Instal TensorFuse CLI:

pip install tensorfuse

Konfigurasikan kredensial AWS:

aws configure

Inisialisasi TensorFuse dengan akun AWS Anda:

tensorkube init

1.2 Buat Rahasia yang Diperlukan

Simpan token Hugging Face Anda:

tensorkube secret create hugging-face-secret YOUR_HF_TOKEN --env default HUGGING_FACE_HUB_TOKEN=

Buat token autentikasi API:

tensorkube secret create vllm-token vllm-key --env default VLLM_API_KEY=

1.3 Buat Dockerfile untuk Llama 4

Untuk model Scout:

FROM vllm/vllm-openai:v0.8.3
ENV HF_HUB_ENABLE_HF_TRANSFER=1
EXPOSE 80
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server", \\\\
            "--model", "meta-llama/Llama-4-Scout-17B-16E-Instruct", \\\\
            "--dtype", "bfloat16", \\\\
            "--trust-remote-code", \\\\
            "--tensor-parallel-size", "8", \\\\
            "--max-model-len", "1000000", \\\\
            "--port", "80", \\\\
            "--override-generation-config", "{\\\\"attn_temperature_tuning\\\\": true}", \\\\
            "--limit-mm-per-prompt", "image=10", \\\\
            "--kv-cache-dtype", "fp8", \\\\
            "--api-key", "${VLLM_API_KEY}"]

Untuk model Maverick:

FROM vllm/vllm-openai:v0.8.3
ENV HF_HUB_ENABLE_HF_TRANSFER=1
EXPOSE 80
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server", \\\\
            "--model", "meta-llama/Llama-4-Maverick-17B-128E-Instruct", \\\\
            "--dtype", "bfloat16", \\\\
            "--trust-remote-code", \\\\
            "--tensor-parallel-size", "8", \\\\
            "--max-model-len", "430000", \\\\
            "--port", "80", \\\\
            "--override-generation-config", "{\\\\"attn_temperature_tuning\\\\": true}", \\\\
            "--limit-mm-per-prompt", "image=10", \\\\
            "--kv-cache-dtype", "fp8", \\\\
            "--api-key", "${VLLM_API_KEY}"]

1.4 Buat Konfigurasi Penerapan

Buat deployment.yaml:

gpus: 8
gpu_type: h100
secret:
  - huggingfacesecret
  - vllmtoken
min-scale: 1
readiness:
  httpGet:
    path: /health
    port: 80

1.5 Terapkan ke AWS

Terapkan layanan Anda:

tensorkube deploy --config-file ./deployment.yaml

1.6 Akses Layanan yang Diterapkan

Daftar penerapan untuk mendapatkan URL titik akhir Anda:

tensorkube deployment list

Uji penerapan Anda:

curl --request POST \\\\
  --url YOUR_APP_URL/v1/completions \\\\
  --header 'Content-Type: application/json' \\\\
  --header 'Authorization: Bearer vllm-key' \\\\
  --data '{
    "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    "prompt": "Earth to Llama 4. What can you do?",
    "max_tokens": 1000
  }'

2. Menerapkan Llama 4 ke Azure

2.1 Siapkan Ruang Kerja Azure ML

Instal Azure CLI dan ekstensi ML:

pip install azure-cli azure-ml
az login

Buat ruang kerja Azure ML:

az ml workspace create --name llama4-workspace --resource-group your-resource-group

2.2 Buat Kluster Komputasi

az ml compute create --name llama4-cluster --type amlcompute --min-instances 0 \\\\
  --max-instances 1 --size Standard_ND40rs_v2 --vnet-name your-vnet-name \\\\
  --subnet your-subnet --resource-group your-resource-group --workspace-name llama4-workspace

2.3 Daftarkan Model Llama 4 di Azure ML

Buat model.yml:

$schema: <https://azuremlschemas.azureedge.net/latest/model.schema.json>
name: llama-4-scout
version: 1
path: .
properties:
  model_name: "meta-llama/Llama-4-Scout-17B-16E-Instruct"

Daftarkan model:

az ml model create --file model.yml --resource-group your-resource-group --workspace-name llama4-workspace

2.4 Buat Konfigurasi Penerapan

Buat deployment.yml:

$schema: <https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json>
name: llama4-deployment
endpoint_name: llama4-endpoint
model: azureml:llama-4-scout@latest
instance_type: Standard_ND40rs_v2
instance_count: 1
environment_variables:
  HUGGING_FACE_HUB_TOKEN: ${{secrets.HF_TOKEN}}
  VLLM_API_KEY: ${{secrets.VLLM_KEY}}
environment:
  image: vllm/vllm-openai:v0.8.3
  conda_file: conda.yml

Buat conda.yml:

channels:
  - conda-forge
dependencies:
  - python=3.10
  - pip
  - pip:
    - vllm==0.8.3
    - transformers
    - accelerate

2.5 Buat Titik Akhir dan Terapkan

az ml online-endpoint create --name llama4-endpoint \\\\
  --resource-group your-resource-group --workspace-name llama4-workspace

az ml online-deployment create --file deployment.yml \\\\
  --resource-group your-resource-group --workspace-name llama4-workspace

2.6 Uji Penerapan

az ml online-endpoint invoke --name llama4-endpoint --request-file request.json \\\\
  --resource-group your-resource-group --workspace-name llama4-workspace

Di mana request.json berisi:

{
  "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
  "prompt": "Earth to Llama 4. What can you do?",
  "max_tokens": 1000
}

3. Menerapkan Llama 4 ke Hugging Face

3.1 Siapkan Akun Hugging Face

Buat akun Hugging Face di https://huggingface.co/
Terima perjanjian lisensi untuk model Llama 4 di https://huggingface.co/meta-llama

3.2 Terapkan Menggunakan Hugging Face Spaces

Buka https://huggingface.co/spaces dan klik "Buat Space baru"

Konfigurasikan Space Anda:

Nama: llama4-deployment
Lisensi: Pilih lisensi yang sesuai
SDK: Pilih Gradio
Perangkat Keras Space: A10G-Large (untuk kinerja terbaik)
Visibilitas: Pribadi atau Publik berdasarkan kebutuhan Anda

Klon repositori Space:

git clone <https://huggingface.co/spaces/YOUR_USERNAME/llama4-deployment>
cd llama4-deployment

3.3 Buat File Aplikasi

Buat app.py:

import gradio as gr
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import os

# Add your HF token to environment or Secrets
os.environ["HUGGING_FACE_HUB_TOKEN"] = "YOUR_HF_TOKEN"

# Load model and tokenizer with appropriate configuration
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Create pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=2048
)

def generate_text(prompt, max_length=1000, temperature=0.7):
    # Format prompt according to Llama 4 format
    formatted_prompt = f"<|begin_of_text|><|user|>\\\\n{prompt}<|end_of_text|>\\\\n<|assistant|>"

    outputs = pipe(
        formatted_prompt,
        max_length=len(tokenizer.encode(formatted_prompt)) + max_length,
        temperature=temperature,
        do_sample=True,
    )

    return outputs[0]['generated_text'].replace(formatted_prompt, "")

# Create Gradio interface
demo = gr.Interface(
    fn=generate_text,
    inputs=[
        gr.Textbox(lines=4, placeholder="Enter your prompt here...", label="Prompt"),
        gr.Slider(minimum=100, maximum=2000, value=1000, step=100, label="Panjang Maks"),
        gr.Slider(minimum=0.1, maximum=1.0, value=0.7, step=0.1, label="Suhu")
    ],
    outputs="text",
    title="Demo Llama 4",
    description="Hasilkan teks menggunakan model Llama 4 Meta",
)

demo.launch()

Buat requirements.txt:

accelerate>=0.20.3
bitsandbytes>=0.41.1
gradio>=3.50.0
torch>=2.0.1
transformers>=4.34.0

3.4 Terapkan ke Hugging Face

Dorong ke Space Hugging Face Anda:

git add app.py requirements.txt
git commit -m "Add Llama 4 deployment"
git push

3.5 Pantau Penerapan

Kunjungi URL Space Anda: https://huggingface.co/spaces/YOUR_USERNAME/llama4-deployment
Build pertama akan memakan waktu karena perlu mengunduh dan menyiapkan model
Setelah diterapkan, Anda akan melihat antarmuka Gradio tempat Anda dapat berinteraksi dengan model

4. Menguji dan Berinteraksi dengan Penerapan Anda

4.1 Menggunakan Klien Python untuk Akses API (AWS & Azure)

import openai

# For AWS
client = openai.OpenAI(
    base_url="YOUR_AWS_URL/v1",  # From tensorkube deployment list
    api_key="vllm-key"  # Your configured API key
)

# For Azure
client = openai.AzureOpenAI(
    azure_endpoint="YOUR_AZURE_ENDPOINT",
    api_key="YOUR_API_KEY",
    api_version="2023-05-15"
)

# Make a text completion request
response = client.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    prompt="Write a short poem about artificial intelligence.",
    max_tokens=200
)

print(response.choices[0].text)

# For multimodal capabilities (if supported)
import base64

# Load image as base64
with open("image.jpg", "rb") as image_file:
    image_base64 = base64.b64encode(image_file.read()).decode('utf-8')

# Create chat completion with the image
response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image:"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
            ]
        }
    ],
    max_tokens=300
)

print(response.choices[0].message.content)

Kesimpulan

Anda sekarang memiliki instruksi langkah demi langkah untuk menerapkan model Llama 4 di AWS, Azure, dan Hugging Face. Setiap platform menawarkan keuntungan yang berbeda:

AWS dengan TensorFuse: Kontrol penuh, skalabilitas tinggi, kinerja terbaik
Azure: Integrasi dengan ekosistem Microsoft, layanan ML terkelola
Hugging Face: Penyiapan paling sederhana, bagus untuk pembuatan prototipe dan demo

Pilih platform yang paling sesuai dengan persyaratan spesifik Anda untuk biaya, skala, kinerja, dan kemudahan pengelolaan.