大模型的N种高效部署方法：以LLama2为例

通过部署LLama2示例，比较不同LLM开源推理服务框架的优缺点。本文没介绍深度学习模型推理服务的传统库，如TorchServe，KServe或Triton Inference Server。

1. vLLM

它的吞吐量比HuggingFace Transformer （HF）高 14 倍到24 倍，吞吐量比HF文本生成推理（TGI）高 2.2 倍。有连续批处理Continuous batching 和PagedAttention功能，集成各种解码算法，包括并行采样、波束搜索等。但缺乏对适配器（LoRA、QLoRA 等）的支持。

后期功能迭代可以追踪官方库：https://github.com/vllm-project/vllm

本地推理服务：

# pip install vllm
from vllm import LLM, SamplingParams

prompts = [
    "Funniest joke ever:",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.95, top_p=0.95, max_tokens=200)
llm = LLM(model="huggyllama/llama-13b")
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

API 服务：

# Start the server:
python -m vllm.entrypoints.api_server --env MODEL_NAME=huggyllama/llama-13b

# Query the model in shell:
curl http://localhost:8000/generate \
    -d '{
        "prompt": "Funniest joke ever:",
        "n": 1,
        "temperature": 0.95,
        "max_tokens": 200
    }'

2. Text generation inference

用于文本生成推理的 Rust、Python 和 gRPC 服务框架。在HuggingFace的生产中使用，为LLM的API推理小部件提供支持。内置 Prometheus metrics，可以监控服务器负载和性能，可以使用Flashattention和PagedAttention。所有依赖项都安装在 Docker 中，支持HuggingFace模型，有很多选项来管理模型推理，包括精度调整、量化、张量并行性、重复惩罚等。适合了解Rust编程的人。

官方库地址： https://github.com/huggingface/text-generation-inference

使用 docker 运行 Web 服务器：

mkdir data
docker run --gpus all --shm-size 1g -p 8080:80 \
-v data:/data ghcr.io/huggingface/text-generation-inference:0.9 \
  --model-id huggyllama/llama-13b \
  --num-shard 1

进行查询：

# pip install text-generation
from text_generation import Client

client = Client("http://127.0.0.1:8080")
prompt = "Funniest joke ever:"
print(client.generate(prompt, max_new_tokens=17 temperature=0.95).generated_text)

3. CTranslate2

CTranslate2 是一个 C++ 和 Python 库，用于使用 Transformer 模型进行高效推理。在 CPU 和 GPU 上快速高效地执行，支持多种 CPU 架构，一些优化技术：layer fusion, padding removal, batch reordering, in-place operations, caching mechanism。支持并行和异步执行。缺乏对适配器（LoRA、QLoRA 等）的支持。

pip install -qqq transformers ctranslate2

# The model should be first converted into the CTranslate2 model format:
ct2-transformers-converter --model huggyllama/llama-13b --output_dir llama-13b-ct2 --force

进行查询：

import ctranslate2
import transformers

generator = ctranslate2.Generator("llama-13b-ct2", device="cuda", compute_type="float16")
tokenizer = transformers.AutoTokenizer.from_pretrained("huggyllama/llama-13b")

prompt = "Funniest joke ever:"
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch(
    [tokens], 
    sampling_topk=1, 
    max_length=200, 
)
tokens = results[0].sequences_ids[0]
output = tokenizer.decode(tokens)
print(output)

4. DeepSpeed-MII

MII 使低延迟和高吞吐量推理成为可能，由 DeepSpeed 提供支持。跨多个副本的负载平衡，用于处理大量用户的非常有用的工具。原生和 Azure 集成、缺乏对适配器（LoRA、QLoRA 等）的支持。

# DON'T INSTALL USING pip install deepspeed-mii
# git clone https://github.com/microsoft/DeepSpeed-MII.git
# git reset --hard 60a85dc3da5bac3bcefa8824175f8646a0f12203
# cd DeepSpeed-MII && pip install .
# pip3 install -U deepspeed

# ... and make sure that you have same CUDA versions:
# python -c "import torch;print(torch.version.cuda)" == nvcc --version
import mii

mii_configs = {
    "dtype": "fp16",
    'max_tokens': 200,
    'tensor_parallel': 1,
    "enable_load_balancing": False
}
mii.deploy(task="text-generation",
           model="huggyllama/llama-13b",
           deployment_name="llama_13b_deployment",
           mii_config=mii_configs)

进行查询：

import mii

generator = mii.mii_query_handle("llama_13b_deployment")
result = generator.query(  
  {"query": ["Funniest joke ever:"]}, 
  do_sample=True,
  max_new_tokens=200
)
print(result)

5. OpenLLM

在生产中操作大型语言模型（LLM）的开放平台。 OpenLLM 支持使用 bitsandbytes 和 GPTQ 进行量化。LangChain 集成。缺乏批处理支持、缺乏内置的分布式推理。

官方库： https://github.com/bentoml/OpenLLM

pip install openllm scipy
openllm start llama --model-id huggyllama/llama-13b \
  --max-new-tokens 200 \
  --temperature 0.95 \
  --api-workers 1 \
  --workers-per-resource 1

进行查询：

import openllm

client = openllm.client.HTTPClient('http://localhost:3000')
print(client.query("Funniest joke ever:"))

6. Ray Serve

Ray Serve 是一个可扩展的模型服务库，用于构建在线推理 API。Serve 与框架无关，因此您可以使用单个工具包来提供深度学习模型中的所有内容。可以使用 Ray 控制面板获取 Ray 集群和 Ray Serve 应用程序状态的高级概述。可以跨多个副本自动缩放和动态请求批处理。Ray Serve并不专注于LLM，它是一个更广泛的框架，用于部署任何ML模型。最适合可用性、可伸缩性和可观测性非常重要的企业。

https://github.com/ray-project

# pip install ray[serve] accelerate>=0.16.0 transformers>=4.26.0 torch starlette pandas
# ray_serve.py
import pandas as pd

import ray
from ray import serve
from starlette.requests import Request

@serve.deployment(ray_actor_options={"num_gpus": 1})
class PredictDeployment:
    def __init__(self, model_id: str):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch

        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            device_map="auto",
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

    def generate(self, text: str) -> pd.DataFrame:
        input_ids = self.tokenizer(text, return_tensors="pt").input_ids.to(
            self.model.device
        )
        gen_tokens = self.model.generate(
            input_ids,
            temperature=0.9,
            max_length=200,
        )
        return pd.DataFrame(
            self.tokenizer.batch_decode(gen_tokens), columns=["responses"]
        )

    async def __call__(self, http_request: Request) -> str:
        json_request: str = await http_request.json()
        return self.generate(prompt["text"])

deployment = PredictDeployment.bind(model_id="huggyllama/llama-13b")


# then run from CLI command:
# serve run ray_serve:deployment

进行查询：

import requests

sample_input = {"text": "Funniest joke ever:"}
output = requests.post("http://localhost:8000/", json=[sample_input]).json()
print(output)

7. MLC LLM

机器学习编译LLM（MLC LLM）是一种通用部署解决方案，使LLM能够利用本机硬件加速在消费者设备上高效运行。该库主要专注于为不同设备编译模型。支持分组量化

# 1. Make sure that you have python >= 3.9
# 2. You have to run it using conda:
conda create -n mlc-chat-venv -c mlc-ai -c conda-forge mlc-chat-nightly
conda activate mlc-chat-venv

# 3. Then install package:
pip install --pre --force-reinstall mlc-ai-nightly-cu118 \
  mlc-chat-nightly-cu118 \
  -f https://mlc.ai/wheels

# 4. Download the model weights from HuggingFace and binary libraries:
git lfs install && mkdir -p dist/prebuilt && \
  git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib && \
  cd dist/prebuilt && \  
  git clone https://huggingface.co/huggyllama/llama-13b dist/ && \
  cd ../..
  
  
# 5. Run server:
python -m mlc_chat.rest --device-name cuda --artifact-path dist

进行查询：

import requests

payload = {
   "model": "lama-30b",
   "messages": [{"role": "user", "content": "Funniest joke ever:"}],
   "stream": False
}
r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload)
print(r.json()['choices'][0]['message']['content'])

其他

dstack 为 LLM 推理配置必要的环境并使用单个命令启动：dstack run . -f vllm/serve.dstack.yml

type: task

env:
  - MODEL=huggyllama/llama-13b
  # (Optional) Specify your Hugging Face token
  - HUGGING_FACE_HUB_TOKEN=

ports:
  - 8000

commands:
  - conda install cuda # Required since vLLM will rebuild the CUDA kernel
  - pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000

结论：

1。 batched prompt 需要最大速度时，使用 vLLM。

2。需要使用 HuggingFace 并且不打算使用多个适配器，选择Text generation inference 。

3。想在 CPU 上运行推理，选 CTranslate2。

4。想将适配器连接到核心模型并使用HuggingFace Agents，选择OpenLLM。

5。想获得稳定的管道和灵活的部署，使用 Ray Serve，项目比较成熟。

6。想在客户端（边缘计算）上本地部署LLM，例如在Android或iPhone平台上，使用MLC LLM。

7。有使用DeepSpeed库的经验并希望继续使用它来部署LLM，使用DeepSpeed-MII。

Reference:

https://betterprogramming.pub/frameworks-for-serving-llms-60b7f7b23407

https://dstack.ai/

本文来自投稿，不代表美熙智能立场，如若转载，请注明：原作者名和出处https://www.icnma.com

大模型的N种高效部署方法：以LLama2为例

1. vLLM

2. Text generation inference

3. CTranslate2

4. DeepSpeed-MII

5. OpenLLM

6. Ray Serve

7. MLC LLM

其他

结论：

Reference:

猜你想看

Ollama使用指南【超全版】

大模型算法岗常见面试题100道

Instruction-tuning Llama2大模型文本分类微调示例

盘点那些热门的开源AI Agent框架【持续更新...】

Win10 RTX4090深度学习配置，并Mac远程登录