Part1LMFlow
香港科技大学新开源库LMFlow[1],有四大特性:可扩展、轻量级、定制化和完全开源。
https://github.com/OptimalScale/LMFlow
我们微调大语言模型做对话机器人,需要以下几个步骤:预训练、有监督微调、指令微调和人类价值观对齐。LMFlow 库可以加载现有的开源大模型,灵活组合这套流程的各环节,有点像HuggingFace的pipline思想,见下文微调示例。
目前支持几种流水线:
Part2上手指南
LMFlow使用文档[2]:
https://optimalscale.github.io/LMFlow/
使用 conda 安装必要的依赖:
git clone https://github.com/OptimalScale/LMFlow.git
cd LMFlow
conda create -n lmflow python=3.9 -y
conda activate lmflow
conda install mpi4py
pip install -e .
运行 bash 脚本下载所需的数据集:
cd data && ./download.sh && cd -
我们也可以在指定的数据集目录下提供 .json 文件的列表。自定义数据的格式:
1、text2text 格式
{
"type": "text2text",
"instances": [
{
"input": "Question: The Transformer architecture [START_REF]",
"output": "N/A"
},
...
]
}
2、text_only
{
"type": "text_only",
"instances": [
{
"text": "Defintion: 。。。. For example,。。。Input: Question: Sentence: 。。。. nQuestion: 。。。? n Output: NA nn"
},
...
]
}
Part3微调示例
通过pipeline_name传入名称即可,例如:inferencer、finetuner、evaluator等。官方demo[3]微调代码的示例:
#!/usr/bin/env python
# coding=utf-8
import sys
from transformers import HfArgumentParser
from lmflow.args import (
ModelArguments,
DatasetArguments,
AutoArguments)
from lmflow.datasets.dataset import Dataset
from lmflow.models.auto_model import AutoModel
from lmflow.pipeline.auto_pipeline import AutoPipeline
def main():
# Parses arguments
pipeline_name = "finetuner"
PipelineArguments = AutoArguments.get_pipeline_args_class(pipeline_name)
parser = HfArgumentParser((ModelArguments, DatasetArguments, PipelineArguments))
if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
# 如果传入json参数文件
model_args, data_args, pipeline_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
else:
model_args, data_args, pipeline_args = parser.parse_args_into_dataclasses()
# Initialization
finetuner = AutoPipeline.get_pipeline(
pipeline_name=pipeline_name,
model_args=model_args,
data_args=data_args,
pipeline_args=pipeline_args,
)
dataset = Dataset(data_args)
model = AutoModel.get_model(model_args)
# Tokenization and text grouping must be done in the main process
with pipeline_args.main_process_first(desc="dataset map tokenization"):
tokenized_dataset = model.tokenize(dataset)
lm_dataset = finetuner.group_text(
tokenized_dataset,
model_max_length=model.get_max_length(),
)
# Finetuning
tuned_model = finetuner.tune(model=model, lm_dataset=lm_dataset)
if __name__ == '__main__':
main()
参考资料
LMFlow: http://lmflow.com/
[2]LMFlow使用文档: https://optimalscale.github.io/LMFlow/
[3]LMFlow: https://github.com/OptimalScale/LMFlow
END
2023-03-29
2023-02-27
2023-02-02
2023-02-02
2022-12-07
2023-03-14
“点赞”是喜欢,“在看、分享”是真爱
<