当前位置：首页 > 资讯 > 系统环境

大模型技术：LoRA 详解，从理论到实践理解 LoRA

时间：2025-11-14 21:10 作者：来源：阅读：3
扫一扫，手机访问

摘要：LoRA（低秩适应）是一种用于微调大型语言模型，特别是基于transformers的语言模型，以减少计算和内存开销。LoRA 一般应用于已经在大型数据集上预先训练的模型。预训练是使用大量文本数据语料库训练模型的初始阶段。该模型通过反向传播调整其内部参数（权重）来学习。在数十亿或数万亿个标记上重复此过程，直到模型可以准确地预测或生成文本。预训练后，模型会进行微调，在较小的特定于任务的数据集上进行进

LoRA（低秩适应）是一种用于微调大型语言模型，特别是基于transformers的语言模型，以减少计算和内存开销。

LoRA 一般应用于已经在大型数据集上预先训练的模型。

预训练是使用大量文本数据语料库训练模型的初始阶段。该模型通过反向传播调整其内部参数（权重）来学习。在数十亿或数万亿个标记上重复此过程，直到模型可以准确地预测或生成文本。

预训练后，模型会进行微调，在较小的特定于任务的数据集上进行进一步训练，以使其适应特定任务，例如情感分析、翻译或问答。

大模型技术：LoRA 详解，从理论到实践理解 LoRA

Pre-training vs Fine-tuning

随着模型变得越来越大，微调对于有效地调整模型、减少所需的时间和 GPU 内存至关重大。它允许有针对性的调整，而无需重新训练整个模型，使大规模语言模型更加实用。

大模型技术：LoRA 详解，从理论到实践理解 LoRA

Model Sizes Over Time. Source

微调技术有多种类型，每种都适合不同的目标和环境。我们的目标是关注 LoRA，但让我们简要回顾一下最常见的技术。

标准（完整）微调Standard (Full) Fine-Tuning：调整所有模型参数。这种方法是资源密集型的，由于它需要更新模型的所有参数。
参数高效微调Parameter-Efficient Fine-Tuning (PEFT)：在预训练模型中引入小型可训练模块或低阶自适应，显着减少可训练参数的数量。 LoRA 就属于这一类。
最后一层微调Last-Layer Fine-Tuning（头部微调）：仅微调模型的最后一层，同时保持模型的其余参数冻结。
基于特征的微调Feature-Based Fine-Tuning：使用预先训练的模型作为特征提取器，并将这些特征输入到单独的分类器或特定于任务的模型中。
指令微调Instruct Fine-Tuning：微调模型以更准确、更安全地遵循人类指令或提示。用于对话式人工智能、客户支持机器人以及模型需要准确遵循用户指令的其他应用程序。
针对特定任务的微调Task-Specific Fine-Tuning：它涉及通过在特定于任务的数据集上进行微调来调整预训练的语言模型来执行特定任务。
多任务文件调整Multi-Task File-Tuning：同时在多个任务上微调模型，使其更加通用。
传输微调Transfer Fine-Tuning：它利用模型在不同的、一般相关的任务或领域中获得的知识来提高其在新任务上的性能。
专家混合 (MoE) 微调：它的目的是通过根据输入数据有选择地激活模型的不同部分来更有效地微调LLM。

大模型技术：LoRA 详解，从理论到实践理解 LoRA

Fine-Tune Types.

参数高效微调 (PEFT)

PEFT 背后的主要思想是使预训练模型适应新任务，同时仅更新一小部分参数。

在深入研究 LoRA 之前，让我们先分解一下 PEFT 的主要类型。

ADAPTERS是插入预训练模型各层之间的小型神经网络。在微调期间，仅训练适配器层，而模型的其余部分保持冻结。这使得模型能够在对原始模型进行最小更改的情况下适应新任务。
PREFIX TUNING为每个输入添加可学习的指令。这就像在处理实际输入之前为模型提供特定于任务的上下文。
PROMPT TUNING涉及学习一组特殊提示，当输入到模型中时，指导模型执行特定任务。这些提示经过训练，而模型的其余部分保持不变。本质上，您正在训练模型如何以特定于任务的方式解释某些输入。
BitFit 专注于仅微调模型层中的偏差项，而不是完整的权重矩阵。
LoRA 将低秩矩阵添加到模型现有的权重矩阵中。

LoRA

在LLMs中，权重矩阵是高维且密集的，这意味着它们有许多参数。

LoRA 背后的想法是，其中许多参数并不是对每项任务都至关重大。 LoRA 没有调整所有矩阵，而是引入了低秩矩阵，这些矩阵更小、更简单，但依旧能够捕获新任务所需的基本变化。

Rank & Low-Rank

秩指矩阵中线性独立的行或列的数量。本质上，它表明矩阵所包含的复杂性或独特信息量。

矩阵的秩始终为零（零矩阵除外，其秩为零）。

我们通过将矩阵转换为行梯形形式来找到独立行的数量。独立行或列的数量是通过对非零行或列进行计数来确定的。

大模型技术：LoRA 详解，从理论到实践理解 LoRA

The rank of A is 3。

大模型技术：LoRA 详解，从理论到实践理解 LoRA

The rank of A is 2。

大模型技术：LoRA 详解，从理论到实践理解 LoRA

Source

如果矩阵的秩等于其最小维度（行或列），则将其视为满级，这意味着它捕获了其大小可能的最大信息量。

一个 低等级另一方面，矩阵的秩小于其维度，这意味着它捕获的信息较少，并且可以视为满秩矩阵的简化或压缩版本。

LoRA 不是直接更新模型的大型满秩权重矩阵，而是引入了低秩矩阵。这些矩阵需要更少的参数来表明，并且训练的计算成本更便宜。

W has a rank of 2，这意味着它可以表明为两个较小矩阵的乘积。

import torch
import numpy as np
_ = torch.manual_seed(42)

d, k = 5, 5

W_rank = 2
W = torch.randn(d,W_rank) @ torch.randn(W_rank,k) # 5x2 @ 2x5
print(W)

"""
tensor([[ 0.9042, -1.4169,  1.4654, -1.2297,  0.6689],
        [ 1.5066,  0.2465, -0.2778, -1.1182,  1.0168],
        [-1.6714,  1.0142, -1.0348,  1.7002, -1.1763],
        [ 0.6054, -1.1338,  1.1743, -0.8894,  0.4548],
        [ 0.2302,  0.3984, -0.4187, -0.0421,  0.1419]])
"""

W_rank = np.linalg.matrix_rank(W)
print(f'Rank of W: {W_rank}')

"""
Rank of W: 2
"""

单值分解 (SVD)：矩阵W可以分解为三个矩阵 U, S，和V.

U, S, V = torch.svd(W)

U

"""
tensor([[-0.5466,  0.3814,  0.5431,  0.4039,  0.3125],
        [-0.3295, -0.7763,  0.0493,  0.4477, -0.2932],
        [ 0.6526,  0.1640,  0.0137,  0.7387, -0.0372],
        [-0.4073,  0.3556, -0.7711,  0.2863, -0.1764],
        [ 0.0301, -0.3139, -0.3284,  0.0938,  0.8854]])
"""

S

"""
tensor([4.6126e+00, 1.9884e+00, 1.0705e-07, 8.8211e-08, 3.1594e-09])
"""

V

"""
tensor([[-0.5032, -0.4806,  0.1853,  0.4932, -0.4881],
        [ 0.3965, -0.5501,  0.6400, -0.3613,  0.0108],
        [-0.4066,  0.5804,  0.6994, -0.0904,  0.0220],
        [ 0.5444,  0.1884,  0.2553,  0.7665,  0.1245],
        [-0.3576, -0.3067,  0.0421,  0.1750,  0.8635]])
"""

U：An orthogonal matrix of shape (d, d) where the columns are the left singular vectors of W.
S：A diagonal matrix (usually represented as a vector of singular values) of shape (min(d, k),), where the values represent the singular values of W.
V：An orthogonal matrix of shape (k, k) where the columns are the right singular vectors of W.

W = U x S x V^T

U_r = U[:, :W_rank]
S_r = torch.diag(S[:W_rank])
V_r = V[:, :W_rank].t()

B = U_r @ S_r
A = V_r
print(f'Shape of B: {B.shape}')
print(f'Shape of A: {A.shape}')

"""
Shape of B: torch.Size([5, 2])
Shape of A: torch.Size([2, 5])
"""

U_r = U[:, :W_rank]:该行选择U的第一个 W_rank 的列。这给出了一个矩阵 U_r 形状的(d, W_rank)，其中包含与最大奇异值相对应的最重大的左奇异向量。
S_r = torch.diag(S[:W_rank]):这创建了一个对角矩阵S_r形状的 (W_rank, W_rank) 使用第一个 W_rank 奇异值来自 S。对角矩阵 S_r 缩放相应的奇异向量。
V_r = V[:, :W_rank].t():这选择了第一个 W_rank 的列V并将它们转置得到一个矩阵 V_r 形状的(W_rank, k)。这些是最重大的右奇异向量。
B = U_r @ S_r:这乘以 U_r 和 S_r 创建矩阵 B 形状的 (d, W_rank)。矩阵B表明左奇异向量与奇异值对角矩阵的乘积。
A = V_r:这一套 A 成为 V_r，其形状为 (W_rank, k)。矩阵A表明转置的右奇异向量。

print("Total parameters of W: ", W.nelement())
print("Total parameters of B and A: ", B.nelement() + A.nelement())

"""
Total parameters of W:  25
Total parameters of B and A:  20
"""

使用 LoRA 跟踪变化

在传统的微调中，直接更新神经网络的整个权重矩阵。这意味着权重矩阵中的每个元素都是根据训练数据进行调整的。

LoRA 不是在微调时直接更新大权重矩阵，而是引入了两个较小的矩阵，一般称为一个和 B.

大模型技术：LoRA 详解，从理论到实践理解 LoRA

Two matrices. Source

这些矩阵被设计为低秩的。例如，如果原始权重矩阵瓦是有尺寸的 n×m，低阶矩阵一个和乙可能有尺寸 n×r 和 r×m 分别是，其中 r 远小于 n 和米.

大模型技术：LoRA 详解，从理论到实践理解 LoRA

Rank 2. Source

矩阵A和B的乘积得出矩阵C=A×B，其大小与原始权重矩阵W一样。

LoRA并不直接更新W ，而是通过调整这个乘积矩阵C来跟踪变化。具体来说，模型的输出受原始权重矩阵与低秩近似之和的影响，即W′=W+C

大模型技术：LoRA 详解，从理论到实践理解 LoRA

由于A和B比W小得多，因此在微调过程中仅调整一小部分参数。这使得该过程在内存使用和计算方面都更加高效。

大模型技术：LoRA 详解，从理论到实践理解 LoRA

Rank decomposition. Source

大模型技术：LoRA 详解，从理论到实践理解 LoRA

Low-rank matrix decomposition. Source

大模型技术：LoRA 详解，从理论到实践理解 LoRA

Source

在反向传播过程中，冻结的预训练权重保持不变，损失仅用于更新 LoRA 引入的 B 和 A 矩阵。

可训练参数的数量、矩阵的秩和模型精度是相互关联的。降低微调的等级会减少可训练参数，使过程更加高效，但可能会限制模型精度。平衡秩和可训练参数对于优化资源使用并保持模型精度至关重大。

大模型技术：LoRA 详解，从理论到实践理解 LoRA

Rank vs Trainable Parameters. Source

QLoRA

QLoRA （量化低秩自适应）以LoRA概念为基础，通过量化增加了一层效率。LoRA 通过使用低秩矩阵减少了可训练参数的数量，而 QLoRA 则通过量化这些矩阵更进一步。

大模型技术：LoRA 详解，从理论到实践理解 LoRA

LoRA and QLoRA. Source

在 QLoRA 中，精度在量化微调过程结束时通过一种称为反量化的技术恢复。QLoRA 使用 4 位量化。

在微调期间，低秩矩阵以量化格式存储，这意味着数值以较低精度表明（例如，4 位或 8 位）。

当模型用于推理或最终输出生成期间，这些量化矩阵被反量化。

然后将反量化的低秩矩阵与原始模型权重（保持较高的精度）相结合以产生最终输出。尽管在微调期间使用量化矩阵，这使得模型在推理期间保持高精度和准确度。

精度和量化

准确是表明数值的详细程度或准确程度。它涉及在训练和推理过程中如何存储和处理数字（如权重、激活和梯度）。精度一般由用于表明数字的位数决定：

32 位（单精度）：最常用于训练深度学习模型。它允许高精度的大范围值，但需要更多的内存和计算资源。
16 位（半精度）：提供精度和资源使用之间的权衡。它减少了内存和计算需求，同时依旧保持许多任务的足够准确性。
8 位及更低（量化精度）：用于量化时，这会显着减少资源使用，但如果不仔细管理，也可能会降低计算的准确性。

大模型技术：LoRA 详解，从理论到实践理解 LoRA

Format of Floating points. Source

大模型技术：LoRA 详解，从理论到实践理解 LoRA

Source

量化是降低模型中使用的数字精度的过程，一般从 32 位浮点到较低位宽的表明形式，例如 16 位、8 位甚至 4 位整数。

大模型技术：LoRA 详解，从理论到实践理解 LoRA

int8 Quantization. Source

模型大小 = 数据类型大小 x 权重数量

给出了运行模型所需内存的粗略估计，也称为推理。

在训练过程中，内存要求更高，由于除了存储权重之外，还需要存储梯度和学习率。

触发器

浮点运算次数 代表 每秒浮点运算数。这是一种衡量计算机或 GPU 执行计算速度的方法，尤其是在处理数字时。

当你使用 精度较低（就像 16 位而不是 32 位），GPU 可以工作得更快，由于数字更小并且更容易处理。切换到较低的精度几乎可以使计算机训练模型的速度提高一倍，由于它可以在一样的时间内执行更多的计算。

Alpha超参数

使用 LoRA 或 QLoRA 微调模型时，权重变化由低秩矩阵表明。这些矩阵并不直接与原始权重相加；相反，它们在添加之前会按一个因子缩放。

Alpha 超参数决定了这种缩放的强度。它充当乘数，调整低秩矩阵对原始权重的影响。较高的 Alpha 会增加适应的影响，而较低的 Alpha 会减少适应的影响。

应用于权重变化的比例因子计算为Alpha / Rank。

代码

PEFT

peft是 Hugging Face 团队开发的一个库。

pip install peft

from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_model, LoraConfig, TaskType

peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)

task_type=TaskType.SEQ_2_SEQ_LM：指定此配置用于序列到序列语言模型。
inference_mode=False：表明此配置用于训练，而不仅仅是推理。
r=8：设置 LoRA 中使用的低秩矩阵的秩。秩越低，参数越少，微调越高效。
lora_alpha=32：控制 LoRA 调整的缩放，影响低秩矩阵对原始权重的影响程度。
lora_dropout=0.1：向 LoRA 层添加 dropout，通过在训练期间随机删除一些连接，有助于防止微调期间过度拟合。

model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/mt0-large")

model = get_peft_model(model, peft_config)

get_peft_model用 LoRA 配置包装预训练模型，有效应用 LoRA 技术。这意味着模型将在微调期间使用低秩矩阵，从而减少需要训练的参数数量。

model.print_trainable_parameters()

"""
trainable params: 2,359,296 || all params: 1,231,940,608 || trainable%: 0.1915
"""

可训练参数：2,359,296：将使用 LoRA 微调的参数数量。
所有参数：1,231,940,608：模型中的参数总数，包括冻结（未训练）的参数。
trainable%：0.1915：微调参数的比例，约占总数的0.19%。这凸显了 LoRA 的效率——只有一小部分模型参数正在接受训练，这使得过程更快、资源占用更少。

CasualLM

我们将按照此调整临时LLM.

我将使用具有 GPU 支持的 Google Colab。

!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git

import torch
torch.cuda.is_available()

# True

让我们加载模型和标记器。我们将使用“bigscience/bloom-1b7”模型和“bigscience/tokenizer”标记器。

AutoModelForCausalLM加载预先训练的因果语言模型，该模型用于文本生成等任务，其中模型预测序列中的下一个单词。

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-1b7",
    torch_dtype=torch.float16,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/tokenizer")

os.environ["CUDA_VISIBLE_DEVICES"]="0":这行设置环境变量 CUDA_VISIBLE_DEVICES 到"0"，这限制代码仅使用计算机上的第一个 GPU (GPU 0)。当在具有多个 GPU 的系统上工作并且您想要指定使用哪一个时，这超级有用。
torch_dtype=torch.float16: 这表明模型应加载 16 位浮点精度 (FP16)。

在 LoRA 中，我们针对模型内的特定权重矩阵进行低秩分解。在引用的论文中，作者选择分解与询问 (wq）和价值 (wv）Transformer 架构中的组件。

在 BLOOM模型中，用于查询、键和值操作的组件被组合成一个名为 query_key_value。因此，与其单独针对 wq 和 wv，整个 query_key_value 模块用于分解。

这些组件的结构和命名约定在不同型号之间可能有所不同。例如，在某些模型中，例如 Llama，权重矩阵可能有不同的标签或组织在不同的模块下。

print(model)

"""
BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 2048)
    (word_embeddings_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(


          (query_key_value): Linear(in_features=2048, out_features=6144, bias=True)



          (dense): Linear(in_features=2048, out_features=2048, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear(in_features=2048, out_features=8192, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear(in_features=8192, out_features=2048, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=2048, out_features=250880, bias=False)
)
"""

预处理阶段...

我们迭代模型中的所有参数（权重和偏差）。

for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)

model.lm_head = CastOutputToFloat(model.lm_head)

param.requires_grad = False:该行通过设置冻结参数 requires_grad 到False，这意味着它们的值在反向传播期间不会更新。当您只想微调模型的某些部分（例如适配器（例如，在 LoRA 或其他微调技术中）），同时保持模型的其余部分不变时，这种情况很常见。

if param.ndim == 1:：此条件检查参数是否为一维张量。这些参数一般包括偏差和归一化层（如 LayerNorm）中的参数。

param.data = param.data.to(torch.float32):将这些小的一维参数转换为 32 位浮点 (FP32) 精度。虽然模型的其余部分可能会使用较低的精度（如 FP16）来提高效率，但对这些参数使用 FP32 可以提高数值稳定性，特别是对于层归一化等操作，其中微小的差异可能会产生重大影响。

model.gradient_checkpointing_enable():此方法可在模型中启用梯度检查点。梯度检查点是一种用于减少训练期间内存使用的技术。该模型不会存储所有中间激活（反向传播所需的），而是在反向传播期间根据需要重新计算它们。

model.enable_input_require_grads():该方法为模型的输入嵌入启用梯度。当您希望梯度传播回输入标记或微调涉及修改嵌入时，这是必要的。这是一种确保某些层的输入跟踪其梯度的方法，当您训练适配器或对模型进行其他小修改时一般需要这种方法。

CastOutputToFloat类的定义是为了确保的输出lm_head（模型的最后一层，负责为语言建模任务生成 logits）被转换为 FP32，无论早期层使用的精度如何。

设置 LoRA 配置...

print_trainable_parameters函数计算并打印模型中相对于参数总数的可训练参数的数量。

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

"""
trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815
"""

r=8:LoRA 中使用的低秩矩阵的秩。较低的秩减少了 LoRA 添加的参数数量，使适应更加高效。
lora_alpha=16: 应用于 LoRA 层输出的缩放因子，用于控制添加参数对模型预测的影响。
target_modules=["query_key_value"]: 指定模型的哪些部分将使用 LoRA 进行调整。在这里，它的目标是 query_key_value 模块，它是模型中注意力机制的一部分。
lora_dropout=0.05: LoRA 层中应用了 dropout 率，以协助规范适应并防止过度拟合。
bias="none": 指定如何在适应层中处理偏差。在这种情况下，不会添加或修改任何偏差。
task_type="CAUSAL_LM": 指示模型正在微调的任务类型，在本例中是因果语言建模。

我们将使用“squad_v2”数据集。

from datasets import load_dataset

qa_dataset = load_dataset("squad_v2")

create_prompt 函数生成一个提示，将给定的上下文、问题和答案组合到单字符串模板中。然后，可以使用该提示来训练模型，使其学习在给定上下文和问题的情况下生成答案。

def create_prompt(context, question, answer):
  if len(answer["text"]) < 1:
    answer = "Cannot Find Answer"
  else:
    answer = answer["text"][0]
  prompt_template = f"### CONTEXT
{context}

### QUESTION
{question}

### ANSWER
{answer}</s>"
  return prompt_template

mapped_qa_dataset = qa_dataset.map(lambda samples: tokenizer(create_prompt(samples['context'], samples['question'], samples['answers'])))

create_prompt函数应用于每个样本qa_dataset（问答数据集），然后使用标记器根据生成的提示对每个样本进行标记，使数据集为模型训练或微调做好准备。

让我们训练……

import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=mapped_qa_dataset["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=100,
        learning_rate=1e-3,
        fp16=True,
        logging_steps=1,
        output_dir='outputs',
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

per_device_train_batch_size=4:设置每个设备（例如每个 GPU）的批量大小。如果您有多个 GPU，则有效批量大小将是该值乘以 GPU 数量。
gradient_accumulation_steps=4:在执行向后传递之前累积多个步骤的梯度。这有效地增加了批量大小，而不需要更多内存，这在有限硬件上使用大型模型进行训练时超级有用。
warmup_steps=100:指定训练期间的热身步骤数。在这些初始步骤中，学习率逐渐从 0 增加到指定值learning_rate。这有助于防止训练早期阶段出现较大的梯度更新，从而导致模型不稳定。
max_steps=100:将训练限制为最多 100 步。这对于快速实验或对较小数据集进行微调超级有用。
learning_rate=1e-3:设置学习率，它控制每个步骤中模型权重的更新量。值为1e-3相对较高，一般用于快速微调或使用 LoRA 等减少可训练参数数量的方法时。
fp16=True:启用混合精度训练，其中使用 16 位浮点数而不是标准 32 位执行计算。这可以减少内存使用并加快训练速度，尤其是在支持 FP16 操作的 GPU 上。
logging_steps=1:在每个步骤后记录训练指标，这对于详细监控很有用，但可能会占用大量资源。对于较长的训练运行，您可能需要增加该值。
output_dir='outputs':指定保存训练输出（例如模型检查点和日志）的目录。

保存并加载模型...

peft_model_id="results"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)

import torch
from peft import PeftModel, PeftConfig


# Load peft config for pre-trained checkpoint etc.
peft_model_id = "results"
config = PeftConfig.from_pretrained(peft_model_id)

# load base LLM model and tokenizer
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,  load_in_8bit=True,  device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
model.eval()

print("Peft model loaded")

让我们尝试一下模型：

from IPython.display import display, Markdown

def make_inference(context, question):
  batch = tokenizer(f"### CONTEXT
{context}

### QUESTION
{question}

### ANSWER
", return_tensors='pt')

  with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=200)

  display(Markdown((tokenizer.decode(output_tokens[0], skip_special_tokens=True))))

context = "The Moon orbits Earth at an average distance of 384,400 km (238,900 mi), or about 30 times Earth's diameter. Its gravitational influence is the main driver of Earth's tides and very slowly lengthens Earth's day. The Moon's orbit around Earth has a sidereal period of 27.3 days. During each synodic period of 29.5 days, the amount of visible surface illuminated by the Sun varies from none up to 100%, resulting in lunar phases that form the basis for the months of a lunar calendar. The Moon is tidally locked to Earth, which means that the length of a full rotation of the Moon on its own axis causes its same side (the near side) to always face Earth, and the somewhat longer lunar day is the same as the synodic period. However, 59% of the total lunar surface can be seen from Earth through cyclical shifts in perspective known as libration."
question = "At what distance does the Moon orbit the Earth?"

make_inference(context, question)

"""
CONTEXT
The Moon orbits Earth at an average distance of 384,400 km (238,900 mi), or about 30 times Earth's diameter. Its gravitational influence is the main driver of Earth's tides and very slowly lengthens Earth's day. The Moon's orbit around Earth has a sidereal period of 27.3 days. During each synodic period of 29.5 days, the amount of visible surface illuminated by the Sun varies from none up to 100%, resulting in lunar phases that form the basis for the months of a lunar calendar. The Moon is tidally locked to Earth, which means that the length of a full rotation of the Moon on its own axis causes its same side (the near side) to always face Earth, and the somewhat longer lunar day is the same as the synodic period. However, 59% of the total lunar surface can be seen from Earth through cyclical shifts in perspective known as libration.

QUESTION
At what distance does the Moon orbit the Earth?

ANSWER
The Moon orbits the Earth at an average distance of 384,400 km (238,900 mi), or about 30 times Earth's diameter. Its gravitational influence is the main driver of Earth's tides and very slowly lengthens Earth's day. The Moon's orbit around Earth has a sidereal period of 27.3 days. During each synodic period of 29.5 days, the amount of visible surface illuminated by the Sun varies from none up to 100%, resulting in lunar phases that form the basis for the months of a lunar calendar. The Moon is tidally locked to Earth, which means that the length of a full rotation of the Moon on its own axis causes its same side (the near side) to always face Earth, and the somewhat longer lunar day is the same as the synodic period. However, 59% of the total lunar surface can be seen from Earth through cyclical shifts in perspective known as libration.

"""

Seq2SeqLM

让我们演示如何应用 LoRA 来微调 FLAN-T5 模型。

我将使用具有 GPU 支持的 Google Colab。

!pip install datasets py7zr rouge-score bitsandbytes accelerate transformers
!pip install "peft==0.2.0"

我们将使用 Samsun dataset。

SAMSum 数据集由三星波兰研发中心的语言学家创建，包含 16000 个带有第三人称摘要的信使式对话，可在非商业许可下用于研究。

from datasets import load_dataset

dataset = load_dataset("samsum", download_mode="force_redownload")

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

"""
Train dataset size: 14732
Test dataset size: 819
"""

print(dataset["train"][1])

"""
{'id': '13728867',
 'dialogue': 'Olivia: Who are you voting for in this election? 
Oliver: Liberals as always.
Olivia: Me too!!
Oliver: Great',
 'summary': 'Olivia and Olivier are voting for liberals in this election. '}
"""

让我们得到分词器。我们将使用 Flan-T5-small.。

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/flan-t5-small"

tokenizer = AutoTokenizer.from_pretrained(model_id)

目前，进行一些预处理......

import numpy as np
from datasets import concatenate_datasets

tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]])
                  .map(lambda x: tokenizer(x["dialogue"], truncation=True), 
                          batched=True, remove_columns=["dialogue", "summary"])
input_lenghts = [len(x) for x in tokenized_inputs["input_ids"]]
max_source_length = int(np.percentile(input_lenghts, 85))
print(f"Max source length: {max_source_length}")

# Max source length: 255

print(tokenized_inputs[0])

"""
{'id': '13818513',
 'input_ids': [21542,
  10,
  27,
...
 61,
  1],
 'attention_mask': [1,
  1,
...
1]}
"""

concatenate_datasets用于将多个数据集合并为一个。

第一，将训练和测试数据集连接成一个数据集以一起处理它们。然后，“对话” 连接数据聚焦的字段被标记化。

max_source_length表明 85% 的标记化输入序列的长度。这是为了更好地利用的最大长度。

同样，“summary” 字段被连接起来并且 target_lenghts 被计算。

tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]])
                    .map(lambda x: tokenizer(x["summary"], truncation=True), 
                          batched=True, remove_columns=["dialogue", "summary"])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
max_target_length = int(np.percentile(target_lenghts, 90))
print(f"Max target length: {max_target_length}")

# Max target length: 50

print(tokenized_targets[0])

"""
{'id': '13818513', 'input_ids': [21542, 13635, 5081, 11, 56, 830, 16637, 128, 5721, 5, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
"""

这 preprocess_function 定义为将数据集转换为适合模型训练的格式。

sample：数据聚焦的一批数据，包含对话和摘要。
padding：确定如何处理短于的序列 max_length。默认值为"max_length"，意味着序列将被填充到指定的最大长度。

这 preprocess_function 使用以下方法应用于整个数据集 map 方法，批量处理数据集。这remove_columns参数指定标记化后应删除哪些原始列（对话、摘要和 id）。

def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["dialogue"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

# save datasets to disk for later easy loading
tokenized_dataset["train"].save_to_disk("data/train")
tokenized_dataset["test"].save_to_disk("data/eval")

"""
Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']
"""

T5 模型一般需要特定于任务的前缀（例如 "summarize: "）来指导模型理解任务。
这 inputs 列表是通过添加创建的 "summarize: " 批次中每个对话的前缀。
这 inputs 使用标记化 tokenizer，和 max_length ，和padding指定的。
truncation=True确保序列长于max_source_length被截断。
model_inputs:包含标记化输入的字典，其中每个标记都由一个 ID 表明。
总结（sample["summary"]）分别使用text_target参数，特定于序列到序列任务中标记化目标序列。
labels:包含标记化摘要的字典。
应用填充时，所有出现的填充令牌 ID (tokenizer.pad_token_id) 中的标签替换为 -100。这样做是由于，在训练期间，填充令牌不应该参与损失计算，并且-100在 PyTorch 中一般用作忽略索引。
标记化标签被添加到 model_inputs 词典下的"labels"钥匙。这确保了输入序列及其相应的目标序列均可用于训练。
标记化的训练数据保存在 data/train 目录，标记化的测试数据保存在data/eval目录。

import torch
torch.cuda.is_available()
# True

让我们加载模型。

from transformers import AutoModelForSeq2SeqLM

model_id = "google/flan-t5-small"

model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True, device_map="auto")

load_in_8bit=True:此选项以 8 位精度而不是标准的 16 位或 32 位精度加载模型。
device_map="auto":此选项自动将模型映射到可用设备（例如 CPU、GPU）。如果 GPU 可用，则将模型加载到 GPU 上；否则，它将回退到CPU。

让我们使用 LoRA 来实现peft.

from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType

lora_config = LoraConfig(
 r=16,
 lora_alpha=32,
 target_modules=["q", "v"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)

model = prepare_model_for_int8_training(model)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# trainable params: 688128 || all params: 77649280 || trainable%: 0.8862001038515747

r=16:秩
lora_alpha=32:比例因子阿尔法
target_modules=["q", "v"]:指定应使用 LoRA 调整模型的哪些层。在本例中，它的目标是“q”（查询）和“v”（值）矩阵，它们是 Transformer 模型中注意力机制的一部分。
lora_dropout=0.05:正则化丢失 (5%)
bias="none":指定如何处理 LoRA 层中的偏差。"none"意味着没有添加或调整偏差。
task_type=TaskType.SEQ_2_SEQ_LM:指定任务类型，在本例中为序列到序列语言建模。

prepare_model_for_int8_training准备先前加载的 8 位模型（model）进行训练。它可以通过冻结某些层并使其他层可训练来确保模型正确配置以进行微调。

数据整理器负责在训练或评估期间动态地将数据批处理在一起。

from transformers import DataCollatorForSeq2Seq

label_pad_token_id = -100

data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

pad_to_multiple_of=8:此选项可确保将序列填充到 8 的倍数的长度。填充到 8 的倍数可能有利于某些硬件（例如 GPU）的性能，由于它与优化计算效率的内存访问模式保持一致。

目前，让我们设置训练循环。

Seq2SeqTrainer是用于训练序列到序列模型的专门训练器类。Seq2SeqTrainingArguments用于定义训练过程的各种超参数和配置。

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

output_dir="lora-flan-t5-base"

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
 auto_find_batch_size=True,
    learning_rate=1e-3, 
    num_train_epochs=5,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)
model.config.use_cache = False

output_dir=output_dir:这设置用于保存模型检查点和日志的输出目录。
auto_find_batch_size=True:这会自动调整批量大小以适应可用内存。这在处理大型模型或有限 GPU 内存时特别有用，由于它通过动态查找最佳批量大小来防止内存不足错误。
learning_rate=1e-3:学习率设置为 0.001，相对较高。由于 LoRA 减少了可训练参数的数量，因此较高的学习率可以有效地进行微调。
num_train_epochs=5:该模型将训练 5 个 epoch，这意味着训练数据将通过模型 5 次。
logging_dir=f"{output_dir}/logs":日志存储在输出目录内的一个目录中，特别是在logs子目录。
logging_strategy="steps":日志将按照训练步骤的定期间隔（而不是纪元）进行记录。
logging_steps=500:每 500 步将记录一次训练指标。
save_strategy="no":训练期间不会保存模型检查点。这对于不需要保存模型状态的快速实验可能很有用。
report_to="tensorboard":训练指标将报告给 TensorBoard，这是一种可视化工具，可协助跟踪和可视化训练过程。

model.config.use_cache = False:这会在训练期间禁用缓存。在某些模型中，缓存可以通过存储中间结果来加速推理，但在训练过程中，它可能会导致问题或导致不必要的警告。通过将其设置为False，代码会抑制这些警告，确保训练过程顺利进行。不过，一般应该重新启用缓存以进行推理以提高性能。

训练…

trainer.train()

"""
/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py:316: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
 [1320/9210 09:46 < 58:29, 2.25 it/s, Epoch 0.72/5]
Step Training Loss
500 1.928700
1000 1.930100
 [9210/9210 1:07:01, Epoch 5/5]
Step Training Loss
500 1.928700
1000 1.930100
1500 1.930600
2000 1.896400
2500 1.875400
3000 1.887600
3500 1.862400
4000 1.841600
4500 1.835700
5000 1.846200
5500 1.832900
6000 1.818700
6500 1.812300
7000 1.796200
7500 1.771600
8000 1.781500
8500 1.769900
9000 1.779300
TrainOutput(global_step=9210, training_loss=1.8422986688106509, metrics={'train_runtime': 4023.311, 'train_samples_per_second': 18.308, 'train_steps_per_second': 2.289, 'total_flos': 6924203301273600.0, 'train_loss': 1.8422986688106509, 'epoch': 5.0})
"""

# Save our LoRA model & tokenizer results
peft_model_id="results"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)
# if you want to save the base model to call
# trainer.model.base_model.save_pretrained(peft_model_id)

TrainOutput:这是训练过程的总结。

global_step=9210:已完成的训练步骤总数。
training_loss=1.8422986688106509:所有步骤的平均训练损失。该值可以总体了解模型在训练过程中的表现情况。
metrics:附加指标字典：
train_runtime=4023.311:训练所需的总时间（以秒为单位）。
train_samples_per_second=18.308:每秒处理的训练样本数。
train_steps_per_second=2.289:每秒处理的训练步骤数。
total_flos=6924203301273600.0:训练期间执行的浮点运算总数 (FLOP)。这是所需计算工作量的衡量标准。

以下是如何加载刚刚保存的模型。

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load peft config for pre-trained checkpoint etc.
peft_model_id = "results"
config = PeftConfig.from_pretrained(peft_model_id)

# load base LLM model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,  load_in_8bit=True,  device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
model.eval()

model.eval()将模型设置为评估模式，这是执行推理时必需的。在评估模式下，某些层（例如 dropout）被禁用，以确保推理期间结果一致。

我们可以从“samsum”数据聚焦生成对话样本的文本摘要。

from datasets import load_dataset
from random import randrange

dataset = load_dataset("samsum")
sample = dataset['test'][randrange(len(dataset["test"]))]

input_ids = tokenizer(sample["dialogue"], return_tensors="pt", truncation=True).input_ids.cuda()

outputs = model.generate(input_ids=input_ids, max_new_tokens=40, do_sample=True, top_p=0.9)
print(f"input sentence: {sample['dialogue']}
{'---'* 20}")

print(f"summary:
{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]}")

"""
input sentence: Lincoln: Heeyyy ;* whats up
Fatima: I talked to Jenson, he’s not too happy ;p
Lincoln: the place sucks??
Fatima: No, the place is ok, I think, we can go there, it’s about Alene
Lincoln: typical, dont worry about it
Fatima: He thinks she may have a depression :[
Lincoln: nothin new, everyone has it, she needs a doctor then
Fatima: But she won’t go ;/
Lincoln: so she’s destroying her life fuck it its not your problem
Fatima: It is, they’re both my friends!
Lincoln: you better think what to do if they break up
Fatima: Ehh yes Ill have a problem ;//
Lincoln: both blaming each other and talking with you about it, perfect
Fatima: Alene is just troubled… She’d been through a lot…
Lincoln: everyone has their problems, the question is are ya doin sth about them
Fatima: She has problems facing it, don’t be surprised :[
Lincoln: then it is her problem
Fatima: You are so cruel at times… o.O
Lincoln: maybe, for me its just a common sense
Fatima: Why can’t everyone be just happy???
Lincoln: youll not understand, you had good childhood, nice parents, you have no idea
Fatima: Probably, true… Well I can be just grateful o.o
Lincoln: do that and stop worrying about others, youre way to bautful for that <3
Fatima: :*:*:*
------------------------------------------------------------
summary:
Fatima spoke to Jenson who might have a depression and does not go because Alene is having problems with her. Lincoln advises them to stop worrying about others.
"""

最后，我们来评价一下。

ROUGE 是摘要等任务的常见评估指标，其目标是将生成的文本与参考（真实情况）文本进行比较。

evaluate库提供轻松访问 NLP 中常用的评估指标，例如 ROUGE、BLEU 等。

evaluate_peft_model从数据聚焦获取单个样本并使用模型生成摘要。然后，它对生成的摘要（预测）和参考摘要（标签）进行解码，并将其返回以进行评估。

import evaluate
import numpy as np
from datasets import load_from_disk
from tqdm import tqdm

metric = evaluate.load("rouge")

def evaluate_peft_model(sample,max_target_length=50):
    # generate summary
    outputs = model.generate(input_ids=sample["input_ids"].unsqueeze(0).cuda(), do_sample=True, top_p=0.9, max_new_tokens=max_target_length)
    prediction = tokenizer.decode(outputs[0].detach().cpu().numpy(), skip_special_tokens=True)
    # decode eval sample
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(sample['labels'] != -100, sample['labels'], tokenizer.pad_token_id)
    labels = tokenizer.decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    return prediction, labels

test_dataset = load_from_disk("data/eval/").with_format("torch")

predictions, references = [] , []
for sample in tqdm(test_dataset):
    p,l = evaluate_peft_model(sample)
    predictions.append(p)
    references.append(l)

rogue = metric.compute(predictions=predictions, references=references, use_stemmer=True)

print(f"Rogue1: {rogue['rouge1']* 100:2f}%")
print(f"rouge2: {rogue['rouge2']* 100:2f}%")
print(f"rougeL: {rogue['rougeL']* 100:2f}%")
print(f"rougeLsum: {rogue['rougeLsum']* 100:2f}%")

"""
Rogue1: 41.264447%
rouge2: 16.052127%
rougeL: 32.351498%
rougeLsum: 32.337353%
"""

rouge['rouge1']：测量预测和参考之间的一元组（单个单词）的重叠。
rouge['rouge2']：测量二元组（单词对）的重叠。
rouge['rougeL']和 rouge['rougeLsum']：测量最长公共子序列 (LCS) 重叠，它捕获以一样顺序出目前预测和参考中的最长单词序列。rougeLsum是该指标的一个变体。
胭脂-1 (41.26%)：表明生成的摘要和参考文献之间的一元组重叠率为 41.26%。
胭脂-2 (16.05%)：表明二元组中有 16.05% 的重叠。
胭脂-L (32.35%)：表明最长公共子序列重叠率为 32.35%。
ROUGE-Lsum (32.33%)：专注于摘要任务的类似指标。

参考：

https://arxiv.org/abs/2106.09685

https://www.superannotate.com/blog/llm-fine-tuning

https://www.turing.com/resources/finetuning-large-language-models#primary-fine-tuning-approaches

https://twosigmaventures.com/blog/article/the-promise-and-perils-of-large-language-models/

https://ai.plainenglish.io/lora-explained-enhancing-ai-models-with-low-rank-adaptation-56d0bfc42deb

全部评论(0)

上一篇：Python 入门系列——22. dict 操作详解
下一篇：C语言宏定义的高级玩法，8个Linux内核常见代码案例教你怎么玩？

最新发布的资讯信息
【系统环境|】在Android中将Gradle Groovy DSL迁移到 Gradle Kotlin DSL(2025-11-14 22:49)
【系统环境|】Kotlin DSL: 在Gradle构建脚本中替代Groovy的优势(2025-11-14 22:49)
【系统环境|】在 Android 中掌握 Kotlin DSL(2025-11-14 22:48)
【系统环境|】android gradle groovy DSL vs kotlin DSL(2025-11-14 22:48)
【系统环境|】在Kotlin中实现DSL领域特定语言实例解析(2025-11-14 22:47)
【系统环境|】Kotlin 的 DSL 实践(2025-11-14 22:47)
【系统环境|】Kotlin DSL 实战：像 Compose 那样写代码(2025-11-14 22:46)
【系统环境|】当 Adapter 遇上 Kotlin DSL，无比简单的调用方式(2025-11-14 22:46)
【系统环境|】Kotlin语言特性: 实现扩展函数与DSL(2025-11-14 22:45)
【系统环境|】kotlin Gradle DSL实战——重构Gradle脚本(2025-11-14 22:45)

真快激活码

店铺

推荐商品