Linear4bit 的输入类型是 torch.float16,但 bnb_4bit_compute_type=torch.float32 (默认)。这将导致推理或训练速度缓慢

Sad*_*afi 8 pytorch huggingface-transformers large-language-model llama

我正在尝试在带有服务器的计算机上运行 Llama 2.0,它警告我,我的速度会变慢,因为我犯了一些我不知道的错误,但是它可以工作,但我不知道如何优化它

以下是功能代码

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch 

class LlamaChatBot:
    def __init__(self, model_name ="daryl149/llama-2-7b-chat-hf"):
            torch.cuda.empty_cache()
            self.isGPU = torch.cuda.is_available()
            self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            if self.isGPU:
                self.tokenizer = AutoTokenizer.from_pretrained(model_name)
                self.model = AutoModelForCausalLM.from_pretrained(
                    model_name,
                    device_map='auto', load_in_4bit=True
                )
            else:
                self.tokenizer = AutoTokenizer.from_pretrained("daryl149/llama-2-7b-chat-hf")
                self.model = AutoModelForCausalLM.from_pretrained(model_name).to(self.device)

    def generate_response(self, prompt):
        if self.isGPU():
            input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')
        else: input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids
        generated_ids = self.model.generate(input_ids, max_length=1024)
        generated_text = self.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
        print(generated_text)

        return generated_text
Run Code Online (Sandbox Code Playgroud)

警告 :

  warnings.warn(f'Input type into Linear4bit is torch.float16, 
                but bnb_4bit_compute_type=torch.float32 (default).
                 This will lead to slow inference or training speed.')
Run Code Online (Sandbox Code Playgroud)

硬件 :

Dell Precision T7920 Tower server/Workstation
Intel xeon gold processor @ 18 cores 2.3 ghz dual 36 cores 72 virtual cpus 
512GB DDR4 RAM 
UPGRADABLE UPTO 3TB RAM
512GB SSD HDD FOR BOOTING
7TB SATA HDD FOR STORAGE 
24GB RTX 3090 DDR6 GRAPHICS CARD
Run Code Online (Sandbox Code Playgroud)

Vic*_*mez 10

您可以在下一个Notebook中找到解决方案 ,使用如下内容:

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
Run Code Online (Sandbox Code Playgroud)

当您使用 from_pretrained() Transformers 方法加载模型时:

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
Run Code Online (Sandbox Code Playgroud)


小智 2

加载模型时,可能还会在 from_pretrained() 函数中设置 bnb_4bit_compute_dtype=torch.float16 。


归档时间:

查看次数:

3723 次

最近记录:

2 年 前