本地推理,单机运行,MacM1芯片系统基于大语言模型C++版本LLaMA部署“本地版”的ChatGPT

在一些大型神经网络中，每个参数需要使用32位或64位浮点数进行存储，这意味着每个参数需要占用4字节或8字节的存储空间。因此，对于包含70亿个参数的神经网络，其存储空间将分别为8 GB或12GB。

因此这种体量的模型单机跑绝对够我们喝一壶，所以本次使用最小的LLaMA 7B模型进行测试。

LLaMA项目安装和模型配置

llama.cpp首先适配的就是苹果的M系列芯片，这对于果粉来说无疑是一个重大利好，首先通过命令拉取C++版本的LLaMA项目：

git clone https://github.com/ggerganov/llama.cpp

随后进入项目目录：

llama.cpp

在项目中，需要单独建立一个模型文件夹models:

mkdir models

随后去huggingface官网下载LLaMA的7B模型文件：https://huggingface.co/nyanko7/LLaMA-7B/tree/main

随后在models目录建立模型子目录7B:

mkdir 7B

将tokenizer.model和tokenizer_checklist.chk放入和7B平行的目录中：

➜  models git:(master ✗ ls  
7B                      tokenizer.model         tokenizer_checklist.chk

随后将checklist.chk consolidated.00.pth和params.json放入7B目录中：

➜  7B git:(master ✗ ls  
checklist.chk       consolidated.00.pth  params.json

至此，模型就配置好了。

LLaMA模型转换

这里通过Python脚本进行转换操作：

python3 convert-pth-to-ggml.py models/7B/ 1

第一个参数是模型所在目录，第二个参数为转换时使用的浮点类型，使用 float32，转换的结果文件会大一倍，当该参数值为 1时，则使用 float16 这个默认值，这里我们使用默认数据类型。

程序输出：

➜  llama.cpp git:(master ✗ python convert-pth-to-ggml.py models/7B/ 1  
{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-06, 'vocab_size': -1}  
n_parts = 1  
  
Processing part 0  
  
Processing variable: tok_embeddings.weight with shape: torch.Size([32000, 4096] and type: torch.float16  
Processing variable: norm.weight with shape: torch.Size([4096] and type: torch.float16  
  Converting to float32  
Processing variable: output.weight with shape: torch.Size([32000, 4096] and type: torch.float16  
Processing variable: layers.0.attention.wq.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.0.attention.wk.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.0.attention.wv.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.0.attention.wo.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.0.feed_forward.w1.weight with shape: torch.Size([11008, 4096] and type: torch.float16  
Processing variable: layers.0.feed_forward.w2.weight with shape: torch.Size([4096, 11008] and type: torch.float16  
Processing variable: layers.0.feed_forward.w3.weight with shape: torch.Size([11008, 4096] and type: torch.float16  
Processing variable: layers.0.attention_norm.weight with shape: torch.Size([4096] and type: torch.float16  
  Converting to float32  
Processing variable: layers.0.ffn_norm.weight with shape: torch.Size([4096] and type: torch.float16  
  Converting to float32  
Processing variable: layers.1.attention.wq.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.1.attention.wk.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.1.attention.wv.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.1.attention.wo.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.1.feed_forward.w1.weight with shape: torch.Size([11008, 4096] and type: torch.float16  
Processing variable: layers.1.feed_forward.w2.weight with shape: torch.Size([4096, 11008] and type: torch.float16  
Processing variable: layers.1.feed_forward.w3.weight with shape: torch.Size([11008, 4096] and type: torch.float16  
Processing variable: layers.1.attention_norm.weight with shape: torch.Size([4096] and type: torch.float16  
  Converting to float32  
Processing variable: layers.1.ffn_norm.weight with shape: torch.Size([4096] and type: torch.float16  
  Converting to float32  
Processing variable: layers.2.attention.wq.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.2.attention.wk.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.2.attention.wv.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.2.attention.wo.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.2.feed_forward.w1.weight with shape: torch.Size([11008, 4096] and type: torch.float16  
Processing variable: layers.2.feed_forward.w2.weight with shape: torch.Size([4096, 11008] and type: torch.float16  
Processing variable: layers.2.feed_forward.w3.weight with shape: torch.Size([11008, 4096] and type: torch.float16  
Processing variable: layers.2.attention_norm.weight with shape: torch.Size([4096] and type: torch.float16  
  Converting to float32  
Processing variable: layers.2.ffn_norm.weight with shape: torch.Size([4096] and type: torch.float16  
  Converting to float32  
Processing variable: layers.3.attention.wq.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.3.attention.wk.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.3.attention.wv.weight with shape: torch.Size([4096, 4096] and type: torch.float16  
Processing variable: layers.3.attention.wo.weight w

编程笔记 » 本地推理,单机运行,MacM1芯片系统基于大语言模型C++版本LLaMA部署“本地版”的ChatGPT

本地推理,单机运行,MacM1芯片系统基于大语言模型C++版本LLaMA部署“本地版”的ChatGPT

LLaMA项目安装和模型配置

LLaMA模型转换

相关文章

Hi，您需要填写昵称和邮箱！