GPT护理机器人 - 让护士的工作变简单

“帮我写一份表白信，我们是大学同学，暗恋十年”

护理相关的领域知识是公司对于护士培训的重点内容）

.....。

看到这些问题后，我就开始尝试通过Fine-tune训练公司内部的护理机器人，希望他可以为护士们的工作带来一些便利。诸多尝试失败后，索性就放了一些时间。

刚看到Fine-tune的介绍时就想，如果通过fine-tune构建个性化的模型，导入公司的母婴护理知识，并且在未来了问答中进化，变成企业内部专家。所以一开始就是向这样的路子摸索着。毕竟介绍里也说了通过少量样本即可完成训练，分类这样的任务可能只需要200个左右的样本即可。（其实问答模型的样本要求至少要有几千个可能会有点效果）

当然，文档中也有一些关于Fine-tune的一些指南和准则。一来是全是英文文档，理解不太深入；二来就是无知无畏，不尝试下就是不死心。这是文档原文，大概的意思Fine-tune可以用来解决一些类似分类（判断对错，情绪判断（乐观，悲观），邮件分类），以及扩写总结之类的场景。文档也有提到案例”Customer support chatbot“，这可能也是大家这样尝试的原因之一吧。在其demo推荐使用 emebedding 来实现，也是本文的重点内容。这是后

虽然通过Fine-tune的方式最终也没有好的效果，一来可能是样本太少，或者样本质量不好；也或者过程中有疏漏的地方。在这里也和大家一起探讨下。毕竟fine-tune的方式还是让人非常神往的。实现代码基本是参考了 openai-cookbook 中的 fine-tuned_qa Demo。大致流程如入：

收集文本数据并根据token的限制，合理分段落。（我自己则是找到内部了母婴护理培训的电子版本。）
用模型text-davinci-003 为每个段落自动生成若干问题，并根据段落及问题自动生成答案。
使用所有生成问题及答案组织成fine-tuen所需要的数据集。
创建新模型并使用。

文本分段 - 因为拿到的资料是word，并且有标题，就直接根据标题他分段了，超过2048的再分一次，代码如下（现学现用，比较粗漏）

import docx
import pandas as pd

def getText(fileName:
doc = docx.Document(fileName
TextList = []

data = {"title":"","content":""}
for paragraph in doc.paragraphs:
if paragraph.style.name == 'Heading 1':
print("title %s " % paragraph.text
if (len(data['content'] > 0:
datax = {}
datax['title'] = data['title']
datax['content'] = data['content']

TextList.append(datax
data['title'] = paragraph.text
data['content'] = ''
else:
data['content'] += paragraph.text+"\n"
TextList.append(data
return TextList

## 根据doc 转 csv
if __name__ == '__main__':
fileName = '/Users/jijunjian/openai/test2.docx'

articList = getText(fileName
count = 0
for article in articList:
if len(article['content'] > 800:
print("%s,%s,\n%s" % (article['title'], len(article['content'],article['content']
count += 1

header = ['title', 'content']
print("总共 %s 篇文章" % count
pd.DataFrame(articList, columns=header.to_csv('data_oring.csv', index=False, encoding='utf-8'

2，生成问题与答案 - 这样生成的质量可能不是太高，可能实际使用时还是要对生成的问题和答案，让领域专家进行修正比较好。

据官方文档介绍，建议生成的数据集中，prompt与completion都要有固定的结尾，且尽量保证其他地方不会出现这个，所以我们这里使用了”\n\n###\n\n“作为结束标志。

1 import pandas as pd 2 import openai 3 import sys 4 sys.path.append(".." 5 from tools.OpenaiInit import openai_config 6 from transformers import GPT2TokenizerFast 7 8 9 tokenizer = GPT2TokenizerFast.from_pretrained("gpt2" 10 11 def count_tokens(text: str -> int: 12 """count the number of tokens in a string""" 13 return len(tokenizer.encode(text 14 15 16 COMPLETION_MODEL = "text-davinci-003" 17 FILE_TUNE_FILE = "search_data.jsonl" 18 19 20 # 获取训练数据 21 def get_training_data(: 22 file_name = "data_oring.csv" 23 df = pd.read_csv(file_name 24 df['context'] = df.title + "\n\n" + df.content 25 print(f"{len(df} rows in the data." 26 return df 27 28 29 # 根据内容，生成问题 30 def get_questions(context: 31 print("正在生成问题" 32 try: 33 response = openai.Completion.create( 34 engine=COMPLETION_MODEL, 35 prompt=f"基于下面的文本生成问题\n\n文本: {context}\n\n问题集:\n1.", 36 temperature=0, 37 max_tokens=500, 38 top_p=1, 39 frequency_penalty=0, 40 presence_penalty=0, 41 stop=["\n\n"] 42 43 return response['choices'][0]['text'] 44 except Exception as e: 45 print("创建问题错误 %s" % e 46 return "" 47 48 49 # 根据问题，生成答案 50 def get_answers(row: 51 print("正在生成答案" 52 try: 53 response = openai.Completion.create( 54 engine=COMPLETION_MODEL, 55 prompt=f"基于下面的文本生成答案\n\n文本: {row.context}\n\n问题集:\n{row.questions}\n\n答案集:\n1.", 56 temperature=0, 57 max_tokens=500, 58 top_p=1, 59 frequency_penalty=0, 60 presence_penalty=0 61 62 return response['choices'][0]['text'] 63 except Exception as e: 64 print (e 65 return "" 66 67 68 # 获取训练数据 /Users/jijunjian/tuningdata.xlsx 69 if __name__ == '__main__': 70 openai_config( 71 df = get_training_data( 72 df['tokens'] = df.context.apply(count_tokens 73 # questions 根据返回生成 74 df['questions']= df.context.apply(get_questions 75 df['questions'] = "1." + df.questions 76 77 df['answers']= df.apply(get_answers, axis=1 78 df['answers'] = "1." + df.answers 79 df = df.dropna(.reset_index(.drop('index',axis=1 80 81 print("正在保存数据" 82 df.to_csv('nursing_qa.csv', index=False 83 84 85 86 df['prompt'] = df.context + "\n\n###\n\n" 87 df['completion'] = " yes\n\n###\n\n" 88 89 df[['prompt', 'completion']].to_json(FILE_TUNE_FILE, orient='records', lines=True 90 91 search_file = openai.File.create( 92 file=open(FILE_TUNE_FILE, 93 purpose='fine-tune' 94 95 qa_search_fileid = search_file['id'] 96 print("上传文件成功，文件ID为：%s" % qa_search_fileid 97 98 # file_id = file-Bv5gP2lAmxLL9rRtdaQXixHF

3，根据生成数据集，创建新的模型。

官方的demo，还有生成验证集，测试集，生成相识的文本，同样的问题与答案来增加一些对抗性，因为最终效果不太好，再是文档中有使用search 模块，但是这已经下线了，我用prompt-completion的数据结构模拟了下，也不知道有没有效果，因为使用openai tools 创建模型可以有一些交互动作，也方便看一些执行结果，花费数据，这里就使用这这工具作了演示，执行一段时间后，可以通过”openai.Model.list(“查看我们创建的模型。当时大概有1000来个问题与答案，花费了0.78刀。（这是4月13尝试的，因为效果不好，结果一放就是半月有余了。时间真是如白驹过隙一般）

1 openai api fine_tunes.create -t "discriminator_train.jsonl" -v "discriminator_test.jsonl" --batch_size 16 --compute_classification_metrics --classification_positive_class yes --model ada --suffix 'discriminator' 2 3 Uploaded file from discriminator_train.jsonl: file-5OeHx3bMDqk****** 4 Uploaded file from discriminator_test.jsonl: file-AnOiDwG1Oqv3Jh****** 5 Created fine-tune: ft-cQBMLPzqVNml1ZWqkGYQKUdO 6 Streaming events until fine-tuning is complete... 7 8 (Ctrl-C will interrupt the stream, but not cancel the fine-tune 9 [2023-04-13 23:17:05] Created fine-tune: ft-cQBMLPz******** 10 [2023-04-13 23:17:22] Fine-tune costs $0.78 11 [2023-04-13 23:17:23] Fine-tune enqueued. Queue number: 3

最后，效果不太理想，一番尝试后，看到文档中的提示信息：

GPT擅长回答训练数据中存在的问题，对于一些不常见的话题，或者企业内部的语料信息，则可以通过把相关信息放在上下文中，传给GPT，根据上下问进行回答。因为不同模型对于token的限制，以及Token本身的成本因素。

Specifically, this notebook demonstrates the following procedure:

Collect: We'll download a few hundred Wikipedia articles about the 2022 Olympics
Chunk: Documents are split into short, mostly self-contained sections to be embedded
Embed: Each section is embedded with the OpenAI API
Store: Embeddings are saved (for large datasets, use a vector database

Search (once per query
Using the embeddings, rank the text sections by relevance to the query

Ask (once per query

Insert the question and the most relevant sections into a message to GPT

Return GPT's answer

#!/usr/bin/env python
# coding=utf-8
from langchain import OpenAI
from llama_index import SimpleDirectoryReader, LangchainEmbedding, GPTListIndex,GPTSimpleVectorIndex, PromptHelper
from llama_index import LLMPredictor, ServiceContext
import gradio as gr
import sys
import os
os.environ["OPENAI_API_KEY"] = 'sk-fHstI********************'
#MODEL_NAME = "text-davinci-003"
MODEL_NAME = "ada:ft-primecare:*************"
def construct_index(directory_path:
max_input_size = 2048
num_outputs = 512
max_chunk_overlap = 20
chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.7, model_name=MODEL_NAME, max_tokens=num_outputs
documents = SimpleDirectoryReader(directory_path.load_data(
#index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context
index.save_to_disk('index.json'
return index
def chatbot(input_text:
index = GPTSimpleVectorIndex.load_from_disk('data/index.json'
response = index.query(input_text, response_mode="compact"
return response.response
if __name__ == '__main__':
iface = gr.Interface(fn=chatbot,inputs=gr.inputs.Textbox(lines=7, label="输入你的问题",outputs="text",title="护理智能机器人"
## 用于生成数据, 放在docs文件夹下
##index = construct_index("docs"
iface.launch(share=True, server_name='0.0.0.0', server_port=8012

使用了gradio 作为演示，效果如下，基本可以根据我们的内部培训资料中回复，美中不足的就是通过要10几秒才可以完成回复，至少比之前fine-tune有了很大的进步了。至此，总算可以安抚下这半月的苦恼了。（下图中的output 如果变成自定义的文本，尝试多次一起没有成功，也是有点遗憾）

一开始以为直接用pyinstaller 打包就可以直接放在服务器上执行，结果 pyinstaller -F, -D 尝试很久都无法打包依赖， --hidden-import 也用了，.spec也用了，都不好使。索性放弃了。

python升级后，又是提示ModuleNotFoundError: No module named '_bz2'，总算是错误信息变了。这个错误大概就是原来自带中的版本中有_bz2模块，重安装的3.10中没有，解决版本都是复制这个文件到新的版本中。

mv _bz2.cpython-36m-x86_64-linux-gnu.so /usr/local/python/lib/python3.10/lib-dynload/_bz2.cpython-310-x86_64-linux-gnu.so

1 /usr/local/python/lib/python3.10/site-packages/gradio/inputs.py:27: UserWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components
2   warnings.warn(
3 /usr/local/python/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: `optional` parameter is deprecated, and it has no effect
4   warnings.warn(value
5 /usr/local/python/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: `numeric` parameter is deprecated, and it has no effect
6   warnings.warn(value
7 Running on local URL:  http://127.0.0.1:8012
8 Running on public URL: https://11d5*****.gradio.live

第二天，找到gradio 中Interface.launch 的参数有个 server_name 设置成通过设置server_name=‘0.0.0.0’ 即可通过IP访问。通过ss -tnlp | grep ":8012" 也可以看到端口的监听从 ”127.0.0.1:8012“ 就成了 ”0.0.0.0:8012 “。

LISTEN 0 128 0.0.0.0:8012 0.0.0.0:* users:(("python",pid=2801254,fd=7

从目前测试的情况来，每问一个问题成本在10美分左右（成本还是比较高），优化的方向可能Chunk的大小，太小无法包含住够的上下问，太大成本又比较高。再回头看Fine-tune的方式，应该是前期训练话费的成本会比较高，后期回答的成本会比较少，只是目前训练效果不太好，看其他文章也是同样的问题。从目前的情况来可能 emebedding的是一个较为合适的落地方式。

成为一名优秀的程序员！

编程笔记 » GPT护理机器人 - 让护士的工作变简单

GPT护理机器人 - 让护士的工作变简单

相关文章

Hi，您需要填写昵称和邮箱！