Meta の言語モデルと日本語で対話できる chatBOT プログラム（chatBOT）（FlexGen, DeepL, Python を使用）（Windows 上）

【サイト内の関連ページ】

FlexGen のインストールと動作確認（大規模言語モデル，チャットボット）（Python，PyTorch を使用）（Windows 上）: 別ページ »で説明している．
DeepL API の認証キーの取得と操舵確認: 別ページ »で説明している．
DeepL API を用いて日本語を英語に翻訳，音声合成（音声合成，人工知能による翻訳）（win32api, DeepL, Python を使用）（Windows 上）別ページ »で説明している．
対話システム，chatBOT: PDFファイル, パワーポイントファイル

【関連する外部ページ】

FlexGen の GitHub のページ: https://github.com/FMInference/FlexGen
DeepL Pro のページ: https://www.deepl.com/pro#developer

前準備

Git のインストール（Windows 上）

Gitは，バージョン管理システム．ソースコードの管理や複数人での共同に役立つ．

Python のインストール（Windows 上）

Build Tools for Visual Studio 2022，NVIDIA ドライバ，NVIDIA CUDA ツールキット 11.8，NVIDIA cuDNN 8.6 のインストール（Windows 上）

PyTorch のインストール（Windows 上）

Python の deepl のインストール（Windows 上）

python -m pip install -U deepl

sudo pip3 install -U deepl

DeepL API の認証キー

DeepL API の認証キーの取得はオンラインで可能である．クレジットカード番号の登録などが必要になる．

Meta の言語モデルと日本語で対話できる chatBOT プログラム（FlexGen, DeepL, Python を使用）（Windows 上）

次のプログラムは，FlexGen に同封のプログラムを，DeepL 翻訳も行うように書き換えたもの．

使用するときは，「auth_key = "＜DeepL API の AUTH KEY＞"」には， DeepL API の認証キーを書くこと．

プログラム実行は，次のプログラムを c.pyのようなファイル名で保存し，「python c.py --model facebook/opt-6.7b」のようなコマンドで行う．「opt-6.7b」のところは，使用するOPT言語モデル名を指定．

"""Run a chatBOT with FlexGen and OPT models."""
# usage: python c.py --model facebook/opt-6.7b
import argparse
import sys
import deepl
import win32com.client
import flexgen
from flexgen.flex_opt import (Policy, OptLM, ExecutionEnv, CompressionConfig, str2bool)
from transformers import AutoTokenizer

auth_key = "＜DeepL API の AUTH KEY＞"
translator = deepl.Translator(auth_key)
speech = win32com.client.Dispatch("Sapi.SpVoice")

from transformers import AutoTokenizer
from flexgen.flex_opt import (Policy, OptLM, ExecutionEnv, CompressionConfig,
        str2bool)


def main(args):
    # Initialize environment
    env = ExecutionEnv.create(args.offload_dir)

    # Offloading policy
    policy = Policy(1, 1,
                    args.percent[0], args.percent[1],
                    args.percent[2], args.percent[3],
                    args.percent[4], args.percent[5],
                    overlap=True, sep_layer=True, pin_weight=args.pin_weight,
                    cpu_cache_compute=False, attn_sparsity=1.0,
                    compress_weight=args.compress_weight,
                    comp_weight_config=CompressionConfig(
                        num_bits=4, group_size=64,
                        group_dim=0, symmetric=False),
                    compress_cache=args.compress_cache,
                    comp_cache_config=CompressionConfig(
                        num_bits=4, group_size=64,
                        group_dim=2, symmetric=False))

    # Model
    print("Initialize...")
    tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", padding_side="left")
    tokenizer.add_bos_token = False
    stop = tokenizer("\n").input_ids[0]

    model = OptLM(args.model, env, args.path, policy)

    context = (
        "A chat between a curious human and a knowledgeable artificial intelligence assistant.\n"
        "Human: Hello! What can you do?\n"
        "Assistant: As an AI assistant, I can answer questions and chat with you.\n"
    )

    # Chat
    print(context, end="")
    while True:
        inp = input("Human: ")
        if not inp:
            print("exit...")
            break

        speech.Speak(inp)
        result = translator.translate_text(inp, target_lang="EN-US")
        print(result)

        context += "Human: " + result.text + "\n"
        inputs = tokenizer([context])
        output_ids = model.generate(
            inputs.input_ids,
            do_sample=True,
            temperature=0.7,
            max_new_tokens=96,
            stop=stop)
        outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
        try:
            index = outputs.index("\n", len(context))
        except ValueError:
            outputs += "\n"
            index = outputs.index("\n", len(context))
        
        outputs = outputs[:index + 1]
        print(outputs[len(context):], end="")
        translated = translator.translate_text(outputs[len(context):], target_lang="JA")
        print(translated)
        speech.Speak(translated)
        context = outputs

    # TODO: optimize the performance by reusing context cache and reducing redundant computation.

    # Shutdown
    env.close_copy_threads()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", type=str, default="facebook/opt-6.7b",
        help="The model name.")
    parser.add_argument("--path", type=str, default="~/opt_weights",
        help="The path to the model weights. If there are no cached weights, "
             "FlexGen will automatically download them from HuggingFace.")
    parser.add_argument("--offload-dir", type=str, default="~/flexgen_offload_dir",
        help="The directory to offload tensors. ")
    parser.add_argument("--percent", nargs="+", type=int,
        default=[100, 0, 100, 0, 100, 0],
        help="Six numbers. They are "
         "the percentage of weight on GPU, "
         "the percentage of weight on CPU, "
         "the percentage of attention cache on GPU, "
         "the percentage of attention cache on CPU, "
         "the percentage of activations on GPU, "
         "the percentage of activations on CPU")
    parser.add_argument("--pin-weight", type=str2bool, nargs="?",
        const=True, default=True)
    parser.add_argument("--compress-weight", action="store_true",
        help="Whether to compress weight.")
    parser.add_argument("--compress-cache", action="store_true",
        help="Whether to compress cache.")
    args = parser.parse_args()

    assert len(args.percent) == 6

    main(args)

下の実行結果では，利用者からの質問により，コンピュータ，チャットボットについて説明している