多言語対応文埋め込み技術を用いた意味的類似性による単語クラスタリング（ソースコードと実行結果）

【概要】単語抽出 → 埋め込み → K-means → クラスタ表示を行う．Sentence Transformers/E5の利用により，文脈を考慮した意味理解（多義語対応: 文脈により異なる意味を区別可能）が可能．日英混在テキストでも統一的に処理可能．このプログラムでは，他のモデルとも比較できるようにしている

Python開発環境，ライブラリ類

ここでは、最低限の事前準備について説明する。機械学習や深層学習を行う場合は、NVIDIA CUDA、Visual Studio、Cursorなどを追加でインストールすると便利である。これらについては別ページ https://www.kkaneko.jp/cc/dev/aiassist.htmlで詳しく解説しているので、必要に応じて参照してください。

Python 3.12 のインストール

インストール済みの場合は実行不要。

管理者権限でコマンドプロンプトを起動（手順：Windowsキーまたはスタートメニュー > cmd と入力 > 右クリック > 「管理者として実行」）し、以下を実行する。管理者権限は、wingetの--scope machineオプションでシステム全体にソフトウェアをインストールするために必要である。

REM Python をシステム領域にインストール
winget install --scope machine --id Python.Python.3.12 -e --silent
REM Python のパス設定
set "PYTHON_PATH=C:\Program Files\Python312"
set "PYTHON_SCRIPTS_PATH=C:\Program Files\Python312\Scripts"
echo "%PATH%" | find /i "%PYTHON_PATH%" >nul
if errorlevel 1 setx PATH "%PATH%;%PYTHON_PATH%" /M >nul
echo "%PATH%" | find /i "%PYTHON_SCRIPTS_PATH%" >nul
if errorlevel 1 setx PATH "%PATH%;%PYTHON_SCRIPTS_PATH%" /M >nul

【関連する外部ページ】

Python の公式ページ: https://www.python.org/

AI エディタ Windsurf のインストール

Pythonプログラムの編集・実行には、AI エディタの利用を推奨する。ここでは，Windsurfのインストールを説明する。

管理者権限でコマンドプロンプトを起動（手順：Windowsキーまたはスタートメニュー > cmd と入力 > 右クリック > 「管理者として実行」）し、以下を実行して、Windsurfをシステム全体にインストールする。管理者権限は、wingetの--scope machineオプションでシステム全体にソフトウェアをインストールするために必要となる。

winget install --scope machine Codeium.Windsurf -e --silent

【関連する外部ページ】

Windsurf の公式ページ: https://windsurf.com/

必要なライブラリのインストール

コマンドプロンプトを管理者として実行（手順：Windowsキーまたはスタートメニュー > cmd と入力 > 右クリック > 「管理者として実行」）し、以下を実行する


pip install sentence-transformers beautifulsoup4 scikit-learn sudachipy sudachidict-core

多言語対応文埋め込み技術による文書分析・単語クラスタリングプログラム

概要

このプログラムは音声から周波数スペクトログラムを計算し、深層学習モデルCREPEを用いて基本周波数（F0）を推定する。音声の物理的特性を数値化し、時間変化を可視化する。

主要技術

CREPE（Convolutional Representation for Pitch Estimation）
深層畳み込みニューラルネットワークを用いたピッチ推定技術である[1]。音声信号から直接基本周波数を推定し、従来の信号処理手法と比較して雑音環境下でも高い精度を維持する。1024サンプルの音声フレームを入力とし、360次元の確率分布として出力する。
STFT（Short-Time Fourier Transform）
音声信号を短時間窓で区切り、各窓でフーリエ変換を適用することで時間-周波数表現を得る手法である[2]。librosaライブラリの実装を使用し、2048点のFFTと512サンプルのホップ長で計算している。

参考文献

[1] Kim, J. W., Salamon, J., Li, P., & Bello, J. P. (2018). CREPE: A Convolutional Representation for Pitch Estimation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 161-165). IEEE.

[2] McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference (Vol. 8, pp. 18-25).

ソースコード


# 多言語対応文埋め込み技術による文書分析・単語クラスタリングプログラム
# 特徴技術名: Sentence Transformers / E5 (多言語対応文埋め込み技術)
# 出典: Wang, L., et al. (2022). Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv preprint arXiv:2212.03533. / Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP-IJCNLP.
# 特徴機能: 多言語対応密な文埋め込み生成機能。E5モデルは1024次元のベクトルに変換し、94言語に対応。MTEBベンチマークで高性能を実証。コサイン類似度による意味的類似性計算が可能
# 学習済みモデル: multilingual-e5-large。Microsoft製の多言語埋め込みモデル。94言語対応、1024次元の密なベクトル空間に文を変換。560Mパラメータ。URL: https://huggingface.co/intfloat/multilingual-e5-large
# 方式設計:
#   関連利用技術: SudachiPy（日本語形態素解析器、Cモード長単位分割使用）、scikit-learn K-means（教師なしクラスタリングアルゴリズム）、Beautiful Soup 4（HTMLパーサー）、urllib（HTTPリクエスト処理）、tkinter（GUIファイル選択ダイアログ）、Sentence Transformers/E5モデル（文埋め込み技術）
#   入力と出力: 入力: URL（ユーザは「0:URL手入力，1:サンプルページ，2:ファイル選択」のメニューで選択。0:URL手入力の場合は手入力。1の場合はhttps://ja.wikipedia.org/wiki/機械学習を使用。2の場合はtkinterでファイル選択）、出力: 処理結果をコンソールに表示。プログラム終了時にresult.txtファイルに保存し、「result.txtに保存」したことをコンソールに表示
#   処理手順: URL/ファイルからコンテンツ取得→Beautiful SoupでHTMLパース→テキスト抽出と正規化→SudachiPy Cモードで形態素解析→名詞・動詞・形容詞を抽出→TF-IDF値計算→E5/Sentence Transformersで単語埋め込み生成→K-meansクラスタリング→クラスタ結果表示
#   前処理と後処理: 前処理: HTMLタグ除去、空白文字正規化、日本語・英数字以外の文字除去。後処理: クラスタごとにTF-IDF値でソート、サマリレポート生成
#   最適化処理: エルボー法とシルエット分析による最適クラスタ数の自動決定。複数の埋め込みモデルから選択可能（E5-large/base、BGE-M3、多言語MiniLM）。文字エンコーディング自動判定（UTF-8/Shift_JIS対応）
#   設定可能パラメータ: 単語の最小文字数（1文字以上または2文字以上を選択）、埋め込みモデル（4種類から選択）
# 実装方針: 単語の最小文字数は実行時にユーザが選択可能。埋め込みモデルも実行時に4種類から選択可能として実装済み
# 技術的特徴: 日本語文字判定に正規表現使用（ひらがな・カタカナ・漢字・CJK統合漢字拡張A領域対応）
# 前提条件:
#   pip install sentence-transformers beautifulsoup4 scikit-learn sudachipy sudachidict-core

from collections import Counter
import urllib.request
import urllib.parse
import re
import tkinter as tk
from tkinter import filedialog
import os
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import silhouette_score
from sudachipy import Dictionary, SplitMode

# グローバル設定値
DEFAULT_N_CLUSTERS = 3  # デフォルトクラスタ数
MAX_CLUSTERS_FOR_ELBOW = 10  # エルボー法での最大クラスタ数
JAPANESE_CHAR_PATTERN = r'[\u3040-\u309F\u30A0-\u30FF\u4E00-\u9FAF\u3400-\u4DBF]'  # 日本語文字パターン（ひらがな・カタカナ・漢字・CJK統合漢字拡張A）
RANDOM_STATE = 42  # 乱数シード（再現性確保のため）
ELBOW_THRESHOLD = 0.02  # エルボー法の変化率閾値


def normalize_text(text):
    """
    テキストの正規化処理

    Args:
        text (str): 入力テキスト

    Returns:
        str: 正規化されたテキスト（連続する空白を単一スペースに変換、日本語・英数字以外を除去）
    """
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^\w\s\u3040-\u309F\u30A0-\u30FF\u4E00-\u9FAF\u3400-\u4DBF]', ' ', text)
    return text.strip()


def read_file_content(file_path):
    """
    ファイルからコンテンツを読み込み、HTMLファイルの場合はテキスト抽出を行う

    Args:
        file_path (str): ファイルパス

    Returns:
        str: 正規化されたテキスト内容
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()

        if file_path.lower().endswith('.html'):
            soup = BeautifulSoup(content, 'html.parser')
            text = soup.get_text()
        else:
            text = content

        return normalize_text(text)
    except Exception as e:
        print(f'ファイルの読み込みに失敗しました: {file_path}')
        print(f'エラー: {e}')
        exit()


def download_and_extract_text(url):
    """
    URLからHTMLコンテンツをダウンロードし、テキストを抽出する

    Args:
        url (str): 対象URL

    Returns:
        str: 抽出・正規化されたテキスト
    """
    try:
        # URLエンコーディング処理（日本語URLに対応）
        parsed_url = urllib.parse.urlparse(url)
        encoded_path = urllib.parse.quote(parsed_url.path.encode('utf-8'))
        encoded_url = urllib.parse.urlunparse((
            parsed_url.scheme,
            parsed_url.netloc,
            encoded_path,
            parsed_url.params,
            parsed_url.query,
            parsed_url.fragment
        ))

        with urllib.request.urlopen(encoded_url) as response:
            html_content = response.read()

        # 文字エンコーディング自動判定
        encoding = 'utf-8'
        try:
            html_content = html_content.decode(encoding)
        except UnicodeDecodeError:
            encoding = 'shift_jis'
            html_content = html_content.decode(encoding, errors='ignore')

        soup = BeautifulSoup(html_content, 'html.parser')
        text = soup.get_text()

        return normalize_text(text)
    except Exception as e:
        print(f'URLからのダウンロードに失敗しました: {url}')
        print(f'エラー: {e}')
        exit()


def extract_words(text):
    """
    テキストから単語を抽出し、TF-IDF値を計算する

    Args:
        text (str): 入力テキスト

    Returns:
        dict: 単語をキー、TF-IDF値を値とする辞書
    """
    print('SudachiPy Cモード（長単位）で日本語を分析中...')

    print()
    print('単語の文字数設定を選択してください:')
    print('1: 1文字以上の単語をすべて処理（例：法、権、国、日本、天皇、憲法）')
    print('2: 2文字以上の単語のみを処理（例：日本、天皇、憲法、国政、政府）')

    length_choice = input('選択: ')
    min_length = 1 if length_choice == '1' else 2

    # SudachiPy初期化（Cモード：長単位分割）
    tokenizer_obj = Dictionary().create()
    split_mode = SplitMode.C

    words = []

    # 正規表現で単語候補を抽出
    word_cands = re.findall(r'\b\w+\b', text)

    for word in word_cands:
        if len(word) < min_length:
            continue

        # 日本語文字が含まれているかチェック
        if re.search(JAPANESE_CHAR_PATTERN, word):
            # 日本語の場合、SudachiPyで形態素解析
            morphemes = tokenizer_obj.tokenize(word, split_mode)
            for m in morphemes:
                pos = m.part_of_speech()[0]  # 品詞の大分類を取得
                if pos in ['名詞', '動詞', '形容詞'] and len(m.surface()) >= min_length:
                    words.append(m.surface())
        else:
            # 英語の場合、そのまま追加
            words.append(word)

    # TF-IDF値を計算
    word_counts = Counter(words)
    unique_words = list(word_counts.keys())
    word_text = ' '.join(words)

    vectorizer = TfidfVectorizer(vocabulary=unique_words)
    tfidf_matrix = vectorizer.fit_transform([word_text])
    feature_names = vectorizer.get_feature_names_out()

    word_tfidf = {}
    for i, word in enumerate(feature_names):
        word_tfidf[word] = tfidf_matrix[0, i]

    return word_tfidf


def perform_clustering_with_debug(word_tfidf, n_clusters=DEFAULT_N_CLUSTERS):
    """
    単語の埋め込みベクトルを生成し、K-meansクラスタリングを実行する（デバッグ出力付き）

    Args:
        word_tfidf (dict): 単語とTF-IDF値の辞書
        n_clusters (int): クラスタ数（デフォルト値）

    Returns:
        tuple: (クラスタ辞書, 埋め込みベクトル, クラスタラベル)
    """
    words = list(word_tfidf.keys())
    if len(words) < n_clusters:
        n_clusters = max(1, len(words))

    # 埋め込みモデル選択
    print()
    print('埋め込みモデルを選択してください:')
    print('0: multilingual-e5-large (1024次元, 94言語, 560Mパラメータ) [推奨]')
    print('1: multilingual-e5-base (768次元, 94言語, 278Mパラメータ)')
    print('2: bge-m3-retromae (1024次元, 100+言語, 568Mパラメータ, 8192トークン対応)')
    print('3: paraphrase-multilingual-MiniLM-L12-v2 (384次元, 50言語, 118Mパラメータ) [軽量]')

    model_choice = input('選択: ')
    models = {
        '0': 'intfloat/multilingual-e5-large',
        '1': 'intfloat/multilingual-e5-base',
        '2': 'BAAI/bge-m3-retromae',
        '3': 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
    }
    model_name = models.get(model_choice, 'intfloat/multilingual-e5-large')

    try:
        model = SentenceTransformer(model_name)
        embeddings = model.encode(words)

        # 単語数が3未満の場合はエルボー法・シルエット分析をスキップ
        if len(words) < 3:
            n_clusters = min(DEFAULT_N_CLUSTERS, len(words))
            print(f'単語数が少ないため、デフォルト値を使用します')
            print(f'最適クラスタ数: {n_clusters}')
        else:
            # エルボー法とシルエット分析による最適クラスタ数決定
            max_k = min(MAX_CLUSTERS_FOR_ELBOW, len(words))
            inertias = []
            # クラスタ数2から開始（k=1は意味がないため）
            # inertias配列のインデックス0はk=2、インデックス1はk=3に対応
            for k in range(2, max_k + 1):
                kmeans_temp = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init='auto')
                kmeans_temp.fit(embeddings)
                inertias.append(kmeans_temp.inertia_)

            # エルボーポイントを検出（一階差分の変化率で判定）
            optimal_k = 2  # デフォルト値
            diffs = []
            change_rates = []
            second_diffs = []

            if len(inertias) > 1:
                # 一階差分を計算（慣性の減少量）
                # diffs配列のインデックス0はk=2-3の差、インデックス1はk=3-4の差に対応
                diffs = [inertias[i] - inertias[i+1] for i in range(len(inertias)-1)]

                # 変化率を計算（相対的な改善度）
                if len(diffs) > 0:
                    # 各差分の変化率を計算
                    for i in range(len(diffs)):
                        if inertias[i] > 0:
                            change_rate = diffs[i] / inertias[i]
                            change_rates.append(change_rate)

                    # 変化率が閾値を下回る最初の点を探す（エルボーポイント検出）
                    for i, rate in enumerate(change_rates):
                        if rate < ELBOW_THRESHOLD:
                            optimal_k = i + 2  # インデックス0がk=2に対応
                            break
                    else:
                        # 閾値を下回る点がない場合、二階差分の最大値を使用
                        if len(diffs) > 1:
                            # 二階差分を計算（加速度的変化）
                            # second_diffs配列のインデックス0はk=3に対応（k=2-3とk=3-4の差の差分）
                            second_diffs = [diffs[i] - diffs[i+1] for i in range(len(diffs)-1)]
                            if second_diffs:
                                # インデックス0がk=3に対応するため+3を使用
                                optimal_k = second_diffs.index(max(second_diffs)) + 4

                optimal_k = min(optimal_k, len(words))

            # シルエット分析で最適クラスタ数を決定
            sil_scores = []
            for k in range(2, max_k + 1):
                kmeans_temp = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init='auto')
                labels = kmeans_temp.fit_predict(embeddings)
                score = silhouette_score(embeddings, labels)
                sil_scores.append(score)

            sil_optimal = sil_scores.index(max(sil_scores)) + 2

            # 最終決定（エルボー法とシルエット分析の平均値）
            n_clusters = int((optimal_k + sil_optimal) / 2)
            n_clusters = min(n_clusters, len(words))

            # デバッグ情報を出力
            print()
            print('=== デバッグ情報 ===')
            print(f'- 慣性値: {inertias}')
            print(f'- 一階差分: {diffs}')
            print(f'- 変化率: {change_rates}')
            if second_diffs:
                print(f'- 二階差分: {second_diffs}')
            print(f'- シルエットスコア: {sil_scores}')
            print(f'- エルボー法閾値: {ELBOW_THRESHOLD}')
            print('==================')
            print()

            print(f'エルボー法最適値: {optimal_k}')
            print(f'シルエット分析最適値: {sil_optimal}')
            print(f'最適クラスタ数: {n_clusters}')

        # K-meansクラスタリング実行
        kmeans = KMeans(n_clusters=n_clusters, random_state=RANDOM_STATE, n_init='auto')
        cluster_labels = kmeans.fit_predict(embeddings)

        # クラスタ結果を整理
        clusters = {}
        for word, label in zip(words, cluster_labels):
            if label not in clusters:
                clusters[label] = []
            clusters[label].append((word, word_tfidf[word]))

        # 各クラスタ内でTF-IDF値により降順ソート
        for cluster_id in clusters:
            clusters[cluster_id].sort(key=lambda x: x[1], reverse=True)

        return clusters, embeddings, cluster_labels
    except Exception as e:
        print(f'クラスタリング処理に失敗しました: {e}')
        exit()


def display_and_save_results(clusters, total_words):
    """
    クラスタリング結果をコンソールに表示し、ファイルに保存する

    Args:
        clusters (dict): クラスタ辞書
        total_words (int): 総単語数
    """
    print()
    print('=== クラスタリング結果 ===')
    print(f'総単語数: {total_words}')
    print(f'クラスタ数: {len(clusters)}')
    print('※ カッコ内はTF-IDF値')
    print()

    for cluster_id, word_list in clusters.items():
        word_with_counts = ', '.join([f'{word}({tfidf:.2f})' for word, tfidf in word_list])
        print(f'クラスタ {cluster_id} ({len(word_list)}語): {word_with_counts}')
        print('----------------------------------------------')

    print()
    print('=== サマリレポート ===')
    sorted_clusters = sorted(clusters.items(), key=lambda x: len(x[1]), reverse=True)
    for cluster_id, word_list in sorted_clusters:
        top_words = [word for word, tfidf in word_list[:10]]  # 上位10語を表示
        word_summary = ', '.join(top_words)
        print(f'クラスタ{cluster_id} ({len(word_list)}語): {word_summary}')
    print()

    # 結果をファイルに保存
    results = []
    results.append('=== 単語クラスタリング結果 ===\n')
    results.append(f'総単語数: {total_words}\n')
    results.append(f'クラスタ数: {len(clusters)}\n')
    results.append('※ カッコ内はTF-IDF値\n\n')

    for cluster_id, word_list in clusters.items():
        word_with_counts = ', '.join([f'{word}({tfidf:.2f})' for word, tfidf in word_list])
        results.append(f'クラスタ {cluster_id} ({len(word_list)}語): {word_with_counts}\n')
        results.append('----------------------------------------------\n')

    results.append('\n=== サマリレポート ===\n')
    sorted_clusters = sorted(clusters.items(), key=lambda x: len(x[1]), reverse=True)
    for cluster_id, word_list in sorted_clusters:
        top_words = [word for word, tfidf in word_list[:10]]
        word_summary = ', '.join(top_words)
        results.append(f'クラスタ{cluster_id} ({len(word_list)}語): {word_summary}\n')

    result_path = os.path.join('.', 'result.txt')
    with open(result_path, 'w', encoding='utf-8') as f:
        f.writelines(results)

    print('result.txtに保存しました')


# メイン処理開始
print('=== URL文書分析・単語クラスタリングシステム ===')
print('HTMLページまたはファイルから単語を抽出し、意味的類似性によりクラスタリングします')
print()
print('【プログラムの概要】')
print('- Sentence Transformers/E5による多言語対応の単語埋め込み')
print('- K-meansクラスタリングによる単語の意味的分類')
print('- 結果はコンソール表示とresult.txtファイルに保存')
print()
print('【ユーザが行う操作】')
print('1. 入力方法の選択（URL手入力/サンプル/ファイル）')
print('2. 単語の最小文字数の選択（1文字以上/2文字以上）')
print('3. 埋め込みモデルの選択（4種類から選択）')
print()

print('0: URL手入力')
print('1: サンプルページ (https://ja.wikipedia.org/wiki/機械学習)')
print('2: ファイル選択')

choice = input('選択: ')

if choice == '0':
    url = input('URLを入力してください: ')
    print(f'URL処理中: {url}')
    text = download_and_extract_text(url)
elif choice == '1':
    url = 'https://ja.wikipedia.org/wiki/機械学習'
    print(f'URL処理中: {url}')
    text = download_and_extract_text(url)
elif choice == '2':
    root = tk.Tk()
    root.withdraw()
    file_path = filedialog.askopenfilename(
        title='ファイルを選択してください',
        filetypes=[('Text files', '*.txt'), ('HTML files', '*.html'), ('All files', '*.*')]
    )
    if not file_path:
        print('ファイルが選択されませんでした')
        exit()
    print(f'ファイル処理中: {file_path}')
    text = read_file_content(file_path)
else:
    print('無効な選択です')
    exit()

word_tfidf = extract_words(text)

if len(word_tfidf) == 0:
    print('有効な単語が見つかりませんでした')
    exit()

print(f'抽出された単語数: {len(word_tfidf)}')
clusters, embeddings, cluster_labels = perform_clustering_with_debug(word_tfidf)

print()
print('=== 結果について ===')
print('【得られる結果】')
print('・クラスタ番号（0, 1, 2...）とそれに所属する単語の一覧')
print('・同じクラスタの単語は意味的に類似している単語群')
print('・異なるクラスタの単語は意味的に異なる単語群')
print()
print('【対象となる単語】')
print('・記事から抽出された単語（重複除去済み）')
print('・単語の重要度（TF-IDF値）を計算し表示')
print('・助詞や短い語は除外されるため、主に名詞・動詞・形容詞が対象')
print()
print('【算出方法】')
print('1. 各単語をSentence Transformers/E5で密なベクトルに変換')
print('2. ベクトル間の距離を計算して類似性を測定')
print('3. K-meansアルゴリズムで距離の近い単語をクラスタ化')
print('4. 同じクラスタ番号の単語は意味的に関連性が高いと判定')
print()

display_and_save_results(clusters, len(word_tfidf))