/villagepump/ChatGPTでScrapboxTranslator v1を作る

generated at 2/16/2025, 6:36:23 PM
ChatGPTでScrapboxTranslator v1を作る
ChatGPTを使って、「ChatGPTを使って翻訳するコード」を書く
成果: ScrapboxTranslator v1

promptjsonファイルの日本語ドキュメントを元に、英訳したjsonファイルを出力するPythonコードを書きたいです。
# jsonファイルのフォーマット
```
{
  "pages": [
    {
      "title": "page1title",
      "lines": [
        "page1title",
        "line2",
        "line3"
      ]
    },
    {
      "title": "page2title",
      "lines": [
        "page2title",
        "line2",
        "line3"
      ]
    },
    ...
  ]
}
```
# 実行すること
1. jsonファイルを受け取る
2. 全pageのtitleのみを抽出し、英訳し、日英のtitleをまとめたdictionaryを作る
3. それぞれのページの本文（lines)を翻訳する。注意: 他のページのtitleへのリンクが[タイトル]と鉤括弧に囲われ含まれている。それらのタイトルは、dictionaryが持つ英訳に一字一句従って翻訳される必要がある。

# 翻訳の方法
ChatGPT APIを用いる。ChatGPT APIの入力と出力のサンプルは以下の通りである。assistantに英訳する様に伝え、出力を使いたい。翻訳時に条件や注意点がある時は、そのことも伝える。
```sample call
# Note: you need to be using OpenAI Python v0.27.0 for the code below to work
import openai

openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ]
)
```
```sample response
{
 'id': 'chatcmpl-6p9XYPYSTTRi0xEviKjjilqrWU2Ve',
 'object': 'chat.completion',
 'created': 1677649420,
 'model': 'gpt-3.5-turbo',
 'usage': {'prompt_tokens': 56, 'completion_tokens': 31, 'total_tokens': 87},
 'choices': [
   {
    'message': {
      'role': 'assistant',
      'content': 'The 2020 World Series was played in Arlington, Texas at the Globe Life Field, which was the new home stadium for the Texas Rangers.'},
    'finish_reason': 'stop',
    'index': 0
   }
  ]
}
```

 pythonimport json
import openai

def translate(text, role="user"):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},
            {"role": role, "content": text}
        ]
    )
    return response.choices[0].message.content

def translate_json_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    title_translation_dict = {}

    # Translate page titles
    for page in data['pages']:
        translated_title = translate(page['title'])
        title_translation_dict[page['title']] = translated_title
        page['title'] = translated_title

    # Translate lines with translated titles replaced
    for page in data['pages']:
        translated_lines = []
        for line in page['lines']:
            for jp_title, en_title in title_translation_dict.items():
                line = line.replace(f"[{jp_title}]", f"[{en_title}]")
            translated_line = translate(line)
            translated_lines.append(translated_line)
        page['lines'] = translated_lines

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

# Replace 'input.json' and 'output.json' with your desired file paths
translate_json_file('input.json', 'output.json')

多分できたのでは
リンクは本文翻訳する前に置き換えたのか。うまくいくなら、確かにその方がシンプルだな
後で検証する

>ありがとう。改善点として、ドキュメントに更新があった時に更新部分だけ翻訳し直す様にしたい。どの様な実装が良いかな。

 pythonimport json
import openai

def has_changes(new_line, old_line):
    return new_line != old_line

def translate_updated_json(input_file, old_input_file, translated_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        new_data = json.load(f)

    with open(old_input_file, 'r', encoding='utf-8') as f:
        old_data = json.load(f)

    with open(translated_file, 'r', encoding='utf-8') as f:
        translated_data = json.load(f)

    for new_page, old_page, translated_page in zip(new_data['pages'], old_data['pages'], translated_data['pages']):
        if has_changes(new_page['title'], old_page['title']):
            translated_title = translate(new_page['title'])
            translated_page['title'] = translated_title

        for idx, (new_line, old_line) in enumerate(zip(new_page['lines'], old_page['lines'])):
            if has_changes(new_line, old_line):
                translated_line = translate(new_line)
                translated_page['lines'][idx] = translated_line

    with open(translated_file, 'w', encoding='utf-8') as f:
        json.dump(translated_data, f, ensure_ascii=False, indent=2)

# Replace the file paths with your desired file paths
translate_updated_json('input_new.json', 'input_old.json', 'output.json')

>この実装では、JSONファイルのページ数と構造が同じであることを前提としています。ファイル構造が大幅に変更される場合や、ページが追加/削除される場合は、この実装が正しく機能しない可能性があります。そのようなシナリオに対応するには、より堅

>ファイル構造は同じだと思うけど、ページが追加/削除される可能性は大いにある。それに対応してほしい

 pythonimport json
import openai

def has_changes(new_line, old_line):
    return new_line != old_line

def translate_updated_json(input_file, old_input_file, translated_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        new_data = json.load(f)

    with open(old_input_file, 'r', encoding='utf-8') as f:
        old_data = json.load(f)

    with open(translated_file, 'r', encoding='utf-8') as f:
        translated_data = json.load(f)

    old_pages_dict = {page['title']: page for page in old_data['pages']}
    translated_pages_dict = {page['title']: page for page in translated_data['pages']}

    new_translated_pages = []
    for new_page in new_data['pages']:
        new_title = new_page['title']
        if new_title not in old_pages_dict:  # Page is added
            translated_page = {
                'title': translate(new_title),
                'lines': [translate(line) for line in new_page['lines']]
            }
        else:  # Page exists in the old data
            old_page = old_pages_dict[new_title]
            translated_page = translated_pages_dict[new_title]

            if has_changes(new_page['title'], old_page['title']):
                translated_page['title'] = translate(new_page['title'])

            for idx, (new_line, old_line) in enumerate(zip(new_page['lines'], old_page['lines'])):
                if has_changes(new_line, old_line):
                    translated_page['lines'][idx] = translate(new_line)

        new_translated_pages.append(translated_page)

    # Remove deleted pages from the translated data
    translated_data['pages'] = [page for page in translated_data['pages'] if page['title'] in old_pages_dict]

    translated_data['pages'] = new_translated_pages

    with open(translated_file, 'w', encoding='utf-8') as f:
        json.dump(translated_data, f, ensure_ascii=False, indent=2)

# Replace the file paths with your desired file paths
translate_updated_json('input_new.json', 'input_old.json', 'output.json')

差分対応の前に、とりあえず普通に翻訳できるものを作るか
（履歴を遡って編集）
>ありがとう。
> 上のコードは、それぞれのtitleやlineで毎回translate()を呼んでいますが、呼べば呼ぶほどAPIのコストがかかるので、呼ぶ回数は最小限にしたい。しかし、ChatCompletionに渡せる最大の文字数は8000程度です。
> titleを翻訳する時には、渡せる最大文字数分までを一度に翻訳してほしい。
> それぞれのページを訳す時は、同じページは一度に翻訳してほしい。文字数制限に引っかかるなら適宜分割してほしい。

 pythonimport json
import openai

MAX_TOKENS = 8000

def translate(text, role="user"):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},
            {"role": role, "content": text}
        ]
    )
    return response.choices[0].message.content

def translate_titles(title_list):
    translated_titles = []
    title_chunk = ""

    for title in title_list:
        if len(title_chunk) + len(title) + 1 < MAX_TOKENS:
            title_chunk += title + "\n"
        else:
            translated_chunk = translate(title_chunk, role="title_translation")
            translated_titles.extend(translated_chunk.split("\n")[:-1])
            title_chunk = title + "\n"

    if title_chunk:
        translated_chunk = translate(title_chunk, role="title_translation")
        translated_titles.extend(translated_chunk.split("\n")[:-1])

    return translated_titles

def translate_page(page_text):
    if len(page_text) <= MAX_TOKENS:
        return translate(page_text, role="page_translation")
    else:
        split_point = page_text.rfind("\n", 0, MAX_TOKENS)
        first_half = page_text[:split_point]
        second_half = page_text[split_point + 1:]
        return translate_page(first_half) + "\n" + translate_page(second_half)

def translate_json_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    title_translation_dict = {}

    # Translate page titles
    title_list = [page['title'] for page in data['pages']]
    translated_titles = translate_titles(title_list)

    for original_title, translated_title in zip(title_list, translated_titles):
        title_translation_dict[original_title] = translated_title

    for page, translated_title in zip(data['pages'], translated_titles):
        page['title'] = translated_title

    # Translate lines with translated titles replaced
    for page in data['pages']:
        page_text = "\n".join(page['lines'])

        for jp_title, en_title in title_translation_dict.items():
            page_text = page_text.replace(f"[{jp_title}]", f"[{en_title}]")

        translated_text = translate_page(page_text)
        page['lines'] = translated_text.split("\n")

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

# Replace 'input.json' and 'output.json' with your desired file paths
translate_json_file('input.json', 'output.json')

試してみたら、いくつか問題があった
なんか勝手にGPTにページを要約されちゃうことがある
GPTにもっと機械的に翻訳させるべきかな
jsonで渡して、jsonで返してもらう、みたいな
いや、jsonに正確に従わせるのはむしろむずいな
箇条書きが崩れる
これは、崩すなと指示すれば良いか
playgroundでpromptを調整

>ありがとう。今はChatCompletionを同期的に呼んでいると思うけど、これを並列・非同期処理にしてほしい。

 pythonimport json
import asyncio
import aiohttp
import openai

MAX_TOKENS = 8000
INPUT_PATH = "input_json/test1.json"
OUTPUT_PATH = "output_json/test1_2.json"

PROMPT = """
You are a translator. 
# Task
You get multiple texts to translate. Translate texts to English, and return translated texts.
# Rules
Keep the number of lines and newlines. Never remove spaces at the beginning of each line. Keep the number of spaces the same. 
Brackets of [text] and [text.icon] must be kept. The content inside the bracket must never be changed.
"""

async def async_translate(session, text, role="user"):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {openai.api_key}"
    }

    data = {
        "model": "gpt-3.5-turbo",
        "messages": [
            {"role": "system", "content": PROMPT},
            {"role": role, "content": text}
        ]
    }

    async with session.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data) as resp:
        response = await resp.json()
        print(response)
        return response["choices"][0]["message"]["content"]

async def translate_titles(session, title_list):
    translated_titles = []
    title_chunk = ""

    for title in title_list:
        if len(title_chunk) + len(title) + 1 < MAX_TOKENS:
            title_chunk += title + "\n"
        else:
            translated_chunk = await async_translate(session, title_chunk)
            translated_titles.extend(translated_chunk.split("\n")[:-1])
            title_chunk = title + "\n"

    if title_chunk:
        translated_chunk = await async_translate(session, title_chunk)
        translated_titles.extend(translated_chunk.split("\n")[:-1])

    return translated_titles

async def translate_page(session, page_text):
    if len(page_text) <= MAX_TOKENS:
        return await async_translate(session, page_text)
    else:
        split_point = page_text.rfind("\n", 0, MAX_TOKENS)
        first_half = page_text[:split_point]
        second_half = page_text[split_point + 1:]
        first_half_translated = await async_translate(session, first_half)
        second_half_translated = await translate_page(session, second_half)
        return first_half_translated + "\n" + second_half_translated

async def translate_json_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    title_translation_dict = {}

    # Translate page titles
    title_list = [page['title'] for page in data['pages']]

    async with aiohttp.ClientSession() as session:
        translated_titles = await translate_titles(session, title_list)

        for original_title, translated_title in zip(title_list, translated_titles):
            title_translation_dict[original_title] = translated_title

        for page, translated_title in zip(data['pages'], translated_titles):
            page['title'] = translated_title

        # Translate lines with translated titles replaced
        for page in data['pages']:
            page_text = "\n".join(page['lines'])

            for jp_title, en_title in title_translation_dict.items():
                page_text = page_text.replace(f"[{jp_title}]", f"[{en_title}]")
            translated_text = await translate_page(session, page_text)
            page['lines'] = translated_text.split("\n")

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

# Replace 'input.json' and 'output.json' with your desired file paths
async def main():
    await translate_json_file(INPUT_PATH, OUTPUT_PATH)

asyncio.run(main())
（APIの使い方が一部間違っていたので、それはが手直しした）
これでも遅いな
いや、これよくみたら非同期処理になってないな
結局for loopでawaitを呼んでるので

>ありがとう。修正点として、roleは常にsystemかuserであるべき。
> あと、それぞれのtranslate_pageはconcurrentlyにやってほしい。

 pythonimport json
import asyncio
import aiohttp
import openai

MAX_TOKENS = 8000

async def async_translate(session, text):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {openai.api_key}"
    }

    data = {
        "model": "gpt-3.5-turbo",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},
            {"role": "user", "content": text}
        ]
    }

    async with session.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data) as resp:
        response = await resp.json()
        return response["choices"][0]["message"]["content"]

async def translate_titles(session, title_list):
    translated_titles = []
    title_chunk = ""

    for title in title_list:
        if len(title_chunk) + len(title) + 1 < MAX_TOKENS:
            title_chunk += title + "\n"
        else:
            translated_chunk = await async_translate(session, title_chunk)
            translated_titles.extend(translated_chunk.split("\n")[:-1])
            title_chunk = title + "\n"

    if title_chunk:
        translated_chunk = await async_translate(session, title_chunk)
        translated_titles.extend(translated_chunk.split("\n")[:-1])

    return translated_titles

async def translate_page(session, page_text):
    if len(page_text) <= MAX_TOKENS:
        return await async_translate(session, page_text)
    else:
        split_point = page_text.rfind("\n", 0, MAX_TOKENS)
        first_half = page_text[:split_point]
        second_half = page_text[split_point + 1:]
        first_half_translated = await async_translate(session, first_half)
        second_half_translated = await translate_page(session, second_half)
        return first_half_translated + "\n" + second_half_translated

async def translate_json_file(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    title_translation_dict = {}

    # Translate page titles
    title_list = [page['title'] for page in data['pages']]

    async with aiohttp.ClientSession() as session:
        translated_titles = await translate_titles(session, title_list)

        for original_title, translated_title in zip(title_list, translated_titles):
            title_translation_dict[original_title] = translated_title

        for page, translated_title in zip(data['pages'], translated_titles):
            page['title'] = translated_title

        # Translate lines with translated titles replaced
        translation_tasks = []

        for page in data['pages']:
            page_text = "\n".join(page['lines'])

            for jp_title, en_title in title_translation_dict.items():
                page_text = page_text.replace(f"[{jp_title}]", f"[{en_title}]")

            translation_tasks.append(translate_page(session, page_text))

        translated_texts = await asyncio.gather(*translation_tasks)

        for page, translated_text in zip(data['pages'], translated_texts):
            page['lines'] = translated_text.split("\n")

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
        
# Replace 'input.json' and 'output.json' with your desired file paths
async def main():
    await translate_json_file('input.json', 'output.json')
    
asyncio.run(main())



新しいpromptで試した
いいところ
ちゃんと翻訳後リンクが機能している
課題
箇条書きのインデントがやっぱ崩れる
あ、中身見たらインデントは保たれていたけど、スペースが\tに置き換わっていた
プロンプトを修正
翻訳後に改行が保たれない
なんでだ、、、
GPTが改行を認識していない説がある
これかもしれない
Bing AIだと改行消されてたし、ありえる
外部リンクも翻訳されてしまう
ex: /villagepump/ChatGPTでScrapboxTranslatorを作るというリンクが翻訳されて、リンク切れを起こす
これはめんどいのでとりあえずいいかな
プロンプト頑張れば治せそう
最終的に、改行と空白（インデント）を\s, \nに置き換えてから翻訳させることにした
これはうまくいった✅

実況が面倒になったのでやめる
https://sharegpt.com/c/Tga6MNE
これでログが見れるはず

トークンカウントの不具合を修正
https://sharegpt.com/c/lmEB2ys
人間がバグを見つけて、機械が治す
素晴らしい世界

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb
rate limitがあることに気づいたので、Semaphoreで制限をかける
（自分はSemaphoreを知らなかったけど、ChatGPTがやってくれた）

とりあえずエラーはキャッチして無視する様にしたけど、ちらほらエラーが出るな
responseが空のパターン
token数が超えているみたいだけど、なぜbatchingが効いていないのかわからない
token数の計算が間違っている?
本文が英語の時に起きがちだな
txt/htmlの謎responseが帰ってくるパターン
 Error occurred while making request: 0, message='Attempt to decode JSON with unexpected mimetype: text/html; charset=utf-8', url=URL('https://api.openai.com/v1/chat/completions') 
これは中身見ないとわからんな

空白\sと改行\nにそれぞれ2トークン使ってしまうの勿体無いな
これとかめっちゃ無駄遣いしている
滅多に使わないけど1 tokenの記号を見つけたい
§
☆
この辺りか

どう頑張っても安定しないので、「翻訳後の行数が翻訳前の行数に±3以上の差があったら再度翻訳する」みたいな仕組みで対処することにした
temperatureを上げつつ、三回まで試す

ChatGPTでつぎはぎで機能を足していると、どんどんコードが読みづらくなっていくな
コードのリファクタリングはChatGPTでできるのか気になる

todo:
ページ1行目は訳したタイトルで置き換える
タイトル翻訳は行数が一致するまでループでやらせても良さそう