ChatGPTでScrapboxTranslator v1を作る
ChatGPTを使って、「ChatGPTを使って翻訳するコード」を書く
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
data:image/s3,"s3://crabby-images/dd4a9/dd4a9b3bfa5715093eb1b16ef7e087fe733912bb" alt="- -"
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
promptjsonファイルの日本語ドキュメントを元に、英訳したjsonファイルを出力するPythonコードを書きたいです。
# jsonファイルのフォーマット
```
{
"pages": [
{
"title": "page1title",
"lines": [
"page1title",
"line2",
"line3"
]
},
{
"title": "page2title",
"lines": [
"page2title",
"line2",
"line3"
]
},
...
]
}
```
# 実行すること
1. jsonファイルを受け取る
2. 全pageのtitleのみを抽出し、英訳し、日英のtitleをまとめたdictionaryを作る
3. それぞれのページの本文(lines)を翻訳する。注意: 他のページのtitleへのリンクが[タイトル]と鉤括弧に囲われ含まれている。それらのタイトルは、dictionaryが持つ英訳に一字一句従って翻訳される必要がある。
# 翻訳の方法
ChatGPT APIを用いる。ChatGPT APIの入力と出力のサンプルは以下の通りである。assistantに英訳する様に伝え、出力を使いたい。翻訳時に条件や注意点がある時は、そのことも伝える。
```sample call
# Note: you need to be using OpenAI Python v0.27.0 for the code below to work
import openai
openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
```
```sample response
{
'id': 'chatcmpl-6p9XYPYSTTRi0xEviKjjilqrWU2Ve',
'object': 'chat.completion',
'created': 1677649420,
'model': 'gpt-3.5-turbo',
'usage': {'prompt_tokens': 56, 'completion_tokens': 31, 'total_tokens': 87},
'choices': [
{
'message': {
'role': 'assistant',
'content': 'The 2020 World Series was played in Arlington, Texas at the Globe Life Field, which was the new home stadium for the Texas Rangers.'},
'finish_reason': 'stop',
'index': 0
}
]
}
```
data:image/s3,"s3://crabby-images/96e80/96e80ddc2f12f02203e7bd3bed9f3a194b846335" alt="chatgpt chatgpt"
pythonimport json
import openai
def translate(text, role="user"):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},
{"role": role, "content": text}
]
)
return response.choices[0].message.content
def translate_json_file(input_file, output_file):
with open(input_file, 'r', encoding='utf-8') as f:
data = json.load(f)
title_translation_dict = {}
# Translate page titles
for page in data['pages']:
translated_title = translate(page['title'])
title_translation_dict[page['title']] = translated_title
page['title'] = translated_title
# Translate lines with translated titles replaced
for page in data['pages']:
translated_lines = []
for line in page['lines']:
for jp_title, en_title in title_translation_dict.items():
line = line.replace(f"[{jp_title}]", f"[{en_title}]")
translated_line = translate(line)
translated_lines.append(translated_line)
page['lines'] = translated_lines
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
# Replace 'input.json' and 'output.json' with your desired file paths
translate_json_file('input.json', 'output.json')
多分できたのでは
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
リンクは本文翻訳する前に置き換えたのか。うまくいくなら、確かにその方がシンプルだな
後で検証する
>ありがとう。改善点として、ドキュメントに更新があった時に更新部分だけ翻訳し直す様にしたい。どの様な実装が良いかな。data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
data:image/s3,"s3://crabby-images/96e80/96e80ddc2f12f02203e7bd3bed9f3a194b846335" alt="chatgpt chatgpt"
pythonimport json
import openai
def has_changes(new_line, old_line):
return new_line != old_line
def translate_updated_json(input_file, old_input_file, translated_file):
with open(input_file, 'r', encoding='utf-8') as f:
new_data = json.load(f)
with open(old_input_file, 'r', encoding='utf-8') as f:
old_data = json.load(f)
with open(translated_file, 'r', encoding='utf-8') as f:
translated_data = json.load(f)
for new_page, old_page, translated_page in zip(new_data['pages'], old_data['pages'], translated_data['pages']):
if has_changes(new_page['title'], old_page['title']):
translated_title = translate(new_page['title'])
translated_page['title'] = translated_title
for idx, (new_line, old_line) in enumerate(zip(new_page['lines'], old_page['lines'])):
if has_changes(new_line, old_line):
translated_line = translate(new_line)
translated_page['lines'][idx] = translated_line
with open(translated_file, 'w', encoding='utf-8') as f:
json.dump(translated_data, f, ensure_ascii=False, indent=2)
# Replace the file paths with your desired file paths
translate_updated_json('input_new.json', 'input_old.json', 'output.json')
>この実装では、JSONファイルのページ数と構造が同じであることを前提としています。ファイル構造が大幅に変更される場合や、ページが追加/削除される場合は、この実装が正しく機能しない可能性があります。そのようなシナリオに対応するには、より堅data:image/s3,"s3://crabby-images/96e80/96e80ddc2f12f02203e7bd3bed9f3a194b846335" alt="chatgpt chatgpt"
>ファイル構造は同じだと思うけど、ページが追加/削除される可能性は大いにある。それに対応してほしいdata:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
data:image/s3,"s3://crabby-images/96e80/96e80ddc2f12f02203e7bd3bed9f3a194b846335" alt="chatgpt chatgpt"
pythonimport json
import openai
def has_changes(new_line, old_line):
return new_line != old_line
def translate_updated_json(input_file, old_input_file, translated_file):
with open(input_file, 'r', encoding='utf-8') as f:
new_data = json.load(f)
with open(old_input_file, 'r', encoding='utf-8') as f:
old_data = json.load(f)
with open(translated_file, 'r', encoding='utf-8') as f:
translated_data = json.load(f)
old_pages_dict = {page['title']: page for page in old_data['pages']}
translated_pages_dict = {page['title']: page for page in translated_data['pages']}
new_translated_pages = []
for new_page in new_data['pages']:
new_title = new_page['title']
if new_title not in old_pages_dict: # Page is added
translated_page = {
'title': translate(new_title),
'lines': [translate(line) for line in new_page['lines']]
}
else: # Page exists in the old data
old_page = old_pages_dict[new_title]
translated_page = translated_pages_dict[new_title]
if has_changes(new_page['title'], old_page['title']):
translated_page['title'] = translate(new_page['title'])
for idx, (new_line, old_line) in enumerate(zip(new_page['lines'], old_page['lines'])):
if has_changes(new_line, old_line):
translated_page['lines'][idx] = translate(new_line)
new_translated_pages.append(translated_page)
# Remove deleted pages from the translated data
translated_data['pages'] = [page for page in translated_data['pages'] if page['title'] in old_pages_dict]
translated_data['pages'] = new_translated_pages
with open(translated_file, 'w', encoding='utf-8') as f:
json.dump(translated_data, f, ensure_ascii=False, indent=2)
# Replace the file paths with your desired file paths
translate_updated_json('input_new.json', 'input_old.json', 'output.json')
差分対応の前に、とりあえず普通に翻訳できるものを作るか
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
(履歴を遡って編集)
> 上のコードは、それぞれのtitleやlineで毎回translate()を呼んでいますが、呼べば呼ぶほどAPIのコストがかかるので、呼ぶ回数は最小限にしたい。しかし、ChatCompletionに渡せる最大の文字数は8000程度です。
> titleを翻訳する時には、渡せる最大文字数分までを一度に翻訳してほしい。
> それぞれのページを訳す時は、同じページは一度に翻訳してほしい。文字数制限に引っかかるなら適宜分割してほしい。
data:image/s3,"s3://crabby-images/96e80/96e80ddc2f12f02203e7bd3bed9f3a194b846335" alt="chatgpt chatgpt"
pythonimport json
import openai
MAX_TOKENS = 8000
def translate(text, role="user"):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},
{"role": role, "content": text}
]
)
return response.choices[0].message.content
def translate_titles(title_list):
translated_titles = []
title_chunk = ""
for title in title_list:
if len(title_chunk) + len(title) + 1 < MAX_TOKENS:
title_chunk += title + "\n"
else:
translated_chunk = translate(title_chunk, role="title_translation")
translated_titles.extend(translated_chunk.split("\n")[:-1])
title_chunk = title + "\n"
if title_chunk:
translated_chunk = translate(title_chunk, role="title_translation")
translated_titles.extend(translated_chunk.split("\n")[:-1])
return translated_titles
def translate_page(page_text):
if len(page_text) <= MAX_TOKENS:
return translate(page_text, role="page_translation")
else:
split_point = page_text.rfind("\n", 0, MAX_TOKENS)
first_half = page_text[:split_point]
second_half = page_text[split_point + 1:]
return translate_page(first_half) + "\n" + translate_page(second_half)
def translate_json_file(input_file, output_file):
with open(input_file, 'r', encoding='utf-8') as f:
data = json.load(f)
title_translation_dict = {}
# Translate page titles
title_list = [page['title'] for page in data['pages']]
translated_titles = translate_titles(title_list)
for original_title, translated_title in zip(title_list, translated_titles):
title_translation_dict[original_title] = translated_title
for page, translated_title in zip(data['pages'], translated_titles):
page['title'] = translated_title
# Translate lines with translated titles replaced
for page in data['pages']:
page_text = "\n".join(page['lines'])
for jp_title, en_title in title_translation_dict.items():
page_text = page_text.replace(f"[{jp_title}]", f"[{en_title}]")
translated_text = translate_page(page_text)
page['lines'] = translated_text.split("\n")
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
# Replace 'input.json' and 'output.json' with your desired file paths
translate_json_file('input.json', 'output.json')
試してみたら、いくつか問題があった
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
なんか勝手にGPTにページを要約されちゃうことがある
GPTにもっと機械的に翻訳させるべきかな
jsonで渡して、jsonで返してもらう、みたいな
いや、jsonに正確に従わせるのはむしろむずいな
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
箇条書きが崩れる
これは、崩すなと指示すれば良いか
playgroundでpromptを調整
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
>ありがとう。今はChatCompletionを同期的に呼んでいると思うけど、これを並列・非同期処理にしてほしい。data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
data:image/s3,"s3://crabby-images/96e80/96e80ddc2f12f02203e7bd3bed9f3a194b846335" alt="chatgpt chatgpt"
pythonimport json
import asyncio
import aiohttp
import openai
MAX_TOKENS = 8000
INPUT_PATH = "input_json/test1.json"
OUTPUT_PATH = "output_json/test1_2.json"
PROMPT = """
You are a translator.
# Task
You get multiple texts to translate. Translate texts to English, and return translated texts.
# Rules
Keep the number of lines and newlines. Never remove spaces at the beginning of each line. Keep the number of spaces the same.
Brackets of [text] and [text.icon] must be kept. The content inside the bracket must never be changed.
"""
async def async_translate(session, text, role="user"):
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {openai.api_key}"
}
data = {
"model": "gpt-3.5-turbo",
"messages": [
{"role": "system", "content": PROMPT},
{"role": role, "content": text}
]
}
async with session.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data) as resp:
response = await resp.json()
print(response)
return response["choices"][0]["message"]["content"]
async def translate_titles(session, title_list):
translated_titles = []
title_chunk = ""
for title in title_list:
if len(title_chunk) + len(title) + 1 < MAX_TOKENS:
title_chunk += title + "\n"
else:
translated_chunk = await async_translate(session, title_chunk)
translated_titles.extend(translated_chunk.split("\n")[:-1])
title_chunk = title + "\n"
if title_chunk:
translated_chunk = await async_translate(session, title_chunk)
translated_titles.extend(translated_chunk.split("\n")[:-1])
return translated_titles
async def translate_page(session, page_text):
if len(page_text) <= MAX_TOKENS:
return await async_translate(session, page_text)
else:
split_point = page_text.rfind("\n", 0, MAX_TOKENS)
first_half = page_text[:split_point]
second_half = page_text[split_point + 1:]
first_half_translated = await async_translate(session, first_half)
second_half_translated = await translate_page(session, second_half)
return first_half_translated + "\n" + second_half_translated
async def translate_json_file(input_file, output_file):
with open(input_file, 'r', encoding='utf-8') as f:
data = json.load(f)
title_translation_dict = {}
# Translate page titles
title_list = [page['title'] for page in data['pages']]
async with aiohttp.ClientSession() as session:
translated_titles = await translate_titles(session, title_list)
for original_title, translated_title in zip(title_list, translated_titles):
title_translation_dict[original_title] = translated_title
for page, translated_title in zip(data['pages'], translated_titles):
page['title'] = translated_title
# Translate lines with translated titles replaced
for page in data['pages']:
page_text = "\n".join(page['lines'])
for jp_title, en_title in title_translation_dict.items():
page_text = page_text.replace(f"[{jp_title}]", f"[{en_title}]")
translated_text = await translate_page(session, page_text)
page['lines'] = translated_text.split("\n")
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
# Replace 'input.json' and 'output.json' with your desired file paths
async def main():
await translate_json_file(INPUT_PATH, OUTPUT_PATH)
asyncio.run(main())
(APIの使い方が一部間違っていたので、それは
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
が手直しした)
これでも遅いな
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
いや、これよくみたら非同期処理になってないな
結局for loopでawaitを呼んでるので
>ありがとう。修正点として、roleは常にsystemかuserであるべき。
> あと、それぞれのtranslate_pageはconcurrentlyにやってほしい。
data:image/s3,"s3://crabby-images/96e80/96e80ddc2f12f02203e7bd3bed9f3a194b846335" alt="chatgpt chatgpt"
pythonimport json
import asyncio
import aiohttp
import openai
MAX_TOKENS = 8000
async def async_translate(session, text):
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {openai.api_key}"
}
data = {
"model": "gpt-3.5-turbo",
"messages": [
{"role": "system", "content": "You are a helpful assistant that translates Japanese text to English."},
{"role": "user", "content": text}
]
}
async with session.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data) as resp:
response = await resp.json()
return response["choices"][0]["message"]["content"]
async def translate_titles(session, title_list):
translated_titles = []
title_chunk = ""
for title in title_list:
if len(title_chunk) + len(title) + 1 < MAX_TOKENS:
title_chunk += title + "\n"
else:
translated_chunk = await async_translate(session, title_chunk)
translated_titles.extend(translated_chunk.split("\n")[:-1])
title_chunk = title + "\n"
if title_chunk:
translated_chunk = await async_translate(session, title_chunk)
translated_titles.extend(translated_chunk.split("\n")[:-1])
return translated_titles
async def translate_page(session, page_text):
if len(page_text) <= MAX_TOKENS:
return await async_translate(session, page_text)
else:
split_point = page_text.rfind("\n", 0, MAX_TOKENS)
first_half = page_text[:split_point]
second_half = page_text[split_point + 1:]
first_half_translated = await async_translate(session, first_half)
second_half_translated = await translate_page(session, second_half)
return first_half_translated + "\n" + second_half_translated
async def translate_json_file(input_file, output_file):
with open(input_file, 'r', encoding='utf-8') as f:
data = json.load(f)
title_translation_dict = {}
# Translate page titles
title_list = [page['title'] for page in data['pages']]
async with aiohttp.ClientSession() as session:
translated_titles = await translate_titles(session, title_list)
for original_title, translated_title in zip(title_list, translated_titles):
title_translation_dict[original_title] = translated_title
for page, translated_title in zip(data['pages'], translated_titles):
page['title'] = translated_title
# Translate lines with translated titles replaced
translation_tasks = []
for page in data['pages']:
page_text = "\n".join(page['lines'])
for jp_title, en_title in title_translation_dict.items():
page_text = page_text.replace(f"[{jp_title}]", f"[{en_title}]")
translation_tasks.append(translate_page(session, page_text))
translated_texts = await asyncio.gather(*translation_tasks)
for page, translated_text in zip(data['pages'], translated_texts):
page['lines'] = translated_text.split("\n")
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
# Replace 'input.json' and 'output.json' with your desired file paths
async def main():
await translate_json_file('input.json', 'output.json')
asyncio.run(main())
新しいpromptで試した
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
いいところ
ちゃんと翻訳後リンクが機能している
課題
箇条書きのインデントがやっぱ崩れる
あ、中身見たらインデントは保たれていたけど、スペースが\tに置き換わっていた
プロンプトを修正
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
翻訳後に改行が保たれない
なんでだ、、、
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
GPTが改行を認識していない説がある
これかもしれない
Bing AIだと改行消されてたし、ありえる
外部リンクも翻訳されてしまう
これはめんどいのでとりあえずいいかな
プロンプト頑張れば治せそう
最終的に、改行と空白(インデント)を\s, \nに置き換えてから翻訳させることにした
これはうまくいった✅
実況が面倒になったのでやめる
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
これでログが見れるはず
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
トークンカウントの不具合を修正
素晴らしい世界
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
rate limitがあることに気づいたので、Semaphoreで制限をかける
(自分はSemaphoreを知らなかったけど、ChatGPTがやってくれた)
とりあえずエラーはキャッチして無視する様にしたけど、ちらほらエラーが出るな
responseが空のパターン
token数が超えているみたいだけど、なぜbatchingが効いていないのかわからない
token数の計算が間違っている?
本文が英語の時に起きがちだな
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
txt/htmlの謎responseが帰ってくるパターン
Error occurred while making request: 0, message='Attempt to decode JSON with unexpected mimetype: text/html; charset=utf-8', url=URL('https://api.openai.com/v1/chat/completions')
これは中身見ないとわからんな
空白\sと改行\nにそれぞれ2トークン使ってしまうの勿体無いな
これとかめっちゃ無駄遣いしている
滅多に使わないけど1 tokenの記号を見つけたい
§
☆
この辺りか
どう頑張っても安定しないので、「翻訳後の行数が翻訳前の行数に±3以上の差があったら再度翻訳する」みたいな仕組みで対処することにした
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
temperatureを上げつつ、三回まで試す
ChatGPTでつぎはぎで機能を足していると、どんどんコードが読みづらくなっていくな
data:image/s3,"s3://crabby-images/2bcfd/2bcfd0ca1259ad12d9136903ffe5d01b3798c3b9" alt="blu3mo blu3mo"
コードのリファクタリングはChatGPTでできるのか気になる
data:image/s3,"s3://crabby-images/910b2/910b276f06e86cb0dd0b0bd0a82e2a5627206809" alt="inajob inajob"
todo:
ページ1行目は訳したタイトルで置き換える
タイトル翻訳は行数が一致するまでループでやらせても良さそう