多模态
TTToken 覆盖从视觉理解、音频输入/TTS、图像生成/编辑 到 视频生成的完整多模态能力。本文给出常见场景的最小可用示例。
视觉理解(Vision)
OpenAI 协议
{
"model": "gpt-4o",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "这张图在说什么?"},
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
]
}]
}
也可直接 base64 内联:
"image_url": {"url": "data:image/png;base64,iVBORw0KGgo..."}
Claude 协议
{
"model": "claude-sonnet-4-5",
"max_tokens": 1024,
"messages": [{
"role": "user",
"content": [
{"type":"image","source":{"type":"url","url":"https://..."}},
{"type":"text","text":"图里有几只猫?"}
]
}]
}
Gemini 协议
{
"contents": [{
"role":"user",
"parts": [
{"inline_data":{"mime_type":"image/png","data":"iVBO..."}},
{"text":"详细描述这张图"}
]
}]
}
常用模型
gpt-4o/gpt-4.1/gpt-5claude-opus-4-5/claude-sonnet-4-5gemini-2.5-pro/gemini-2.5-flashqwen-vl-max/glm-4v/doubao-vision
音频输入 & TTS
语音转写(STT)
POST
/v1/audio/transcriptionscurl https://tttoken.xyz/v1/audio/transcriptions \
-H "Authorization: Bearer $TTT_KEY" \
-F file=@meeting.m4a \
-F model=whisper-1
语音合成(TTS)
POST
/v1/audio/speechcurl https://tttoken.xyz/v1/audio/speech \
-H "Authorization: Bearer $TTT_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini-tts",
"input": "你好,欢迎使用 TTToken。",
"voice": "alloy",
"format": "mp3"
}' --output hello.mp3
把音频作为对话输入
GPT-4o / Gemini 2.5 支持在 Chat 里直接塞入音频:
"content": [
{"type": "input_audio", "input_audio": {"data": "UklGR...", "format": "wav"}},
{"type": "text", "text": "总结这段录音"}
]
图像生成
POST
/v1/images/generations{
"model": "gpt-image-1",
"prompt": "水墨风格的山水画,有远山、近树、一叶扁舟",
"size": "1024x1024",
"quality": "high",
"n": 1
}
其他可选模型:
| 模型 | 特点 |
|---|---|
gpt-image-1 | OpenAI 官方,文字渲染好 |
flux-1.1-pro / flux-kontext | 写实与编辑能力强 |
dall-e-3 | 风格多样 |
mj_imagine | Midjourney 代理,需走 /mj/ 端点 |
nano-banana | Gemini 图像编辑,成本低 |
图像编辑 / 局部重绘
POST
/v1/images/editscurl https://tttoken.xyz/v1/images/edits \
-H "Authorization: Bearer $TTT_KEY" \
-F model=gpt-image-1 \
-F image=@room.png \
-F mask=@mask.png \
-F prompt="把沙发换成皮质棕色的"
Midjourney 兼容
POST
/mj/submit/imagine兼容 midjourney-proxy 协议:
{
"prompt": "a cute corgi sitting on grass --ar 16:9 --v 6.1",
"botType": "MID_JOURNEY"
}
返回 taskId,随后轮询 GET /mj/task/{id}/fetch。
视频生成
POST
/v1/videos{
"model": "sora-2",
"prompt": "一只戴墨镜的橘猫冲浪",
"duration": 5,
"size": "1280x720"
}
支持 Sora、Veo、Kling、Runway 等模型。返回 task_id,通过 /v1/videos/{task_id} 查询结果,/v1/videos/{task_id}/content 下载 mp4。
多模态嵌入
部分模型(如 voyage-multimodal-3、gemini-embedding-001)支持对图像 + 文本统一做向量化,用于图文检索场景。