多模态

最后更新:2026-04-24 · 预计阅读 10 分钟

TTToken 覆盖从视觉理解音频输入/TTS图像生成/编辑视频生成的完整多模态能力。本文给出常见场景的最小可用示例。

视觉理解(Vision)

OpenAI 协议

{
  "model": "gpt-4o",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "这张图在说什么?"},
      {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
    ]
  }]
}

也可直接 base64 内联:

"image_url": {"url": "data:image/png;base64,iVBORw0KGgo..."}

Claude 协议

{
  "model": "claude-sonnet-4-5",
  "max_tokens": 1024,
  "messages": [{
    "role": "user",
    "content": [
      {"type":"image","source":{"type":"url","url":"https://..."}},
      {"type":"text","text":"图里有几只猫?"}
    ]
  }]
}

Gemini 协议

{
  "contents": [{
    "role":"user",
    "parts": [
      {"inline_data":{"mime_type":"image/png","data":"iVBO..."}},
      {"text":"详细描述这张图"}
    ]
  }]
}

常用模型

音频输入 & TTS

语音转写(STT)

POST/v1/audio/transcriptions
curl https://tttoken.xyz/v1/audio/transcriptions \
  -H "Authorization: Bearer $TTT_KEY" \
  -F file=@meeting.m4a \
  -F model=whisper-1

语音合成(TTS)

POST/v1/audio/speech
curl https://tttoken.xyz/v1/audio/speech \
  -H "Authorization: Bearer $TTT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "你好,欢迎使用 TTToken。",
    "voice": "alloy",
    "format": "mp3"
  }' --output hello.mp3

把音频作为对话输入

GPT-4o / Gemini 2.5 支持在 Chat 里直接塞入音频:

"content": [
  {"type": "input_audio", "input_audio": {"data": "UklGR...", "format": "wav"}},
  {"type": "text", "text": "总结这段录音"}
]

图像生成

POST/v1/images/generations
{
  "model": "gpt-image-1",
  "prompt": "水墨风格的山水画,有远山、近树、一叶扁舟",
  "size": "1024x1024",
  "quality": "high",
  "n": 1
}

其他可选模型:

模型特点
gpt-image-1OpenAI 官方,文字渲染好
flux-1.1-pro / flux-kontext写实与编辑能力强
dall-e-3风格多样
mj_imagineMidjourney 代理,需走 /mj/ 端点
nano-bananaGemini 图像编辑,成本低

图像编辑 / 局部重绘

POST/v1/images/edits
curl https://tttoken.xyz/v1/images/edits \
  -H "Authorization: Bearer $TTT_KEY" \
  -F model=gpt-image-1 \
  -F image=@room.png \
  -F mask=@mask.png \
  -F prompt="把沙发换成皮质棕色的"

Midjourney 兼容

POST/mj/submit/imagine

兼容 midjourney-proxy 协议:

{
  "prompt": "a cute corgi sitting on grass --ar 16:9 --v 6.1",
  "botType": "MID_JOURNEY"
}

返回 taskId,随后轮询 GET /mj/task/{id}/fetch

视频生成

POST/v1/videos
{
  "model": "sora-2",
  "prompt": "一只戴墨镜的橘猫冲浪",
  "duration": 5,
  "size": "1280x720"
}

支持 Sora、Veo、Kling、Runway 等模型。返回 task_id,通过 /v1/videos/{task_id} 查询结果,/v1/videos/{task_id}/content 下载 mp4。

多模态嵌入

部分模型(如 voyage-multimodal-3gemini-embedding-001)支持对图像 + 文本统一做向量化,用于图文检索场景。