Skip to content

Support HTTP/HTTPS image URLs in training metadata #1485

@firefighter-eric

Description

@firefighter-eric

背景

当前 DiffSynth 训练数据集看起来只支持本地图片路径。以 Qwen-Image-Layered 训练入
口为例,image / layer_input_image 会走:

ToAbsolutePath(args.dataset_base_path) >> LoadImage()

其中 ToAbsolutePath 直接拼接路径:

os.path.join(self.base_path, data)

LoadImage 则直接读取本地文件:

Image.open(data)

因此如果 metadata 中写入 HTTP/HTTPS 图片 URL,例如:

{
  "prompt": "a poster",
  "image": "https://example.com/image.png",
  "layer_input_image": "https://example.com/source.png"
}

当前逻辑会把它拼成类似 data/https://example.com/image.png,然后按本地文件交给
PIL.Image.open(),导致读取失败。

期望能力

希望训练 metadata 支持 HTTP/HTTPS 图片 URL:

  • 本地相对路径继续拼接 dataset_base_path 后读取。
  • 绝对本地路径继续可用。
  • http://https:// 路径不拼接 dataset_base_path,而是下载后读取。
  • list 类型图片字段也支持 URL,例如 image: [url1, url2, ...]
  • 最好支持可选本地缓存,避免每个 epoch 或 dataloader worker 重复下载同一张图片。

使用场景

很多训练数据图片已经存放在对象存储、CDN 或内部图片服务中。支持 URL 可以避免训练
前完整落盘一份图片数据,也方便复用已有数据 manifest。

建议实现

一种可能的实现方式:

  1. ToAbsolutePath 检测 http:// / https://,URL 直接返回原字符串。
  2. LoadImage 检测 URL,使用 requestsurllib 下载 bytes,再通过
    Image.open(BytesIO(...)) 读取。
  3. 本地路径继续保持当前 Image.open(path) 行为。
  4. 增加可选参数,例如 cache dir、timeout、retry、是否启用 URL 读取等。
  5. 下载或解码失败时,错误信息中包含失败 URL 和具体异常原因。

Background

The current DiffSynth training dataset pipeline appears to only support local
image paths. For example, in the Qwen-Image-Layered training entry, image and
layer_input_image are processed through:

ToAbsolutePath(args.dataset_base_path) >> LoadImage()

ToAbsolutePath directly joins the path:

os.path.join(self.base_path, data)

and LoadImage directly loads a local file:

Image.open(data)

As a result, if the metadata contains HTTP/HTTPS image URLs, such as:

{
  "prompt": "a poster",
  "image": "https://example.com/image.png",
  "layer_input_image": "https://example.com/source.png"
}

the current logic turns it into something like data/https://example.com/ image.png, then passes it to PIL.Image.open() as a local file path, which
fails.

Expected Behavior

It would be useful for training metadata to support HTTP/HTTPS image URLs:

  • Local relative paths should keep the existing behavior and be joined with
    dataset_base_path.
  • Absolute local paths should continue to work.
  • http:// and https:// values should not be joined with dataset_base_path;
    they should be downloaded and loaded as images.
  • List-valued image fields should also support URLs, such as image: [url1, url2, ...].
  • Optional local caching would be helpful to avoid repeated downloads across
    epochs or dataloader workers.

Use Case

Many training datasets store images in object storage, CDNs, or internal image
services. URL support would make it easier to reuse existing manifests without
materializing the entire image dataset locally before training.

Suggested Implementation

One possible approach:

  1. Detect http:// / https:// in ToAbsolutePath and return URL values
    unchanged.
  2. Detect URLs in LoadImage, download bytes using requests or urllib, and
    load via Image.open(BytesIO(...)).
  3. Keep the existing Image.open(path) behavior for local files.
  4. Add optional arguments such as cache directory, timeout, retry count, and
    whether URL loading is enabled.
  5. Include the failing URL and the underlying exception in error messages when
    download or decoding fails.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions