DIMT25@ICDAR

Document images such as scans and PDF renderings are important carriers of human knowledge. With the advancement of digitalization, techniques for automatically recognition, understanding , and translating document images have become a crucial part of digital transformation. Among them, Document Image Machine Translation (DIMT) is not only necessary for real-world communication but also essential for many downstream tasks, such as cross-lingual document retrieval, summarization, and information extraction. Over the past few years, DIMT technology has achieved remarkable progress with the development of deep learning. However, existing technologies are difficult to meet the demand for practical applications. The main reasons lie in the following aspects: (1) Multi-modality and cross-linguality: Due to the inherent multi-modality nature, real-world document images often involve a intricate combination of complex layout, dense text, and visually-rich elements, resulting in difficulties in their comprehensive understanding and the cross-lingual translation. (2) Image and text noise: Many factors such as image defect or OCR error may cause noise to the image or text as model input, leading to additional challenges to a DIMT system. (3) Lack of samples and a unified benchmark: Due to the high annotation cost and different annotation protocals, existing datasets often suffer insufficient samples, inconsistent labels and evaluation metrics, resulting in model performance that are not directly comparable.

To advance the research field of DIMT, we plan to launch a Document Image Machine Translation challenge (DIMT25@ICDAR). This challenge aims to provide adequate samples with standarlized annotations and metric to establish benchmark results with clear, replicable settings. The DIMT25@ICDAR challenge focus on the translation of real-world complex-layout document images, taking into account two document domains including web documents and academic articles.

Event	Date
Competition website available	December 10, 2024
Training data and baseline code available	January 10, 2025
Test data release	February 20, 2025
Submission site opens	March 20, 2025
Deadline for competition submissions	April 20, 2025
Deadline for competition reports	April 25, 2025

DIMT2025 CHALLENGE@ICDAR

Based on the input type, we release two tracks: 1) OCR-based DIMT, where model inputs include the image and its OCR results (word and word bounding box). 2) OCR-free DIMT, where model input is the image. Both tracks are given English document images and are required to translate them to Chinese.

Track 1. OCR-based DIMT-LLM. It aims to evaluate the performance of LLM-based methods on the DIMT task. In this sub-track, participants must use large language models (LLMs) with over 1 billion parameters to achieve the OCR-based DIMT task. Open-source LLMs can be utilized, and participants are allowed to fine-tune these models to improve performance. The number of parameters in the model must be specified in the submitted reports used during inference.

Track 1. OCR-based DIMT-Small. It aims to evaluate the performance of small model-based methods on the DIMT task. In this sub-track, participants are only allowed to use small models, where the number of parameters is fewer than 1 billion, to achieve the OCR-based DIMT task. Participants must focus on optimizing these smaller models for accurate translation and reordering. The number of parameters in the model must be specified in the submitted reports used during inference.

Track 2. OCR-free DIMT-LLM. It aims to evaluate the performance of LLM-based methods on the OCR-free DIMT task. In this sub-track, participants must use large language models (LLMs) with over 1 billion parameters to achieve the OCR-free DIMT task. Open-source LLMs can be utilized and fine-tuned to handle complex layouts and long context translation. Participants must specify the number of parameters in the model used during inference in their submitted reports.

Track 2. OCR-free DIMT-Small. It aims to evaluate the performance of small model-based methods on the OCR-free DIMT task. In this sub-track, participants are only permitted to use small models with fewer than 1 billion parameters to achieve the OCRfree DIMT task. Participants must work within these constraints, focusing on optimizing smaller models to handle complex layouts and long context. The number of parameters in the model must be specified in the submitted reports used during inference.

MathJax Example

Evaluation Metrics. We will provide evaluation scripts for both tracks: For Track 1, the metric is document-level BLEU for both translated target language text and reordered source language text. For Track 2, the metric is document-level BLEU for translated target language text in markdown format.

Dataset. The competitition dataset statistics are summarized in Table 1: For Track 1, we provide a train set containing 300K images and a valid set of 1K images. All images are converted from open-source web documents available on the internet. Each image is accompanied by corresponding OCR results (word and word bounding box), word-level reading order index, sentence-level and document-level translation. For Track 2, we provide a train set containing 124K images and a valid set of 1K images. All images are converted from PDF and latex files crawled from Arixv. Each image is accompanied by corresponding source language text and target language text inmarkdown format. For details of the dataset, please refer to our work.

**Table1:Statistical information of the DIMT 2025 challenge.**
Track	Dataset	# of Examples
Train	Valid	Test
Track 1	DIMT-WebDoc-300K	300K	1K	1K
Track 2	DIMT-arXiv-124K	124K	1K	1K

Result submission.

CodaLab link for DIMT25@ICDAR-OCR-based(Reorder-LLM): https://codalab.lisn.upsaclay.fr/competitions/21826

CodaLab link for DIMT25@ICDAR-OCR-based(Reorder-Small): https://codalab.lisn.upsaclay.fr/competitions/21827

CodaLab link for DIMT25@ICDAR-OCR-based(Translation-LLM): https://codalab.lisn.upsaclay.fr/competitions/21829

CodaLab link for DIMT25@ICDAR-OCR-based(Translation-Small): https://codalab.lisn.upsaclay.fr/competitions/21830

CodaLab link for DIMT25@ICDAR-OCR-free(Translation-LLM): https://codalab.lisn.upsaclay.fr/competitions/21832

CodaLab link for DIMT25@ICDAR-OCR-free(Translation-Small): https://codalab.lisn.upsaclay.fr/competitions/21833

Paper submission. All participants are encouraged to submit a paper describing their solution to the dimt2025.contact@gmail.com. Top-5 teams in each track MUST submit a method description.

Please download the End User License Agreement (Google Drive, Baidu Pan), fill it out and send it to dimt2025.contact@gmail.com to access the data. We will review your application and get in touch as soon as possible.
Track 1 Baseline Code & Train, Test Dataset: https://huggingface.co/datasets/zhangzhiyang/DIMT2025.ICDAR.Track_1
Track 2 Baseline Code & Test Dataset: https://huggingface.co/liangyupu/DIMT2025.ICDAR.Track_2
Track 2 Train Dataset: https://huggingface.co/datasets/liangyupu/DoTA_dataset
Contact email: dimt2025.contact@gmail.com

Report submission details. All teams are required to submit a method report using the official LaTeX template (Overleaf, Huggingface .zip) we have provided.. The report should include the following sections:

Introduction: Your team name, task overview, and method name.
Method Description: Core idea, innovations, and model architecture.
Implementation Details:
- Model parameter count
- Training data: official dataset usage and any external/public/private data
- Use of pre-trained models: specify which LLMs were used, how they were used (inference or fine-tuning), and their parameter sizes
Experimental Results: Report final model performance and, optionally, results from intermediate experiments.
Compliance Statement: Affirming adherence to data ethics and competition rules.

Notice

Please keep your report within 4 pages (excluding references and appendix, if any).
Submit the compiled PDF and the .tex source files as part of your final submission.
Deadline for method report submission: April 25th

[1] Z. Zhang, Y. Zhang, Y. Liang, L. Xiang, Y. Zhao, Y. Zhou, and C. Zong. LayoutDIT: Layout-aware end-to-end document image translation with multi-step conductive decoder. in Proc. of EMNLP Findings, 2023, pp. 10043–10053.
[2] Y. Liang, Y. Zhang, C. Ma, Z. Zhang, Y. Zhao, L. Xiang, C. Zong, and Y. Zhou. Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling in Proc. of NAACL, 2024, pp. 7077–7088.

Cite Us

@inproceedings{zhang-etal-2023-layoutdit,
                title = "{L}ayout{DIT}: Layout-Aware End-to-End Document Image Translation with Multi-Step Conductive Decoder",
                author = "Zhang, Zhiyang  and
                Zhang, Yaping  and
                Liang, Yupu  and
                Xiang, Lu  and
                Zhao, Yang  and
                Zhou, Yu  and
                Zong, Chengqing",
                editor = "Bouamor, Houda  and
                Pino, Juan  and
                Bali, Kalika",
                booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
                month = dec,
                year = "2023",
                address = "Singapore",
                publisher = "Association for Computational Linguistics",
                url = "https://aclanthology.org/2023.findings-emnlp.673",
                doi = "10.18653/v1/2023.findings-emnlp.673",
                pages = "10043--10053",
                }

@inproceedings{liang-etal-2024-document,
                title = "Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling",
                author = "Liang, Yupu  and
                Zhang, Yaping  and
                Ma, Cong  and
                Zhang, Zhiyang  and
                Zhao, Yang  and
                Xiang, Lu  and
                Zong, Chengqing  and
                Zhou, Yu",
                editor = "Duh, Kevin  and
                Gomez, Helena  and
                Bethard, Steven",
                booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
                month = jun,
                year = "2024",
                address = "Mexico City, Mexico",
                publisher = "Association for Computational Linguistics",
                url = "https://aclanthology.org/2024.naacl-long.392",
                doi = "10.18653/v1/2024.naacl-long.392",
                pages = "7084--7095",
                }

News

Schedule

DIMT2025 CHALLENGE@ICDAR

ORGANISERS

Chengqing Zong

Yaping Zhang

Yang Zhao

Lu Xiang

Yu Zhou

Zhiyang Zhang

Yupu Liang

Zhiyuan Chen

Cite Us