AI

This AI paper introduces MathCoder-VL and FigCodifier: Multimodal Mathematical Reasoning for Code Alignment through Visually

Multimodal mathematical reasoning enables machines to solve problems involving textual information and visual components such as graphs and graphics. This requires combining language understanding and visual interpretation to make sense in complex mathematical environments. Such functionality is crucial in education, automated tutoring and document analysis, where problems and images often fusions occur.

The main obstacle in this field is the lack of high-quality accurate alignment between mathematical images and their text or symbolic representations. Most datasets used to train large multi-model models are derived from image titles in natural settings, which often miss detailed elements that are essential for mathematical accuracy. This creates problems for models that rely on these data sources, making them unreliable when dealing with geometric shapes, numbers or technical diagrams. The performance of a model in mathematical reasoning depends largely on its ability to correctly interpret and associate these visual details with mathematical expressions or indications.

In the past, some methods have tried to solve this problem by enhancing the visual encoder or using manually made datasets. However, these methods tend to rely on manual coding or template-based generation, resulting in low image diversity, limiting their applicability. Some efforts, such as Math-Lalava and Mavis, have developed synthetic datasets and used templates or predefined categories. Nevertheless, they are unable to dynamically create various math-related visual effects. This shortage limits the learning scope of models and puts them in more complex or less structured mathematical problems.

Researchers at the Multimedia Laboratory of China University of Hong Kong and CPII under Innohk introduced a new approach called MathCoder-VL. This method combines a visual pair code model called FigCodifier and synthetic data engine. They built the IMGCODE-8.6M dataset using the strategy in the model, which allowed them to build the largest image code dataset to iterate over date. In addition, they developed MM-MATHINSTRUCT-3M, a multi-modal instruction dataset with new synthetic images. The MATHCODER-VL model is divided into two stages: mid-term training on IMGCODE-8.6M to improve visual text alignment and fine-tuning of MM-MATHINSTRUCT-3M to enhance reasoning capabilities.

The Figcodifier model works by converting mathematical graphs into code that can accurately recreate these graphs. Unlike subtitle-based datasets, this code image pairing ensures strict alignment and accuracy. The process begins with Datikz’s 119k image code pairs and is extended by iterative training using images collected from textbooks, K12 datasets, and Arxiv papers. The final dataset includes 8.6 million code image pairs, covering a variety of mathematical topics. Figcodifier also supports Python-based rendering, which adds diversity to image generation. The system filters low-quality data by checking code validity and removing redundant or useless visuals, resulting in 4.3 million high-quality Tikz and 4.3 million Python-based Pairs.

Performance evaluation shows that MathCoder-VL performs better than multiple open source models. The 8B version has a precision of 73.6% on the MathVista geometry problem-solving subset, surpassing GPT-4O and Claude 3.5 sonnets, respectively, at 8.9% and 9.2%, respectively. It also scored 26.1% on math videos and 46.5% on math. In Chinese benchmarks, it achieved 51.2% on Gaokao-MM. In the We-Math benchmark, it solved the two-step problem at a rate of 58.6%, outperforming the GPT-4O’s 58.1%. It performed 52.1% on the three-step problem, again surpassing GPT-4O’s 43.6%. Compared with its basic model Intervl2-8B, it has a mathematical video growth rate of 6.1%, while MathVista has a 11.6%.

This work clearly defines the problem of insufficient visual text alignment in multimodal mathematical reasoning and provides scalable and innovative solutions. The introduction of small-scale code racks and synthetic datasets allows models to learn from accurate, diverse visual effects and pair with precise code, significantly improving their inference capabilities. MathCoder-VL represents practical advancements in the field, demonstrating how thoughtful model design and high-quality data overcome the long-term limitations of mathematical AI.


View paper and GitHub pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.


Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button