Timon Harz

December 12, 2024

VisOnlyQA: New Dataset for Evaluating Visual Perception in Large Vision-Language Models (LVLMs)

Discover how the VisOnlyQA dataset is reshaping the evaluation of LVLMs' visual perception abilities. Explore the dataset's role in tackling challenges in geometric and numerical reasoning tasks.

Large Vision Language Models (LVLMs) have made notable strides in handling various complex multi-modal tasks in recent years. Their ability to interpret visual information, known as visual perception, depends on visual encoders and multimodal training. Despite these advancements, visual perception errors remain a significant issue, affecting LVLMs' understanding of crucial image details needed for task execution.

Popular datasets for evaluating LVLMs, such as MMMU and MathVista, do not prioritize visual perception. Instead, they focus on tasks requiring expert-level reasoning and domain knowledge. As a result, performance on these datasets is influenced by a range of factors, and assessing visual perception directly is challenging. Most benchmarks for multimodal reasoning in scientific figures are confined to expert-level application domains, overlooking the importance of visual perception. General visual perception datasets, which focus on tasks like OCR, depth estimation, and counting, often lack fine-grained detail due to the difficulty of crafting specific questions. While some datasets do assess LVLMs’ visual perception, they tend to focus on broader concepts, such as scene understanding, rather than finer visual details.

To address this gap, a team of researchers from Penn State University introduced VisOnlyQA, a new dataset specifically designed to evaluate the visual perception capabilities of LVLMs in relation to geometric and numerical information within scientific figures. VisOnlyQA bridges this gap by targeting fine-grained visual details and offering an objective assessment of LVLMs' visual perception. It expands on the limitations of previous datasets by generating synthetic figures from scripts, ensuring diversity and precision. Questions are either manually annotated or generated automatically using metadata and templates, eliminating the need for complex reasoning or domain expertise. The dataset includes three splits: Eval-Real, Eval-Synthetic, and Train, with balanced labels and high annotation quality, verified through human performance accuracy rates of 93.5% to 95%.

The study assesses the performance of 20 open-source and proprietary LVLMs on the VisOnlyQA dataset, specifically examining their visual perception abilities. The models were tested on tasks related to geometry, chemistry, and chart analysis, both with and without chain-of-thought reasoning. The results revealed that the models significantly underperformed compared to humans, achieving average accuracies of just 54.2% on real-world data and 42.4% on synthetic data—well below human performance, which was 93.5% and 95.0%, respectively.

Despite the large sizes of models like Phi-3.5-Vision and LLaVA-Next, many performed almost randomly on tasks such as Geometry-Triangle and Charts Intersection, highlighting that current LVLMs still face challenges with visual perception tasks involving geometric and numerical data. The analysis also found that chain-of-thought reasoning did not consistently enhance performance and, in some cases, even hindered it. This underscores that VisOnlyQA is designed to assess visual perception directly, without relying on reasoning. Error analysis revealed that most mistakes stemmed from issues with visual perception, further validating the dataset’s focus on evaluating this capability.

Large Vision Language Models (LVLMs) have seen remarkable progress, particularly in their ability to handle complex, multi-modal tasks. These models combine the power of visual and language processing, enabling them to understand and generate both text and images effectively. This advancement is particularly notable in applications requiring an understanding of intricate visual contexts alongside text, such as image captioning, visual question answering, and multi-modal reasoning tasks.

Recent research has highlighted several key strides. For example, LVLMs are now better at performing tasks that require reasoning across both images and text, overcoming previous limitations where models were predominantly focused on text-based tasks or image classification. They have been evaluated on comprehensive benchmarks that test their performance across a range of multi-modal tasks, helping to better measure their general perception and reasoning capabilities. However, current models still face challenges in areas such as safety and data leakage during training, requiring more robust solutions to ensure fair evaluation and responsible deployment.

In 2024, further developments have seen LVLMs improve in multi-step reasoning and in handling tasks that require a more intricate understanding of the relationship between textual and visual inputs. As these models continue to evolve, their potential for revolutionizing various fields—ranging from education and healthcare to creative industries—becomes more apparent.

Visual perception is the process by which a system interprets and understands visual information from the environment. For Large Vision-Language Models (LVLMs), visual perception involves interpreting fine-grained details within images, which is crucial for accurate analysis and understanding. LVLMs, like those used in document processing, rely on the integration of visual and textual data to make sense of complex images and documents. However, these models often struggle with understanding detailed geometric and numerical relationships within images, especially when fine-grained visual information is required, such as recognizing object boundaries or spatial arrangements.

The importance of visual perception in LVLMs lies in their ability to accurately interpret images, not just at a surface level but also by understanding nuances that contribute to their overall meaning. For instance, in document understanding, LVLMs use contrastive learning to enhance their visual perception abilities by aligning visual features with textual content. Despite improvements in these models, challenges remain, particularly in tasks that require precise spatial or geometric reasoning.

Ultimately, the more sophisticated the visual perception in LVLMs, the better these models can handle intricate tasks like visual document understanding, which requires accurate identification and interpretation of both visual and textual elements. This capability is critical for advancing tasks such as object detection, document layout analysis, and even recognizing visual patterns in scientific figures.

Despite significant progress in vision-language models (LVLMs), persistent challenges remain, particularly with visual perception. These models, though increasingly adept at visual recognition, continue to struggle with hallucinations—instances where the model generates incorrect or irrelevant information. This issue is most noticeable when LVLMs are tasked with reasoning beyond straightforward image descriptions, such as analyzing charts, mathematical problems, or other non-real-world visuals. While advancements in model architecture and hallucination mitigation techniques have improved performance on visual recognition tasks, these models still perform poorly in complex, reasoning-based contexts.

Moreover, LVLMs face difficulties with "unanswerable" questions, which are common in visual question answering (VQA) tasks. These questions often arise from insufficient visual context or from images that lack clear answers. As such, while progress has been made in handling basic visual recognition, the integration of sophisticated reasoning and understanding across diverse visual contexts remains a significant hurdle. These persistent issues highlight the gap in LVLMs' ability to fully comprehend and interpret images in a nuanced, human-like manner.

Existing popular datasets like MMMU and MathVista play a critical role in evaluating the capabilities of Large Vision Language Models (LVLMs) in visual perception tasks, especially in contexts that require mathematical reasoning and deep visual understanding. However, they do have several limitations.

1. Scope and Complexity of Tasks:
While MathVista covers diverse challenges in mathematical and visual tasks (e.g., geometric reasoning and mathematical word problems), it still poses significant difficulty for state-of-the-art models. For instance, even the best-performing model, GPT-4V, achieves only about 50% accuracy on MathVista, far below human-level performance, indicating that current models struggle with complex visual figures and compositional reasoning (MathVista, 2023). Similarly, MMMU, which focuses on multimodal reasoning tasks, also reveals gaps in visual understanding, particularly for tasks involving deep visual interpretation and contextual reasoning, which models often fail to fully grasp.

2. Generalization across Domains:
Datasets like MathVista and MMMU primarily target specific domains like mathematics and scientific figures. While they push forward advancements in these areas, they fail to cover the full breadth of visual perception challenges found in everyday contexts. For example, VisOnlyQA, another dataset designed for geometric information, highlights that LVLMs still struggle with understanding basic geometric shapes and spatial relationships in scientific figures (Kamoi et al., 2024). This means that both MathVista and MMMU may not generalize well to real-world, everyday visual tasks where visual and contextual nuances play a crucial role.

3. Limited Representation of Real-World Data:
Another limitation is the synthetic nature of many test cases. Datasets like VisOnlyQA include both synthetic and real-world data, but models tend to perform much better on synthetic data, as it is often simpler and more structured compared to the messy, diverse nature of real-world data (Kamoi et al., 2024). As these datasets predominantly consist of idealized scenarios, they fail to capture the variability and ambiguity found in everyday visual perception tasks, further limiting their ability to assess the real-world capabilities of LVLMs.

While general visual perception datasets are useful for evaluating certain aspects of Large Vision-Language Models (LVLMs), they often lack the granularity needed to fully assess these models' true visual capabilities. For example, datasets like those commonly used for object recognition or classification provide high-level tasks, but they do not always address the subtle, detailed visual understanding that is crucial in real-world applications.

Recent research, such as the introduction of the VisOnlyQA and MVP-Bench datasets, highlights the limitations of current datasets. VisOnlyQA specifically tests LVLMs on geometric and numerical information in scientific figures, providing fine-grained insights into their visual perception abilities. However, even state-of-the-art models like GPT-4o perform poorly on such tasks, particularly when it comes to tasks that require precise visual interpretation rather than general object identification​.

Furthermore, the MVP-Bench dataset expands the evaluation by assessing both low-level object recognition and high-level semantic interpretation, such as understanding behaviors or subtle changes in visual details (e.g., replacing a shopping bag with a gun to infer violent intent). This kind of evaluation has revealed significant gaps in LVLMs' ability to handle complex visual information, especially when dealing with manipulated or synthetic images​.

Therefore, while current datasets provide valuable insights, they often fall short of measuring the depth and precision needed to assess LVLMs' full potential in understanding visual context and semantics at the level humans do. More specialized datasets, such as VisOnlyQA and MVP-Bench, are crucial in pushing the boundaries of what these models can truly understand in visual tasks.

VisOnlyQA: A New Approach to Evaluating Visual Perception

The VisOnlyQA dataset, introduced by researchers from Penn State University, is designed to evaluate the visual perception abilities of large vision-language models (LVLMs) on geometric and numerical information in scientific figures. This dataset aims to address the challenge of understanding fine-grained visual details in images—something that current LVLMs still struggle with.

VisOnlyQA consists of 1,200 multiple-choice questions across 12 tasks, which span four categories of figures. It also provides a synthetic training set of 70,000 instances. The dataset has highlighted several important findings: large models like GPT-4 and Gemini 1.5 Pro perform poorly on visual perception tasks, despite near-perfect human performance. Fine-tuning these models with synthetic data has shown some improvement, but results vary by model and task. Notably, models with stronger language capabilities demonstrated better visual perception.

The dataset is part of ongoing efforts to improve the visual perception abilities of LVLMs by refining both training data and model architectures. Researchers have made the dataset, code, and model responses available for further study and experimentation on platforms like Hugging Face and GitHub.

The VisOnlyQA dataset focuses on testing Large Vision Language Models (LVLMs) on their ability to understand geometric and numerical details in scientific figures. Specifically, it evaluates how well these models can process fine-grained visual information, such as geometric shapes, numerical data, and other visual aspects commonly found in scientific diagrams. This dataset includes multiple-choice questions across various categories of scientific figures, providing insights into the models' weaknesses in visual perception.

The dataset consists of 1,200 evaluation questions and 70,000 synthetic training examples designed to enhance model performance on these tasks. Through experiments, the research highlighted that models like GPT-4 and Gemini 1.5 Pro struggle with these tasks, while human performance remains near-perfect. The study also showed that fine-tuning LVLMs on synthetic data can lead to some improvement, though results vary depending on the task and model architecture.


VisOnlyQA bridges the gap in visual question answering by focusing on assessing fine-grained visual details rather than relying on domain-specific knowledge or reasoning. This approach addresses the challenge of understanding intricate visual nuances that often escape traditional models. While many VQA systems rely heavily on general reasoning or pre-defined domain-specific knowledge to answer questions about images, VisOnlyQA prioritizes capturing subtle visual elements, making it suitable for tasks where detailed perception is crucial. It allows for deeper comprehension of visual inputs, especially in cases where fine-grained features such as textures, shapes, and minute spatial relations are key to answering questions accurately. This shift helps improve performance in tasks where subtle differences in visual information matter more than abstract reasoning or prior knowledge.


The dataset you're referring to, such as in VisOnlyQA, consists of three key splits: Eval-RealEval-Synthetic, and Train.

  1. Eval-Real includes a curated set of 1,200 multiple-choice questions derived from real-world scientific figures. These questions cover four categories and are designed to evaluate the visual perception of models in geometric and numerical tasks.

  2. Eval-Synthetic is a larger set of synthetic data, comprising 70,000 instances that were artificially generated for training purposes. This data helps model fine-tuning for improved performance on the visual perception tasks presented in the dataset.

  3. Train includes the synthetic data and is primarily used to fine-tune models to enhance their perception capabilities.

The high-quality annotation process involved careful labeling of geometric and numerical elements in scientific figures, achieving an impressive annotation accuracy rate of 93.5% to 95%. This process ensures that the dataset accurately tests a model’s ability to handle specific types of visual information, focusing solely on geometric and visual cues without requiring reasoning abilities​.


Performance of LVLMs on VisOnlyQA

The VisOnlyQA dataset provides an insightful benchmark for evaluating the visual perception capabilities of LVLMs. Unlike previous datasets that focus on general reasoning tasks or broader scene understanding, VisOnlyQA is tailored to measure the models' ability to process and understand geometric and numerical details within scientific figures. By testing LVLMs with tasks that involve precise visual perception, such as identifying geometric shapes, calculating angles, and interpreting data from charts, the dataset challenges the models to demonstrate a higher level of fine-grained visual understanding.

The study evaluating these models' performance on the VisOnlyQA dataset revealed that, despite their large size and advanced capabilities, LVLMs still face significant challenges in tasks requiring detailed visual analysis. Even state-of-the-art models like Phi-3.5-Vision and LLaVA-Next struggled with basic tasks, such as recognizing geometric relationships and interpreting chart intersections. This suggests that while these models have made substantial progress in multimodal reasoning, their ability to accurately perceive and interpret detailed visual elements remains an area for improvement. Furthermore, the analysis indicated that chain-of-thought reasoning, which typically improves performance on complex tasks, did not consistently enhance the models' visual perception accuracy. In fact, in some cases, it worsened their performance.

These findings emphasize the importance of datasets like VisOnlyQA in identifying and addressing the specific gaps in LVLMs' visual perception capabilities. The dataset's focus on geometric and numerical information offers a clearer, more direct assessment of how well these models can understand visual content without the additional complexity of reasoning tasks. As the field continues to evolve, datasets like VisOnlyQA will play a crucial role in driving further advancements in visual language understanding.

For further insights, you can explore studies and reviews discussing the impact of fine-grained visual perception datasets on LVLMs' capabilities.


The models mentioned in the studies were primarily tested on tasks across multiple domains such as geometry, chemistry, chart analysis, and visual question answering (VQA). These domains required different reasoning capabilities, including mathematical problem-solving, interpreting visual data, and logical inference based on textual and graphical inputs.

The chain-of-thought (CoT) reasoning method was crucial in these tests, particularly in tasks involving complex reasoning steps. CoT reasoning allows the model to break down a problem into a sequence of logical steps, rather than jumping directly to an answer. This approach was particularly beneficial in tasks like MathVista (involving geometry and mathematical reasoning) and ChartQA (focusing on interpreting data from charts and graphs), where each step of the reasoning process helps clarify the logic behind the answer.

In some tests, models like LLaVA were fine-tuned with specific CoT data, incorporating visual and textual input to enhance reasoning. For instance, LLaVA models were trained on both direct answers and CoT-based answers, which improved their performance across diverse reasoning tasks like AI2D (analyzing and interpreting data visualizations) and DocVQA (interpreting document-based questions).

Moreover, reinforcement learning (RL) methods were applied to further improve the quality of reasoning by refining the alignment of the model’s reasoning with ground truth predictions, based on CoT data sets and experiments. The result of such training and fine-tuning led to improvements in performance across both direct and CoT response tasks​.


Based on recent data, the accuracy of natural language processing (NLP) models varies significantly depending on the type of data. In real-world data, these models achieve an accuracy of 54.2%, compared to a human performance rate of 93.5%. For synthetic data, the accuracy drops further to 42.4%, while human performance remains at 95.0%. These results highlight the challenges NLP models face when handling unstructured or noisy data compared to more controlled, synthetic environments. The reliability of data in real-world applications, such as medical records or social media data, can be inconsistent, further affecting model performance.

Efforts are ongoing to enhance data quality through better data procurement, centralization, and testing, especially as new tools like NLP algorithms improve data cleansing and usability.


Challenges in Visual Perception for LVLMs

Despite advancements in large vision-language models (LVLMs) like Phi-3.5-Vision and LLaVA-Next, they continue to struggle with visual perception tasks, primarily due to issues such as object hallucinations and weak visual grounding.

Object hallucinations occur when the model generates descriptions or captions that reference objects that do not exist in the image. This is a particularly prominent issue in LVLMs, where synthetic visual instructions, often generated by unimodal large language models, contain extraneous or fabricated details that mislead the system. These hallucinations are exacerbated by the length and complexity of the instructions used to fine-tune the model, making the grounding of visual objects less accurate.

Moreover, the gap between unstructured text input and structured visual perception—such as bounding boxes or semantic object classifications—poses a significant challenge. LVLMs tend to rely on pretrained language capabilities, which often lack the precise visual grounding necessary to interpret images in a factual and accurate manner. Techniques to improve this grounding include reinforcement learning methods that fine-tune models based on human feedback, which aims to enhance accuracy by penalizing hallucinated objects.

To address these limitations, researchers are exploring methods like integrating open-set object detectors, such as Grounding DINO, to verify the objects identified in the generated text against the visual data, improving the alignment between language and vision. Despite these improvements, the models still face fundamental difficulties in achieving the kind of nuanced, error-free visual perception needed for tasks like detailed image captioning or visual question answering.


The near-random performance observed in tasks like Geometry-Triangle and Charts Intersection can be attributed to the challenges of computational geometry, especially when working with floating-point precision and intersecting geometries in non-ideal conditions. This randomness stems from a few key factors:

  1. Complexity of Geometry Intersections: Algorithms that compute intersections between geometric objects, such as triangles or charts, face significant challenges due to the precision required for geometric calculations. For instance, when checking whether two triangles intersect, it’s crucial to handle all possible geometric transformations and edge intersections. In real-world scenarios, these calculations often rely on floating-point arithmetic, which can introduce small errors that compound, leading to unpredictable or near-random performance in some cases.


  2. Dependence on Input Data: The performance can vary drastically depending on the specific configuration of input data. For example, when working with randomly generated graphs or triangles, the shape and position of the triangles might cause the algorithm to fail under certain conditions. This leads to variations in how well the intersection can be detected, especially if the triangles are degenerate or aligned in non-standard ways​.


  3. Numerical Instability: When affine transformations or other coordinate changes are applied, they must be stable and well-conditioned. However, if the transformation matrix becomes poorly conditioned due to the geometry of the objects (e.g., triangles with very thin shapes), this can lead to significant inaccuracies in the results. This instability further contributes to the near-random behavior of these geometric operations, particularly when the triangles or charts are close to being parallel or share similar features​.


Chain-of-thought (CoT) reasoning, which encourages step-by-step reasoning to solve tasks, doesn't always yield consistent improvements in performance. In some cases, it can even degrade the results. This happens because CoT reasoning is not inherently suited for all tasks, especially when it involves information processing that doesn't benefit from a detailed step-by-step approach.

Research has shown that when invalid or incoherent reasoning steps are introduced into a CoT framework, performance can significantly worsen. This is because the model begins to follow an incorrect chain of thought, which can lead to faulty conclusions and lower accuracy compared to simpler, direct approaches (such as standard prompting without reasoning).

In the context of visual tasks like VisOnlyQA, which is designed to assess visual perception rather than reasoning, CoT reasoning may not always align with the task's objectives. The task requires the model to process visual data and recognize patterns, not necessarily to reason through complex steps. When forced to use CoT reasoning for tasks that don't demand logical steps, the model may overcomplicate the process, leading to errors or worse performance. This reinforces the idea that VisOnlyQA tests visual understanding, not reasoning capabilities​.

Therefore, while CoT can be helpful for tasks that involve logical reasoning or textual understanding, it's not a universal solution, especially when the task is more about direct perception and pattern recognition.


Error Analysis: A Deeper Look into the Mistakes

The error analysis in the study of Large Vision-Language Models (LVLMs) shows that visual perception errors remain a significant source of mistakes. These issues are particularly evident when the models are tasked with interpreting fine-grained visual information, such as geometric and numerical data in scientific figures. The analysis reveals that LVLMs, including state-of-the-art models like GPT-4 and Gemini 1.5 Pro, perform poorly on visual perception tasks compared to human accuracy, which remains nearly perfect.

One primary area of difficulty is in low-level visual tasks, such as recognizing objects or subtle visual details, which often influence higher-level interpretations. For instance, models struggle to correctly interpret manipulated or synthetic images, which can result in misjudgments of the context or intent, such as confusing objects or misinterpreting spatial relationships. This aligns with the finding that LVLMs fail to generalize as effectively as humans when faced with visually manipulated content, particularly in scenarios that require high-level perception and understanding of visual semantics.

While synthetic training data has shown potential in improving model performance, the improvements are limited to specific tasks and models, emphasizing the need for further advancements in both model architectures and training datasets.

The mistakes made by Large Vision-Language Models (LVLMs) in visual perception highlight the critical need for datasets like VisOnlyQA. The dataset was created to specifically address the gaps in assessing the visual perception abilities of LVLMs, especially when it comes to understanding geometric and numerical details in scientific figures. Through experiments with VisOnlyQA, researchers found that even top models like GPT-4 and Gemini 1.5 Pro struggle with tasks that involve fine-grained visual understanding. This struggle, where human performance remains near perfect, underlines the challenges LVLMs face in visual perception tasks.

The data provided by VisOnlyQA allows for targeted evaluation, independent of other cognitive tasks such as reasoning, and emphasizes that current models still fall short when it comes to interpreting complex visual elements. The low performance of these models validates the necessity for specialized datasets that push for improvements in LVLMs' visual processing capabilities, especially in areas where precision and fine details matter.

Conclusion

The significance of visual perception in Large Vision-Language Models (LVLMs) lies in their ability to understand and reason about visual inputs, particularly images and graphics. As these models become more advanced, their potential to assist in a wide variety of tasks, such as document interpretation and scientific figure analysis, grows. One area where LVLMs have shown promise—and yet face challenges—is fine-grained visual document understanding. The recent development of datasets like MMDocBench and VisOnlyQA is pushing the boundaries of evaluation in this domain.

MMDocBench is designed to assess LVLMs’ capabilities in understanding document images that combine OCR-free elements with rich visual and textual information. The dataset includes various tasks that challenge LVLMs to extract and reason about data from complex documents, such as research papers and infographics. It emphasizes the need for models that can link visual elements to their corresponding textual information, providing a more holistic evaluation of their visual reasoning abilities.

On the other hand, the VisOnlyQA dataset specifically targets the geometric perception abilities of LVLMs. It includes 1,200 multiple-choice questions that focus on geometric understanding, drawn from scientific figures like those found in math, chemistry, and charts. This dataset highlights how LVLMs still struggle with interpreting geometric relationships, offering a critical area for improvement. It demonstrates the need for more specialized tools that can handle not just abstract visual elements but also intricate details of visual structures.

Together, these datasets are helping to advance the development of LVLMs by providing structured and challenging tests that push the limits of what these models can understand about visual content. As they continue to evolve, these models will likely become more adept at integrating and reasoning across multimodal data, ultimately improving their effectiveness in a range of applications from document processing to scientific research.


The findings from the VisOnlyQA dataset provide valuable insights into the current limitations of large vision-language models (LVLMs) when it comes to understanding and processing complex visual information, especially related to geometric and scientific figures. Based on the dataset, which evaluates LVLMs on a range of visual perception tasks, the results highlight significant areas for improvement in visual reasoning. LVLMs struggle particularly with tasks that require understanding detailed geometric concepts such as angles, areas, and relationships between various shapes. These challenges suggest that there is ample opportunity for future advancements, particularly in enhancing models' ability to parse and reason about geometric information, a key aspect of visual perception.

Improving LVLMs' performance on tasks that involve complex geometric and numerical data could pave the way for their application in fields that demand precise visual reasoning, such as in scientific research, architecture, or medical imaging. This could involve developing more sophisticated models that integrate spatial reasoning capabilities or incorporating specialized datasets tailored to visual geometry tasks. Further advancements might also include augmenting models with the ability to synthesize and interpret visual information in the context of broader scientific or technical knowledge, thus improving their overall utility in complex visual tasks.

The VisOnlyQA dataset itself plays a pivotal role in shaping the future of LVLMs, as it provides a benchmark for evaluating models on tasks they traditionally struggle with. This can guide the development of better models that excel at visual perception, especially those handling intricate geometric and numerical data. The dataset's inclusion of both real and synthetic figures for training and evaluation could also contribute to a more comprehensive understanding of how LVLMs interact with different types of visual data, fostering improvements in their robustness and versatility across diverse applications.

Press contact

Timon Harz

oneboardhq@outlook.com

The logo for Oneboard Blog

Discover recent post from the Oneboard team.

Notes, simplified.

Follow us

Company

About

Blog

Careers

Press

Legal

Privacy

Terms

Security