Timon Harz
December 12, 2024
Lavita AI Launches Medical Benchmark to Advance Long-Form Question Answering with Open Models and Expert-Annotated Datasets
Lavita AI's new benchmark evaluates medical QA systems using real-world questions, providing valuable insights into the capabilities of open-source models. The results challenge conventional wisdom and offer a more transparent alternative to closed AI models.

Medical question-answering (QA) systems are essential in modern healthcare, providing valuable tools for both medical professionals and the public. Long-form QA systems stand apart from simpler models by delivering comprehensive answers that reflect the complexity of real-world clinical scenarios. These systems must interpret nuanced, often ambiguous or incomplete questions, producing reliable, detailed responses. As the use of AI in healthcare grows, the demand for effective long-form QA systems is rising. These systems enhance healthcare accessibility and offer an opportunity to improve AI's decision-making abilities and patient engagement.
However, one significant challenge for long-form QA systems is the lack of appropriate benchmarks to assess the performance of large language models (LLMs) in generating detailed answers. Existing benchmarks often rely on automatic scoring systems and multiple-choice formats, which fail to capture the complexity of real-world clinical settings. Many of these benchmarks are closed-source and lack medical expert annotations, limiting transparency and accessibility and slowing progress in developing systems capable of addressing complex medical queries. Additionally, some existing datasets contain errors, outdated information, or overlap with training data, compromising their usefulness for reliable evaluations.
Despite efforts to address these gaps, methods like automatic evaluation metrics and curated multiple-choice datasets (e.g., MedRedQA, HealthSearchQA) offer limited assessments and do not fully encompass the broader context of long-form answers. The absence of diverse, high-quality datasets and well-defined evaluation frameworks has hindered the development of effective long-form QA systems.
To address these challenges, researchers from Lavita AI, Dartmouth Hitchcock Medical Center, and Dartmouth College have introduced a publicly accessible benchmark for evaluating long-form medical QA systems. This benchmark includes over 1,298 real-world medical questions annotated by healthcare professionals. It incorporates various performance criteria—such as correctness, helpfulness, reasoning, harmfulness, efficiency, and bias—allowing for a comprehensive evaluation of both open and closed-source models. By leveraging expert annotations and advanced clustering techniques, the researchers have created a high-quality, diverse dataset that supports robust model evaluation. Additionally, GPT-4 and other LLMs were used for semantic deduplication and question curation, further enhancing the dataset's quality.
The creation of this benchmark followed a multi-phase approach. Researchers collected over 4,271 user queries from 1,693 conversations with Lavita Medical AI Assist, filtering and deduplicating them to yield 1,298 high-quality questions. Semantic similarity analysis was applied to ensure the dataset covered a wide range of medical scenarios. The queries were categorized into three difficulty levels—basic, intermediate, and advanced—based on the complexity of the questions and the required medical knowledge. Annotation batches were then created, each containing 100 questions, with answers generated by various models for pairwise evaluation by human experts.

The benchmark results provided valuable insights into the performance of various LLMs. Smaller models like AlpaCare-13B outperformed larger models, such as BioMistral-7B, across most evaluation criteria. Surprisingly, the state-of-the-art open model Llama-3.1-405B-Instruct outperformed the commercial GPT-4 across all metrics, including correctness, efficiency, and reasoning. These findings challenge the assumption that closed, domain-specific models always outperform open, general-purpose models. Additionally, the results revealed that Meditron3-70B, a specialized clinical model, did not significantly outperform its base model, Llama-3.1-70B-Instruct, prompting further questions about the true value of domain-specific tuning.

Key takeaways from Lavita AI's research:
The dataset contains 1,298 curated medical questions, categorized into basic, intermediate, and advanced levels to evaluate various aspects of medical QA systems.
The benchmark assesses models across six criteria: correctness, helpfulness, reasoning, harmfulness, efficiency, and bias.
Llama-3.1-405B-Instruct outperformed GPT-4, while AlpaCare-13B performed better than BioMistral-7B.
Meditron3-70B did not show significant improvements over its base model, Llama-3.1-70B-Instruct.
Open models performed on par with or better than closed systems, suggesting open-source solutions could address privacy and transparency concerns in healthcare.
The benchmark's open structure and reliance on human annotations offer a scalable and transparent foundation for future advancements in medical QA.

In conclusion, this study fills the gap in robust benchmarks for long-form medical QA by introducing a dataset of 1,298 real-world medical questions, annotated by experts and evaluated across six performance metrics. The results demonstrate the superior performance of open models like Llama-3.1-405B-Instruct, which outperformed the commercial GPT-4. Specialized models, such as Meditron3-70B, showed no significant improvement over general-purpose counterparts, highlighting that well-trained open models are sufficient for medical QA tasks. These findings emphasize the potential of open-source solutions to provide privacy-conscious and transparent healthcare AI.
The launch of the new medical benchmark by Lavita AI marks a significant advancement in the field of medical AI, specifically in long-form question answering (QA). Unlike traditional benchmarks that often rely on multiple-choice questions or automated metrics, this new benchmark offers a more comprehensive evaluation tool. By using real-world medical questions with long-form answers, and including expert annotations from medical professionals, it better reflects the complexities of clinical scenarios. This approach enables more accurate assessments of LLMs' performance in medical QA, addressing previous gaps in the evaluation of AI models for healthcare.
Additionally, this benchmark is open-source, making it a valuable resource for future research in medical AI. It not only facilitates transparency and reproducibility but also suggests that open models, which are often considered less capable than closed-source counterparts, can perform just as well—if not better—in this domain. This shift could have far-reaching implications for the development of privacy-conscious, transparent healthcare AI systems.
The Challenge in Medical QA
Existing benchmarks for evaluating long-form answers in medical question answering (QA) often face significant limitations. One major issue is the lack of fine-grained, human-annotated datasets that can accurately assess the nuanced aspects of long-form responses. While some benchmarks focus on general question-answering tasks, they do not always capture the complexity of medical answers that require multi-step reasoning, domain-specific knowledge, and the ability to handle ambiguity.
For instance, many medical QA benchmarks prioritize fact-based, short-answer questions, which do not account for the detailed, context-heavy answers required in real-world medical scenarios. This oversight can lead to a gap in evaluating the true capabilities of models in delivering comprehensive, clear, and medically accurate long-form responses.
Another limitation lies in the metrics used for evaluation. Current benchmarks typically emphasize correctness and relevance but fail to adequately address the comprehensiveness, fluency, and reasoning required for complex, long-form answers. Moreover, many existing datasets are not structured to capture the reasoning behind medical decisions, a critical factor in healthcare-related responses.
Lavita AI's introduction of a new benchmark aims to address these gaps by providing a curated, expert-annotated dataset specifically designed to evaluate long-form medical QA across multiple performance criteria. This approach not only highlights the importance of reasoning and medical knowledge in evaluating answers but also ensures that the answers are contextually rich and medically sound.
Current benchmarks for medical question answering (QA) have focused largely on multiple-choice questions (MCQs) and automated metrics. These approaches, while useful for evaluating certain aspects of performance, do not capture the full complexity of real-world medical applications. In practice, medical QA often requires long-form answers that incorporate nuanced, multi-step reasoning and the ability to address vague or complex symptom descriptions—elements that MCQs and automated metrics fail to adequately test.
This gap is especially evident in specialized medical domains, where questions might require emotional support, ethical considerations, or even the use of current medical research to solve diagnostic challenges. Current benchmarks, such as the ones used to evaluate models like BioMistral and GPT-4o, often focus on general performance across basic, intermediate, and advanced levels of complexity. However, they do not fully address how these models handle long-form answers that incorporate critical real-world knowledge, ethical reasoning, and humanistic decision-making. Moreover, the absence of human evaluation for long-form answers in these benchmarks further limits their real-world applicability.
Lavita AI's new benchmark helps bridge this gap by providing a more comprehensive and nuanced evaluation of long-form answers. Their dataset of expert-annotated medical questions tests open models with various difficulty levels, from basic to highly complex, ensuring that the models can handle the intricacies of medical practice. This shift toward open models in medical AI also aligns with concerns over privacy and transparency, demonstrating that these solutions can perform on par with—or even surpass—commercial models that often lack the same level of transparency.
Introducing the Lavita Medical Benchmark
Lavita AI's newly launched medical benchmark dataset features 1,298 curated medical questions, which were carefully selected and categorized by difficulty (basic, intermediate, advanced). These questions were gathered from the Lavita Medical AI Assist platform and represent a wide range of real-world scenarios that would test the capabilities of long-form medical question-answering (QA) systems. Each question has been annotated with expert answers by medical professionals, ensuring high-quality, reliable data for model evaluation.
The questions in this dataset cover various medical domains and are intended to evaluate the nuanced performance of AI models in healthcare. To ensure accuracy, two medical doctors specialized in fields like radiology and pathology were involved in the expert annotation process. This multi-level, expert-reviewed dataset is crucial for assessing how AI models handle complex, real-world medical inquiries.
The Lavita AI dataset reflects real-world consumer medical questions by focusing on queries that individuals might typically ask in everyday healthcare situations. Unlike many medical datasets that primarily target specialized or technical questions, this dataset includes common questions relevant to consumers, making it more accessible for general public use. These questions are annotated by medical experts to ensure that the answers provided are not only relevant but also accurate and actionable. The dataset spans a range of difficulty levels (basic, intermediate, and advanced) to capture a variety of medical issues, from simple symptoms to more complex, multi-faceted cases. This diversity ensures that the models are tested across scenarios reflecting real-world health concerns that consumers may face, enhancing the practical value of the benchmark for improving medical question answering systems.
The benchmark for long-form medical question answering (QA) evaluates responses based on six key performance criteria:
Correctness: This criterion assesses whether the answer is factually accurate and aligned with medical knowledge. It determines how well the model answers the question based on reliable medical information.
Helpfulness: This evaluates the extent to which the answer is practical, usable, and relevant to the user's needs. A helpful answer provides clear, actionable insights for the user.
Reasoning: This criterion looks at how well the model explains its thought process. A strong answer should include logical reasoning or an explanation of how the conclusion was reached, especially for complex medical queries.
Harmfulness: The potential for an answer to cause harm is another key aspect. This criterion examines whether the answer could lead to risky or unsafe medical advice, reflecting the model’s adherence to safety standards in healthcare.
Efficiency: Efficiency measures how well the answer balances providing necessary information without being overly detailed or too vague. An efficient response is concise while still thorough enough to be useful.
Bias: Bias assesses the degree to which the answer may reflect prejudice or discriminatory viewpoints, ensuring that the model’s responses are fair and equitable.
These criteria were developed after feedback from medical experts to ensure a comprehensive evaluation of long-form answers in the medical QA domain, especially in terms of both clinical accuracy and user safety.
Key Findings and Insights
Lavita AI's recent benchmark study on long-form medical question answering has revealed significant insights into the performance of open-source models compared to commercial offerings. Notably, the Llama-3.1-405B-Instruct, an open-source model, outperformed the commercial GPT-4o in multiple AI benchmarks, such as GSM8K, MMLU, and Winograd. The findings underscore the growing capability of open models in medical and other complex domains.
While specialized models like Meditron3-70B, which focuses on clinical medicine, did not show substantial improvements over general-purpose models like Llama-3.1-405B, this suggests that well-trained open models may be sufficiently robust for tackling complex medical QA tasks. The performance of Llama-3.1-405B demonstrates the potential of open-source AI in the healthcare space, challenging the dominance of closed-source models like GPT-4o. Additionally, it highlights the broader implications for privacy-conscious and transparent AI applications in healthcare, where open models can offer both flexibility and security, especially with respect to HIPAA compliance.
Lavita AI's findings emphasize the viability of open-source solutions as competitive alternatives to commercial models, potentially reshaping the landscape of AI-powered medical question answering.
The launch of Meditron3-70B, a specialized medical pre-trained model, raised intriguing questions about the value of domain-specific tuning. Despite its higher parameter count, it did not significantly outperform general-purpose models like Llama-3.1-70B in medical tasks such as MMLU College Biology and Professional Medicine.
Interestingly, both Meditron3-70B and Llama-3.1-70B demonstrated strong results, with Llama-3.1-70B even surpassing Llama-3.2-90B in specialized medical tasks. This raises the question of whether the investment in fine-tuning models for specific domains truly leads to improved performance or if large, general-purpose models can already handle these tasks effectively.
The lack of a clear advantage for domain-specific models like Meditron3-70B suggests that more experimentation may be needed to determine whether their specialized training can justify the additional resources required for fine-tuning, especially when general-purpose models are already yielding strong results in medical contexts.
Implications for Open-Source Solutions in Healthcare
The results of Lavita AI's study on open models in medical question answering (QA) suggest a transformative shift for healthcare AI, especially in addressing privacy and transparency concerns. By demonstrating that open models, like Llama-3.1-405B-Instruct, can outperform commercial alternatives such as GPT-4, the study paves the way for broader adoption of open-source solutions in medical AI.
The implications for privacy are significant: open models allow for better control over data, minimizing the risk of third-party exposure. However, they also raise concerns, such as the potential for model inversion or membership inference attacks, where private data could be inadvertently exposed through open model access. Nonetheless, the transparency that open models provide enables healthcare systems to scrutinize and trust the AI's decision-making process. This is crucial in healthcare, where the stakes of patient data confidentiality are high.
Moreover, specialized models like Meditron3-70B, which showed no substantial improvements over more general models, further support the idea that well-trained, open models can meet the needs of medical QA tasks. This could lead to more cost-effective, accessible solutions for healthcare providers, which could ultimately foster more inclusive, equitable healthcare.
In summary, while challenges related to privacy and security remain, the findings suggest that open models offer a viable path forward, enabling healthcare AI to be both more transparent and more adaptable, benefiting patients, providers, and researchers alike.
Open-source benchmarks play a crucial role in fostering collaboration and transparency within the AI community. By providing public access to data, code, and methodologies, they enable researchers and developers worldwide to inspect, audit, and improve AI models. This openness is essential for ensuring the ethical development of AI, reducing bias, and promoting innovation.
Transparency is one of the most significant benefits of open-source AI. It allows for a deeper understanding of how AI models make decisions, helping to identify and mitigate potential biases in both the data and the algorithms themselves. With open-source systems, external audits become possible, which adds accountability and allows for quicker identification of issues that might otherwise go unnoticed in closed systems. This transparency creates a space for diverse contributors to address challenges and refine models, fostering a more inclusive and responsible AI development environment.
Moreover, open-source benchmarks also support reproducibility in AI research, a critical aspect of ensuring that experiments can be verified and results validated across the community. They establish a standard for performance evaluation, allowing different teams to compare models fairly, track progress, and align their work with best practices in the industry.
Ultimately, open-source benchmarks help level the playing field by democratizing access to cutting-edge AI technologies. This enables smaller organizations, academic institutions, and independent researchers to participate in developing advanced AI solutions, reducing the concentration of power in the hands of a few large corporations. It also contributes to a more collaborative environment where diverse perspectives can drive progress.
Conclusion
The launch of the Lavita AI benchmark marks an important step in advancing medical question answering (QA) systems. By introducing a dataset of 1,298 real-world medical questions, annotated by experts, Lavita provides a comprehensive and transparent evaluation framework for assessing the performance of medical LLMs. This benchmark addresses significant gaps in current medical AI research, especially around long-form answers, human evaluation, and the transparency of closed-source models.
One of the most crucial insights from the Lavita benchmark is the superior performance of open-source models, such as Llama-3.1-405B-Instruct, which outperformed the commercial GPT-4o. This demonstrates the increasing potential of open models in the healthcare AI space, offering a viable alternative to closed-source models that often raise concerns around privacy and transparency.
Furthermore, Lavita emphasizes the importance of human expert annotations in improving the reliability and applicability of AI models in real-world scenarios. The detailed evaluation metrics, which include correctness, helpfulness, and bias, provide a foundation for the next generation of medical QA systems. These insights not only guide the refinement of open-source models but also point to the potential of these models in providing privacy-conscious, HIPAA-compliant solutions for healthcare.
Encouraging the further exploration of expert-annotated datasets and open-source models is vital for driving innovation in AI-assisted healthcare. By embracing these resources, researchers and developers can create more effective, ethical, and scalable solutions that meet the rigorous demands of the medical field. The Lavita benchmark serves as a powerful tool to propel this vision forward, paving the way for better medical decision-making and patient care.
Press contact
Timon Harz
oneboardhq@outlook.com
Other posts
Company
About
Blog
Careers
Press
Legal
Privacy
Terms
Security