DSpace Repository

BioLLMBench: A Comprehensive Benchmarking of Large Language Models in Bioinformatics

Show simple item record

dc.contributor.author MANGUL, Serghei
dc.contributor.author MUNTEANU, Viorel
dc.contributor.author SUHODOLSCHI, Timur
dc.contributor.author CIORBA, Dumitru
dc.contributor.author WANG, Wei
dc.date.accessioned 2024-06-12T08:33:41Z
dc.date.available 2024-06-12T08:33:41Z
dc.date.issued 2024
dc.identifier.citation MANGUL, Serghei et al. BioLLMBench: A Comprehensive Benchmarking of Large Language Models in Bioinformatics: Preprint. In: Research Square, 2024, 49 p. en_US
dc.identifier.uri https://doi.org/10.21203/rs.3.rs-3780193/v1
dc.identifier.uri http://repository.utm.md/handle/5014/27366
dc.description.abstract Large Language Models (LLMs) have shown great promise in their knowledge integration and problem-solving capabilities, but their ability to assist in bioinformatics research has not been systematically evaluated. To bridge this gap, we present BioLLMBench, a novel benchmarking framework coupled with a scoring metric scheme for comprehensively evaluating LLMs in solving bioinformatics tasks. Through BioLLMBench, we conducted a thorough evaluation of 2,160 experimental runs of the three most widely used models, GPT-4, Bard and LLaMA, focusing on 36 distinct tasks within the field of bioinformatics. The tasks come from six key areas of emphasis within bioinformatics that directly relate to the daily challenges and tasks faced by individuals within the field. These areas are domain expertise, mathematical problem-solving, coding proficiency, data visualization, summarizing research papers, and developing machine learning models. The tasks also span across varying levels of complexity, ranging from fundamental concepts to expert-level challenges. Each key area was evaluated using seven specifically designed task metrics, which were then used to conduct an overall evaluation of the LLM’s response. To enhance our understanding of model responses under varying conditions, we implemented a Contextual Response Variability Analysis. Our results reveal a diverse spectrum of model performance, with GPT-4 leading in all tasks except mathematical problem solving. GPT4 was able to achieve an overall proficiency score of 91.3% in domain knowledge tasks, while Bard excelled in mathematical problem-solving with a 97.5% success rate. While GPT-4 outperformed in machine learning model development tasks with an average accuracy of 65.32%, both Bard and LLaMA were unable to generate executable end-to-end code. All models faced considerable challenges in research paper summarization, with none of them exceeding a 40% score in our evaluation using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, highlighting a significant area for future improvement. We observed an increase in model performance variance when using a new chatting window compared to using the same chat, although the average scores between the two contextual environments remained similar. Lastly, we discuss various limitations of these models and acknowledge the risks associated with their potential misuse. en_US
dc.language.iso en en_US
dc.publisher Research Square, Preprint Platform en_US
dc.relation.ispartofseries Research Square;Preprint
dc.rights Attribution-NonCommercial-NoDerivs 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/3.0/us/ *
dc.subject large language lodels (LLM) en_US
dc.subject bioinformatics en_US
dc.title BioLLMBench: A Comprehensive Benchmarking of Large Language Models in Bioinformatics en_US
dc.type Article en_US


Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States

Search DSpace


Advanced Search

Browse

My Account