Abstract:
Recent advances in high-throughput sequencing technologies have enabled the collection and sharing of a vast amount of omics data, along with its associated metadata. Enhancing the availability of this metadata is crucial to ensure the reusability and reproducibility of raw data, as well as for facilitating novel biomedical discoveries through efficient data reuse. In this study, we performed a comprehensive assessment of metadata completeness by analyzing over 26,000,000 experiments shared in the Sequence Read Archive (SRA) from 2008 to 2023. Our results show that the countries of Central Europe, the USA and China show dominance in generating sequencing data, corresponding to 45%, 16% and correspondingly 8% of total data in the SRA repository, the most frequently used platform is ILLUMINA (90%). Identified that some of the metadata contains inconsistencies in completeness: the absence of temporary identifiers (5.2%), the lack of assigned TaxonomyID (5%), and the absence of library strategy (8%). Our results highlight the urgent need for improved metadata sharing practices and the standardization of reporting.