基于深度学习的土壤学文献图形数值抽取技术框架初步构建
DOI:
作者:
作者单位:

土壤与农业可持续发展国家重点实验室中国科学院南京土壤研究所

作者简介:

通讯作者:

中图分类号:

S15;TP399

基金项目:

国家重点研发计划项目(2020YFC1807401)、国家科技基础资源调查专项项目(2021FY100703)和中国科学院网络安全和信息化专项应用示范项目(CAS-WX2022SF-0201)


The preliminary construction of the technical framework for numerical value extraction from figures of soil papers based on deep learning
Author:
Affiliation:

Institute of Soil Science, Chinese Academy of Sciences

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    科研文献是土壤学成果表达、传播及交流的主要载体,图形是其中传达土壤数据模式和趋势的重要方式。由于难以方便获取图形中关键要素的真实数值,限制了其价值的进一步发挥,亟需一种自动化、高精度的图形数值抽取方法,本文提出了基于深度学习的文献图形数值抽取技术框架。首先,对土壤学文献图形中常见要素及其表示符号进行梳理,并收集相关图形进行标注,形成训练数据集;然后,利用基于全局图像信息的YOLO v8模型,通过多轮迭代优化形成适合于土壤学文献图形要素的识别模型;最后,研发图像坐标到数值坐标的转换算法,实现二维散点和柱状图数值的自动提取。经独立样本检验,模型可有效识别文献图形要素,且解算的数值与手工提取结果高度吻合(线性回归决定系数R2大于0.99)。据此,本文构建的基于深度学习的图形数值抽取技术框架具有较强的可行性,为土壤学文献中图形数据进一步高效利用提供了一种新途径。

    Abstract:

    The soil research papers are the main approaches for the expression and dissemination of soil knowledge, and are also able to promote communications between soil researchers. The figures in these papers are an important form to reveal patterns and trends of soil data. However, because the values of relevant points, bars and other elements are difficult to conveniently be extracted from figures, further usages of these kinds of data are limited. Thus, it is urgent to develop an automated high-precision extraction technique for numerical values from figures in soil papers. In this study, based on deep learning, we proposed a technical framework to extract numerical values from figures of soil papers. Firstly, the common figure elements and their symbols were sorted out, and some figures were collected and manually labelled to form a training dataset. Secondly, using YOLO v8 base model, which uses the global image to detect multiple targets through one-time process, an optimized model suitable for the detection of figure elements in soil papers was trained through several rounds of training. Thirdly, to convert the identified figure elements to real values, an algorithm was designed to automatically calculate the numerical values in 2D scatter and histogram figures. Using figures that were not involved in the training, the results showed this technique could effectively extract the figure elements and the numerical values were in high agreement with the manually extracted values. Therefore, the technical framework proposed in this study has strong feasibility, which provides a new approach for the efficient use of figure data in soil papers.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-12-25
  • 最后修改日期:2024-05-12
  • 录用日期:2024-05-14
  • 在线发布日期:
  • 出版日期: