基于深度神经网络的 SSR 分子标记对茶叶产地的溯源研究

龚 浩 1; 张莉莉 1; 陈富荣 2; 林丽霞1; 陈意君 1; 张 乐1; 孙春 莲 2; 孙 键 1

文章摘要

龚浩 1，张莉莉 1，陈富荣 2，林丽霞1，陈意君 1，张乐1，孙春莲 2，孙键 1.基于深度神经网络的 SSR 分子标记对茶叶产地的溯源研究[J].广东农业科学,2023,50(9):108-116

查看全文 HTML 基于深度神经网络的 SSR 分子标记对茶叶产地的溯源研究

Research on SSR Molecular Markers for Traceability of Tea Origins Based on Deep Neural Network

DOI：10.16768/j.issn.1004-874X.2023.09.011

中文关键词: 茶叶；SSR；PCA 深度神经网络；溯源；分子标记

英文关键词: tea SSR PCA deep neural network traceability molecular marker

基金项目:广东省科技创新战略专项基金（pdjh2023b0500）；惠州学院教授、博士启动项目（2021JB017）

作者	单位
龚浩 1，张莉莉 1，陈富荣 2，林丽霞1，陈意君 1，张乐1，孙春莲 2，孙键 1	1. 惠州学院生命科学学院，广东惠州 516007；2. 惠州学院经济管理学院，广东惠州 516007

摘要点击次数: 743

全文下载次数: 635

中文摘要:

【目的】对不同品种的茶叶进行区分和产地溯源，同时为其他植物分类提供参考依据。【方法】以简单重复序列标记（Simple sequence repeat，SSR）为基础，运用生物信息学的研究方法，对来自湖南、云南、福建和浙江省的 313 个茶叶样本的来源属地及 10 个外类群关系进行研究：首先，筛选出高质量的 54 个 SSR 位点，通过主成分分析（Principal compon ent analysis，PCA），构建进化树，分析各省间茶叶样本的差异度；其次，通过比较线性回归模型、随机森林模型和深度神经网络（Deep neural network，DNN）模型的分类准确度，选择准确度最高的神经网络模型进行溯源模型构建及优化。【结果】4 个省的茶叶样本个体相对聚集，其中云南省的样本个体较其他省份差异大；福建、浙江、湖南的样本分别聚集，表明福建、浙江、湖南三省间茶叶差异显著，但有少量交叉，具有一定的相似遗传结构特性，亲缘关系较近。利用 3 种不同的模型对 54 个 SSR 分子标记矩阵构建模型，初步鉴定出线性回归模型准确率为 81%，随机森林模型准确率为 77%，而 DNN 模型准确率最高、为86%，由此可得出 DNN 模型对茶叶的分类效果最好。随后利用 54 个 SSR 分子标记和 323 个样本构建预测模型，并对一次训练的样本个数（Batch size)、训练的次数（Step size)、隐藏层层数及每层节点数进行优化，发现这 4 个参数的优化结果当样本个数为 150、训练次数为 20 000、隐藏层层数为 2 层时验证集和测试集的准确率最高、约 95%，即 2 层神经网络对茶叶分析效果最佳。【结论】基于深度神经网络的 SSR 分子标记为茶叶分类、产地溯源研究和茶叶育种等方面提供支持依据，构建的分类模型也可用于其他物种重测序数据的属地来源鉴定。

英文摘要:

【Objective】The study was conducted to differentiate and trace the origin of different varieties of tea, and provide a reference basis for the classification of other plants.【Method】The sources genus of 313 tea samples from Hunan, Yunnan, Fujian and Zhejiang Provinces and 10 outgroup relationships were investigated by utilizing SSR-based and bioinformatics research methods. First, 54 SSR loci of high quality were screened and the degree of variation among tea samples from different provinces were analyzed by Principal Component Analysis(PCA) and constructing an evolutionary tree. Second, the classification accuracy of three models including the Linear Regression Model, the Random Forest Model, and the Deep Neural Networks Model(DNN) were compared and the Neural Networks Model with the highest accuracy were selected for constructing and optimizing the traceability model.【Result】The sample individuals showed relative aggregation within the four provinces, in which the sample individuals within Yunnan Province differed significantly compared with those in other provinces; while the samples from Fujian, Zhejiang and Hunan showed separated aggregation, indicating that there were significant differences in tea among Fujian, Zhejiang and Hunan Provinces, but there was a small amount of crossover, with some similar genetic structure characteristics, and that the individuals from these three provinces were more closely related. By using three different models to construct a model for the molecular marker matrix of 54 SSR markers, we initially identified that the accuracy of the Linear Regression Model was 81%, that of the Random Forest Model was 77%, and while the accuracy of DNN Model was the highest of 86%. Consequently, it could be inferred that the DNN Model was optimal for classifying tea trees. Subsequently, a prediction model was constructed with 54 SSR markers and 323 samples. The batch size, step size, number of layers in the hidden layer, and number of nodes in each layer of each training were optimized. It was found that the highest accuracy of approximately 95% for validation and test sets was achieved when the batch size was 150, the step size was 20 000 and the number of layers in the hidden layer was 2. Therefore, a 2-layer neural network was optimal for the analysis of tea.【Conclusion】DNN-Based SSR molecular markers provide a strong foundation for researches on tea classification, origin traceability, and tea breeding. The constructed classification model can also be used for identifying the origin of resequencing data for other species.

查看/发表评论下载PDF阅读器