Title Page
Contents
Abstract 11
1. Introduction 13
2. Materials and Methods 17
2.1. eDNA sampling 17
2.2. DNA extraction and polymerase chain reaction 20
2.3. Library preparation and sequencing 23
2.4. Data processing and taxonomic assignment 23
2.5. Biodiversity analysis 24
2.6. Comparison of eDNA metabarcoding and previous survey 25
2.7. Dataset construction for the machine learning classifier 25
2.8. Feature extraction 26
2.9. Model training and evaluation 28
2.10. Model application to eDNA metabarcoding data 29
3. Results and Discussion 30
3.1. Sequencing and biodiversity analysis 30
3.2. Comparative analysis of fish relative abundance 34
3.3. Comparison of the species list of eDNA metabarcoding and previous survey database 45
3.4. Construction of a single descriptor-based machine learning classifier 52
3.5. Evaluation of the machine learning classifier presenting the highest performance 56
3.6. Assignment of eDNA metabarcoding sequences with the machine learning classifier 60
4. Conclusion 63
5. References 65
국문요약 95
Table 1. Environmental variables for each of the three freshwater sampling sites in Han River. 19
Table 2. PCR primer sets used to amplify two metabarcoding markers, COI, and 12S rRNA. 22
Table 3. Summary of descriptors used to construct the machine learning classifier. 27
Table 4. Summary of reads that remained at each stage of the denoising process 31
Table 5. Comparison of alpha diversity of eDNA samples from Han River according to the genetic marker. 44
Table 6. Comparison of species identified in the eDNA metabarcoding and previous survey data. 49
Table 7. The list of 12S rRNA gene sequences used for constructing the machine learning classifier. 53
Table 8. Misclassification rate of the XGBoost model trained using the PseKNC descriptor based on 12S rRNA sequences. 59
Table 9. List of species assigned by consensus-BLAST and the machine learning classifier using 12S rRNA gene. 61
Appendix Table 1. List of amplicon sequence variants (ASVs) classified as Actinopteri among the results of assignment with consensus-BLAST of 98 % similarity threshold of ASVs based on COI marker. 74
Appendix Table 2. List of amplicon sequence variants (ASVs) classified as Actinopteri among the results of assignment with consensus-BLAST of 98 % similarity threshold of ASVs based on 12S rRNA marker. 78
Appendix Table 3. List of fish species identified using COI and 12S rRNA markers. 90
Appendix Table 4. List of fish species detected in April and June, 2022 using COI marker. 91
Appendix Table 5. List of fish species detected in the three collection sites using COI marker. 92
Appendix Table 6. List of fish species detected in April and June, 2022 using 12S rRNA marker. 93
Appendix Table 7. List of fish species detected in the three collection sites using 12S rRNA marker. 94
Figure 1. Three freshwater sampling sites (PL, JB, and HB) in Han River, Korea; PL: Paldang Lake; JB: Jamsil Bridge; HB: Haengju Bridge. 18
Figure 2. Rarefaction curves for six samples using COI gene. The curves show predicted total diversity by sequencing depth. The x axis represents the sequencing depth of samples and the y axis represents a measure of the species richness, estimated with the... 32
Figure 3. Rarefaction curves for six samples using 12S rRNA gene. The curves show predicted total diversity by sequencing depth. The x axis represents the sequencing depth of samples and the y axis represents a measure of the species richness, estimated... 33
Figure 4. The species level relative abundance of fish communities organized by sampling sites and months using COI marker. The list at the right of figure is sorted from the highest to the lowest relative abundance of taxa; Paldang... 36
Figure 5. Venn diagram showing the number of detected species according to the analysis of eDNA metabarcoding using COI marker (n=28) (a) collection months; (b) collection sites; Paldang Lake (PL); Jamsil Bridge (JB); Haengju Bridge (HB); Red:... 37
Figure 6. The species level relative abundance of fish communities organized by sampling sites and months using 12S rRNA marker. The list at the right of the figure is sorted from the highest to the lowest relative abundance of taxa; Paldang... 39
Figure 7. Venn diagram showing the number of detected species according to the analysis of eDNA metabarcoding using 12S rRNA marker (n=48); (a) collection months; (b) collection sites; Paldang Lake (PL); Jamsil Bridge (JB); Haengju Bridge (HB);... 40
Figure 8. Venn diagram showing the number of detected species according to genetic markers of eDNA metabarcoding. Blue: number of detected species using COI marker. Red: number of detected species using 12S rRNA marker. The... 42
Figure 9. Performance of models constructed with 12S rRNA gene sequences of fish; (a) Sensitivity of each descriptor-based models; (b) Specificity of each descriptor-based models; (c) Precision of each descriptor-based models; (d) Accuracy of each descriptor-based... 55
Figure 10. Confusion matrix of the XGBoost model using the PseKNC descriptor based on 12S rRNA sequences. The row contains the actual species, and the columns contain the species predicted by the model. The model prediction results are... 58