Data Augmentation for Small-Scale Raman Spectroscopy Dataset in Breast Cancer Cell Classification
DOI:
https://doi.org/10.66535/x4vg6236Keywords:
Raman spectroscopy, data augmentation, GAN, machine learning, classification, small-scale dataset, breast cellAbstract
The application of machine learning to Raman spectral analysis is limited by the scarcity of labeled data, particularly in single-cell studies where large datasets are difficult to obtain. Under such small-sample conditions, the reliability of different data augmentation strategies remains unclear. This study systematically evaluates four data augmentation methods—localized blurring, Gaussian noise addition, random amplitude scaling, and generative adversarial network (GAN)–based synthesis—for small-scale Raman spectral classification of breast cells, focusing on a training set containing 10 samples per class. Distributional similarity between original and augmented data was assessed using Fréchet Inception Distance and t-distributed stochastic neighbor embedding, and classification performance was evaluated using a one-dimensional ResNet model. The results show that augmentation effectiveness depends on both the augmentation strategy and the number of synthetic samples. Gaussian noise augmentation achieved the highest distributional similarity and improved classification accuracy from 92.45% to 95.35%, while localized blurring also yielded consistent improvements, with accuracies exceeding 94.30%. GAN-based augmentation enhanced performance at suitable augmentation sizes, reaching peak accuracies above 94.10%, but showed greater sensitivity to parameter selection. In contrast, random amplitude scaling provided little improvement across most settings. Additional parameterization experiments were conducted by changing the original training set to investigate extremely data-scarce scenarios, including 1 or 2 spectra per class. In these cases, data augmentation alleviated severe overfitting, with Gaussian and blurring methods providing the most stable gains, whereas GAN-based augmentation showed variable effectiveness and scaling remained ineffective. These results offer practical, quantitative guidance for selecting appropriate data augmentation strategies in small-scale Raman spectral analysis.
References
1. Sharma, M. P.; Shukla, S.; Misra, G., Recent advances in breast cancer cell line research. 2024, 154 (10), 1683-1693.
2. Yin, P.; Lian, X.; Wu, X.; Xiao, Y.; Feng, C.; Lv, Y.; Yi, L.; Liang, M.; Ge, G.; Dmitriy, K.; Hu, B., Raman Peak Features Matching: Enhancing Spectral Analysis through Feature Augmentation. Analytical Chemistry 2025, 97 (16), 8801-8812.
3. Hajab, H.; Anwar, A.; Nawaz, H.; Majeed, M. I.; Alwadie, N.; Shabbir, S.; Amber, A.; Jilani, M. I.; Nargis, H. F.; Zohaib, M.; Ismail, S.; Kamal, A.; Imran, M., Surface-enhanced Raman spectroscopy of the filtrate portions of the blood serum samples of breast cancer patients obtained by using 30 kDa filtration device. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 2024, 311, 124046.
4. Ma, M.; Zhang, J.; Liu, Y.; Wang, X.; Han, B., Advances in the clinical application of Raman spectroscopy in breast cancer. Applied Spectroscopy Reviews 2024, 59 (10), 1459-1493.
5. Spaziani, S.; Esposito, A.; Barisciano, G.; Quero, G.; Elumalai, S.; Leo, M.; Colantuoni, V.; Mangini, M.; Pisco, M.; Sabatino, L.; De Luca, A. C.; Cusano, A., Combined SERS-Raman screening of HER2-overexpressing or silenced breast cancer cell lines. Journal of Nanobiotechnology 2024, 22 (1), 350.
6. Li, J.; Wang, X.; Min, S.; Xia, J.; Li, J., Raman spectroscopy combined with convolutional neural network for the sub-types classification of breast cancer and critical feature visualization. Computer Methods and Programs in Biomedicine 2024, 255, 108361.
7. Liu, X.; Jia, Y.; Zheng, C., Recent progress in Surface-Enhanced Raman Spectroscopy detection of biomarkers in liquid biopsy for breast cancer. 2024, Volume 14 - 2024.
8. Wang, M.; Zhang, K.; Yue, L.; Liu, X.; Lai, Y.; Zhang, H., Robust Diagnosis of Breast Cancer Based on Silver Nanoparticles by Surface-Enhanced Raman Spectroscopy and Machine Learning. ACS Applied Nano Materials 2024, 7 (11), 13672-13680.
9. Yin, P.; Li, G.; Zhang, B.; Farjana, H.; Zhao, L.; Qin, H.; Hu, B.; Ou, J.; Tian, J., Facile PEG-based isolation and classification of cancer extracellular vesicles and particles with label-free surface-enhanced Raman scattering and pattern recognition algorithm. Analyst 2021, 146 (6), 1949-1955.
10. Liu, T.; Chen, J.; Kong, L.; Li, X.; Chen, X., Utilization of a portable Raman spectrometer combined with a PCA-SVM model for starch type differentiation. Food Bioscience 2024, 57, 103465.
11. Kang, S.; Kim, I.; Vikesland, P. J., Discriminatory Detection of ssDNA by Surface-Enhanced Raman Spectroscopy (SERS) and Tree-Based Support Vector Machine (Tr-SVM). Analytical Chemistry 2021, 93 (27), 9319-9328.
12. Du, Y.; Han, D.; Liu, S.; Sun, X.; Ning, B.; Han, T.; Wang, J.; Gao, Z., Raman spectroscopy-based adversarial network combined with SVM for detection of foodborne pathogenic bacteria. Talanta 2022, 237, 122901.
13. Ouyang, Q.; Fan, Z.; Chang, H.; Shoaib, M.; Chen, Q., Analyzing TVB-N in snakehead by Bayesian-optimized 1D-CNN using molecular vibrational spectroscopic techniques: Near-infrared and Raman spectroscopy. Food Chemistry 2025, 464, 141701.
14. Lim, J.; Shin, G.; Shin, D., Fast Detection and Classification of Microplastics below 10 μm Using CNN with Raman Spectroscopy. Analytical Chemistry 2024, 96 (17), 6819-6825.
15. Wan, Y.; Jiang, Y.; Zheng, W.; Li, X.; Sun, Y.; Yang, Z.; Qi, C.; Zhao, X., Rapid and high accuracy identification of culture medium by CNN of Raman spectra. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 2025, 329, 125608.
16. Kang, Z.; Li, Y.; Liu, J.; Chen, C.; Wu, W.; Chen, C.; Lv, X.; Liang, F., H-CNN combined with tissue Raman spectroscopy for cervical cancer detection. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 2023, 291, 122339.
17. Zhang, Y.; Li, Z.; Li, Z.; Wang, H.; Regmi, D.; Zhang, J.; Feng, J.; Yao, S.; Xu, J., Employing Raman Spectroscopy and Machine Learning for the Identification of Breast Cancer. Biological Procedures Online 2024, 26 (1), 28.
18. Zeng, Q.; Chen, C.; Chen, C.; Song, H.; Li, M.; Yan, J.; Lv, X., Serum Raman spectroscopy combined with convolutional neural network for rapid diagnosis of HER2-positive and triple-negative breast cancer. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 2023, 286, 122000.
19. Nunekpeku, X.; Zhang, W.; Gao, J.; Adade, S. Y.-S. S.; Li, H.; Chen, Q., Gel strength prediction in ultrasonicated chicken mince: Fusing near-infrared and Raman spectroscopy coupled with deep learning LSTM algorithm. Food Control 2025, 168, 110916.
20. Wu, X.; Du, Z.; Ma, R.; Zhang, X.; Yang, D.; Liu, H.; Zhang, Y., Qualitative and quantitative studies of phthalates in extra virgin olive oil (EVOO) by surface-enhanced Raman spectroscopy (SERS) combined with long short term memory (LSTM) neural network. Food Chemistry 2024, 433, 137300.
21. Chen, X.; Shen, J.; Liu, C.; Shi, X.; Feng, W.; Sun, H.; Zhang, W.; Zhang, S.; Jiao, Y.; Chen, J.; Hao, K.; Gao, Q.; Li, Y.; Hong, W.; Wang, P.; Feng, L.; Yue, S., Applications of Data Characteristic AI-Assisted Raman Spectroscopy in Pathological Classification. Analytical Chemistry 2024, 96 (16), 6158-6169.
22. Chen, T.; Baek, S.-J., Library-Based Raman Spectral Identification Using Multi-Input Hybrid ResNet. ACS Omega 2023, 8 (40), 37482-37489.
23. Ho, C.-S.; Jean, N.; Hogan, C. A.; Blackmon, L.; Jeffrey, S. S.; Holodniy, M.; Banaei, N.; Saleh, A. A. E.; Ermon, S.; Dionne, J., Rapid identification of pathogenic bacteria using Raman spectroscopy and deep learning. Nature Communications 2019, 10 (1), 4927.
24. Xie, Y.; Yang, S.; Zhou, S.; Liu, J.; Zhao, S.; Jin, S.; Chen, Q.; Liang, P., SE-ResNet-based classifier for highly similar mixtures based on Raman spectrum: Classification for alcohol systems as an example. 2023, 54 (2), 191-200.
25. Chang, M.; He, C.; Du, Y.; Qiu, Y.; Wang, L.; Chen, H., RaT: Raman Transformer for highly accurate melanoma detection with critical features visualization. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 2024, 305, 123475.
26. Wang, Z.; Li, Y.; Zhai, J.; Yang, S.; Sun, B.; Liang, P., Deep learning-based Raman spectroscopy qualitative analysis algorithm: A convolutional neural network and transformer approach. Talanta 2024, 275, 126138.
27. Zhou, X.; Chen, C.; Lv, X.; Zuo, E.; Li, M.; Wu, L.; Chen, X.; Wu, X.; Chen, C., CMACF: Transformer-based cross-modal attention cross-fusion model for systemic lupus erythematosus diagnosis combining Raman spectroscopy, FTIR spectroscopy, and metabolomics. Information Processing & Management 2024, 61 (6), 103804.
28. Ozer, I.; Cetin, O.; Gorur, K.; Temurtas, F., Improved machine learning performances with transfer learning to predicting need for hospitalization in arboviral infections against the small dataset. Neural Computing and Applications 2021, 33 (21), 14975-14989.
29. Li, G.; Li, C.; Wang, C.; Wang, Z., Suboptimal capability of individual machine learning algorithms in modeling small-scale imbalanced clinical data of local hospital. PLOS ONE 2024, 19 (2), e0298328.
30. Shimakawa, H.; Kumada, A.; Sato, M., Extrapolative prediction of small-data molecular property using quantum mechanics-assisted machine learning. npj Computational Materials 2024, 10 (1), 11.
31. Yan, C.; Feng, X.; Wick, C.; Peters, A.; Li, G., Machine learning assisted discovery of new thermoset shape memory polymers based on a small training dataset. Polymer 2021, 214, 123351.
32. Zhao, J.; Lui, H.; Kalia, S.; Lee, T. K.; Zeng, H., Improving skin cancer detection by Raman spectroscopy using convolutional neural networks and data augmentation. 2024, Volume 14 - 2024.
33. Qi, Y.; Hu, D.; Zheng, M.; Jiang, Y.; Chen, Y. P., Deep learning assisted Raman spectroscopy for rapid identification of 2D materials. Applied Materials Today 2024, 41, 102499.
34. Luo, J.; Wu, Q.; Cao, J.; Fang, H.; Xu, C.; He, D., Comparison of data augmentation and classification algorithms based on plastic spectroscopy. Analytical Methods 2025, 17 (6), 1236-1251.
35. Deng, L.; Zhong, Y.; Wang, M.; Zheng, X.; Zhang, J., Scale-Adaptive Deep Model for Bacterial Raman Spectra Identification. IEEE Journal of Biomedical and Health Informatics 2022, 26 (1), 369-378.
36. Flanagan, A. R.; Glavin, F. G., A Comparative Analysis of Data Synthesis Techniques to Improve Classification Accuracy of Raman Spectroscopy Data. Journal of Chemical Information and Modeling 2024, 64 (7), 2311-2322.
37. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y., Generative adversarial networks. 2020, 63 (11 %J Commun. ACM), 139–144.
38. Gracia Moisés, A.; Vitoria Pascual, I.; Imas González, J. J.; Ruiz Zamarreño, C., Data Augmentation Techniques for Machine Learning Applied to Optical Spectroscopy Datasets in Agrifood Applications: A Comprehensive Review. 2023, 23 (20), 8562.
39. Wu, M.; Wang, S.; Pan, S.; Terentis, A. C.; Strasswimmer, J.; Zhu, X., Deep learning data augmentation for Raman spectroscopy cancer tissue classification. Scientific Reports 2021, 11 (1), 23842.
40. Pavlou, E.; Kourkoumelis, N., Deep adversarial data augmentation for biomedical spectroscopy: Application to modelling Raman spectra of bone. Chemometrics and Intelligent Laboratory Systems 2022, 228, 104634.
41. Safir, F.; Vu, N.; Tadesse, L. F.; Firouzi, K.; Banaei, N.; Jeffrey, S. S.; Saleh, A. A. E.; Khuri-Yakub, B. T.; Dionne, J. A., Combining Acoustic Bioprinting with AI-Assisted Raman Spectroscopy for High-Throughput Identification of Bacteria in Blood. Nano Letters 2023, 23 (6), 2065-2073.
42. Yu, Q.; Shen, X.; Yi, L.; Liang, M.; Li, G.; Guan, Z.; Wu, X.; Castel, H.; Hu, B.; Yin, P.; Zhang, W., Fragment-Fusion Transformer: Deep Learning-Based Discretization Method for Continuous Single-Cell Raman Spectral Analysis. ACS Sensors 2024, 9 (8), 3907-3920.