Comparison of Traditional and Modern Topic Model Algorithms in Terms of Topic Determination in Official Documents

Keywords: LDA, LSA, NMF, Topic Modeling, Standard File Plan Codes with Retention Period

Abstract

The rapid increase in the number of documents in the digital environment makes analysis and control of documents difficult. To overcome this complexity, separating and classifying digital documents according to certain criteria is becoming more and more important day by day. In order to classify documents effectively, various methods include innovative techniques such as machine learning, deep learning and topic modeling. In this study, LDA (Latent Dirichlet Allocation), LSA (Latent Semantic Analysis) and NMF (Non-Negative Matrix Factorization) algorithms, which are widely used in the literature, are used in official documents. Performance comparison was made in terms of topic determination. It was observed that the NMF algorithm gave the most successful results with -5.217 in terms of c_umass metric and 88.1% in terms of correct classification.

Author Biographies

Zeynep Bozdogan, Duzce University

Graduate School

Duzce, Turkey

Resul Kara, Duzce University

Computer Engineering

Duzce, Turkey

References

[1] A. KAYA and E. GÜLBANDILAR, “Comparison of Topic Modeling Methods,” Eskişehir Türk Dünyası Uygul. and Research Center. Informatics Journal. , vol. 3, no. 2, pp. 46–53, May 2022, doi: 10.53608/estudambilisim.1097978.
[2] Z. Li, W. Shang, and M. Yan, “News text classification model based on topic model,” in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS) , Jun. 2016, pp. 1–5, doi: 10.1109/ICIS.2016.7550929.
[3] N. Akhtar, MMS Beg, and H. Javed, “TextRank enhanced Topic Model for Query focused Text Summarization,” in 2019 Twelfth International Conference on Contemporary Computing (IC3) , Aug. 2019, pp. 1–6, doi: 10.1109/IC3.2019.8844939.
[4] DFO Onah, ELL Pang, and M. El-Haj, “A Data-driven Latent Semantic Analysis for Automatic Text Summarization using LDA Topic Modelling,” in 2022 IEEE International Conference on Big Data (Big Data) , Dec. 2022, pp. 2771–2780, doi: 10.1109/BigData55660.2022.10020259.
[5] X. Wang, L. Zhang, and D. Klabjan, “Keyword-based Topic Modeling and Keyword Selection,” in 2021 IEEE International Conference on Big Data (Big Data) , Dec. 2021, pp. 1148–1154, doi: 10.1109/BigData52589.2021.9671416.
[6] S. Xu, J. Guo, and X. Chen, “Extracting topic keywords from Sina Weibo text sets,” in 2016 International Conference on Audio, Language and Image Processing (ICALIP) , Jul. 2016, pp. 668–673, doi: 10.1109/ICALIP.2016.7846663.
[7] G. Tolegen, A. Toleu, R. Mussabayev, and A. Krassovitskiy, “A Clustering-based Approach for Topic Modeling via Word Network Analysis,” in 2022 7th International Conference on Computer Science and Engineering (UBMK) , Sep . 2022, pp. 192–197, doi: 10.1109/UBMK55850.2022.9919530.
[8] P. Delir Haghighi, F. Burstein, D. Urquhart, and F. Cicuttini, “Investigating Individuals' Perceptions Regarding the Context Around the Low Back Pain Experience: Topic Modeling Analysis of Twitter Data,” J. Med. Internet Pic. , vol. 23, no. 12, p. e26093, Dec. 2021, doi: 10.2196/26093.
[9] G. AYDIN and İ. HALLAÇ, “Automatic Topic Detection in Turkish Texts,” Fırat University Engineering Science. Journal. , vol. 33, no. 2, pp. 599–606, Sep. 2021, doi: 10.35234/fumbd.899917.
[10] ZA Guven, B. Diri, and T. Cakaloglu, “Comparison of Topic Modeling Methods for Type Detection of Turkish News,” in 2019 4th International Conference on Computer Science and Engineering (UBMK) , Sep. 2019, pp. 150–154, doi: 10.1109/UBMK.2019.8907050.
[11] ME Roberts, BM Stewart, and D. Tingley, “stm : An R Package for Structural Topic Models,” J. Stat. Softw. , vol. 91, no. 2, 2019, doi: 10.18637/jss.v091.i02.
[12] ZA GÜVEN, B. DİRİ, and T. ÇAKALOĞLU, “Comparison of n-stage Latent Dirichlet Discrimination and other topic modeling methods for sentiment analysis,” Gazi University Engineering Architect. Faculty Journal. , vol. 35, no. 4, pp. 2135–2146, Jul. 2020, doi: 10.17341/gazimmfd.556104.
[13] V. ALTINTAŞ, M. ALBAYRAK, and K. TOPAL, “Hidden topic modeling with Dirichlet separation for posts about cancer disease,” Gazi University Engineering Architect. Faculty Journal. , vol. 36, no. 4, pp. 2183–2196, Sep. 2021, doi: 10.17341/gazimmfd.734730.
[14] SP MC, BR Reddy, DS Tharun Reddy, and D. Gupta, “Comparative Analysis of Research Papers Categorization Using LDA and NMF Approaches,” in 2022 IEEE North Karnataka Subsection Flagship International Conference (NKCon) , Nov. 2022, pp. 1–7, doi: 10.1109/NKCon56289.2022.10127059.
[15] K. BİNİCİ, “A Study on Automatic Assignment of Standard File Plan Numbers to e-Documents with a Machine Learning Approach,” Information Management , vol. 2, no. 2, pp. 116–126, Dec. 2019, doi: 10.33721/by.654464.
[16] Z. Bozdoğan and R. Kara, “Subject Detection in Official Correspondence Using Text Mining.”
[17] N. ÇİÇEK, “The Power of Function in File Classification Plans,” in Information Management in a Changing World Symposium , 2007, pp. 235–244.
[18] “TC Council of Higher Education, Higher Education Institutions and Higher Education Institutions Standard File Plan with Retention Period.” https://www.yok.gov.tr/Documents/Universiteler/Standart_Dosya_Plani.pdf (accessed Oct. 18, 2023).
[19] A. Aksoy and T. Öztürk, “Turkish Stop Words Turkish Filler Words.” https://github.com/ahmetax/trstop (accessed Jun. 10, 2023).
[20] AA Akın and MD Akın, “Zemberek, an open source NLP framework for Turkic Languages,” Structure , vol. 10, pp. 1–5, 2007.
[21] “Zemberek Library.” https://github.com/ahmetaa/zemberek-nlp.
[22] DM Blei, AY Ng, and MI Jordan, “Latent Dirichlet Allocation,” J. Mach. Learn. Pic. , vol. 3, no. null, pp. 993–1022, Mar. 2003.
[23] I. Vayansky and SAP Kumar, “A review of topic modeling methods,” Inf. Syst. , vol. 94, p. 101582, Dec. 2020, doi: 10.1016/j.is.2020.101582.
[24] E. Ekinci, “A Unique Topic Modeling Method Based on Semantic Similarities of Documents,” Kocaeli University, 2019.
[25] B. ÇULLU and A. OKURSOY, “Investigating the Service Quality of Cargo Companies with Text Mining,” Anadolu Üniversitesi Sos. Science. Journal. , vol. 23, no. 2, pp. 399–422, Jul. 2023, doi: 10.18037/ausbd.1205507.
[26] I. AlAgha, “Topic Modeling and Sentiment Analysis of Twitter Discussions on COVID-19 from Spatial and Temporal Perspectives,” J. Inf. Sci. THEORY PRACTICE. , vol. 9, no. 1, 2021, doi: https://doi.org/10.1633/JISTaP.2021.9.1.3.
[27] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. M. Blei, “Reading Tea Leaves: How Humans Interpret Topic Models,” in Proceedings of the 22nd International Conference on Neural Information Processing Systems , 2009, pp. 288–296.
[28] S. Ozturk et al. , “Turkish labeled text corpus,” in 2014 22nd Signal Processing and Communications Applications Conference (SIU) , Apr. 2014, pp. 1395–1398, doi: 10.1109/SIU.2014.6830499.
[29] S. Deerwester, ST Dumais, GW Furnas, TK Landauer, and R. Harshman, “Indexing by latent semantic analysis,” J. Am. Soc. Inf. Sci. , vol. 41, no. 6, pp. 391–407, Sep. 1990, doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
[30] Y. Li and B. Shen, “Research on sentiment analysis of microblogging based on LSA and TF-IDF,” in 2017 3rd IEEE International Conference on Computer and Communications (ICCC) , Dec. 2017, pp. 2584–2588, doi: 10.1109/CompComm.2017.8323002.
[31] J. Zeng et al. , “Statutes Recommendation Based on Text Similarity,” in 2017 14th Web Information Systems and Applications Conference (WISA) , Nov. 2017, pp. 201–204, doi: 10.1109/WISA.2017.52.
[32] P. Paatero and U. Tapper, “Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values,” Environmetrics , vol. 5, no. 2, pp. 111–126, Jun. 1994, doi: 10.1002/env.3170050203.
[33] A. Abdelrazek, Y. Eid, E. Gawish, W. Medhat, and A. Hassan, “Topic modeling algorithms and applications: A survey,” Inf. Syst. , vol. 112, p. 102131, Feb. 2023, doi: 10.1016/j.is.2022.102131.
[34] D. Mimno, H. Wallach, E. Talley, M. Leenders, and A. McCallum, “Optimizing Semantic Coherence in Topic Models,” in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing , 2011, pp. . 262–272, [Online]. Available: https://aclanthology.org/D11-1024.
[35] F. Jelinek, RL Mercer, LR Bahl, and J. K. Baker, “Perplexity—a measure of the difficulty of speech recognition tasks,” J. Acoust. Soc. Am. , vol. 62, no. S1, pp. S63–S63, Dec. 1977, doi: 10.1121/1.2016299.
[36] M. Röder, A. Both, and A. Hinneburg, “Exploring the Space of Topic Coherence Measures,” in Proceedings of the Eighth ACM International Conference on Web Search and Data Mining , Feb. 2015, pp. 399–408, doi: 10.1145/2684822.2685324.
[37] J.M. Campagnolo, D. Duarte, and G. Dal Bianco, “Topic Coherence Metrics: How Sensitive Are They?,” J. Inf. Data Manag. , vol. 13, no. 4, Oct. 2022, doi: 10.5753/jidm.2022.2181.
[38] P. Tijare and P. Jhansi Rani, “Exploring popular topic models,” J. Phys. Conf. Ser. , vol. 1706, no. 1, p. 012171, Dec. 2020, doi: 10.1088/1742-6596/1706/1/012171.
[39] P. Dasgupta, J. Amin, C. Paris, and C. R. MacIntyre, “News Coverage of Face Masks in Australia During the Early COVID-19 Pandemic: Topic Modeling Study,” JMIR Infodemiology , vol. 3, p. e43011, Aug. 2023, doi: 10.2196/43011.
[40] D. Colla, M. Delsanto, M. Agosto, B. Vitiello, and DP Radicioni, “Semantic coherence markers: The contribution of perplexity metrics,” Artif. Intel. Med. , vol. 134, p. 102393, Dec. 2022, doi: 10.1016/j.artmed.2022.102393.
Published
2024-06-30
How to Cite
Bozdogan, Z., & Kara, R. (2024). Comparison of Traditional and Modern Topic Model Algorithms in Terms of Topic Determination in Official Documents. Journal of Engineering Research and Applied Science, 13(1), 2490-2499. Retrieved from http://journaleras.com/index.php/jeras/article/view/328
Section
Articles