Term frequency – Inverse Document Frequency (TFIDF) is a vital first step in text analytics for information retrieval and machine learning applications. It is a memory-intensive and complex task due to the need to create and process a large sparse matrix of term frequencies, with the documents as rows and the term as columns and populate it with the term frequency of each word in each document.
The standard method of storing the sparse array is the “Compressed Sparse Row” (CSR), which stores the sparse array as three one-dimensional arrays for the row id, column id, and term frequencies. We propose an alternate representation to the CSR: a list of lists (LIL) where each document is represented as its own list of tuples and each tuple storing the column id and the term frequency value. We implemented both techniques to compare their memory efficiency and speed. The new LIL representation increase the memory capacity by 52% and is only 12% slower in processing time. This enables researchers with limited processing power to be able to work on bigger text analysis datasets.
Senbel, S. (2021). Fast and memory-efficient TFIDF calculation for text analysis of large datasets. In H. Fujita, A. Selamat, J. CW. Lin, & M. Ali (Eds.), Advances and trends in artificial intelligence: Artificial intelligence practices (pp. 557-563). Springer. Doi: 10.1007/978-3-030-79457-6_47