Document Type

Book Chapter

Publication Date

2021

Abstract

Term frequency – Inverse Document Frequency (TFIDF) is a vital first step in text analytics for information retrieval and machine learning applications. It is a memory-intensive and complex task due to the need to create and process a large sparse matrix of term frequencies, with the documents as rows and the term as columns and populate it with the term frequency of each word in each document.

The standard method of storing the sparse array is the “Compressed Sparse Row” (CSR), which stores the sparse array as three one-dimensional arrays for the row id, column id, and term frequencies. We propose an alternate representation to the CSR: a list of lists (LIL) where each document is represented as its own list of tuples and each tuple storing the column id and the term frequency value. We implemented both techniques to compare their memory efficiency and speed. The new LIL representation increase the memory capacity by 52% and is only 12% slower in processing time. This enables researchers with limited processing power to be able to work on bigger text analysis datasets.

Comments

Part of the Lecture Notes in Computer Science book series (LNCS, volume 12798)

Accepted version is posted.

DOI

10.1007/978-3-030-79457-6_47

Recommended Citation

Senbel, S. (2021). Fast and memory-efficient TFIDF calculation for text analysis of large datasets. In H. Fujita, A. Selamat, J. CW. Lin, & M. Ali (Eds.), Advances and trends in artificial intelligence: Artificial intelligence practices (pp. 557-563). Springer. Doi: 10.1007/978-3-030-79457-6_47

Download

Link to Publisher Version

Included in

Computer Sciences Commons, Data Science Commons

COinS

School of Computer Science & Engineering Faculty Publications

Fast and Memory-Efficient TFIDF Calculation for Text Analysis of Large Datasets

Document Type

Publication Date

Abstract

Comments

DOI

Recommended Citation

Included in

Search

Browse

Author Corner

Links

School of Computer Science & Engineering Faculty Publications

Fast and Memory-Efficient TFIDF Calculation for Text Analysis of Large Datasets

Authors

Document Type

Publication Date

Abstract

Comments

DOI

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links