WEClustering: word embeddings based text clustering technique for large datasets.

November 15, 2021 Data Mining, Information Retrieval

Researchers

Jasmeet Singh Seema Bawa Vivek Mehta

Journal

Complex & intelligent systems

Modalities

Models

Bidirectional Encoders Representations using Transformers

Abstract

A massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual datasets, especially which contain a large number of documents are sparse and have high dimensionality. Hence, traditional clustering techniques such as K-means, Agglomerative clustering, and DBSCAN cannot perform well. In this paper, a clustering technique especially suitable to large text datasets is proposed that overcome these limitations. The proposed technique is based on word embeddings derived from a recent deep learning model named “Bidirectional Encoders Representations using Transformers”. The proposed technique is named as WEClustering. The proposed technique deals with the problem of high dimensionality in an effective manner, hence, more accurate clusters are formed. The technique is validated on several datasets of varying sizes and its performance is compared with other widely used and state of the art clustering techniques. The experimental comparison shows that the proposed clustering technique gives a significant improvement over other techniques as measured by metrics such Purity and Adjusted Rand Index.© The Author(s) 2021.

Show Full Text

WEClustering: word embeddings based text clustering technique for large datasets.

Researchers

Journal

Modalities

Models

Abstract

Deep-Learning-Empowered Holographic Metasurface with Simultaneously Customized Phase and Amplitude.

Four-Dimensional Cone Beam CT Imaging Using A Single Routine Scan via Deep Learning.

Classifying cancer pathology reports with hierarchical self-attention networks.

Artificial intelligence for proteomics and biomarker discovery.

A microstructural neural network biomarker for dystonia diagnosis identified by a DystoniaNet deep learning platform.

Leave a Reply Cancel reply

Researchers

Journal

Modalities

Models

Abstract

Similar Posts

Leave a Reply Cancel reply