BoostXML: Gradient Boosting for Extreme Multilabel Text Classification With Tail Labels.

Researchers

Journal

Modalities

Models

Abstract

Multilabel learning involving hundreds of thousands or even millions of labels is referred to as extreme multilabel learning (XML), in which the labels often follow a power-law distribution with the majority occurring in very few data points as tail labels. Recent years have witnessed the intensive use of deep-learning methods for high-performance XML, but they are typically optimized for the head labels with abundant training instances and less consider the performance on tail labels, which, however, like the needles in haystacks, are often the focus of attention in real-life applications. In light of this, we present BoostXML, a deep learning-based XML method for extreme multilabel text classification, enhanced greatly by gradient boosting. In BoostXML, we pay more attention to tail labels in each Boosting Step by optimizing the residual mostly from unfitted training instances with tail labels. A Corrective Step is further proposed to avoid the mismatching between the text encoder and weak learners during optimization, which reduces the risk of falling into local optima and improves model performance. A Pretraining Step is also introduced in the initial stage of BoostXML to avoid exorbitant bias to tail labels. Extensive experiments on five benchmark datasets with state-of-the-art baselines demonstrate the advantage of BoostXML in tail-label prediction.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *