Gaussian Process Classification and Active Learning with Multiple Annotators

Rodrigues, F. and Pereira, F.C. and Ribeiro, B.
in proceedings of International Conference in Machine Learning (ICML), 2014

Abstract: Learning from multiple annotators took a valuable step towards modeling data that does not fit the usual single annotator setting, since multiple annotators sometimes offer varying degrees of expertise. When disagreements occur, the establishment of the correct label through trivial solutions such as majority voting may not be adequate, since without considering heterogeneity in the annotators, we risk generating a flawed model.
In this paper, we generalize GP classification in order to account for multiple annotators with different levels expertise. By explicitly handling uncertainty, Gaussian processes (GPs) provide a natural framework for building proper multiple-annotator models. We empirically show that our model significantly outperforms other commonly used approaches, such as majority voting, without a significant increase in the computational cost of approximate Bayesian inference. Furthermore, an active learning methodology is proposed, which is able to reduce annotation cost even further.

Publications

Sequence labeling with multiple annotators

Rodrigues, F. and Pereira, F.C. and Ribeiro, B.
Machine Learning, Springer, 2013

Abstract: The increasingly popular use of Crowdsourcing as a resource to obtain labeled data has been contributing to the wide awareness of the machine learning community to the problem of supervised learning from multiple annotators. Several approaches have been proposed to deal with this issue, but they disregard sequence labeling problems. However, these are very common, for example, among the Natural Language Processing and Bioinformatics communities. In this paper, we present a probabilistic approach for sequence labeling using Conditional Random Fields (CRF) for situations where label sequences from multiple annotators are available but there is no actual ground truth. The approach uses the Expectation-Maximization algorithm to jointly learn the CRF model parameters, the reliability of the annotators and the estimated ground truth. When it comes to performance, the proposed method (CRF-MA) significantly outperforms typical approaches such as majority voting.

Publications

Learning from Multiple Annotators: Distinguishing Good from Random Labelers

Rodrigues, F. and Pereira, F.C. and Ribeiro, B.
Pattern Recognition Letters, Elsevier, 2013

Abstract: With the increasing popularity of online crowdsourcing platforms such as Amazon Mechanical Turk (AMT), building supervised learning models for datasets with multiple annotators is receiving an increasing attention from researchers. These platforms provide an inexpensive and accessible resource that can be used to obtain labeled data, and in many situations the quality of the labels competes directly with those of experts. For such reasons, much attention has recently been given to annotator-aware models. In this paper, we propose a new probabilistic model for supervised learning with multiple annotators where the reliability of the di?erent annotators is treated as a latent variable. We empirically show that this model is able to achieve state of the art performance, while reducing the number of model parameters, thus avoiding a potential overfitting. Furthermore, the proposed model is easier to implement and extend to other classes of learning problems such as sequence labeling tasks.

Publications

Text analysis in incident duration prediction

Pereira, F.C. and Rodrigues, F. and Ben-Akiva, M.
Transportation Research Part C, Elsevier, 2013

Abstract: Due to their heterogeneous case-by-case nature, plenty of relevant information about traffic incidents is communicated in free flow text fields instead of constrained value fields. As a result, such text components enclose considerable richness that is invaluable for incident analysis, modeling and prediction. However, the difficulty to formally interpret such data has led to minimal consideration in previous work.

This paper proposes the use of topic modeling, a text analysis technique, in the problem of incident duration prediction. We analyze a dataset of 2 years of accident cases and develop a duration prediction model that considers both textual and non-textual features. To demonstrate the value of the approach, we compare predictions with and without text analysis using several different prediction models.

Publications