Bilingual Distributed Word Representations from Document-Aligned Comparable Data

Ivan Vulić; Marie-Francine Moens

doi:10.1613/jair.4986

PDF PS

Published: Apr 12, 2016

DOI: https://doi.org/10.1613/jair.4986

Ivan Vulić

Marie-Francine Moens

Abstract

We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the MuPTM-based and context-counting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.

Issue

Vol. 55 (2016)

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details