Language indetification using Out of Place method for N-Gram distribution

Recently, I have been looking for methods of language identification for large collection of heterogeneuos documents. After experimenting with different models over the weekend I decided to share one that is simple to implement and can be used as a baseline. It will not require expert knowledge in NLP and can also be used as a foundation for more advanced models.

The method we will use today is N-Gram classification, introduced in “N-Gram-Based Text Categorization” paper by W. Cavnar and J. Trenkle(1994). The idea is in creating a language profile based on N-Gram frequency distribution, with N laying in range (1 to 5). According to the authors, top 300 N-Grams are generally sufficient to identify a language and construct a text category profile. The profile is used for calculating a distance between a document and a language by comparing N-Gram positions in their profiles with Out Of Place method (Figure 1). If you are interested in more detailed explanation check the original paper in the references section at the end of the post.

Figure 1: Out Of Place method

Software implementation

The code is implemented in Python and uses only one external library. I followed the principle of self-documenting code and aimed to make names of methods and variables reflective of their purpose. Therefore, I commented only few sections of code that could benefit from it.

Requirements:

  • Python 2.7+
  • NLTK 3.0+

Code:

# coding:utf-8
__author__ = "Walter Volodenkov"
import operator
import os
import pickle

from nltk.util import ngrams

#According to the paper top 300 N-Grams tend to highly correlate with the language
MAX_NGRAMS = 300
MODEL_FILENAME = "model.bin"

class LanguageIdentifier:
    def __init__(self):
        self._lang_ngrams_freqs = {}
        self._lang_profiles = None

    def _tokenize(self, text):
        import string
        text = text.translate(string.maketrans("",""), string.punctuation + string.digits)
        text = text.split(" ")
        return text

    def _update_lang_freq_distr(self, ngrams, label):
        if label not in self._lang_ngrams_freqs:
            self._lang_ngrams_freqs[label] = {}
        freqs = self._lang_ngrams_freqs[label]
        for ngram in ngrams:
            new_value = freqs.get(ngram, 0) + 1
            freqs[ngram]= new_value

    def _get_text_ngram_freq(self, ngrams):
        freq = {}
        for ngram in ngrams:
            new_value = freq.get(ngram, 0) + 1
            freq[ngram] = new_value
        return freq

    def _create_ngrams(self, tokens):
        _ngrams = []
        for token in tokens:
            for n in range(1, 6):
                ngrams_tupples = ngrams(token, n, True, True, ' ')
                for ngram_tupple in ngrams_tupples:
                    ngram = ''.join(ngram_tupple)
                    _ngrams.append(ngram)
        return _ngrams

    def _get_document_ngrams(self, text):
        tokens = self._tokenize(text)
        ngrams = self._create_ngrams(tokens)
        return ngrams

    def _add_document_to_lang_freq_distr(self, filepath, label):
        file = open(filepath, mode='r')
        text = file.read()
        file.close()
        ngrams = self._get_document_ngrams(text)
        self._update_lang_freq_distr(ngrams, label)

    #sort by frequency and get top x ngrams and their frequency profile
    def _get_top_ngrams_freq(self, freq):
        ngrams_sorted_freq = sorted(freq.iteritems(), key=operator.itemgetter(1), reverse=True)
        top_ngrams_freq = ngrams_sorted_freq[0:MAX_NGRAMS]
        return top_ngrams_freq

    def _create_language_profiles(self):
        profiles = {}
        for lang, ngram_freq in self._lang_ngrams_freqs.iteritems():
            top_ngrams_freq = self._get_top_ngrams_freq(ngram_freq)
            top_ngrams =  [ng[0] for ng in top_ngrams_freq]
            profiles[lang] = top_ngrams
        return profiles

    def _save_language_profiles(self, profiles):
        file = open(MODEL_FILENAME, "w")
        pickle.dump(profiles, file)
        file.close()

    def _load_language_profiles(self):
        file = open (MODEL_FILENAME, "r")
        profiles = pickle.load(file)
        file.close()
        return profiles

    # Compute distance with OOP (Out Of Place) method
    def _compute_distance(self, lang_ngrams, doc_ngrams):
        doc_distance = 0
        max_oop_val = len(doc_ngrams)
        for ngram in doc_ngrams:
            doc_index = doc_ngrams.index(ngram)
            lang_index = max_oop_val
            if ngram in lang_ngrams:
                lang_index = lang_ngrams.index(ngram)
            distance = abs(lang_index - doc_index)
            doc_distance += distance
        return doc_distance

    def _predict_language(self, text):
        distances = {}
        doc_ngrams = self._get_document_ngrams(text)
        doc_ngrams_freq = self._get_text_ngram_freq(doc_ngrams)
        doc_top_ngrams_freq = self._get_top_ngrams_freq(doc_ngrams_freq)
        doc_top_ngrams =  [ng[0] for ng in doc_top_ngrams_freq]
        for lang, freq in self._lang_profiles.iteritems():
            distance = self._compute_distance(freq, doc_top_ngrams)
            distances[lang] = distance
        nearest_lang = min(distances, key=distances.get)
        return nearest_lang

    def predict_language(self, text):
        self._lang_profiles = self._load_language_profiles()
        return self._predict_language(text)

    def evaluate(self, data_path):
        correct = 0
        total = 0
        self._lang_profiles = self._load_language_profiles()
        for filename in os.listdir(data_path):
            if filename.startswith("."):
                continue
            label = filename.split("_")[0]
            if label is None:
                print ("Incorrect data file name: " + filename)
                continue
            if label not in self._lang_profiles:
                print ("Model was not trained to recognized language " + label)
                continue
            filepath = os.path.join(data_path, filename)
            file = open(filepath, "r")
            text = file.read()
            file.close()
            predicted_lang = self._predict_language(text)
            if predicted_lang == label:
                correct += 1
            total += 1
            if total % 100 == 0:
                print ("Evaluated %d files") % (total)
                print ("Precision %.3f") % (correct/ float(total))
        print ("Evaluation completed")
        precision = correct/float(total)
        return precision

    def train(self, data_path):
        counter = 0
        for filename in os.listdir(data_path):
            if filename.startswith("."):
                continue
            label = filename.split("_")[0]
            if label is None:
                print ("Incorrect data file name: " + filename)
                continue
            filepath = os.path.join(data_path, filename)
            self._add_document_to_lang_freq_distr(filepath, label)
            counter += 1
            if counter % 100 == 0:
                print ("Processed %d files") % (counter)
        print ("Processing completed")
        profiles = self._create_language_profiles()
        self._save_language_profiles(profiles)

def train():
    lang_identifier = LanguageIdentifier()
    path = "path to training dataset directory"
    lang_identifier.train(path)

def predict_document_language():
    path = "path to document"
    file = open(path, "r")
    text = file.read()
    file.close()
    lang_identifier = LanguageIdentifier()
    lang = lang_identifier.predict_language(text)
    print lang

def evaluate():
    lang_identifier = LanguageIdentifier()
    path = "path to test dataset directory"
    precision = lang_identifier.evaluate(path)
    print ("Precision %.3f") % (precision)

def main():
    train()
    #predict()
    #evaluate()

if __name__=='__main__':
    main()

Dataset

After examining open datasets with multiple languages I decided to use datasets from Wikipedia articles collected by Timothy Baldwin and Marco Lui. The datasets are available on a personal page of T. Baldwin. For training I used Wikipedia dataset for Language Evaluation from Marco Lui , Timothy Baldwin (2011) with 10000 samples from 68 languages. For evaluation I used Wikipedia dataset from Timothy Baldwin and Marco Lui (2010) with 4962 samples from 67 languages. More information about the datasets is available in archived files, on the page of T. Balwdin mentioned above. Unfortunately, the dataset was rather noisy and some documents included html/wiki markdown which could affect the training outcome. Nevertheless, the application achieved 0.783 precision on the test set which seems adequate.

Operation

There are 3 modes of operations in the app: training on a dataset, predicting language of a single document and evaluation of the method on the whole set of documents. Documents are stored in a directory and their name start with an ISO language code, separated with an underscore from the rest of the name. Here are a few examples:

  • en_1234
  • de_3545.txt
  • fr_401

Conclusion

The method is mature and offers reasonable precision considering it’s simplicity and computational complexity. Even though it is not state of the art classifier it provides a foundation that can be build upon. This makes it a good candidate for implementing a proof of concept.

References:

  • W. B. Cavnar and J. M. Trenkle. 1994. N-Gram-Based Text Categorization.” Proceedings of the Third Symposium on Document Analysis and Information Retrieval.

  • Timothy Baldwin and Marco Lui. 2010. Language identification: The long and the short of the matter. In Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010)

  • Marco Lui , Timothy Baldwin. 2011. Cross-domain feature selection for language identification. In Proceedings of 5th International Joint Conference on Natural Language Processing (AFNLP 2011)