Natural Language Processing

Fabian Suchanek (fabian@LASTNAME.name)

10th and 15th of June 2011

 

The purpose of this lab session is to get some hands-on experience with Part-of-Speech (POS) tagging.

1. A POS-tagged Corpus

We have produced a POS-tagged corpus at /infres/ic2/suchanek/nlp/wikipedia_pos.txt.
  1. Guess the meanings of the tags NNP, VB, DT, JJ, NN, IN. Explain your guesses.
There are many other tags about which we will not worry too much.

2. Building a Hidden Markov Model

The goal is to build a Hidden Markov Model (HMM) from the POS-tagged corpus, which contains
  1. the emission probabilities tag → word
  2. the transition probabilities tag → tag

You may work in any programming language (as long as it produces the desired output). In case you opt for Java, you can proceed as follows:

  1. Make a class HiddenMarkovModel, which contains two fields:
      protected Map<String, Map<String,Double>> transitionProb;
      protected Map<String, Map<String,Double>> emissionProb;
    
    The first will map a tag to a successor tag and the number of times that this transition was seen. The second will map a tag to a word and the number of times that this emission was seen.
  2. Write a toString() method that outputs both maps in a readable format.
  3. Write the methods
    public void foundTransition(String fromTag, String toTag);
    public void foundEmission(String tag, String word);
    
    ...which increase the respective counter by 1.

    For example, if the transitionProb map is empty and we call

    foundTransition("NNP", "VB");
    foundTransition("NNP", "VB");
    foundTransition("NNP", "ADJ");
    foundTransition("VB", "ADJ");
    
    then the map should be NNP -> { VB ->2, ADJ -> 1}, VB -> {ADJ -> 1}.
  4. Write a method normalize(), which normalizes the counts in transitionProb and emissionProb to probabilities.

    For example, the above map should become NNP -> { VB ->0.666, ADJ -> 0.333}, VB -> {ADJ -> 1}

  5. Write a constructor, which takes as argument a file, parses it and builds the Hidden Markov Model.

    Hints:

  6. Build a model on the sample file /infres/ic2/suchanek/nlp/sample_pos.txt. Explain the result of toString() of the model.

3. Viterbi

The goal is to implement a POS-Tagger by the Viterbi algorithm. You may work in any programming language (as long as it produces the desired output). In case you opt for Java, you can proceed as follows:
  1. Do not be afraid
  2. Make your Hidden Markov Model ready for action: Have the loop in the constructor break after 10000 words and have the class implement Serializable. Write the methods
    public double emissionProbability(String tag, String word);
    public double transitionProbability(String fromTag, String toTag);
    
    which return 0 if the input could not be found.
  3. Build a model for the file /infres/ic2/suchanek/nlp/sound_check.txt and store it on disk (using ObjectOutputStream).
  4. Make a class Viterbi, which has a field protected HiddenMarkovModel model and a constructor that takes a HMM stored on disk as argument (use ObjectInputStream).
  5. Write a method public List<String> parse(String sentence), which, given a sentence, returns the list of most likely tags.

    Hints:

  6. Parse the sentence "Sound sounds", output the tags with non-zero probability for each word. In the first iteration, what does the algorithm "think" that the word "sound" is most likely? Why does the algorithm "change its mind"? Save and explain the tags with non-zero probability for each word in each iteration and the final tag assignment.
  7. Build a model for the Wikipedia POS-tagged corpus. (Break after fewer iterations in case you run into trouble.) Parse the sentence "April is the fourth month of the year". Check whether you could reproduce the original tags. Save and explain the tags with non-zero probability for each word in each iteration and the final tag assignment.