NLP Seminar

Rob Speer

Contents

The state of NLP

computational_linguists.png

NLP chain in Open Mind

An example of cheap tricks that get something done.

  • Start with English text: "Something you might do while waiting in line is talking".
  • Combination of chunking and pattern matching: HasSubevent(waiting in line, talking)
  • Remove stopwords: HasSubevent(waiting line, talking)
  • Stemming: HasSubevent(wait line, talk)
  • Canonical word order: HasSubevent(line wait, talk)

Natural Language ToolKit

Stopwords

the bill of rights => bill rights
to be or not to be, that is the question => question
  • Remove "unimportant" words such as articles and prepositions
  • Stopword lists are all over the place (including NLTK)

Stemming

happiness => happi
happier => happi
happy => happi
computational linguistics => comput linguist

Lemmatizing

the squawking ducks flew away
  => [the] [squawk +PRES] [duck +PL] [fly +PAST] [away]

Tagging

You know who 's had it too easy ? Computational linguists .
  => You/PRP know/VBP who/WP 's/VBZ had/VBD it/PRP too/RB easy/JJ ?/.
     Computational/JJ linguists/NNS ./.

Other minor operations

  • Sentence splitting
    • Just write some regular expressions
  • Tokenization (who's => who 's)
    • Use regular expressions, or a sed script that comes with MXPOST
  • Unicode normalization (naïve => naive)
    • Use your Unicode library's NFKC normalization and throw away the diacritics

Chunking

A/DT piece/NN of/IN paper/NN is/VBZ for/IN writing/VBG things/NNS down/RB
  => [NP a piece of paper] is for [VP writing things down]
  • Specific to the application
  • Sometimes regular expressions over tags are sufficient
  • Open Mind uses a custom nondeterministic chunker written in NLTK
  • Stanford Named Entity Recognizer: http://nlp.stanford.edu/software/CRF-NER.shtml

Parsing

Tree structure

A/DT piece/NN of/IN paper/NN is/VBZ for/IN writing/VBG
things/NNS down/RP ./.  =>

(S (NP (NP (DT A) (NN piece))
       (PP (IN of) (NP (NN paper))))
   (VP (VBZ is)
       (PP (IN for)
           (S (VP (VBG writing) 
                  (NP (NNS things))
                  (PRT (RP down))))))
   (. .))

Link structure

a piece of paper is for writing things down =>

     +--------Ss-------+            +--------K--------+
+-Ds-+--Mp-+-Jp-+      +-Pp-+--Mgp--+----Op---+       |
|    |     |    |      |    |       |         |       |
a piece.n of paper.n is.v for.p writing.v things.n down.e 

Decisions

  • Hand-written rules or statistical methods?
  • Trees or links?

A few good parsers

 How many lemons did Dr Jekyll eat?
   =>
 Image:candc.png