NLP Seminar
Rob Speer
Contents |
[edit]
The state of NLP
[edit]
NLP chain in Open Mind
An example of cheap tricks that get something done.
- Start with English text: "Something you might do while waiting in line is talking".
- Combination of chunking and pattern matching: HasSubevent(waiting in line, talking)
- Remove stopwords: HasSubevent(waiting line, talking)
- Stemming: HasSubevent(wait line, talk)
- Canonical word order: HasSubevent(line wait, talk)
[edit]
Natural Language ToolKit
- Python NLP tools
- http://nltk.sourceforge.net
- Gives you the Legos, not the castle
- nltk.data: Loads of corpora from different sources
[edit]
Stopwords
the bill of rights => bill rights to be or not to be, that is the question => question
- Remove "unimportant" words such as articles and prepositions
- Stopword lists are all over the place (including NLTK)
[edit]
Stemming
happiness => happi happier => happi happy => happi computational linguistics => comput linguist
- Remove letters from the ends of words, until all forms of the same root word are represented by the same string
- Basically everything is based on the Porter stemmer
- Snowball: http://snowball.tartarus.org/
- PyStemmer: http://sourceforge.net/projects/pystemmer/
[edit]
Lemmatizing
the squawking ducks flew away => [the] [squawk +PRES] [duck +PL] [fly +PAST] [away]
- More principled than stemming
- PC-KIMMO and ENGLEX: http://www.sil.org/pckimmo/v2/doc/englex.html
- C code that thinks it's running on MS-DOS
- Roll your own from the UPenn flat wordlist: http://xbean.cs.ccu.edu.tw/~dan/XTag/morph-1.5/data/morph_english.flat
[edit]
Tagging
You know who 's had it too easy ? Computational linguists .
=> You/PRP know/VBP who/WP 's/VBZ had/VBD it/PRP too/RB easy/JJ ?/.
Computational/JJ linguists/NNS ./.
- Brill tagger: trainable in NLTK
- "The" Brill tagger: http://research.microsoft.com/users/brill/
- BADLY-DOCUMENTED C CODE WHOSE MESSAGES ARE IN ALL CAPS
- MXPOST, a modern tagger: http://www.inf.ed.ac.uk/resources/nlp/local_doc/MXPOST.html
[edit]
Other minor operations
- Sentence splitting
- Just write some regular expressions
- Tokenization (who's => who 's)
- Use regular expressions, or a sed script that comes with MXPOST
- Unicode normalization (naïve => naive)
- Use your Unicode library's NFKC normalization and throw away the diacritics
[edit]
Chunking
A/DT piece/NN of/IN paper/NN is/VBZ for/IN writing/VBG things/NNS down/RB => [NP a piece of paper] is for [VP writing things down]
- Specific to the application
- Sometimes regular expressions over tags are sufficient
- Open Mind uses a custom nondeterministic chunker written in NLTK
- Stanford Named Entity Recognizer: http://nlp.stanford.edu/software/CRF-NER.shtml
[edit]
Parsing
[edit]
Tree structure
A/DT piece/NN of/IN paper/NN is/VBZ for/IN writing/VBG
things/NNS down/RP ./. =>
(S (NP (NP (DT A) (NN piece))
(PP (IN of) (NP (NN paper))))
(VP (VBZ is)
(PP (IN for)
(S (VP (VBG writing)
(NP (NNS things))
(PRT (RP down))))))
(. .))
[edit]
Link structure
a piece of paper is for writing things down =>
+--------Ss-------+ +--------K--------+
+-Ds-+--Mp-+-Jp-+ +-Pp-+--Mgp--+----Op---+ |
| | | | | | | | |
a piece.n of paper.n is.v for.p writing.v things.n down.e
[edit]
Decisions
- Hand-written rules or statistical methods?
- Trees or links?
[edit]
A few good parsers
- The Bikel parser: robust, statistical tree parser based on Michael Collins' work
- The Link Grammar Parser: rule-based link parser
- http://www.abisource.org/projects/link-grammar/
- Lots of free software builds on it
- The Stanford parser: statistical parser that outputs trees or links
- C&C parser: cutting-edge, even makes an attempt at semantics
- ...which is great if you can figure out what the boxes mean
- http://svn.ask.it.usyd.edu.au/trac/candc
How many lemons did Dr Jekyll eat? =>![]()

