trevmex's tumblings

JavaScripter, Rubyist, Functional Programmer, Agile Practitioner.

Natural Language Processing

@phillylambda January Meeting

Andrew Larkin (@ALarkinDesign), Comcast Interactive Media

The basics of NLP

NLP is a subset of artificial intelligence. We are trying to give an appearance of a machine that thinks.

NLP also mixes in cognitive science and linguistics to understand grammers.

NLP also needs statistics and probability to figure out.

Language does not always mean what we think it means. Human language has a lot of ambiguity, NLP has to do its best to limit that ambiguity.

NLP is a key component for better human-computer interaction. We tend to forget that WIMP is not a “natural” or intuitive way of interacting with the world.

There are different levels of interpreting text when dealing with natural language processing

  • Phonology - The analysis of sounds, useful in speech recognition.
  • Morphology - The study of how words break down into their base parts to learn works (e.g. preregistration = pre - registratra - tion)
  • Lexical - The understanding of an ambiguous word as it is used in a lexical context. Building a class structure for a word.
  • Syntactic - A look at the grammatical structure of a sentence. This allows us to parse sentences and understand grammers.
  • Semantic - A way to understand words by the structure of the sentence around it (like the word ‘It’ in ‘I like programming. It makes me happy.’)
  • Discourse - A way to tag words as they relate to the entire system (or paragraph, for example).
  • Pragmatic - It is important to make your NLP be knowledgeable of the subject matter of your topic. That helps to narrow and define your system.

The NLP pipeline goes: phonology -< Morphology -< Syntax -< Semantics -< Reasoning.

NLP Toolkit - NLTK

NLTK is a python library that you can use to interpret text!

NLTK has a huge corpora of text that have been organized and cut up for your use. Everything from presidential addresses to classical books. Amazing!

The brown corpus is VERY diverse. Check it out!

import nltk
from nltk.corpus import brown
brown.categories() # List out categories of pre-processed text in the brown corpus
brown.words(categories='news') # An array of ALL the words in the news category
brown.sents(categories='news') # An array of all the sentences in the news category
genre_words = [(genre, word)
for genre in ['news', 'romance']
for word in brown.words(categories=genre)
]
cfd = nltk.ConditionalFreqDist(genre_words) # All of the words in news and romance categories with their frequency in the text.
cfd.tabulate(samples=['Monday', 'Tuesday', 'Wednesday']) # A table of how often Monday, Tuesday, and Wednesday occurs in the news and romance categories.

WordNet

WordNet is a like a thesaurus. It has a list of SynSets (Synonym Sets)

A SynSet provides a tree of specificity for words (e.g. a more specific “motor vehicle” is a “motorcar,” a less specific synonym is “artifact”)

A more specific synonym is called a hypernym (e.g. motorcar).

A less specific synonym is called a hyponym (e.g. artifact).

Using the lowest_common() function, you can compare SynSets to see how closely related two words are to each other.

Working with raw text

  1. HTML - NLTK can parse HTML and read the NL in it.
  2. ACSII - The HTML will be converted into ACSII text.
  3. Text - The ASCII text will then be tokenized into words, sentences, etc.
  4. Vocab - The text will then be built us to give you a vocabulary to use and understand.

Identifying parts of speech

tagged_sents = brown.tagged_sents(categories='news') # A list of tagged sentences.
size = int(len(tagged_sents) * 0.9)
train_sents = tagged_sents[:size] # Make a sample set of training sentences.
test_sents = tagged_sents[size:] # Make the rest a test set.
unigram_tagger = nltk.UnigramTagger(train_sents) # A unigram tagger looks at a single word and tries to assign meaning of that word, it does not look at the words around it.
unigram_tagger.evaluate(test_sents) # This tells us how accurate our tagger is.
t0 = ntlk.DefaultTagger('NN') # Default to nouns.
t1 = ntlk.UnigramTagger(train_sents, backoff=t0) # Tell the tagger that if it doesn't understand something, guess that it is a noun.

Chunking

There are noun phrases and verb phrases that you can chunk text into.

Grammers and parsers take sentences and turn them into noun parts and verb parts.

NLP is a way to interpret language logically!

Check out the Stanford Online Course on Natural Language Processing. You can take it for free, it starts January 23rd!

Thank you the Andrew for the great talk!

Notes

  1. trevmex posted this