trevmex's tumblings

JavaScripter, Rubyist, Functional Programmer, Agile Practitioner.
 ()
Rails’ controllers are like waiters in a restaurant. A customer orders a steak dinner from a waiter. The waiter takes the request and tells the kitchen that he needs a steak dinner. When the steak dinner is ready, the waiter delivers it to the customer for her enjoyment. Craig Demyanovich, from “The RSpec Book” by David Chelimsky, pg. 341
 ()
People are never going to stop wanting to be with their families, and parents are never going to stop wanting to feed their children. Rob Keithan on immigration, “Who We Are and Who We Can Be,” Unitarian Society of Germantown, January 22, 2012

@mdb, @ALarkinDesign, @hexinteractive, and me reinventing ping pong after hours at #CIM

Innovating Ping Pong (by JohnRiv)


 ()
reblogged from aanniimmee
aanniimmee:

- From “Metal Skin Panic Madox-01,” directed by Shinji Aramaki (1987)

aanniimmee:

- From “Metal Skin Panic Madox-01,” directed by Shinji Aramaki (1987)

Natural Language Processing

@phillylambda January Meeting

Andrew Larkin (@ALarkinDesign), Comcast Interactive Media

The basics of NLP

NLP is a subset of artificial intelligence. We are trying to give an appearance of a machine that thinks.

NLP also mixes in cognitive science and linguistics to understand grammers.

NLP also needs statistics and probability to figure out.

Language does not always mean what we think it means. Human language has a lot of ambiguity, NLP has to do its best to limit that ambiguity.

NLP is a key component for better human-computer interaction. We tend to forget that WIMP is not a “natural” or intuitive way of interacting with the world.

There are different levels of interpreting text when dealing with natural language processing

  • Phonology - The analysis of sounds, useful in speech recognition.
  • Morphology - The study of how words break down into their base parts to learn works (e.g. preregistration = pre - registratra - tion)
  • Lexical - The understanding of an ambiguous word as it is used in a lexical context. Building a class structure for a word.
  • Syntactic - A look at the grammatical structure of a sentence. This allows us to parse sentences and understand grammers.
  • Semantic - A way to understand words by the structure of the sentence around it (like the word ‘It’ in ‘I like programming. It makes me happy.’)
  • Discourse - A way to tag words as they relate to the entire system (or paragraph, for example).
  • Pragmatic - It is important to make your NLP be knowledgeable of the subject matter of your topic. That helps to narrow and define your system.

The NLP pipeline goes: phonology -< Morphology -< Syntax -< Semantics -< Reasoning.

NLP Toolkit - NLTK

NLTK is a python library that you can use to interpret text!

NLTK has a huge corpora of text that have been organized and cut up for your use. Everything from presidential addresses to classical books. Amazing!

The brown corpus is VERY diverse. Check it out!

import nltk
from nltk.corpus import brown
brown.categories() # List out categories of pre-processed text in the brown corpus
brown.words(categories='news') # An array of ALL the words in the news category
brown.sents(categories='news') # An array of all the sentences in the news category
genre_words = [(genre, word)
for genre in ['news', 'romance']
for word in brown.words(categories=genre)
]
cfd = nltk.ConditionalFreqDist(genre_words) # All of the words in news and romance categories with their frequency in the text.
cfd.tabulate(samples=['Monday', 'Tuesday', 'Wednesday']) # A table of how often Monday, Tuesday, and Wednesday occurs in the news and romance categories.

WordNet

WordNet is a like a thesaurus. It has a list of SynSets (Synonym Sets)

A SynSet provides a tree of specificity for words (e.g. a more specific “motor vehicle” is a “motorcar,” a less specific synonym is “artifact”)

A more specific synonym is called a hypernym (e.g. motorcar).

A less specific synonym is called a hyponym (e.g. artifact).

Using the lowest_common() function, you can compare SynSets to see how closely related two words are to each other.

Working with raw text

  1. HTML - NLTK can parse HTML and read the NL in it.
  2. ACSII - The HTML will be converted into ACSII text.
  3. Text - The ASCII text will then be tokenized into words, sentences, etc.
  4. Vocab - The text will then be built us to give you a vocabulary to use and understand.

Identifying parts of speech

tagged_sents = brown.tagged_sents(categories='news') # A list of tagged sentences.
size = int(len(tagged_sents) * 0.9)
train_sents = tagged_sents[:size] # Make a sample set of training sentences.
test_sents = tagged_sents[size:] # Make the rest a test set.
unigram_tagger = nltk.UnigramTagger(train_sents) # A unigram tagger looks at a single word and tries to assign meaning of that word, it does not look at the words around it.
unigram_tagger.evaluate(test_sents) # This tells us how accurate our tagger is.
t0 = ntlk.DefaultTagger('NN') # Default to nouns.
t1 = ntlk.UnigramTagger(train_sents, backoff=t0) # Tell the tagger that if it doesn't understand something, guess that it is a noun.

Chunking

There are noun phrases and verb phrases that you can chunk text into.

Grammers and parsers take sentences and turn them into noun parts and verb parts.

NLP is a way to interpret language logically!

Check out the Stanford Online Course on Natural Language Processing. You can take it for free, it starts January 23rd!

Thank you the Andrew for the great talk!

#phillyrb January 2012 Meeting

Awesome talk by Dustin about continuations in Ruby. Check out the mailing list for his slides, and don’t miss his talk at RedSnake Philly next month on Feb.21st!

MailCatcher is a private SMTP server/client you can you to test your email without spamming people! Check it out if you are making custom email applications.

Ruby on Big Data

What is “Big Data?” Sometimes it isn’t always about size, sometimes it can be about CPU-bound processes that need to be processed, like Natural Language Processing.

NoSQL storage is all about BASE:

  • Basic Availability
  • Soft-state
  • Eventual consistency

Cassandra

Cassandra was taken from Dynamo (Amazon’s Paxos implementation) and Google’s BigTable, and mixed it together. Facebook then released it open source.

Cassandra’s Data Model

  • Keyspaces
    • Column Families
      • Rows (Sorted by KEY!)
        • Columns {Key: Value}

This is a sparely populated data model, that means that you are able to add keys at will.

Cassandra’s Hash Ring implements the Paxos hash ring model. This allows you to distribute keys to various nodes in the hash ring, to solve for data replication and fast connections.

You can have multiple consistency levels: one, quorum, and all.

  • one: This will return right away, and replicate data later.
  • quorum: This waits until there are n/2+1 nodes that have written your data, where n is the number of nodes.
  • all: This waits until ALL nodes have written the data. This is the slowest, but most secure.

You can store anything you want in your column values. That is nice, so you can define your own schemas there without major constraints.

Hadoop

Hadoop is the Apache implementation of Google’s BigTable. To get info out of it, you have to write a map and reduce functions.

Solandra is a library that combines the Solr search library with Cassandra, so that your indexes are in Cassandra.

Why use Ruby for Big Data?

Because we LOVE Ruby!

Ruby is simple enough that you can give it to clients to write map/reduce jobs. This is NON-TRIVIAL in Java. A map/reduce in Java is about 500 lines of code, in Ruby, it is 22 lines.

Virgil

Virgil is a REST client for Cassandra! Virgil let’s you create Cassandra models with HTTP PUT calls.

Virgil also has a GUI to allow you to look into your Cassandra DB with about 200 lines of ExtJS code.

With Virgil you get both CRUD functions and Map/Reduce in Cassandra for the first time.

“Use real-time systems for batch processing.”

Typhoeus is a concurrent HTTP client the runs really fast. This is a great gem to use for massive HTTP calls, like adding info to Cassandra through Virgil.

Bridging the gap between Java and Ruby

Redbridge is the JRuby implementation of JSR 223, which is what bridges Ruby to Java. You can use that to hook into Java through JRuby.

Super Columns

WTF is a super column?

It is an old (deprecated) way to add meta-data to Cassandra, but it is deprecated. Don’t use it!

Storm

Storm is a way to do real-time processing with streams of data. Twitter uses this to push out all their data.

Thank you to Brian and the other speakers for the great info!



 ()
reblogged from r38y