GoRuCo talk: Classifying documents using Ruby

Presented by Paul Dix

UPDATE: Paul posted a zip file of the code he referenced in his talk.

What are classifiers?

  • Most common application is spam vs. ham (usually naive bayes)
  • Supervised machine learning

Useful For…

  • Detecting language (French, English, etc.)
  • Categorizing news and blogs
  • Spam detection (email, trackback, wiki, comment, etc.)
  • Sentiment detection (for example, determine positive vs. negative movie reviews)

Things You Do

  • Get training data
    • Set of documents that are labeled
    • The more the better
  • Document preprocessing—convert it to a form that machine learning programs can deal with
  • Feature selection (optional)—improves accuracy
  • Train the classifier
  • Test and update

Text classification represents documents as a vector of features

Basic representation

  • Count of terms
  • Steps:
    • Clean text of punctuation
    • Lowercase
    • Split on spaces
    • Stem all words
    • Return an array of counts

Document by term (or feature) matrix

  • Count of term frequency by document

Additional features

  • You can create complicated preprocessing
  • Examples:
    • Include link structure
    • Include metadata
    • Include social data

High dimensionality

  • Noisy features (i.e. count of the word “the”)
  • Non-discrimating features (i.e. a word only once)
  • Words that are too similar

Stemming

  • Most popular: Porter Stemmer – Dr. Martin Porter ‘79
  • "cats".stem # => "cat"
  • "international".stem # => "intern"
  • "finalize".stem # => "final"
  • "ruby".stem # => "rubi"

Advantages of features selection

  • Limits size of vocabulary and matrix
  • Increase classification accuracy
  • Avoids noisy features and over-fitting

Methods for feature selection

  • Mutual information
  • Chi-squared (statistical independence of two variables)
  • Frequency based

The process of feature selection

  • Add each document from training set
  • Calculate features
  • Select best features

Feature extraction

  • Selected features from each document

Linear classifiers

  • Finds hyperplane
  • Simple decision boundary
  • Binary classification (i.e. spam vs. not spam)

Naive Bayes

  • Classifier gem
    • Doesn’t give you much control of how everything works
  • Generative model using probabilities
  • Bayes formula

Why is it called “naive”?

  • Positional independence (words location in document does not matter)
  • Conditional independence (one word does not impact other words)
  • Also known as “Idiot’s Bayes”

Naive Bayes advantages

  • Simple
  • Very fast (runs in linear time)
  • Very effective for text classification

Other considerations

  • Binomial (occurs/does not occur) vs. multinomal (how many occurrences?) models
  • Probabilities of classes based on training set
    • Overall probability between classes (is it 50/50, or 90%/10%?)

Support vector machines (SVM)

  • State of the art
  • Hard to use
  • Ruby has bindings for SVM

Kernels

  • a.k.a. “the kernel trick”
  • Method that transforms the kernel space
  • Non-linear classification
  • Four main ones: linear, polynomial, radial basis function, sigmoid
  • Use existing code… you don’t write your own

Perceptron

  • Neural network
  • Linear classifier
  • Converges through iterations
  • Not guaranteed to converge

k nearest neighbors

  • Non-linear
  • Arbitrarily complex decision boundary
  • No traditional training
  • Can be slow for large training sets

Others

  • Decision trees
  • Rocchio
  • Compression
    • Squish by Bob Aman
    • “Mind meltingly slow”
  • Latent semantic analyxix
  • Weka—Java-based processing library
    • Could be used with JRuby

Multiple classes

  • “Any of” classification – Not mutually exclusive
    • Train classifier for each classification (for example, “about Ruby” and “not about Ruby”, “about CSS” and “not about CSS”)
  • “One of” classification
    • i.e. language identification
    • Becomes hard as number of classifications increases

Bias-variance tradeoff

  • Linear classifiers are high bias
  • Non-linear classifiers are high variance

Testing classification

  • Each classification task is different
    • Hard to predict
    • Need to test and tune
  • Constant testing

Basic methods

  • Training and test sets
  • Accuracy – # correct/# total
  • Cross validation
    • 10 fold is common

Closing observations

  • Some tasks are easier than others
    • Can expect >90% on easy tasks
  • Best for triage
  • Naive Bayes is simplest to work with

Q & A

  • Q: How much training data?
  • A: The more the better. 50 is not going to give you good results.
  • Q: What about using temporal proximity as a feature?
  • A: You could do that by having features like “close in time to X”, “close in time to Y”, etc.

0 Responses to “GoRuCo talk: Classifying documents using Ruby”