April 21, 2007
GoRuCo talk: Classifying documents using Ruby
Presented by Paul Dix
UPDATE: Paul posted a zip file of the code he referenced in his talk.
What are classifiers?
- Most common application is spam vs. ham (usually naive bayes)
- Supervised machine learning
Useful For…
- Detecting language (French, English, etc.)
- Categorizing news and blogs
- Spam detection (email, trackback, wiki, comment, etc.)
- Sentiment detection (for example, determine positive vs. negative movie reviews)
Things You Do
- Get training data
- Set of documents that are labeled
- The more the better
- Document preprocessing—convert it to a form that machine learning programs can deal with
- Feature selection (optional)—improves accuracy
- Train the classifier
- Test and update
Text classification represents documents as a vector of features
Basic representation
- Count of terms
- Steps:
- Clean text of punctuation
- Lowercase
- Split on spaces
- Stem all words
- Return an array of counts
Document by term (or feature) matrix
- Count of term frequency by document
Additional features
- You can create complicated preprocessing
- Examples:
- Include link structure
- Include metadata
- Include social data
High dimensionality
- Noisy features (i.e. count of the word “the”)
- Non-discrimating features (i.e. a word only once)
- Words that are too similar
Stemming
- Most popular: Porter Stemmer – Dr. Martin Porter ‘79
"cats".stem # => "cat""international".stem # => "intern""finalize".stem # => "final""ruby".stem # => "rubi"
Advantages of features selection
- Limits size of vocabulary and matrix
- Increase classification accuracy
- Avoids noisy features and over-fitting
Methods for feature selection
- Mutual information
- Chi-squared (statistical independence of two variables)
- Frequency based
The process of feature selection
- Add each document from training set
- Calculate features
- Select best features
Feature extraction
- Selected features from each document
Linear classifiers
- Finds hyperplane
- Simple decision boundary
- Binary classification (i.e. spam vs. not spam)
Naive Bayes
- Classifier gem
- Doesn’t give you much control of how everything works
- Generative model using probabilities
- Bayes formula
Why is it called “naive”?
- Positional independence (words location in document does not matter)
- Conditional independence (one word does not impact other words)
- Also known as “Idiot’s Bayes”
Naive Bayes advantages
- Simple
- Very fast (runs in linear time)
- Very effective for text classification
Other considerations
- Binomial (occurs/does not occur) vs. multinomal (how many occurrences?) models
- Probabilities of classes based on training set
- Overall probability between classes (is it 50/50, or 90%/10%?)
Support vector machines (SVM)
- State of the art
- Hard to use
- Ruby has bindings for SVM
Kernels
- a.k.a. “the kernel trick”
- Method that transforms the kernel space
- Non-linear classification
- Four main ones: linear, polynomial, radial basis function, sigmoid
- Use existing code… you don’t write your own
Perceptron
- Neural network
- Linear classifier
- Converges through iterations
- Not guaranteed to converge
k nearest neighbors
- Non-linear
- Arbitrarily complex decision boundary
- No traditional training
- Can be slow for large training sets
Others
- Decision trees
- Rocchio
- Compression
- Squish by Bob Aman
- “Mind meltingly slow”
- Latent semantic analyxix
- Weka—Java-based processing library
- Could be used with JRuby
Multiple classes
- “Any of” classification – Not mutually exclusive
- Train classifier for each classification (for example, “about Ruby” and “not about Ruby”, “about CSS” and “not about CSS”)
- “One of” classification
- i.e. language identification
- Becomes hard as number of classifications increases
Bias-variance tradeoff
- Linear classifiers are high bias
- Non-linear classifiers are high variance
Testing classification
- Each classification task is different
- Hard to predict
- Need to test and tune
- Constant testing
Basic methods
- Training and test sets
- Accuracy – # correct/# total
- Cross validation
- 10 fold is common
Closing observations
- Some tasks are easier than others
- Can expect >90% on easy tasks
- Best for triage
- Naive Bayes is simplest to work with
Q & A
- Q: How much training data?
- A: The more the better. 50 is not going to give you good results.
- Q: What about using temporal proximity as a feature?
- A: You could do that by having features like “close in time to X”, “close in time to Y”, etc.