Classifier Validation
Classifier Reborn provides with various methods to evaluate, validate, and generate statistics for classifiers. The repository contains some sample validation Rake tasks that can be run in the development environment using the following command.
$ redis-server --daemonize yes
$ rake validate
In this document we will talk about the following.
- Running k-fold cross-validation
- Writing a custom validation with reporting
- Evaluating a well-populated existing classifier model
To illustrate the usage let’s walk through with some examples.
For this walk through we will use the SMSSpamCollection.tsv data file that is included in the project repository.
It is a TSV file in which the first column is the class (ham
or spam
) and the second column is the corresponding SMS text.
Entries of this file look like below.
ham Yeah that's the impression I got
ham I was slept that time.you there?
spam Win a £1000 cash prize or a prize worth £5000
ham Hope you are not scared!
...
However, the validator does not read data from files so we need to transform the data in the required format. Also, we will only select the first 5,000 records from the file for this illustration.
# encoding: utf-8
require 'classifier-reborn'
include ClassifierReborn::ClassifierValidator
tsv_file_path = "test/data/corpus/SMSSpamCollection.tsv"
data = File.read(tsv_file_path).force_encoding("utf-8").split("\n")
sample_data = data.take(5000).collect{|line| line.strip.split("\t")}
Although, loading classifier-reborn
is not needed just yet, but we will need it as we extend this script incrementally.
Additionally, we have included the ClassifierReborn::ClassifierValidator
module to make its methods available locally without the need of repeating the namespace.
The sample_data
is in the required format now.
Here is what it looks like.
pp sample_data.sample(4)
#=> [["ham", "Yeah that's the impression I got"],
# ["ham", "I was slept that time.you there?"],
# ["spam", "Win a £1000 cash prize or a prize worth £5000"],
# ["ham", "Hope you are not scared!"]]
K-fold Cross-validation
Let’s begin with standard k-fold cross-validation.
We pass the name of the classifier to validate (Bayes
in this example), the samaple data (sample_data
we created in the last step), and the number of folds (5 in this case) to the cross_validate
method.
The default value of k
(number of folds) is set to 10, if not specified.
Classifier initialization options, if any, can be supplied as hash as the last argument.
cross_validate("Bayes", sample_data, 5)
Alternatively, a classifier instance can be created with custom arguments and supplied in place of the name of the classifier as illustrated below.
classifier = ClassifierReborn::Bayes.new("Ham", "Spam", stopwords: "/path/to/custom/stopwords/file")
cross_validate(classifier, sample_data, 5)
Once the validation runs are completed following report will be generated.
--------------- Run Report ----------------
Run Total Correct Incorrect Accuracy
-------------------------------------------
1 1000 972 28 0.97200
2 1000 973 27 0.97300
3 1000 981 19 0.98100
4 1000 981 19 0.98100
5 1000 967 33 0.96700
-------------------------------------------
All 5000 4874 126 0.97480
----------------------- Confusion Matrix -----------------------
Predicted -> Ham Spam Total Recall
----------------------------------------------------------------
Ham 4225 102 4327 0.97643
Spam 24 649 673 0.96434
----------------------------------------------------------------
Total 4249 751 5000
Precision 0.99435 0.86418 Accuracy -> 0.97480
# Positive class: Ham
Total population : 5000
Condition positive : 4327
Condition negative : 673
True positive : 4225
True negative : 649
False positive : 24
False negative : 102
Prevalence : 0.8654
Specificity : 0.9643387815750372
Recall : 0.9764270857406979
Precision : 0.9943516121440339
Accuracy : 0.9748
F1 score : 0.9853078358208955
# Positive class: Spam
Total population : 5000
Condition positive : 673
Condition negative : 4327
True positive : 649
True negative : 4225
False positive : 102
False negative : 24
Prevalence : 0.1346
Specificity : 0.9764270857406979
Recall : 0.9643387815750372
Precision : 0.8641810918774967
Accuracy : 0.9748
F1 score : 0.9115168539325843
The first table in the above report is a summary of each individual runs (of k-folds) followed by the overall accumulated summary line at the end.
The second table is a standard multi-class Confusion Matrix. Along with the cross-matching counts of actual and predicted classes it also shows per class recall column and per class precision row at the ends.
At the end there are various derived statistical measures listed for each class taken as the positive class one at a time (in one-vs.-rest manner).
Custom Validation
While k-fold
cross-validation is a pretty good and commonly used validation method, there might be cases where one wants to implement custom logic of how to split the sample data and how to perform one or more runs to combine the results.
Classifier Reborn provides with a validate
method that accepts a classifier, training data, testing data, and optional hash to instantiate the classifier if the name of a classifier was supplied instead.
This method returns an associative confusion matrix hash that can then be supplied to stats calculation or report generation methods.
To illustrate a simple custom validation let’s split the sample_data
into test and training sets with one is to four ratio.
test_data, training_data = sample_data.partition.with_index{|_, i| i % 5 == 0}
Now, using these data sets get the confusion matrix using validate
method.
conf_mat = validate("Bayes", training_data, test_data)
# Alternatively, an instance of a custom classifier can be created and supplied instead of the name
The returned confusion matrix looks like this.
pp conf_mat
#=> {"Ham"=>{"Ham"=>828, "Spam"=>27},
# "Spam"=>{"Ham"=>7, "Spam"=>138}}
The primary level keys of this nested hash represent the actual classes while the secondary level keys are predicted classes.
There can be more than two classes, but this hash will remain only two level deep as the number of classes does not affect the organization of this data structure.
This means, conf_mat["Ham"][Spam"]
tells that there were 27 records that were actually Ham
, but predicted as Spam
.
We can now generate report from this data structure.
generate_report(conf_mat)
This will yield the following report.
--------------- Run Report ----------------
Run Total Correct Incorrect Accuracy
-------------------------------------------
All 1000 966 34 0.96600
----------------------- Confusion Matrix -----------------------
Predicted -> Ham Spam Total Recall
----------------------------------------------------------------
Ham 828 27 855 0.96842
Spam 7 138 145 0.95172
----------------------------------------------------------------
Total 835 165 1000
Precision 0.99162 0.83636 Accuracy -> 0.96600
# Positive class: Ham
Total population : 1000
Condition positive : 855
Condition negative : 145
True positive : 828
True negative : 138
False positive : 7
False negative : 27
Prevalence : 0.855
Specificity : 0.9517241379310345
Recall : 0.968421052631579
Precision : 0.9916167664670659
Accuracy : 0.966
F1 score : 0.9798816568047337
# Positive class: Spam
Total population : 1000
Condition positive : 145
Condition negative : 855
True positive : 138
True negative : 828
False positive : 27
False negative : 7
Prevalence : 0.145
Specificity : 0.968421052631579
Recall : 0.9517241379310345
Precision : 0.8363636363636363
Accuracy : 0.966
F1 score : 0.8903225806451613
This report is similar to the k-fold
cross-validation method, except, it does not have multiple run reports in the first table.
However, generate_report
method is capable of taking more than one conf_mat
hashes in an array or separate arguments.
In that case, each conf_mat
hash will be treated as individual run result and corresponding individual and accumulated reports will be generated.
Suppose we only want to generate the run reports, but no multi-class confusion matrix or other derived statics.
run_report = build_run_report(conf_mat)
pp run_report
#=> {:total=>1000, :correct=>966, :incorrect=>34, :accuracy=>0.966}
This data can be used to print the report in a custom manner or utilize corresponding provided print method.
print_run_report(run_report, "Custom", true)
This will print the following report where the last argument is set to true
to print the header.
Run Total Correct Incorrect Accuracy
Custom 1000 966 34 0.96600
Now, suppose we only want to generate the multi-class confusion matrix report, but no run reports or other derived statics.
print_conf_mat(conf_mat)
This will print only the confusion matrix.
----------------------- Confusion Matrix -----------------------
Predicted -> Ham Spam Total Recall
----------------------------------------------------------------
Ham 828 27 855 0.96842
Spam 7 138 145 0.95172
----------------------------------------------------------------
Total 835 165 1000
Precision 0.99162 0.83636 Accuracy -> 0.96600
We can convert this multi-class confusion matrix data conf_mat
to corresponding confusion table.
Although, in information retrieval world, confusion matrix and confusion table are the same thing, here we are establishing a difference that the confusion table will only have binary classes (positive
and negative
).
This will divide records in true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
A multi-class confusion matrix can be converted to corresponding confusion table by treating one class as positive and every other class as negative.
If the same process is repeated for each class taken as positive one at a time, we will get N
confusion tables for a classifier with N
classes.
We have a method conf_mat_to_tab
to perform this conversion.
conf_tab = conf_mat_to_tab(conf_mat)
pp conf_tab
#=> {"Ham"=>{:p=>{:t=>828, :f=>7}, :n=>{:t=>138, :f=>27}},
# "Spam"=>{:p=>{:t=>138, :f=>27}, :n=>{:t=>828, :f=>7}}}
This means, conf_tab["Ham"][:p][:t]
tells that taking Ham
as the positive class, there were 828 records that were predicted as positive and the prediction was true (also known as true positives or TP
).
We can pass this conf_tab
hash to the print_conf_tab
method to print various derived statistical values for each class.
However, if we are only interested in one class to be treated as the positive class (e.g., Ham
) then we can extract the derived values of only that class.
derivations = conf_tab_derivations(conf_tab["Ham"])
pp derivations
#=> {:total_population=>1000,
# :condition_positive=>855,
# :condition_negative=>145,
# :true_positive=>828,
# :true_negative=>138,
# :false_positive=>7,
# :false_negative=>27,
# :prevalence=>0.855,
# :specificity=>0.9517241379310345,
# :recall=>0.968421052631579,
# :precision=>0.9916167664670659,
# :accuracy=>0.966,
# :f1_score=>0.9798816568047337}
These derivation can then be printed in a more human readable format.
print_derivations(derivations)
This will print a properly capitalized and aligned report.
Total population : 1000
Condition positive : 855
Condition negative : 145
True positive : 828
True negative : 138
False positive : 7
False negative : 27
Prevalence : 0.855
Specificity : 0.9517241379310345
Recall : 0.968421052631579
Precision : 0.9916167664670659
Accuracy : 0.966
F1 score : 0.9798816568047337
Note: When dealing with real data, there might be cases when derived values (such as precision or recall) return zero which could be a side effect of the denominator being zero in the division.
Evaluation
So far we have seen how can we validate a classifier implementation against a sample dataset. This might help selecting the most suitable classifier for a specific application based on data. However, there are times when we want to evaluate how a well trained, running classifier is performing as the new data is coming for classification. Having such an evaluation would help deciding whether more training is needed to maintain the desired accuracy (or other factors such as precision or recall).
In such cases we cannot use validation methods as they will destroy the existing trained model and populate the classifier with new data.
Classifier Reborn provides with an evaluate
method for such cases.
It accepts an instance of a classifier which is already trained, then evaluates it against a supplied test data.
Let’s build a classifier, train it, and persist the trained model in a file.
classifier = ClassifierReborn::Bayes.new
training_data.each do |rec|
classifier.train(rec.first, rec.last)
end
model = Marshal.dump(classifier)
File.open("classifier-model.dat", "wb") {|f| f.write(model) }
Now, let’s load that saved classifier and evaluate it.
trained_classifier = Marshal.load(File.binread("classifier-model.dat"))
conf_mat = evaluate(trained_classifier, test_data)
With this conf_mat
in hand, we can generate all those reports that were explained in the Custom Validation section above.
Saving the model to a file is not the only way to persist a classifier. Classifier Reborn supports Redis backend that can make a better practical use case of incremental training and evaluation. However, the process would not be much different, except there could be multiple classifier instances connected to a single shared storage backend and the evaluation instance can be one of them.
redis_backend = ClassifierReborn::BayesRedisBackend.new
classifier = ClassifierReborn::Bayes.new backend: redis_backend
training_data.each do |rec|
classifier.train(rec.first, rec.last)
end
Then we can create an evaluation instance as a separate process (even from a separate host).
evaluation_classifier = ClassifierReborn::Bayes.new backend: redis_backend
conf_mat = evaluate(evaluation_classifier, test_data)
Again, various reports can be generated the same way as explained in the previous section. Here is one such example.
print_conf_mat(conf_mat)
This will print the corresponding confusion matrix.
----------------------- Confusion Matrix -----------------------
Predicted -> Ham Spam Total Recall
----------------------------------------------------------------
Ham 828 27 855 0.96842
Spam 7 138 145 0.95172
----------------------------------------------------------------
Total 835 165 1000
Precision 0.99162 0.83636 Accuracy -> 0.96600