Hướng dẫn dùng stratified tuned python

2. Lấy dữ liệu (Gather Data)

Dữ liệu thô ban đầu dưới dạng file text (topic_detection_train.v1.0.txt). Nội dung file gồm các dòng, mỗi dòng bao gồm 1 cặp giá (label, text) chi tiết thể hiện dưới đây. Có thể thấy giữa labeltext là kí tự ' ', label được phân biệt bằng '__label__'. Dữ liệu text chứa nhiều nhiễu (icon, kí tự thừa).

scikit-learn: Model selection: choosing estimators and their parameters

scikit-learn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. This object takes an estimator during the construction and exposes an estimator API:

>>>

>>> from sklearn.model_selection import GridSearchCV, cross_val_score
>>> Cs = np.logspace(-6, -1, 10)
>>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
...                    n_jobs=-1)
>>> clf.fit(X_digits[:1000], y_digits[:1000])        
GridSearchCV(cv=None,...
>>> clf.best_score_                                  
0.925...
>>> clf.best_estimator_.C                            
0.0077...

>>> # Prediction performance on test set is not as good as on train set
>>> clf.score(X_digits[1000:], y_digits[1000:])      
0.943...

By default, the GridSearchCV uses a 3-fold cross-validation. However, if it detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold.

-------------------------------------------------------------------------------------------------------------------

In machine learning, two tasks are commonly done at the same time in data pipelines: cross validation and (hyper)parameter tuning. Cross validation is the process of training learners using one set of data and testing it using a different set. Parameter tuning is the process to selecting the values for a model's parameters that maximize the accuracy of the model.

In this tutorial we work through an example which combines cross validation and parameter tuning using scikit-learn.

Note: This tutorial is based on examples given in the scikit-learn documentation. I have combined a few examples in the documentation, simplified the code, and added extensive explanations/code comments.

Preliminaries

import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn import datasets, svm
import matplotlib.pyplot as plt

Create Two Datasets

In the code below, we load the digits dataset, which contains 64 feature variables. Each feature denotes the darkness of a pixel in an 8 by 8 image of a handwritten digit. We can see these features for the first observation:

# Load the digit data
digits = datasets.load_digits()

# View the features of the first observation
digits.data[0:1]

array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.,   0.,   0.,  13.,
         15.,  10.,  15.,   5.,   0.,   0.,   3.,  15.,   2.,   0.,  11.,
          8.,   0.,   0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.,   0.,
          5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4.,  11.,   0.,
          1.,  12.,   7.,   0.,   0.,   2.,  14.,   5.,  10.,  12.,   0.,
          0.,   0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])

The target data is a vector containing the image's true digit. For example, the first observation is a handwritten digit for '0'.

# View the target of the first observation
digits.target[0:1]

To demonstrate cross validation and parameter tuning, first we are going to divide the digit data into two datasets called data1 and data2data1 contains the first 1000 rows of the digits data, while data2 contains the remaining ~800 rows. Note that this split is separate to the cross validation we will conduct and is done purely to demonstrate something at the end of the tutorial. In other words, don't worry about data2 for now, we will come back to it.

# Create dataset 1
data1_features = digits.data[:1000]
data1_target = digits.target[:1000]

# Create dataset 2
data2_features = digits.data[1000:]
data2_target = digits.target[1000:]

Create Parameter Candidates

Before looking for which combination of parameter values produces the most accurate model, we must specify the different candidate values we want to try. In the code below we have a number of candidate parameter values, including four different values for C (1, 10, 100, 1000), two values for gamma (0.001, 0.0001), and two kernels (linear, rbf). The grid search will try all combinations of parameter values and select the set of parameters which provides the most accurate model.

parameter_candidates = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]

Conduct Grid Search To Find Parameters Producing Highest Score

Now we are ready to conduct the grid search using scikit-learn's GridSearchCV which stands for grid search cross validation. By default, the GridSearchCV's cross validation uses 3-fold KFold or StratifiedKFold depending on the situation.

# Create a classifier object with the classifier and parameter candidates
clf = GridSearchCV(estimator=svm.SVC(), param_grid=parameter_candidates, n_jobs=-1)

# Train the classifier on data1's feature and target data
clf.fit(data1_features, data1_target)   

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}, {'kernel': ['rbf'], 'gamma': [0.001, 0.0001], 'C': [1, 10, 100, 1000]}],
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

Success! We have our results! First, let's look at the accuracy score when we apply the model to the data1's test data.

# View the accuracy score
print('Best score for data1:', clf.best_score_)

Best score for data1: 0.942

Which parameters are the best? We can tell scikit-learn to display them:

# View the best parameters for the model found using grid search
print('Best C:',clf.best_estimator_.C)
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)

Best C: 10
Best Kernel: rbf
Best Gamma: 0.001

This tells us that the most accurate model uses C=10, the rbf kernel, and gamma=0.001.

Sanity Check Using Second Dataset

Remember the second dataset we created? Now we will use it to prove that those parameters are actually used by the model. First, we apply the classifier we just trained to the second dataset. Then we will train a new support vector classifier from scratch using the parameters found using the grid search. We should get the same results for both models.

# Apply the classifier trained using data1 to data2, and view the accuracy score
clf.score(data2_features, data2_target)  

# Train a new classifier using the best parameters found by the grid search
svm.SVC(C=10, kernel='rbf', gamma=0.001).fit(data1_features, data1_target).score(data2_features, data2_target)

Success!

-------------------------------------------------------------------------------------------------------------

K-Fold Cross Validation and GridSearchCV in Scikit-Learn

Python is one of the most popular open-source languages for data analysis (along with R), and for good reason. With well-supported open source libraries such as NumPy and SciPy, Python is powerful enough for mining large and complex datasets, and yet versatile enough as a general-purpose programming language to integrate smoothly with web applications, databases, and other things.

Today, we’ll be taking a quick look at the basics of K-Fold Cross Validation and GridSearchCV in the popular machine learning library Scikit-Learn. Although this won’t be comprehensive, we will dig into a few of the nuances of using these. In using these two tools, we are seeking to address two main problems in data analysis.

  1. How do I generate more data for testing?
  2. How do I find optimal parameter values for my models?

Let’s start by exploring K-Fold Cross Validation, which is slightly simpler than GridSearchCV. We’ll call it KFCV for short. First, load up the canonical UCI digits dataset conveniently built into Scikit-Learn. Here, we’ll just use the first 1000 samples (out of 1797 total). Note that our data has 64 features, corresponding to an 8×8 grid of pixels which represent the image. The labels are 0 through 9.

from sklearn import datasets, linear_model, cross_validation, grid_search
import numpy as np
digits = datasets.load_digits()
x = digits.data[:1000]
y = digits.target[:1000]

Printing the shapes of these two matrices x and y yields (1000, 64) and (1000,), respectively. K-Fold Cross Validation is used to validate your model through generating different combinations of the data you already have. For example, if you have 100 samples, you can train your model on the first 90, and test on the last 10. Then you could train on samples 1-80 & 90-100, and test on samples 80-90. Then repeat. This way, you get different combinations of train/test data, essentially giving you ‘more’ data for validation from your original data. The number of times you ‘switch around’ the train/test data is the number of folds. Therefore, 3-Fold Cross Validation will yield 3 sets of train/test data, 5-Fold Cross Validation will yield 5 sets, and so forth. Here’s how we set it up:

kf_total = cross_validation.KFold(len(x), n_folds=10, indices=True, shuffle=True, random_state=4)
for train, test in kf_total:
print train, '\n', test, '\n\n'

A few notes about using the above method for KFCV:

  • The first argument above is the # of samples we want to deal with. Since we’re doing KFCV over our entire 1000-sample digits dataset, we set it to the length of x.
  • The second argument is the # of folds; 10 is often used.
  • The third argument indicates whether you are returning the indices of the original data rather than the elements themselves. Useful if you need to continue working with the original data.
  • The fourth argument , shuffle, means that KFCV will mix around the data (as you’ll see below), meaning the generated test data won’t be necessarily 0, 1, 2, 3, etc. Although shuffle allows you to throw in some randomness, you don’t want to set shuffle to true for some datasets. For example, if you’re working with the well-known iris dataset where samples 1-50 are from one kind of flower and samples from 50 on are for another kind of flower, you don’t necessarily want to mix up samples 1-50 and 50+.
  • The fifth argument controls the degree of randomness.

Running the above code yields ten sets of train/test data (adding the ellipsis for brevity):

[ 0 1 2 3 4 5 6 ...]
[11 13 17 33 34 35 36 ...]

[ 0 1 2 4 5 6 7 ...]
[ 3 8 14 15 18 24 25 ...]

[ 0 2 3 5 7 8 9 ...]
[ 1 4 6 12 16 19 20 ...]
...

It’s hard to tell here, but if you print out the above train/test data fully you’ll see that each training set has more elements than each corresponding test set.

Now, if we create a Scikit-Learn model as usual, we can use the returned train/test indices to see how well our model performs against KFCV’s 10 generated datasets:

lr = linear_model.LogisticRegression()
[lr.fit(x[train_indices], y[train_indices]).score(x[test_indices],y[test_indices])
for train_indices, test_indices in kf_total]

This is well-documented in the official tutorial page on estimator validation with KFCV. Running the above code gives a NumPy array of 10 floats, i.e. successful prediction scores, for each of our 10 datasets:

[0.95999999999999996,
0.95999999999999996,
0.98999999999999999,
0.96999999999999997,
0.97999999999999998,
0.96999999999999997,
0.93999999999999995,
0.94999999999999996,
0.94999999999999996,
0.96999999999999997]

Alternately, you could also run the process using Scikit-Learn’s pre-implemented tool for scoring and validating a model, cross_val_score:

cross_validation.cross_val_score(lr, x, y, cv=kf_total, n_jobs = 1)

This gives the same results as above, but (at least in an iPython notebook) the floats seem to be truncated to 2 decimal places.

Moving onto the second problem mentioned at the beginning of this post, we’ll now check out GridSearchCV. This allows us to create a special model that will find its optimal parameter values. For example, one of the parameters for linear_model.LinearRegression is C, the inverse of regularization strength (similar to the cost variable for SVMs). The lower the value of C, the more strong the regularization.

It’s relatively easy to get started with GridSearchCV. Let’s check out some of the example code (slightly modified) from the official tutorial:

c_range = np.logspace(0, 4, 10)
lrgs = grid_search.GridSearchCV(estimator=lr, param_grid=dict(C=c_range), n_jobs=1)

The first line sets up a possible range of values for the optimal parameter C. The function numpy.logspace, in this line, returns 10 evenly spaced values between 0 and 4 on a log scale (inclusive), i.e. our optimal parameter will be anywhere from 10^0 to 10^4. It’s unlikely C will be on the order of 10^4, of course, but that’s another story.

The second line builds our classifier. Here’s a rundown of each argument, as described in the docs:

  • Estimator: perhaps obvious, but the estimator you want to use. You can either create the estimator right there (i.e. estimator=linear_model.LogisticRegression()), or pass in a classifier you’ve already created.
  • Param_grid: the parameters you want to optimize. This can be a dictionary, as above, or a list of dictionaries. Important: make sure that each key of the dictionary is the real name of the estimator’s corresponding parameter. For example, you must pass in C for LogisticRegression() as above if you want to optimize the regularization constant.
  • n_jobs: the # of CPUs you want to use. Obviously, using more (with n_jobs=-1) should usually be faster, but if you want to run cross_validation.cross_val_score with the new optimized estimator, you must set n_jobs = 1.

Now, we can run cross-validation techniques on this new GridSearchCV estimator as before:

[lrgs.fit(x[train],y[train]).score(x[test],y[test]) for train, test in kf_total]
----
[0.95999999999999996,
0.95999999999999996,
0.98999999999999999,
0.96999999999999997,
0.97999999999999998,
0.96999999999999997,
0.93999999999999995,
0.94999999999999996,
0.94999999999999996,
0.96999999999999997]

In addition to the scores for the 10 datasets, we find a couple more attributes for our optimized model, which are the best score from our cross-validation and the best possible value of C. Recall that an estimator’s attributes, in Scikit-Learn, are expressed with a trailing underscore. For example, a Logistic Regression estimator lr will have an intercept and coefficient, which can be accessed with lr.coef_ and lr.intercept_.

print lrgs.best_score_
print lrgs.best_estimator_.C
----
0.9555555555555556
1.0

Finally, as before, we can run cross_validation.cross_val_score on our newly optimized estimator, which gives the same results as directly calling lrgs.fit() in a list comprehension a couple paragraphs back.

cross_validation.cross_val_score(lrgs, x, y, cv=kf_total, n_jobs=1)
----
array([ 0.96, 0.96, 0.99, 0.97, 0.98, 0.97, 0.94, 0.95, 0.95, 0.97])