2016年5月 – かものはしの分析ブログ

XGBoostのパラメータチューニング実践 with Python

以前の投稿で紹介したXGBoostのパラメータチューニング方法ですが、実際のデータセットに対して実行するためのプログラムを実践してみようと思います。プログラム自体はAnalytics_Vidhya/Articles/Parameter_Tuning_XGBoost_with_Example/XGBoost models.ipynbに載っているのですが、データセットがついていません。そこで、前回の投稿(不均衡なデータの分類問題について with Python)で赤ワインのデータセットを手に入れているので、こちらのデータセットを用います。誤植なのかところどころ、うまく回らなかったところがあったので、手直しをしています。

以下の工程に従って進みます。結構長いですが、辛抱強く実践してみて下さい。
・ライブラリの読み込み
・データの読み込み
・前処理
・学習用データとテスト用データの作成
・XGBoostの予測結果をもとに、AUCの数値を返すための関数の定義
・モデルの実行
・チューニング

ライブラリの読み込み

import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics
from sklearn.grid_search import GridSearchCV

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

import pandas as pd

import numpy as np

import xgboost as xgb

from xgboost.sklearn import XGBClassifier

from sklearn import cross_validation, metrics

from sklearn.grid_search import GridSearchCV

import matplotlib.pylab as plt

%matplotlib inline

from matplotlib.pylab import rcParams

rcParams['figure.figsize'] = 12, 4

データの読み込み

#importing the red wine data
wine_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")

1 2	#importing the red wine data wine_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")

前処理

#ユニークIDを行ごとに割り当てる。
wine_df['ID'] = range(1, len(wine_df) + 1)

#ワインの質に関するデータを0-1データに置換する。qualityが6よりも小さかったら0、それ以外は1とする。
Y = wine_df.quality.values
wine_df.quality = np.asarray([1 if  i>=6 else 0 for i in Y])
wine_df.head(10)

#ユニークIDを行ごとに割り当てる。

wine_df['ID'] = range(1, len(wine_df) + 1)

#ワインの質に関するデータを0-1データに置換する。qualityが6よりも小さかったら0、それ以外は1とする。

Y = wine_df.quality.values

wine_df.quality = np.asarray([1 if i>=6 else 0 for i in Y])

wine_df.head(10)

学習用データとテスト用データの作成

#学習用データとテスト用データの作成
msk = np.random.rand(len(wine_df)) < 0.8 #乱数を発生させて0.8よりも小さいデータを選ぶ
train = wine_df[msk]
test = wine_df[~msk]

train.shape, test.shape
((1236, 13), (363, 13))

target='quality'
IDcol = 'ID'

#訓練データの目的変数の確認
train[target].value_counts()
1    659
0    577
Name: quality, dtype: int64

#学習用データとテスト用データの作成

msk = np.random.rand(len(wine_df)) < 0.8 #乱数を発生させて0.8よりも小さいデータを選ぶ

train = wine_df[msk]

test = wine_df[~msk]

train.shape, test.shape

((1236, 13), (363, 13))

target='quality'

IDcol = 'ID'

#訓練データの目的変数の確認

train[target].value_counts()

1 659

0 577

Name: quality, dtype: int64

XGBoostの予測結果をもとに、AUCの数値を返すための関数の定義

XGBoostの予測結果から、AUCの数値を返し、特徴量に応じた重要度を出力するためのプログラムです。

#テスト結果を格納するデータフレームの生成
test_results = pd.DataFrame(data=test.ID)

#関数の定義
def modelfit(alg, dtrain, dtest, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        xgtest = xgb.DMatrix(dtest[predictors].values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics=['auc'], early_stopping_rounds=early_stopping_rounds, show_progress=False)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain[target],eval_metric=['auc'])
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
        
    #Print model report:
    print "\nModel Report"
    print "Accuracy : %.4g" % metrics.accuracy_score(dtrain[target].values, dtrain_predictions)
    print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain[target], dtrain_predprob)
    
    # Predict on testing data:
    dtest['predprob'] = alg.predict_proba(dtest[predictors])[:,1]
    #results = test_results.merge(dtest[['ID','predprob']], on='ID')
    print 'AUC Score (Test): %f' % metrics.roc_auc_score(dtest[target], dtest['predprob'])
                
    feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')

#テスト結果を格納するデータフレームの生成

test_results = pd.DataFrame(data=test.ID)

#関数の定義

def modelfit(alg, dtrain, dtest, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):

if useTrainCV:

xgb_param = alg.get_xgb_params()

xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)

xgtest = xgb.DMatrix(dtest[predictors].values)

cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,

metrics=['auc'], early_stopping_rounds=early_stopping_rounds, show_progress=False)

alg.set_params(n_estimators=cvresult.shape[0])

#Fit the algorithm on the data

alg.fit(dtrain[predictors], dtrain[target],eval_metric=['auc'])

#Predict training set:

dtrain_predictions = alg.predict(dtrain[predictors])

dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]

#Print model report:

print "\nModel Report"

print "Accuracy : %.4g" % metrics.accuracy_score(dtrain[target].values, dtrain_predictions)

print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain[target], dtrain_predprob)

# Predict on testing data:

dtest['predprob'] = alg.predict_proba(dtest[predictors])[:,1]

#results = test_results.merge(dtest[['ID','predprob']], on='ID')

print 'AUC Score (Test): %f' % metrics.roc_auc_score(dtest[target], dtest['predprob'])

feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)

feat_imp.plot(kind='bar', title='Feature Importances')

plt.ylabel('Feature Importance Score')

モデルの実行

predictors = [x for x in train.columns if x not in [target, IDcol]]
xgb1 = XGBClassifier(
        learning_rate =0.1,
        n_estimators=1000,
        max_depth=5,
        min_child_weight=1,
        gamma=0,
        subsample=0.8,
        colsample_bytree=0.8,
        objective= 'binary:logistic',
        nthread=4,
        scale_pos_weight=1,
        seed=27)

modelfit(xgb1, train, test, predictors)

Will train until cv error hasn't decreased in 50 rounds.
Stopping. Best iteration: 237

Model Report
Accuracy : 1
AUC Score (Train): 1.000000
AUC Score (Test): 0.875199

predictors = [x for x in train.columns if x not in [target, IDcol]]

xgb1 = XGBClassifier(

learning_rate =0.1,

n_estimators=1000,

max_depth=5,

min_child_weight=1,

gamma=0,

subsample=0.8,

colsample_bytree=0.8,

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

seed=27)

modelfit(xgb1, train, test, predictors)

Will train until cv error hasn't decreased in 50 rounds.

Stopping. Best iteration: 237

Model Report

Accuracy : 1

AUC Score (Train): 1.000000

AUC Score (Test): 0.875199

チューニング

max_depthとmin_child_weightの数値をチューニングするためのプログラムです。

#Grid seach on subsample and max_features
#Choose all predictors except target & IDcols
param_test1 = {
    'max_depth':range(3,10,2),
    'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=1000, max_depth=5,
                                        min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
                                        objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
                       param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])

#Grid seach on subsample and max_features

#Choose all predictors except target & IDcols

param_test1 = {

'max_depth':range(3,10,2),

'min_child_weight':range(1,6,2)

}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=1000, max_depth=5,

min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27),

param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch1.fit(train[predictors],train[target])

gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

([mean: 0.76728, std: 0.03045, params: {'max_depth': 3, 'min_child_weight': 1},
  mean: 0.76649, std: 0.03378, params: {'max_depth': 3, 'min_child_weight': 3},
  mean: 0.76540, std: 0.03620, params: {'max_depth': 3, 'min_child_weight': 5},
  mean: 0.76509, std: 0.03183, params: {'max_depth': 5, 'min_child_weight': 1},
  mean: 0.76430, std: 0.02988, params: {'max_depth': 5, 'min_child_weight': 3},
  mean: 0.76221, std: 0.03336, params: {'max_depth': 5, 'min_child_weight': 5},
  mean: 0.77162, std: 0.03335, params: {'max_depth': 7, 'min_child_weight': 1},
  mean: 0.76575, std: 0.03585, params: {'max_depth': 7, 'min_child_weight': 3},
  mean: 0.76277, std: 0.03511, params: {'max_depth': 7, 'min_child_weight': 5},
  mean: 0.77235, std: 0.03283, params: {'max_depth': 9, 'min_child_weight': 1},
  mean: 0.76452, std: 0.03414, params: {'max_depth': 9, 'min_child_weight': 3},
  mean: 0.76114, std: 0.03561, params: {'max_depth': 9, 'min_child_weight': 5}],
 {'max_depth': 9, 'min_child_weight': 1},
 0.77235073909956886)

gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

([mean: 0.76728, std: 0.03045, params: {'max_depth': 3, 'min_child_weight': 1},

mean: 0.76649, std: 0.03378, params: {'max_depth': 3, 'min_child_weight': 3},

mean: 0.76540, std: 0.03620, params: {'max_depth': 3, 'min_child_weight': 5},

mean: 0.76509, std: 0.03183, params: {'max_depth': 5, 'min_child_weight': 1},

mean: 0.76430, std: 0.02988, params: {'max_depth': 5, 'min_child_weight': 3},

mean: 0.76221, std: 0.03336, params: {'max_depth': 5, 'min_child_weight': 5},

mean: 0.77162, std: 0.03335, params: {'max_depth': 7, 'min_child_weight': 1},

mean: 0.76575, std: 0.03585, params: {'max_depth': 7, 'min_child_weight': 3},

mean: 0.76277, std: 0.03511, params: {'max_depth': 7, 'min_child_weight': 5},

mean: 0.77235, std: 0.03283, params: {'max_depth': 9, 'min_child_weight': 1},

mean: 0.76452, std: 0.03414, params: {'max_depth': 9, 'min_child_weight': 3},

mean: 0.76114, std: 0.03561, params: {'max_depth': 9, 'min_child_weight': 5}],

{'max_depth': 9, 'min_child_weight': 1},

0.77235073909956886)

より細かい数値で再度最適なパラメータを探します。

#Grid seach on subsample and max_features
#Choose all predictors except target & IDcols
param_test2 = {
    'max_depth':[4,5,6,7,8,9],
    'min_child_weight':[1,2,3,4,5,6]
}
gsearch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=1000, max_depth=5,
                                        min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
                                        objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
                       param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])

#Grid seach on subsample and max_features

#Choose all predictors except target & IDcols

param_test2 = {

'max_depth':[4,5,6,7,8,9],

'min_child_weight':[1,2,3,4,5,6]

}

gsearch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=1000, max_depth=5,

min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),

param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch2.fit(train[predictors],train[target])

gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_

([mean: 0.76820, std: 0.03336, params: {'max_depth': 4, 'min_child_weight': 1},
  mean: 0.76431, std: 0.02792, params: {'max_depth': 4, 'min_child_weight': 2},
  mean: 0.76171, std: 0.03356, params: {'max_depth': 4, 'min_child_weight': 3},
  mean: 0.76257, std: 0.03277, params: {'max_depth': 4, 'min_child_weight': 4},
  mean: 0.76128, std: 0.03661, params: {'max_depth': 4, 'min_child_weight': 5},
  mean: 0.75902, std: 0.03280, params: {'max_depth': 4, 'min_child_weight': 6},
  mean: 0.76509, std: 0.03183, params: {'max_depth': 5, 'min_child_weight': 1},
  mean: 0.76426, std: 0.02974, params: {'max_depth': 5, 'min_child_weight': 2},
  mean: 0.76430, std: 0.02988, params: {'max_depth': 5, 'min_child_weight': 3},
  mean: 0.76262, std: 0.02992, params: {'max_depth': 5, 'min_child_weight': 4},
  mean: 0.76221, std: 0.03336, params: {'max_depth': 5, 'min_child_weight': 5},
  mean: 0.76655, std: 0.03397, params: {'max_depth': 5, 'min_child_weight': 6},
  mean: 0.77066, std: 0.02936, params: {'max_depth': 6, 'min_child_weight': 1},
  mean: 0.76422, std: 0.03038, params: {'max_depth': 6, 'min_child_weight': 2},
  mean: 0.76126, std: 0.03021, params: {'max_depth': 6, 'min_child_weight': 3},
  mean: 0.76334, std: 0.03176, params: {'max_depth': 6, 'min_child_weight': 4},
  mean: 0.76347, std: 0.03245, params: {'max_depth': 6, 'min_child_weight': 5},
  mean: 0.76437, std: 0.03546, params: {'max_depth': 6, 'min_child_weight': 6},
  mean: 0.77162, std: 0.03335, params: {'max_depth': 7, 'min_child_weight': 1},
  mean: 0.76140, std: 0.03245, params: {'max_depth': 7, 'min_child_weight': 2},
  mean: 0.76575, std: 0.03585, params: {'max_depth': 7, 'min_child_weight': 3},
  mean: 0.76345, std: 0.03518, params: {'max_depth': 7, 'min_child_weight': 4},
  mean: 0.76277, std: 0.03511, params: {'max_depth': 7, 'min_child_weight': 5},
  mean: 0.75858, std: 0.03375, params: {'max_depth': 7, 'min_child_weight': 6},
  mean: 0.77487, std: 0.03377, params: {'max_depth': 8, 'min_child_weight': 1},
  mean: 0.76740, std: 0.03472, params: {'max_depth': 8, 'min_child_weight': 2},
  mean: 0.76048, std: 0.03267, params: {'max_depth': 8, 'min_child_weight': 3},
  mean: 0.76288, std: 0.03773, params: {'max_depth': 8, 'min_child_weight': 4},
  mean: 0.76138, std: 0.03045, params: {'max_depth': 8, 'min_child_weight': 5},
  mean: 0.76233, std: 0.03652, params: {'max_depth': 8, 'min_child_weight': 6},
  mean: 0.77235, std: 0.03283, params: {'max_depth': 9, 'min_child_weight': 1},
  mean: 0.76929, std: 0.03267, params: {'max_depth': 9, 'min_child_weight': 2},
  mean: 0.76452, std: 0.03414, params: {'max_depth': 9, 'min_child_weight': 3},
  mean: 0.76152, std: 0.03731, params: {'max_depth': 9, 'min_child_weight': 4},
  mean: 0.76114, std: 0.03561, params: {'max_depth': 9, 'min_child_weight': 5},
  mean: 0.76551, std: 0.03394, params: {'max_depth': 9, 'min_child_weight': 6}],
 {'max_depth': 8, 'min_child_weight': 1},
 0.77486987248915451)

gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_

([mean: 0.76820, std: 0.03336, params: {'max_depth': 4, 'min_child_weight': 1},

mean: 0.76431, std: 0.02792, params: {'max_depth': 4, 'min_child_weight': 2},

mean: 0.76171, std: 0.03356, params: {'max_depth': 4, 'min_child_weight': 3},

mean: 0.76257, std: 0.03277, params: {'max_depth': 4, 'min_child_weight': 4},

mean: 0.76128, std: 0.03661, params: {'max_depth': 4, 'min_child_weight': 5},

mean: 0.75902, std: 0.03280, params: {'max_depth': 4, 'min_child_weight': 6},

mean: 0.76509, std: 0.03183, params: {'max_depth': 5, 'min_child_weight': 1},

mean: 0.76426, std: 0.02974, params: {'max_depth': 5, 'min_child_weight': 2},

mean: 0.76430, std: 0.02988, params: {'max_depth': 5, 'min_child_weight': 3},

mean: 0.76262, std: 0.02992, params: {'max_depth': 5, 'min_child_weight': 4},

mean: 0.76221, std: 0.03336, params: {'max_depth': 5, 'min_child_weight': 5},

mean: 0.76655, std: 0.03397, params: {'max_depth': 5, 'min_child_weight': 6},

mean: 0.77066, std: 0.02936, params: {'max_depth': 6, 'min_child_weight': 1},

mean: 0.76422, std: 0.03038, params: {'max_depth': 6, 'min_child_weight': 2},

mean: 0.76126, std: 0.03021, params: {'max_depth': 6, 'min_child_weight': 3},

mean: 0.76334, std: 0.03176, params: {'max_depth': 6, 'min_child_weight': 4},

mean: 0.76347, std: 0.03245, params: {'max_depth': 6, 'min_child_weight': 5},

mean: 0.76437, std: 0.03546, params: {'max_depth': 6, 'min_child_weight': 6},

mean: 0.77162, std: 0.03335, params: {'max_depth': 7, 'min_child_weight': 1},

mean: 0.76140, std: 0.03245, params: {'max_depth': 7, 'min_child_weight': 2},

mean: 0.76575, std: 0.03585, params: {'max_depth': 7, 'min_child_weight': 3},

mean: 0.76345, std: 0.03518, params: {'max_depth': 7, 'min_child_weight': 4},

mean: 0.76277, std: 0.03511, params: {'max_depth': 7, 'min_child_weight': 5},

mean: 0.75858, std: 0.03375, params: {'max_depth': 7, 'min_child_weight': 6},

mean: 0.77487, std: 0.03377, params: {'max_depth': 8, 'min_child_weight': 1},

mean: 0.76740, std: 0.03472, params: {'max_depth': 8, 'min_child_weight': 2},

mean: 0.76048, std: 0.03267, params: {'max_depth': 8, 'min_child_weight': 3},

mean: 0.76288, std: 0.03773, params: {'max_depth': 8, 'min_child_weight': 4},

mean: 0.76138, std: 0.03045, params: {'max_depth': 8, 'min_child_weight': 5},

mean: 0.76233, std: 0.03652, params: {'max_depth': 8, 'min_child_weight': 6},

mean: 0.77235, std: 0.03283, params: {'max_depth': 9, 'min_child_weight': 1},

mean: 0.76929, std: 0.03267, params: {'max_depth': 9, 'min_child_weight': 2},

mean: 0.76452, std: 0.03414, params: {'max_depth': 9, 'min_child_weight': 3},

mean: 0.76152, std: 0.03731, params: {'max_depth': 9, 'min_child_weight': 4},

mean: 0.76114, std: 0.03561, params: {'max_depth': 9, 'min_child_weight': 5},

mean: 0.76551, std: 0.03394, params: {'max_depth': 9, 'min_child_weight': 6}],

{'max_depth': 8, 'min_child_weight': 1},

0.77486987248915451)

max_depthを8、min_child_weightを1として、他のパラメータチューニングに移ります。
続いて、gammaのチューニングを行います。

#Grid seach on subsample and max_features
#Choose all predictors except target & IDcols
param_test3 = {
    'gamma':[i/10.0 for i in range(0,5)]
}
gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=1000, max_depth=8,
                                        min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
                                        objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
                       param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])

#Grid seach on subsample and max_features

#Choose all predictors except target & IDcols

param_test3 = {

'gamma':[i/10.0 for i in range(0,5)]

}

gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=1000, max_depth=8,

min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),

param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch3.fit(train[predictors],train[target])

gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

([mean: 0.77487, std: 0.03377, params: {'gamma': 0.0},
  mean: 0.77689, std: 0.03298, params: {'gamma': 0.1},
  mean: 0.77735, std: 0.03117, params: {'gamma': 0.2},
  mean: 0.78163, std: 0.03076, params: {'gamma': 0.3},
  mean: 0.78790, std: 0.03328, params: {'gamma': 0.4}],
 {'gamma': 0.4},
 0.78789976715320331)

gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

([mean: 0.77487, std: 0.03377, params: {'gamma': 0.0},

mean: 0.77689, std: 0.03298, params: {'gamma': 0.1},

mean: 0.77735, std: 0.03117, params: {'gamma': 0.2},

mean: 0.78163, std: 0.03076, params: {'gamma': 0.3},

mean: 0.78790, std: 0.03328, params: {'gamma': 0.4}],

{'gamma': 0.4},

0.78789976715320331)

gammaを0.4と置きます。
ここで、いままでにチューニングしたパラメータを用いて再度推定を行います。先ほどの0.875よりも高くなっています。

predictors = [x for x in train.columns if x not in [target, IDcol]]
xgb2 = XGBClassifier(
        learning_rate =0.1,
        n_estimators=1000,
        max_depth=8,
        min_child_weight=1,
        gamma=0.4,
        subsample=0.8,
        colsample_bytree=0.8,
        objective= 'binary:logistic',
        nthread=4,
        scale_pos_weight=1,
        seed=27)
modelfit(xgb2, train, test, predictors)


Will train until cv error hasn't decreased in 50 rounds.
Stopping. Best iteration: 120

Model Report
Accuracy : 1
AUC Score (Train): 1.000000
AUC Score (Test): 0.884028

predictors = [x for x in train.columns if x not in [target, IDcol]]

xgb2 = XGBClassifier(

learning_rate =0.1,

n_estimators=1000,

max_depth=8,

min_child_weight=1,

gamma=0.4,

subsample=0.8,

colsample_bytree=0.8,

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

seed=27)

modelfit(xgb2, train, test, predictors)

Will train until cv error hasn't decreased in 50 rounds.

Stopping. Best iteration: 120

Model Report

Accuracy : 1

AUC Score (Train): 1.000000

AUC Score (Test): 0.884028

続いて、subsampleとcolsample_bytreeのチューニングを行います。

#Grid seach on subsample and max_features
#Choose all predictors except target & IDcols
param_test4 = {
    'subsample':[i/10.0 for i in range(6,10)],
    'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=8,
                                        min_child_weight=1, gamma=0.4, subsample=0.8, colsample_bytree=0.8,
                                        objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
                       param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])

#Grid seach on subsample and max_features

#Choose all predictors except target & IDcols

param_test4 = {

'subsample':[i/10.0 for i in range(6,10)],

'colsample_bytree':[i/10.0 for i in range(6,10)]

}

gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=8,

min_child_weight=1, gamma=0.4, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),

param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch4.fit(train[predictors],train[target])

gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

([mean: 0.78994, std: 0.02779, params: {'subsample': 0.6, 'colsample_bytree': 0.6},
  mean: 0.78900, std: 0.03519, params: {'subsample': 0.7, 'colsample_bytree': 0.6},
  mean: 0.78509, std: 0.03202, params: {'subsample': 0.8, 'colsample_bytree': 0.6},
  mean: 0.78706, std: 0.02848, params: {'subsample': 0.9, 'colsample_bytree': 0.6},
  mean: 0.78511, std: 0.03140, params: {'subsample': 0.6, 'colsample_bytree': 0.7},
  mean: 0.78343, std: 0.03336, params: {'subsample': 0.7, 'colsample_bytree': 0.7},
  mean: 0.78939, std: 0.03203, params: {'subsample': 0.8, 'colsample_bytree': 0.7},
  mean: 0.78646, std: 0.04090, params: {'subsample': 0.9, 'colsample_bytree': 0.7},
  mean: 0.77809, std: 0.03452, params: {'subsample': 0.6, 'colsample_bytree': 0.8},
  mean: 0.78994, std: 0.03483, params: {'subsample': 0.7, 'colsample_bytree': 0.8},
  mean: 0.79369, std: 0.03232, params: {'subsample': 0.8, 'colsample_bytree': 0.8},
  mean: 0.79207, std: 0.03057, params: {'subsample': 0.9, 'colsample_bytree': 0.8},
  mean: 0.78466, std: 0.02672, params: {'subsample': 0.6, 'colsample_bytree': 0.9},
  mean: 0.78863, std: 0.03289, params: {'subsample': 0.7, 'colsample_bytree': 0.9},
  mean: 0.78905, std: 0.02660, params: {'subsample': 0.8, 'colsample_bytree': 0.9},
  mean: 0.78501, std: 0.03666, params: {'subsample': 0.9, 'colsample_bytree': 0.9}],
 {'colsample_bytree': 0.8, 'subsample': 0.8},
 0.79369231068019075)

gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

([mean: 0.78994, std: 0.02779, params: {'subsample': 0.6, 'colsample_bytree': 0.6},

mean: 0.78900, std: 0.03519, params: {'subsample': 0.7, 'colsample_bytree': 0.6},

mean: 0.78509, std: 0.03202, params: {'subsample': 0.8, 'colsample_bytree': 0.6},

mean: 0.78706, std: 0.02848, params: {'subsample': 0.9, 'colsample_bytree': 0.6},

mean: 0.78511, std: 0.03140, params: {'subsample': 0.6, 'colsample_bytree': 0.7},

mean: 0.78343, std: 0.03336, params: {'subsample': 0.7, 'colsample_bytree': 0.7},

mean: 0.78939, std: 0.03203, params: {'subsample': 0.8, 'colsample_bytree': 0.7},

mean: 0.78646, std: 0.04090, params: {'subsample': 0.9, 'colsample_bytree': 0.7},

mean: 0.77809, std: 0.03452, params: {'subsample': 0.6, 'colsample_bytree': 0.8},

mean: 0.78994, std: 0.03483, params: {'subsample': 0.7, 'colsample_bytree': 0.8},

mean: 0.79369, std: 0.03232, params: {'subsample': 0.8, 'colsample_bytree': 0.8},

mean: 0.79207, std: 0.03057, params: {'subsample': 0.9, 'colsample_bytree': 0.8},

mean: 0.78466, std: 0.02672, params: {'subsample': 0.6, 'colsample_bytree': 0.9},

mean: 0.78863, std: 0.03289, params: {'subsample': 0.7, 'colsample_bytree': 0.9},

mean: 0.78905, std: 0.02660, params: {'subsample': 0.8, 'colsample_bytree': 0.9},

mean: 0.78501, std: 0.03666, params: {'subsample': 0.9, 'colsample_bytree': 0.9}],

{'colsample_bytree': 0.8, 'subsample': 0.8},

0.79369231068019075)

より細かい範囲で再度パラメータをチューニングします。

#Grid seach on subsample and max_features
#Choose all predictors except target & IDcols
param_test5 = {
    'subsample':[i/100.0 for i in range(75,90,5)],
    'colsample_bytree':[i/100.0 for i in range(75,90,5)]
}
gsearch5 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=8,
                                        min_child_weight=1, gamma=0.4, subsample=0.8, colsample_bytree=0.8,
                                        objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
                       param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch5.fit(train[predictors],train[target])

#Grid seach on subsample and max_features

#Choose all predictors except target & IDcols

param_test5 = {

'subsample':[i/100.0 for i in range(75,90,5)],

'colsample_bytree':[i/100.0 for i in range(75,90,5)]

}

gsearch5 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=8,

min_child_weight=1, gamma=0.4, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),

param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch5.fit(train[predictors],train[target])

gsearch5.grid_scores_, gsearch5.best_params_, gsearch5.best_score_

([mean: 0.78890, std: 0.03171, params: {'subsample': 0.75, 'colsample_bytree': 0.75},
  mean: 0.79369, std: 0.03232, params: {'subsample': 0.8, 'colsample_bytree': 0.75},
  mean: 0.79374, std: 0.03061, params: {'subsample': 0.85, 'colsample_bytree': 0.75},
  mean: 0.78890, std: 0.03171, params: {'subsample': 0.75, 'colsample_bytree': 0.8},
  mean: 0.79369, std: 0.03232, params: {'subsample': 0.8, 'colsample_bytree': 0.8},
  mean: 0.79374, std: 0.03061, params: {'subsample': 0.85, 'colsample_bytree': 0.8},
  mean: 0.78418, std: 0.03232, params: {'subsample': 0.75, 'colsample_bytree': 0.85},
  mean: 0.78905, std: 0.02660, params: {'subsample': 0.8, 'colsample_bytree': 0.85},
  mean: 0.78367, std: 0.03582, params: {'subsample': 0.85, 'colsample_bytree': 0.85}],
 {'colsample_bytree': 0.75, 'subsample': 0.85},
 0.79374219292158221)

gsearch5.grid_scores_, gsearch5.best_params_, gsearch5.best_score_

([mean: 0.78890, std: 0.03171, params: {'subsample': 0.75, 'colsample_bytree': 0.75},

mean: 0.79369, std: 0.03232, params: {'subsample': 0.8, 'colsample_bytree': 0.75},

mean: 0.79374, std: 0.03061, params: {'subsample': 0.85, 'colsample_bytree': 0.75},

mean: 0.78890, std: 0.03171, params: {'subsample': 0.75, 'colsample_bytree': 0.8},

mean: 0.79369, std: 0.03232, params: {'subsample': 0.8, 'colsample_bytree': 0.8},

mean: 0.79374, std: 0.03061, params: {'subsample': 0.85, 'colsample_bytree': 0.8},

mean: 0.78418, std: 0.03232, params: {'subsample': 0.75, 'colsample_bytree': 0.85},

mean: 0.78905, std: 0.02660, params: {'subsample': 0.8, 'colsample_bytree': 0.85},

mean: 0.78367, std: 0.03582, params: {'subsample': 0.85, 'colsample_bytree': 0.85}],

{'colsample_bytree': 0.75, 'subsample': 0.85},

0.79374219292158221)

続いて、reg_alphaをチューニングします。

#Grid seach on subsample and max_features
#Choose all predictors except target & IDcols
param_test6 = {
    'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=8,
                                        min_child_weight=1, gamma=0.4, subsample=0.85, colsample_bytree=0.75,
                                        objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
                       param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch6.fit(train[predictors],train[target])

#Grid seach on subsample and max_features

#Choose all predictors except target & IDcols

param_test6 = {

'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]

}

gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=8,

min_child_weight=1, gamma=0.4, subsample=0.85, colsample_bytree=0.75,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),

param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch6.fit(train[predictors],train[target])

gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

([mean: 0.79377, std: 0.03058, params: {'reg_alpha': 1e-05},
  mean: 0.79068, std: 0.02953, params: {'reg_alpha': 0.01},
  mean: 0.79298, std: 0.03268, params: {'reg_alpha': 0.1},
  mean: 0.78731, std: 0.03270, params: {'reg_alpha': 1},
  mean: 0.72370, std: 0.03333, params: {'reg_alpha': 100}],
 {'reg_alpha': 1e-05},
 0.79376831622356758)

gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

([mean: 0.79377, std: 0.03058, params: {'reg_alpha': 1e-05},

mean: 0.79068, std: 0.02953, params: {'reg_alpha': 0.01},

mean: 0.79298, std: 0.03268, params: {'reg_alpha': 0.1},

mean: 0.78731, std: 0.03270, params: {'reg_alpha': 1},

mean: 0.72370, std: 0.03333, params: {'reg_alpha': 100}],

{'reg_alpha': 1e-05},

0.79376831622356758)

範囲が粗かったので、より細かくパラメータをチューニングします。

#Grid seach on subsample and max_features
#Choose all predictors except target & IDcols
param_test7 = {
    'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]
}
gsearch7 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=8,
                                        min_child_weight=1, gamma=0.4, subsample=0.85, colsample_bytree=0.75,
                                        objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
                       param_grid = param_test7, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch7.fit(train[predictors],train[target])

#Grid seach on subsample and max_features

#Choose all predictors except target & IDcols

param_test7 = {

'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]

}

gsearch7 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=8,

min_child_weight=1, gamma=0.4, subsample=0.85, colsample_bytree=0.75,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),

param_grid = param_test7, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch7.fit(train[predictors],train[target])

gsearch7.grid_scores_, gsearch7.best_params_, gsearch7.best_score_

([mean: 0.79374, std: 0.03061, params: {'reg_alpha': 0},
  mean: 0.79433, std: 0.03076, params: {'reg_alpha': 0.001},
  mean: 0.79099, std: 0.02989, params: {'reg_alpha': 0.005},
  mean: 0.79068, std: 0.02953, params: {'reg_alpha': 0.01},
  mean: 0.79160, std: 0.02950, params: {'reg_alpha': 0.05}],
 {'reg_alpha': 0.001},
 0.79432567460197734)

gsearch7.grid_scores_, gsearch7.best_params_, gsearch7.best_score_

([mean: 0.79374, std: 0.03061, params: {'reg_alpha': 0},

mean: 0.79433, std: 0.03076, params: {'reg_alpha': 0.001},

mean: 0.79099, std: 0.02989, params: {'reg_alpha': 0.005},

mean: 0.79068, std: 0.02953, params: {'reg_alpha': 0.01},

mean: 0.79160, std: 0.02950, params: {'reg_alpha': 0.05}],

{'reg_alpha': 0.001},

0.79432567460197734)

これまでにチューニングしてきたパラメータを用いて再度推定を行います。

xgb3 = XGBClassifier(
        learning_rate =0.1,
        n_estimators=1000,
        max_depth=8,
        min_child_weight=1,
        gamma=0.4,
        subsample=0.85,
        colsample_bytree=0.75,
        reg_alpha=0.001,
        objective= 'binary:logistic',
        nthread=4,
        scale_pos_weight=1,
        seed=27)
modelfit(xgb3, train, test, predictors)

Will train until cv error hasn't decreased in 50 rounds.
Stopping. Best iteration: 153

Model Report
Accuracy : 1
AUC Score (Train): 1.000000
AUC Score (Test): 0.880331

xgb3 = XGBClassifier(

learning_rate =0.1,

n_estimators=1000,

max_depth=8,

min_child_weight=1,

gamma=0.4,

subsample=0.85,

colsample_bytree=0.75,

reg_alpha=0.001,

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

seed=27)

modelfit(xgb3, train, test, predictors)

Will train until cv error hasn't decreased in 50 rounds.

Stopping. Best iteration: 153

Model Report

Accuracy : 1

AUC Score (Train): 1.000000

AUC Score (Test): 0.880331

ブログであるように試行回数を1,000回から5,000回まで増やしてみます。

xgb4 = XGBClassifier(
        learning_rate =0.01,
        n_estimators=5000,
        max_depth=8,
        min_child_weight=1,
        gamma=0.4,
        subsample=0.85,
        colsample_bytree=0.75,
        reg_alpha=0.001,
        objective= 'binary:logistic',
        nthread=4,
        scale_pos_weight=1,
        seed=27)
modelfit(xgb4, train, test, predictors)

Will train until cv error hasn't decreased in 50 rounds.
Stopping. Best iteration: 604

Model Report
Accuracy : 0.9951
AUC Score (Train): 0.999955
AUC Score (Test): 0.888000

xgb4 = XGBClassifier(

learning_rate =0.01,

n_estimators=5000,

max_depth=8,

min_child_weight=1,

gamma=0.4,

subsample=0.85,

colsample_bytree=0.75,

reg_alpha=0.001,

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

seed=27)

modelfit(xgb4, train, test, predictors)

Will train until cv error hasn't decreased in 50 rounds.

Stopping. Best iteration: 604

Model Report

Accuracy : 0.9951

AUC Score (Train): 0.999955

AUC Score (Test): 0.888000

88.8%まで向上しました。色々と数値いじっても、1%高めるだけにとどまってしまうのですね。

とにかく、XGBoostをPythonで実行してパラメータチューニングするという一連の試行がこのコードでできそうなので、今後も使いまわしてみようと思います。

不均衡なデータの分類問題について with Python

データマイニング界隈で人気のKDnuggetsで紹介されていた、”Dealing with Unbalanced Classes, SVMs, Random Forests, and Decision Trees in Python“のプログラムが残念なことに画像だったので、写経しました。せっかくなので、紹介させていただきます。内容としては不均衡データに対する処方の紹介で、プログラムはPythonで書かれています。ライブラリさえインストールできれば皆さんもすぐに実行できるので、是非チャレンジしてみて下さい。

まずはもろもろライブラリを呼び出します。

%matplotlib inline
import numpy as np
import scipy as sp
import pandas as pd
import sklearn
import seaborn as sns
from matplotlib import  pyplot as plt

import sklearn.cross_validation

%matplotlib inline

import numpy as np

import scipy as sp

import pandas as pd

import sklearn

import seaborn as sns

from matplotlib import pyplot as plt

import sklearn.cross_validation

CSV形式のデータセットをWebサイトから取得します。ワインの評価と、ワインに関した特徴量からなるデータセットです。

wine_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")
wine_df.head()

1 2	wine_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";") wine_df.head()

分類のための目的変数を作成します。

#ワインの質に関する数値
Y = wine_df.quality.values
#質に関するデータを落としている。
wine_df = wine_df.drop('quality', axis =1)
#7よりも小さかったら0、それ以外は1とする。
Y = np.asarray([1 if  i>=7 else 0 for i in Y])
wine_df.head()

#ワインの質に関する数値

Y = wine_df.quality.values

#質に関するデータを落としている。

wine_df = wine_df.drop('quality', axis =1)

#7よりも小さかったら0、それ以外は1とする。

Y = np.asarray([1 if i>=7 else 0 for i in Y])

wine_df.head()

X =  wine_df.as_matrix()

1	X = wine_df.as_matrix()

ランダムフォレストを実行します。

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import  cross_val_score

scores =[]

#1~41までの木の数のランダムフォレストを実行する。
for val in range(1,41):
    clf = RandomForestClassifier(n_estimators  =val)
    validated = cross_val_score(clf, X, Y, cv =10)
    scores.append(validated)

from sklearn.ensemble import RandomForestClassifier

from sklearn.cross_validation import cross_val_score

scores =[]

#1~41までの木の数のランダムフォレストを実行する。

for val in range(1,41):

clf = RandomForestClassifier(n_estimators =val)

validated = cross_val_score(clf, X, Y, cv =10)

scores.append(validated)

どんな結果が返ってくるのか、試しに一つだけツリーの数を2にして実行してみます。10回分のクロスバリデーションを行った推定結果が出力されています。これは、いわゆる正解率のことを指します。

#木の数が2のランダムフォレストの結果を返す
clf1 = RandomForestClassifier(n_estimators = 2)
validated = cross_val_score(clf1, X, Y, cv=10)
validated

array([ 0.8757764 ,  0.86956522,  0.8625    ,  0.84375   ,  0.8875    ,
        0.85      ,  0.8375    ,  0.85534591,  0.85534591,  0.88679245])

#木の数が2のランダムフォレストの結果を返す

clf1 = RandomForestClassifier(n_estimators = 2)

validated = cross_val_score(clf1, X, Y, cv=10)

validated

array([ 0.8757764 , 0.86956522, 0.8625 , 0.84375 , 0.8875 ,

0.85 , 0.8375 , 0.85534591, 0.85534591, 0.88679245])

ツリーの数に応じた正解率を可視化します。

sns.boxplot(data=scores)
plt.xlabel('number of trees')
plt.ylabel('Classification scores')
plt.title('Classification score for number of trees')
plt.show()

sns.boxplot(data=scores)

plt.xlabel('number of trees')

plt.ylabel('Classification scores')

plt.title('Classification score for number of trees')

plt.show()

正解率はツリーの数を増やすことで増すようです。

しかしながら、正解率は誤解されやすい指標です。不均衡データでは偏りのある方ばかりを当てていても、正解率は増してしまいます。当たりだけを予測できて、ハズレを予測できないというのは分類器として使いみちが限られると思います。そこで、悪いワインの割合を直線で引いてみます。

len_y = len(Y)
temp = [i for i in Y if i ==0]
temp_1 = temp.count(0)

percentage = float(temp_1)/float(len_y)

print(float(temp_1)/float(len_y)*100)

sns.boxplot(data=scores)
plt.axhline(y = percentage, ls = '--')
plt.xlabel('number of trees')
plt.ylabel('Classification Scores')
plt.title('Classification scores of  for trees')
plt.show()

len_y = len(Y)

temp = [i for i in Y if i ==0]

temp_1 = temp.count(0)

percentage = float(temp_1)/float(len_y)

print(float(temp_1)/float(len_y)*100)

sns.boxplot(data=scores)

plt.axhline(y = percentage, ls = '--')

plt.xlabel('number of trees')

plt.ylabel('Classification Scores')

plt.title('Classification scores of for trees')

plt.show()

悪いワインの割合がそもそも多いので、悪いワインと判定しまくっていても、正解率は高いわけです。

そこで、機械学習における予測精度の評価指標とされているF値を使います。
ツリーの数を増やしても、F値は良くなっていないようです。

scores = []

for val in range(1, 41):
    cfl = RandomForestClassifier(n_estimators = val)
    validated = cross_val_score(clf, X, Y, cv=10, scoring = 'f1')
    scores.append(validated)

scores = []

for val in range(1, 41):

cfl = RandomForestClassifier(n_estimators = val)

validated = cross_val_score(clf, X, Y, cv=10, scoring = 'f1')

scores.append(validated)

sns.boxplot( data=scores)
plt.xlabel('number of trees')
plt.ylabel('F1 Scores')
plt.title('F1 scores as a function of the number of trees')
plt.show()

sns.boxplot( data=scores)

plt.xlabel('number of trees')

plt.ylabel('F1 Scores')

plt.title('F1 scores as a function of the number of trees')

plt.show()

ここでは、0.5よりも大きいとする予測になる特徴量のデータを切り捨てます。その切り捨てる割合がどこが望ましいのかを以下で探していきます。

clf = RandomForestClassifier(n_estimators= 15)
clf.fit(X, Y)

(clf.predict_proba(X)[:,1] > 0.5).astype(int)

clf = RandomForestClassifier(n_estimators= 15)

clf.fit(X, Y)

(clf.predict_proba(X)[:,1] > 0.5).astype(int)

def cutoff_predict(clf, X,  cutoff):
    return (clf.predict_proba(X)[:,1] > cutoff).astype(int)

scores = []

def custom_f1(cutoff):
    def f1_cutoff(clf, X, Y):
        ypred = cutoff_predict(clf, X,  cutoff)
        return sklearn.metrics.f1_score(Y, ypred)
    
    return f1_cutoff

for cutoff in np.arange(0.1,  0.9, 0.1):
    clf = RandomForestClassifier(n_estimators=15)
    validated = cross_val_score(clf,  X, Y, cv=10, scoring=custom_f1(cutoff)) 
    scores.append(validated)

def cutoff_predict(clf, X, cutoff):

return (clf.predict_proba(X)[:,1] > cutoff).astype(int)

scores = []

def custom_f1(cutoff):

def f1_cutoff(clf, X, Y):

ypred = cutoff_predict(clf, X, cutoff)

return sklearn.metrics.f1_score(Y, ypred)

return f1_cutoff

for cutoff in np.arange(0.1, 0.9, 0.1):

clf = RandomForestClassifier(n_estimators=15)

validated = cross_val_score(clf, X, Y, cv=10, scoring=custom_f1(cutoff))

scores.append(validated)

sns.boxplot(data=scores, names= np.arange(0.1, 0.9, 0.1))
plt.xlabel('each cut off value')
plt.ylabel('F1 Scores')
plt.title('custom F scores')
plt.show()

sns.boxplot(data=scores, names= np.arange(0.1, 0.9, 0.1))

plt.xlabel('each cut off value')

plt.ylabel('F1 Scores')

plt.title('custom F scores')

plt.show()

どうやら、階級値が2～4、つまり割合にして0.3～0.5のカットオフ値が望ましい水準のようです。

以下では、決定境界の可視化を行います。しかしながら、二次元の可視化となると、複数あるデータの中から特徴量を二つだけ選ばなければなりません。その変数を決めるに際して、変数の重要度を用います。変数の重要度はランダムフォレストで計算可能です。

clf = RandomForestClassifier(n_estimators=15)
clf.fit(X, Y)

imp = clf.feature_importances_
names = wine_df.columns

imp, names = zip(*sorted(zip(imp, names)))

plt.barh(range(len(names)), imp, align='center')
plt.yticks(range(len(names)), names)

plt.xlabel('Importance of features')
plt.ylabel('Features')
plt.title('Importance of each feature')
plt.show()

clf = RandomForestClassifier(n_estimators=15)

clf.fit(X, Y)

imp = clf.feature_importances_

names = wine_df.columns

imp, names = zip(*sorted(zip(imp, names)))

plt.barh(range(len(names)), imp, align='center')

plt.yticks(range(len(names)), names)

plt.xlabel('Importance of features')

plt.ylabel('Features')

plt.title('Importance of each feature')

plt.show()

from sklearn.tree import DecisionTreeClassifier
import sklearn.linear_model
import sklearn.svm

def plot_decision_surface(clf, X_train, Y_train):
    plot_step=0.1
    
    if X_train.shape[1] != 2:
        raise ValueError("X_train should have exactly 2 columns!")
        
    x_min, x_max = X_train[:, 0].min() - plot_step, X_train[:, 0].max() + plot_step
    y_min, y_max = X_train[:, 1].min() - plot_step, X_train[:, 1].max() + plot_step
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                        np.arange(y_min, y_max, plot_step))
    
    clf.fit(X_train, Y_train)
    if hasattr(clf, 'predict_proba'):
        Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:,1]
    else:
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap = plt.cm.Reds)
    plt.scatter(X_train[:,0], X_train[:,1], c=Y_train, cmap=plt.cm.Paired)
    plt.show()

from sklearn.tree import DecisionTreeClassifier

import sklearn.linear_model

import sklearn.svm

def plot_decision_surface(clf, X_train, Y_train):

plot_step=0.1

if X_train.shape[1] != 2:

raise ValueError("X_train should have exactly 2 columns!")

x_min, x_max = X_train[:, 0].min() - plot_step, X_train[:, 0].max() + plot_step

y_min, y_max = X_train[:, 1].min() - plot_step, X_train[:, 1].max() + plot_step

xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),

np.arange(y_min, y_max, plot_step))

clf.fit(X_train, Y_train)

if hasattr(clf, 'predict_proba'):

Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:,1]

else:

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

cs = plt.contourf(xx, yy, Z, cmap = plt.cm.Reds)

plt.scatter(X_train[:,0], X_train[:,1], c=Y_train, cmap=plt.cm.Paired)

plt.show()

こちらで、重要度が上位のものに絞って決定境界を可視化します。ここでは、ランダムフォレストのみならず、SVMや決定木も実行されています。

imp_fe = np.argsort(imp)[::-1][0:2]
X_imp = X[:, imp_fe]

algorithms = [DecisionTreeClassifier(),
             RandomForestClassifier(),
             sklearn.svm.SVC(C = 100.0, gamma = 1)]

title = ['Decision Tree Classifier', 'Random Forest Classifier',
        'Support Vector Maachine']

for i in xrange(3):
    plt.title(title[i])
    plt.xlabel('Feature1')
    plt.ylabel('Feature2')
    plot_decision_surface(algorithms[i], X_imp, Y)

imp_fe = np.argsort(imp)[::-1][0:2]

X_imp = X[:, imp_fe]

algorithms = [DecisionTreeClassifier(),

RandomForestClassifier(),

sklearn.svm.SVC(C = 100.0, gamma = 1)]

title = ['Decision Tree Classifier', 'Random Forest Classifier',

'Support Vector Maachine']

for i in xrange(3):

plt.title(title[i])

plt.xlabel('Feature1')

plt.ylabel('Feature2')

plot_decision_surface(algorithms[i], X_imp, Y)

sklearnのSVMはデフォルトではクラスごとの重み付けを行わないが、自動で重み付けを行うことが出来る。以下の例では、C=1、gamma=1でクラスごとの重み付けを行う・行わないでの決定境界を描いている。重み付けを行うことで、赤色の少ない方のデータの識別が比較的できていることが伺えるが、他方で、多くの青を誤判定している。さらなる改善にはパラメータチューニングが必要となります。

svm = [sklearn.svm.SVC(C = 1.0, gamma = 1.0, class_weight=None),
      sklearn.svm.SVC(C = 1.0, gamma = 1.0, class_weight='auto')]

title = ['Svm without class weight', 'Svm with class weight']

for i in xrange(2):
    plt.title(title[i])
    plt.xlabel('Feature1')
    plt.ylabel('Feature2')
    
    plot_decision_surface(svm[i], X_imp, Y)

svm = [sklearn.svm.SVC(C = 1.0, gamma = 1.0, class_weight=None),

sklearn.svm.SVC(C = 1.0, gamma = 1.0, class_weight='auto')]

title = ['Svm without class weight', 'Svm with class weight']

for i in xrange(2):

plt.title(title[i])

plt.xlabel('Feature1')

plt.ylabel('Feature2')

plot_decision_surface(svm[i], X_imp, Y)

不均衡データに対するアプローチや、Pythonによる機械学習を学ぶ良い機会になりました。KDnuggetsは非常に勉強になりますね。

Tokyo.R#53で得たパッケージ情報とその実践

第53回のTokyo.Rで気になったパッケージの情報と実行例をいくつかあげました。スライドなどもろもろの発表はこちらの方のブログ「第53回R勉強会@東京で発表してきた」が非常に詳しく書かれています。

【目次】
・ggradarパッケージ
・proxyパッケージ
・因果推論（CBPSパッケージ）
・MXNetパッケージ
・missForestパッケージ
・RFinanceパッケージ

ggradarパッケージ

簡単にレーダーチャートを作れるパッケージです。こちらのブログを参考にしています。

install.packages("devtools")
devtools::install_github("ricardo-bion/ggradar")

1 2	install.packages("devtools") devtools::install_github("ricardo-bion/ggradar")

企業の職場環境に関してまとめられた某口コミサイトから4個ほどデータを拝借してきました。

> CompanyVoiceData
  company growth stability salary rewarding idea difficulty welfare education
1  google    5.0       5.0    4.9       5.0  4.3        5.0     5.0       4.6
2   yahoo    3.9       5.0    3.2       3.8  3.7        3.9     3.1       3.3
3 recruit    4.4       4.8    5.0       5.0  5.0        5.0     4.0       5.0
4  amazon    5.0       5.0    4.2       4.0  4.2        5.0     3.6       3.3

> CompanyVoiceData

company growth stability salary rewarding idea difficulty welfare education

1 google 5.0 5.0 4.9 5.0 4.3 5.0 5.0 4.6

2 yahoo 3.9 5.0 3.2 3.8 3.7 3.9 3.1 3.3

3 recruit 4.4 4.8 5.0 5.0 5.0 5.0 4.0 5.0

4 amazon 5.0 5.0 4.2 4.0 4.2 5.0 3.6 3.3

ggradarをそのまま使おうとすると、Circular Air Lightというフォントが必要だと怒られるので、参考のブログにある通り、OSXの場合はこちらをダブルクリックでインストールして再起動します。

先ほどのデータに対して、以下のコードを実行すれば非常に簡単にレーダーチャートが作れました。

library("ggradar")
CompanyVoiceData <- data.frame(read.csv(file ="company_voice.csv",header = TRUE))

ggradar(CompanyVoiceData, 
        grid.max = max(CompanyVoiceData[, 2:ncol(CompanyVoiceData)]),
        background.circle.colour = "#ffdd99", #背景色の指定
        background.circle.transparency = 1, #背景色の透明度を指定
        group.line.width = 2, #線の太さの指定
        group.point.size = 6, #シンボルの大きさの指定
        axis.label.size = 5, #軸ラベルサイズの指定
        gridline.min.colour = "#4b61ba", #最小円の線色の指定
        gridline.mid.colour = "#a87963", #中円の線色の指定
        gridline.max.colour = "#e1e6ea", #最大円の線色の指定
        grid.line.width = 1.5, #各円の線の太さの指定
        gridline.min.linetype = "longdash", #線種の指定
        gridline.mid.linetype = "longdash", #線種の指定
        gridline.max.linetype = "longdash") #線種の指定

library("ggradar")

CompanyVoiceData <- data.frame(read.csv(file ="company_voice.csv",header = TRUE))

ggradar(CompanyVoiceData,

grid.max = max(CompanyVoiceData[, 2:ncol(CompanyVoiceData)]),

background.circle.colour = "#ffdd99", #背景色の指定

background.circle.transparency = 1, #背景色の透明度を指定

group.line.width = 2, #線の太さの指定

group.point.size = 6, #シンボルの大きさの指定

axis.label.size = 5, #軸ラベルサイズの指定

gridline.min.colour = "#4b61ba", #最小円の線色の指定

gridline.mid.colour = "#a87963", #中円の線色の指定

gridline.max.colour = "#e1e6ea", #最大円の線色の指定

grid.line.width = 1.5, #各円の線の太さの指定

gridline.min.linetype = "longdash", #線種の指定

gridline.mid.linetype = "longdash", #線種の指定

gridline.max.linetype = "longdash") #線種の指定

proxyパッケージ

距離や類似度を計算するパッケージです。
先ほどのデータに対して類似度と距離を計算してみます。

library(proxy)
> simil(CompanyVoiceData[,-1])
          1         2         3
2 0.2286639                    
3 0.6373648 0.1339713          
4 0.6499133 0.5787506 0.4188571
> dist(CompanyVoiceData[,-1])
         1        2        3
2 3.522783                  
3 1.435270 3.401470         
4 2.269361 1.989975 2.393742

library(proxy)

> simil(CompanyVoiceData[,-1])

1 2 3

2 0.2286639

3 0.6373648 0.1339713

4 0.6499133 0.5787506 0.4188571

> dist(CompanyVoiceData[,-1])

1 2 3

2 3.522783

3 1.435270 3.401470

4 2.269361 1.989975 2.393742

こんな感じで、類似度や距離の計算ができます。

因果推論

こちらはパッケージとかそういうものではなく、既存の関数などで計算できるようです。
こちらのブログ、「調査観察データにおける因果推論(3) – Rによる傾向スコア，IPW推定量，二重にロバストな推定量の算出」に詳しく書かれています。
・glm関数での傾向スコアの算出
・傾向スコアを共変量としてlm関数で回帰分析
・コードを愚直に書いてIPW推定量の算出
・期待値の標準誤差を出すための関数を作成
・DR推定量の算出をするための関数を作成
などで、推定自体は実現できるようです。

ただし、CBPS(Covariate Balancing Propensity Score)というパッケージがあるらしく、このパッケージを用いれば因果推論の計算を行えるようです。

Package ‘CBPS’
以下のようなExampleコードが載っていたので、実行してみましたが、なかなか結果が返ってこなかったので不安になりました。計算が終わるまで10分以上はかかったと思います。

library(CBPS)
data(Blackwell)

form1<-"d.gone.neg ~ d.gone.neg.l1 + d.gone.neg.l2 + d.neg.frac.l3 + camp.length + camp.length +
deminc + base.poll + year.2002 + year.2004 + year.2006 + base.und + office"

##Fitting the models in Imai and Ratkovic (2014)
##Warning: may take a few mintues; setting time.vary to FALSE
##Results in a quicker fit but with poorer balance
fit1 <- CBMSM(formula = form1, time=Blackwell$time,id=Blackwell$demName,data=Blackwell, type="MSM",
            iterations = NULL, twostep = TRUE, msm.variance = "full", time.vary = TRUE)
fit2 <- CBMSM(formula = form1, time=Blackwell$time,id=Blackwell$demName,data=Blackwell, type="MSM",
            iterations = NULL, twostep = TRUE, msm.variance = "approx", time.vary = TRUE)

##Assessing balance
bal1 <- balance(fit1)
bal2 <- balance(fit2)

##Effect estimation: Replicating Effect Estimates in
##Table 3 of Imai and Ratkovic (2014)
lm1 <- lm(demprcnt[time==1]~fit1$treat.hist,data=Blackwell,weights=fit1$glm.weights)
lm2 <- lm(demprcnt[time==1]~fit1$treat.hist,data=Blackwell,weights=fit1$weights)
lm3 <- lm(demprcnt[time==1]~fit1$treat.hist,data=Blackwell,weights=fit2$weights)
lm4 <- lm(demprcnt[time==1]~fit1$treat.cum,data=Blackwell,weights=fit1$glm.weights)
lm5 <- lm(demprcnt[time==1]~fit1$treat.cum,data=Blackwell,weights=fit1$weights)
lm6 <- lm(demprcnt[time==1]~fit1$treat.cum,data=Blackwell,weights=fit2$weights)

library(CBPS)

data(Blackwell)

form1<-"d.gone.neg ~ d.gone.neg.l1 + d.gone.neg.l2 + d.neg.frac.l3 + camp.length + camp.length +

deminc + base.poll + year.2002 + year.2004 + year.2006 + base.und + office"

##Fitting the models in Imai and Ratkovic (2014)

##Warning: may take a few mintues; setting time.vary to FALSE

##Results in a quicker fit but with poorer balance

fit1 <- CBMSM(formula = form1, time=Blackwell$time,id=Blackwell$demName,data=Blackwell, type="MSM",

iterations = NULL, twostep = TRUE, msm.variance = "full", time.vary = TRUE)

fit2 <- CBMSM(formula = form1, time=Blackwell$time,id=Blackwell$demName,data=Blackwell, type="MSM",

iterations = NULL, twostep = TRUE, msm.variance = "approx", time.vary = TRUE)

##Assessing balance

bal1 <- balance(fit1)

bal2 <- balance(fit2)

##Effect estimation: Replicating Effect Estimates in

##Table 3 of Imai and Ratkovic (2014)

lm1 <- lm(demprcnt[time==1]~fit1$treat.hist,data=Blackwell,weights=fit1$glm.weights)

lm2 <- lm(demprcnt[time==1]~fit1$treat.hist,data=Blackwell,weights=fit1$weights)

lm3 <- lm(demprcnt[time==1]~fit1$treat.hist,data=Blackwell,weights=fit2$weights)

lm4 <- lm(demprcnt[time==1]~fit1$treat.cum,data=Blackwell,weights=fit1$glm.weights)

lm5 <- lm(demprcnt[time==1]~fit1$treat.cum,data=Blackwell,weights=fit1$weights)

lm6 <- lm(demprcnt[time==1]~fit1$treat.cum,data=Blackwell,weights=fit2$weights)

MXNet

XGBoostのパッケージを作ったチームが手がけているパッケージで、深層学習を実行できます。

インストール方法はここに書かれています。
Deep Learning for R

install.packages("drat", repos="https://cran.rstudio.com")
drat:::addRepo("dmlc")
install.packages("mxnet")

install.packages("drat", repos="https://cran.rstudio.com")

drat:::addRepo("dmlc")

install.packages("mxnet")

あれ、OSXではエラーが返ってきてライブラリが読み込めないですね。どうやら私のためにあるようなブログ「Installing mxnet for R on Yosemite」があったので、時間を見つけてチャレンジしてみようと思います。

ディープラーニングを用いた回帰分析については、Neural Network with MXNet in Five Minutesにコードがもろもろ載っていますので、チャレンジしてみると良いと思います。

リンク先に載っているのですが、一応コードを以下に記しておきます。

data(BostonHousing, package="mlbench")

train.ind = seq(1, 506, 3)
train.x = data.matrix(BostonHousing[train.ind, -14])
train.y = BostonHousing[train.ind, 14]
test.x = data.matrix(BostonHousing[-train.ind, -14])
test.y = BostonHousing[-train.ind, 14]

# Define the input data
data <- mx.symbol.Variable("data")
# A fully connected hidden layer
# data: input source
# num_hidden: number of neurons in this hidden layer
fc1 <- mx.symbol.FullyConnected(data, num_hidden=1)

# Use linear regression for the output layer
lro <- mx.symbol.LinearRegressionOutput(fc1)

preds = predict(model, test.x)

## Auto detect layout of input matrix, use rowmajor..
sqrt(mean((preds-test.y)^2))

demo.metric.mae <- mx.metric.custom("mae", function(label, pred) {
  res <- mean(abs(label-pred))
  return(res)
})

mx.set.seed(0)
model <- mx.model.FeedForward.create(lro, X=train.x, y=train.y,
                                     ctx=mx.cpu(), num.round=50, array.batch.size=20,
                                     learning.rate=2e-6, momentum=0.9, eval.metric=demo.metric.mae)

data(BostonHousing, package="mlbench")

train.ind = seq(1, 506, 3)

train.x = data.matrix(BostonHousing[train.ind, -14])

train.y = BostonHousing[train.ind, 14]

test.x = data.matrix(BostonHousing[-train.ind, -14])

test.y = BostonHousing[-train.ind, 14]

# Define the input data

data <- mx.symbol.Variable("data")

# A fully connected hidden layer

# data: input source

# num_hidden: number of neurons in this hidden layer

fc1 <- mx.symbol.FullyConnected(data, num_hidden=1)

# Use linear regression for the output layer

lro <- mx.symbol.LinearRegressionOutput(fc1)

preds = predict(model, test.x)

## Auto detect layout of input matrix, use rowmajor..

sqrt(mean((preds-test.y)^2))

demo.metric.mae <- mx.metric.custom("mae", function(label, pred) {

res <- mean(abs(label-pred))

return(res)

})

mx.set.seed(0)

model <- mx.model.FeedForward.create(lro, X=train.x, y=train.y,

ctx=mx.cpu(), num.round=50, array.batch.size=20,

learning.rate=2e-6, momentum=0.9, eval.metric=demo.metric.mae)

missForest

ランダムフォレストを用いて、欠損値補完を行うためのパッケージです。目的変数が欠損していても適用できるようです。
詳しくは、スライドを見ていただいた方がいいですが、以下のプログラムで実行できました。ちなみにスライドはこちら、「Imputation of Missing Values using Random Forest」

library(missForest)
library(dplyr)

#ggplot2のデータセットを読み込む
data(diamonds, package = "ggplot2")
dia.sample <- sample_n(diamonds, size=2000)
dia.sample <- as.data.frame(dia.sample)

#既存データセットに5%の欠損を与える
dia.mis <- prodNA(dia.sample, noNA=0.05)

#補完の実行
dia.imp <- missForest(dia.mis, verbose=TRUE)
dia.imp %>% str(max.level=1)

#補完精度の推定
dia.imp$OOBerror
dia.imp <- missForest(dia.mis, verbose=TRUE, variablewise=TRUE)

#補完精度の検証
mixError(ximp = dia.imp$ximp,
         xmis = dia.mis,
         xtrue = dia.sample)

library(missForest)

library(dplyr)

#ggplot2のデータセットを読み込む

data(diamonds, package = "ggplot2")

dia.sample <- sample_n(diamonds, size=2000)

dia.sample <- as.data.frame(dia.sample)

#既存データセットに5%の欠損を与える

dia.mis <- prodNA(dia.sample, noNA=0.05)

#補完の実行

dia.imp <- missForest(dia.mis, verbose=TRUE)

dia.imp %>% str(max.level=1)

#補完精度の推定

dia.imp$OOBerror

dia.imp <- missForest(dia.mis, verbose=TRUE, variablewise=TRUE)

#補完精度の検証

mixError(ximp = dia.imp$ximp,

xmis = dia.mis,

xtrue = dia.sample)

RFinanceYJ

Yohei Sato, Nobuaki Oshiro, Shinichi Takayanagiさんたちが作った、Yahoo!ファイナンスの株価データを取得できるパッケージです。だいぶ前からあったようですが、使って分析している人は初めて見ました。どうやらYahoo!ファイナンスの仕様によって書き換えていかないといけないようです。「2015-01-20 Rでチャートを書いてみる(9)」のブログに実行可能なプログラムがあります。以下、実行可能なコードを転載いたします。

library(RFinanceYJ)

#API
quoteStockTsData <- function(x, since=NULL,start.num=0,date.end=NULL,time.interval='daily')
{
  time.interval <- substr(time.interval,1,1)
  function.stock <- function(quote.table.item){
    if( xmlSize(quote.table.item) < 5) return(NULL) 
    d <- convertToDate(xmlValue(quote.table.item[[1]]),time.interval)
    o <- as.number(xmlValue(quote.table.item[[2]]))
    h <- as.number(xmlValue(quote.table.item[[3]]))
    l <- as.number(xmlValue(quote.table.item[[4]]))
    c <- as.number(xmlValue(quote.table.item[[5]]))
    v <- ifelse(xmlSize(quote.table.item) >= 6,as.number(xmlValue(quote.table.item[[6]])),0)
    a <- ifelse(xmlSize(quote.table.item) >= 7,as.number(xmlValue(quote.table.item[[7]])),0)
    return(data.frame(date=d,open=o,high=h,low=l,close=c,volume=v, adj_close=a))
  }
  return(quoteTsData(x,function.stock,since,start.num,date.end,time.interval,type="stock"))
}
quoteFundTsData <- function(x, since=NULL,start.num=0,date.end=NULL,time.interval='daily')
{
  time.interval <- substr(time.interval,1,1)
  function.fund <- function(quote.table.item){
    d <- convertToDate(xmlValue(quote.table.item[[1]]),time.interval)
    if(time.interval=='monthly'){
      d <- endOfMonth(d)
    }
    c <- as.number(xmlValue(quote.table.item[[2]]))
    v <- as.number(xmlValue(quote.table.item[[3]]))
    return(data.frame(date=d,constant.value=c,NAV=v))
  }
  return(quoteTsData(x,function.fund,since,start.num,date.end,time.interval,type="fund"))
}
quoteFXTsData <- function(x, since=NULL,start.num=0,date.end=NULL,time.interval='daily')
{
  time.interval <- substr(time.interval,1,1)
  function.fx <- function(quote.table.item){
    d <- convertToDate(xmlValue(quote.table.item[[1]]),time.interval)
    o <- as.number(xmlValue(quote.table.item[[2]]))
    h <- as.number(xmlValue(quote.table.item[[3]]))
    l <- as.number(xmlValue(quote.table.item[[4]]))
    c <- as.number(xmlValue(quote.table.item[[5]]))
    return(data.frame(date=d,open=o,high=h,low=l,close=c))
  }
  return(quoteTsData(x,function.fx,since,start.num,date.end,time.interval,type="fx"))
}
######  private functions  #####
#get time series data from Yahoo! Finance.
quoteTsData <- function(x,function.financialproduct,since,start.num,date.end,time.interval,type="stock"){
  r <- NULL
  result.num <- 51
  financial.data <- data.frame(NULL)
  #start <- (gsub("([0-9]{4,4})-([0-9]{2,2})-([0-9]{2,2})","&c=\\1&a=\\2&b=\\3",since))
  #end   <- (gsub("([0-9]{4,4})-([0-9]{2,2})-([0-9]{2,2})","&f=\\1&d=\\2&e=\\3",date.end))
  start <- (gsub("([0-9]{4,4})-([0-9]{2,2})-([0-9]{2,2})","&sy=\\1&sm=\\2&sd=\\3",since))
  end   <- (gsub("([0-9]{4,4})-([0-9]{2,2})-([0-9]{2,2})","&ey=\\1&em=\\2&ed=\\3",date.end))
  
  if(!any(time.interval==c('d','w','m'))) stop("Invalid time.interval value")
  
  extractQuoteTable <- function(r,type){
    if(type %in% c("fund","fx")){
      tbl <- r[[2]][[2]][[7]][[3]][[3]][[9]][[2]]
    }
    else{
      tbl <- r[[2]][[2]][[7]][[3]][[3]][[10]][[2]]
    }
    return(tbl)
  }
  
  #while( result.num >= 51 ){
  while(1){
    start.num <- start.num + 1
    quote.table <- NULL
    quote.url <- paste('http://info.finance.yahoo.co.jp/history/?code=',x,start,end,'&p=',start.num,'&tm=',substr(time.interval,1,1),sep="")
    #cat(quote.url)
    #try( r <- xmlRoot(htmlTreeParse(quote.url,error=xmlErrorCumulator(immediate=F))), TRUE)  # これだと取得時にエラーが出た。。
    try(r<-htmlParse(quote.url))
    if( is.null(r) ) stop(paste("Can not access :", quote.url))
    
    #try( quote.table <- r[[2]][[1]][[1]][[16]][[1]][[1]][[1]][[4]][[1]][[1]][[1]], TRUE )
    #try( quote.table <- extractQuoteTable(r,type), TRUE )
    try( quote.table <- xpathApply(r,"//table")[[2]], TRUE )
    
    quote.size<-xmlSize(quote.table)
    #cat(paste("size:",quote.size))
    if(xmlSize(quote.table)<=1){
      return (financial.data)
    }
    if( is.null(quote.table) ){
      if( is.null(financial.data) ){
        stop(paste("Can not quote :", x))
      }else{
        financial.data <- financial.data[order(financial.data$date),]
        return(financial.data)
      }
    }
    
    size <- xmlSize(quote.table)
    for(i in 2:size){
      financial.data <- rbind(financial.data,function.financialproduct(quote.table[[i]]))
    }
    
    #result.num <- xmlSize(quote.table)
    Sys.sleep(1)
  }
  financial.data <- financial.data[order(financial.data$date),]
  return(financial.data)  
}
#convert string formart date to POSIXct object
convertToDate <- function(date.string,time.interval)
{
  #data format is different between monthly and dialy or weekly
  if(any(time.interval==c('d','w'))){
    result <- gsub("^([0-9]{4})([^0-9]+)([0-9]{1,2})([^0-9]+)([0-9]{1,2})([^0-9]+)","\\1-\\3-\\5",date.string)
  }else if(time.interval=='m'){
    result <- gsub("^([0-9]{4})([^0-9]+)([0-9]{1,2})([^0-9]+)","\\1-\\3-01",date.string)
  }
  return(as.POSIXct(result))
}
#convert string to number.
as.number <- function(string)
{
  return(as.double(as.character(gsub("[^0-9.]", "",string))))
}
#return end of month date.
endOfMonth <- function(date.obj)
{
  startOfMonth     <- as.Date(format(date.obj,"%Y%m01"),"%Y%m%d")
  startOfNextMonth <- as.Date(format(startOfMonth+31,"%Y%m01"),"%Y%m%d")
  return(startOfNextMonth-1)
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

library(RFinanceYJ)

#API

quoteStockTsData <- function(x, since=NULL,start.num=0,date.end=NULL,time.interval='daily')

{

time.interval <- substr(time.interval,1,1)

function.stock <- function(quote.table.item){

if( xmlSize(quote.table.item) < 5) return(NULL)

d <- convertToDate(xmlValue(quote.table.item[[1]]),time.interval)

o <- as.number(xmlValue(quote.table.item[[2]]))

h <- as.number(xmlValue(quote.table.item[[3]]))

l <- as.number(xmlValue(quote.table.item[[4]]))

c <- as.number(xmlValue(quote.table.item[[5]]))

v <- ifelse(xmlSize(quote.table.item) >= 6,as.number(xmlValue(quote.table.item[[6]])),0)

a <- ifelse(xmlSize(quote.table.item) >= 7,as.number(xmlValue(quote.table.item[[7]])),0)

return(data.frame(date=d,open=o,high=h,low=l,close=c,volume=v, adj_close=a))

}

return(quoteTsData(x,function.stock,since,start.num,date.end,time.interval,type="stock"))

}

quoteFundTsData <- function(x, since=NULL,start.num=0,date.end=NULL,time.interval='daily')

{

time.interval <- substr(time.interval,1,1)

function.fund <- function(quote.table.item){

d <- convertToDate(xmlValue(quote.table.item[[1]]),time.interval)

if(time.interval=='monthly'){

d <- endOfMonth(d)

}

c <- as.number(xmlValue(quote.table.item[[2]]))

v <- as.number(xmlValue(quote.table.item[[3]]))

return(data.frame(date=d,constant.value=c,NAV=v))

}

return(quoteTsData(x,function.fund,since,start.num,date.end,time.interval,type="fund"))

}

quoteFXTsData <- function(x, since=NULL,start.num=0,date.end=NULL,time.interval='daily')

{

time.interval <- substr(time.interval,1,1)

function.fx <- function(quote.table.item){

d <- convertToDate(xmlValue(quote.table.item[[1]]),time.interval)

o <- as.number(xmlValue(quote.table.item[[2]]))

h <- as.number(xmlValue(quote.table.item[[3]]))

l <- as.number(xmlValue(quote.table.item[[4]]))

c <- as.number(xmlValue(quote.table.item[[5]]))

return(data.frame(date=d,open=o,high=h,low=l,close=c))

}

return(quoteTsData(x,function.fx,since,start.num,date.end,time.interval,type="fx"))

}

###### private functions #####

#get time series data from Yahoo! Finance.

quoteTsData <- function(x,function.financialproduct,since,start.num,date.end,time.interval,type="stock"){

r <- NULL

result.num <- 51

financial.data <- data.frame(NULL)

#start <- (gsub("([0-9]{4,4})-([0-9]{2,2})-([0-9]{2,2})","&c=\\1&a=\\2&b=\\3",since))

#end <- (gsub("([0-9]{4,4})-([0-9]{2,2})-([0-9]{2,2})","&f=\\1&d=\\2&e=\\3",date.end))

start <- (gsub("([0-9]{4,4})-([0-9]{2,2})-([0-9]{2,2})","&sy=\\1&sm=\\2&sd=\\3",since))

end <- (gsub("([0-9]{4,4})-([0-9]{2,2})-([0-9]{2,2})","&ey=\\1&em=\\2&ed=\\3",date.end))

if(!any(time.interval==c('d','w','m'))) stop("Invalid time.interval value")

extractQuoteTable <- function(r,type){

if(type %in% c("fund","fx")){

tbl <- r[[2]][[2]][[7]][[3]][[3]][[9]][[2]]

}

else{

tbl <- r[[2]][[2]][[7]][[3]][[3]][[10]][[2]]

}

return(tbl)

}

#while( result.num >= 51 ){

while(1){

start.num <- start.num + 1

quote.table <- NULL

quote.url <- paste('http://info.finance.yahoo.co.jp/history/?code=',x,start,end,'&p=',start.num,'&tm=',substr(time.interval,1,1),sep="")

#cat(quote.url)

#try( r <- xmlRoot(htmlTreeParse(quote.url,error=xmlErrorCumulator(immediate=F))), TRUE) # これだと取得時にエラーが出た。。

try(r<-htmlParse(quote.url))

if( is.null(r) ) stop(paste("Can not access :", quote.url))

#try( quote.table <- r[[2]][[1]][[1]][[16]][[1]][[1]][[1]][[4]][[1]][[1]][[1]], TRUE )

#try( quote.table <- extractQuoteTable(r,type), TRUE )

try( quote.table <- xpathApply(r,"//table")[[2]], TRUE )

quote.size<-xmlSize(quote.table)

#cat(paste("size:",quote.size))

if(xmlSize(quote.table)<=1){

return (financial.data)

}

if( is.null(quote.table) ){

if( is.null(financial.data) ){

stop(paste("Can not quote :", x))

}else{

financial.data <- financial.data[order(financial.data$date),]

return(financial.data)

}

size <- xmlSize(quote.table)

for(i in 2:size){

financial.data <- rbind(financial.data,function.financialproduct(quote.table[[i]]))

}

#result.num <- xmlSize(quote.table)

Sys.sleep(1)

}

financial.data <- financial.data[order(financial.data$date),]

return(financial.data)

}

#convert string formart date to POSIXct object

convertToDate <- function(date.string,time.interval)

{

#data format is different between monthly and dialy or weekly

if(any(time.interval==c('d','w'))){

result <- gsub("^([0-9]{4})([^0-9]+)([0-9]{1,2})([^0-9]+)([0-9]{1,2})([^0-9]+)","\\1-\\3-\\5",date.string)

}else if(time.interval=='m'){

result <- gsub("^([0-9]{4})([^0-9]+)([0-9]{1,2})([^0-9]+)","\\1-\\3-01",date.string)

}

return(as.POSIXct(result))

}

#convert string to number.

as.number <- function(string)

{

return(as.double(as.character(gsub("[^0-9.]", "",string))))

}

#return end of month date.

endOfMonth <- function(date.obj)

{

startOfMonth <- as.Date(format(date.obj,"%Y%m01"),"%Y%m%d")

startOfNextMonth <- as.Date(format(startOfMonth+31,"%Y%m01"),"%Y%m%d")

return(startOfNextMonth-1)

}

このコードでYahoo!ジャパンの株価を見てみましょう。ちなみに番号は4689です。どうやら上手く取れているようです。

> quoteStockTsData("4689.t",since="2016-01-01")
         date open high low close   volume adj_close
1  2016-05-02  476  483 475   478 18498100       478
2  2016-04-28  504  508 493   496 11966300       496
3  2016-04-27  505  511 495   497 12973800       497
4  2016-04-26  507  508 495   500  7712600       500
5  2016-04-25  513  515 506   509  7350600       509
6  2016-04-22  515  517 509   514  8908900       514
7  2016-04-21  512  517 506   514 13249900       514
8  2016-04-20  511  515 493   506 14455700       506
9  2016-04-19  516  523 511   516 13345800       516
10 2016-04-18  503  509 499   503 10275900       503
11 2016-04-15  504  519 504   513 16962900       513

> quoteStockTsData("4689.t",since="2016-01-01")

date open high low close volume adj_close

1 2016-05-02 476 483 475 478 18498100 478

2 2016-04-28 504 508 493 496 11966300 496

3 2016-04-27 505 511 495 497 12973800 497

4 2016-04-26 507 508 495 500 7712600 500

5 2016-04-25 513 515 506 509 7350600 509

6 2016-04-22 515 517 509 514 8908900 514

7 2016-04-21 512 517 506 514 13249900 514

8 2016-04-20 511 515 493 506 14455700 506

9 2016-04-19 516 523 511 516 13345800 516

10 2016-04-18 503 509 499 503 10275900 503

11 2016-04-15 504 519 504 513 16962900 513