本文最后更新于:2020年7月14日 下午

集成学习(Ensemble Learning introduction)

如果聚合一组预测器(比如分类器或回归器)的预测,得到的预测结果也比最好的单个预测器要好。这样的一组预测器,我们称为集成。这种技术,也被称为集成学习,而一个集成学习的算法则被称为集成方法。

  • voting(投票)
  • Bagging
    • 随机森林(RandomForest)
  • Boosting
    • AdaBoost
    • 提升树
    • XGboost
  • stacking

voting(投票)

1、硬投票分类器(Hard Voting)

hard_voting

分类结果是01分类,即选择一种结果(跟hard label类似)。

2、软投票分类器(Soft Voting)
分类器能够估算出类别的概率。把概率取平均,取平均概率最高的类别作为预测标签(跟soft label类似)。被称为软投票法。通常来说比硬投票法表现更优,因为它给予那些高度自信的投票更高的权重。

硬投票表决(sklearn代码简单实现):

from sklearn.ensemble import VotingClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# 使用Sklearn中moon数据集
X,y = make_moons(n_samples=7000,noise=0.1)
plt.scatter(X[:,0],X[:,1])

输入

# 数据集分割
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.25, random_state=42)
# 定义三个基分类器
# -逻辑回归
# -决策树
# -SVM
lr = LogisticRegression()
dt = DecisionTreeClassifier()
svm = SVC()
# 定义投票分类器
voting = VotingClassifier(
    estimators=[('lr',lr),('dt',dt),('svm',svm)],
    voting='hard'
)
# 输出各个分类器的准确率
for clf in (lr,dt,svm,voting):
    clf.fit(X_train,y_train)
    y_hat = clf.predict(X_test)
    print(clf.__class__.__name__,'=',accuracy_score(y_test,y_hat))

输出:

LogisticRegression = 0.8862857142857142
DecisionTreeClassifier = 0.996
SVC = 0.9988571428571429
VotingClassifier = 0.9977142857142857

软投票表决(跟上面的代码只有一些些不同):

# 数据集加载是一致的
# 定义三个基分类器
lr = LogisticRegression()
dt = DecisionTreeClassifier()
svm = SVC(probability=True) # probability设置为True,概率形式的预测结果输入到votingClassifier中
# 定义投票分类器
voting = VotingClassifier(
    estimators=[('lr',lr),('rf',dt),('svc',svm)],
    voting='soft' # 设置为soft
)
# 输出各个分类器的准确率
for clf in (lr,dt,svm,voting):
    clf.fit(X_train,y_train)
    y_hat = clf.predict(X_test)
    print(clf.__class__.__name__,'=',accuracy_score(y_test,y_hat))

输出:

LogisticRegression = 0.88
DecisionTreeClassifier = 0.9994285714285714
SVC = 0.9994285714285714
VotingClassifier = 0.9994285714285714

Bagging(并行)

bagging和pasting
每个预测器使用的算法相同,但是在不同的训练集(随机子集)上进行训练。采样时如果将样本放回,这汇总方法叫做bagging(bootstrap aggregating,自举汇聚法)。采样时样本不放回,这种方法叫做pasting。
(实际应用中bagging更加常用。这里就着重介绍bagging)

image-20200713004236379

随机森林(Random Forest)

随机森林是决策树的集成,通常用bagging(有时也可能是pasting)方法训练。除了先构建一个BaggingClassifier然后将结果传输到DecisionTreeClassifier,还有一种方法就是使用RandomForestClassifier类。

大概有63.2%样本会被抽到,用作训练集。未抽到的37.8%样本(out of bag简称oob)作为测试集。
证明:假设进行N次抽样,未抽到的概率为 $1-(1-\frac{1}{N})^N=1-[(1+\frac{1}{-N})^{-N}]^{-1}$,当N趋于无穷大时,上式趋于$1-e^{-1}=63.2\%$。

代码:


from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np
iris = load_iris()
X = iris.data
y = iris.target
# bootstrap = True 为bagging,bootstrap=False为pasting
# max_samples设置为整数表示的就是采样的样本数,设置为浮点数表示的是max_samples*x.shape[0]
# oob_score=True表示未抽到的样本作为测试集
# n_estimators基本分类器的数量
bag_clf = BaggingClassifier(
    SVC(),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1 
    ,oob_score=True
)
bag_clf.fit(X,y)
# oob_score默认为False时进行下列预测
# y_hat = bag_clf.predict(X)
# print(bag_clf.__class__.__name__,'=',accuracy_score(y,y_hat))
# BaggingClassifier = 0.9733333333333334
print(bag_clf.oob_score_)

输出:

0.9666666666666667

如果基本分类器是决策树,那么集成后的分类器就是随机森林

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1
)
bag_clf.fit(X,y)
y_hat = bag_clf.predict(X)
print(bag_clf.__class__.__name__,'=',accuracy_score(y,y_hat))

输出:

BaggingClassifier = 1.0

Sklearn也提供了直接实现随机森林的API

rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(X, y)
print(rnd_clf.__class__.__name__, '=', accuracy_score(y, y_hat))

输出:

RandomForestClassifier = 1.0

Boosting(串行)

将几个弱学习器结合成一个强学习器的集成方法。大多数提升法的总体思想是循环训练预测器,每一次都对其前序做出一些改正。

  • AdaBoost
  • GBDT
  • XGBoost

AdaBoost基本原理

  • AdaBoost:从弱学习算法出发,反复学习得到一系列弱分类器(又称基本分类器),然后组合这些弱分类器构成一个强分类器。

  • 要解决的两个问题:

    1、每一轮如何改变训练数据的权值或概率分布:提高那些被前一轮错误分类样本的权值,降低那些被正确分类样本的权值。
    2、如何将弱分类器组合成强分类器:加大分类误差小的弱分类器权值,减小分类误差率大的弱分类器权值。

AdaBoost算法流程

(1)初始化训练数据权值分布 $D_1=(w_{11},\ldots,w_{1i},\ldots,w_{1m}),\quad w_{1i}=\frac{1}{m}$
(2)对于每一轮训练t=1,2,…,T
  (a)使用具有权值分布Dt的训练数据集进行学习,得到基本分类器$G_t(x)$
  (b)计算$G_t(x)$在训练数据集上的分类误差率:

  (c)计算$G_t(x)$的权重系数 $\alpha_t = \frac{1}{2}\log{\frac{1-e_t}{e_t}}$
  (d)更新训练数据集的权重分布 $D_{t+1}=(w_{t+1,1},\ldots,w_{t+1,i},\ldots,w_{t+1,m})$
(3)构建基本分类器的线性组合 $w_{t+1,i}=\frac{w_{ti}}{Z_t}\exp{(-\alpha_t y^{(i)} G_t(x^{(i)}))}$,其中$Z_t$是归一化因子,$Z_t=\sum_{i=1}^{m}w_{ti}\exp{(-\alpha_t y^{(i)}) G_t(x^{(i)})}$
   $f(x)=\sum_{i=1}^{T} \alpha_t G_t(x)$ 得到最终分类器 $G(x)=sign(f(x))=sign(\sum_{t=1}^{T} \alpha_t G_t(x))$

image-20200713150853956

image-20200713153247630

image-20200713153314897

AdaBoost代码(Sklearn实现)

代码(参数明细见scikit-learn Adaboost类库使用小结):

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np
data = np.loadtxt('data/wine.data',delimiter=',')
X = data[:,1:]
y = data[:,0:1]

# 对于决策树而言,归一化作用不明显,对于逻辑回归而言,归一化是比较有用的~
# from sklearn.preprocessing import StandardScaler
# X = StandardScaler().fit_transform(X)
# from sklearn.linear_model import LogisticRegression
# rf = LogisticRegression()

X_train,X_test,y_train,y_test = train_test_split(X,y.ravel(),train_size=0.8,random_state=0)
# 定义弱(基)分类器
rf = DecisionTreeClassifier()
# 定义AdaBoost分类器
model = AdaBoostClassifier(base_estimator=rf,n_estimators=50,algorithm="SAMME.R", learning_rate=0.5)
model.fit(X_train,y_train)
#AdaBoostClassifier(algorithm='SAMME.R',
#                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
#                                                         class_weight=None,
#                                                         criterion='gini',
#                                                         max_depth=None,
#                                                         max_features=None,
#                                                         max_leaf_nodes=None,
#                                                         min_impurity_decrease=0.0,
#                                                         min_impurity_split=None,
#                                                         min_samples_leaf=1,
#                                                         min_samples_split=2,
#                                                         min_weight_fraction_leaf=0.0,
#                                                         presort='deprecated',
#                                                         random_state=None,
#                                                         splitter='best'),
#                   learning_rate=0.5, n_estimators=50, random_state=None)
y_train_hat = model.predict(X_train)
print("train accuarcy:",accuracy_score(y_train,y_train_hat))
y_test_hat = model.predict(X_test)
print("test accuarcy:", accuracy_score(y_test, y_test_hat))

输出:

train accuarcy: 1.0
test accuarcy: 0.9722222222222222

梯度提升(Gradient Boosting Tree)

与AdaBoost类似,梯度提升也是逐步在集成中添加预测器,与AdaBoost类似,梯度提升也是在继承中添加预测器,每一个都对其前序做出改正。不同之处在于,它不是像AdaBoost那样在每个迭代中调整实例权重,而是让新的预测器针对前一个预测器的残差进行拟合

提升树种每一颗树学的是之前所有树结论累积的残差。残差就是真实值和预测值的差值。如A的真实年龄是18,第一棵树预测年龄是12岁,则残差值为6.接下来在第二颗树把A的年龄当初6岁去学习,如果第二棵树能把A分到6岁的叶子节点,那累加两颗树的结论就是A的真实年龄;如果第二棵树的结论是5岁,则A仍然存在1岁的残差,第三棵树把A的年龄当成1岁,继续学。最后,所有树的累加值就是最终预测值。
提升树算法
(1)初始化$f_0(x)=0$
(2)对m=1,2,…,M
  (a) 计算残差 $r_{mi}=y_i-f_{m-1}(x_i)$
  (b) 拟合残差 $r_{mi}$ 学习一个回归树 $T(x;\Theta_m)$
  (c) 更新 $f_m(x)=f_{m-1}(x)+T(x;\Theta_m)$
(3)得到提升树 $f_M(x)=\sum_{i=1}^{M}T(x;\Theta_m)$

代码(预测一维数据的结果):

from sklearn.tree import DecisionTreeRegressor
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
import matplotlib.pyplot as plt
def loaddata():
    data = np.loadtxt('data/data.txt',delimiter=',')
    n = data.shape[1]-1 #特征数
    X = data[:,0:n]
    y = data[:,-1].reshape(-1,1)
    return X,y
X,y = loaddata()
plt.scatter(X,y)

输出1

1)根据算法原理手动实现(使用决策树回归DesicionTreeRegressor)

# 1、定义第一课树(最大深度设定为5),并进行训练
tree_reg1 = DecisionTreeRegressor(max_depth=5)
tree_reg1.fit(X, y)
#DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
#                      max_leaf_nodes=None, min_impurity_decrease=0.0,
#                      min_impurity_split=None, min_samples_leaf=1,
#                      min_samples_split=2, min_weight_fraction_leaf=0.0,
#                      presort=False, random_state=None, splitter='best')
# 2、计算残差,并把残差当做目标值训练第二棵树(最大深度设定为5)
y2 = y - tree_reg1.predict(X).reshape(-1,1)
tree_reg2 = DecisionTreeRegressor(max_depth=5)
tree_reg2.fit(X, y2)
# 3、继续计算残差,并把残差当做目标值训练第三棵树(最大深度设定为5)
y3 = y2 - tree_reg2.predict(X).reshape(-1,1)
tree_reg3 = DecisionTreeRegressor(max_depth=5)
tree_reg3.fit(X, y3)
# 4、测试
# 取训练集前5条数据,并对前5条数据做预测
X_new = X[0:5,]
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
print(y_pred)

输出;

[17.61560196  9.15380196 12.831       4.57199973  6.68971688]

2)直接使用sklearn提供的GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=5, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)
print(gbrt.predict(X_new))

输出:

[17.61560196  9.15380196 12.831       4.57199973  6.68971688]

梯度提升树(Gradient Boosting Tree)

上例中使用的平方损失:$L(y, f_t(x))=\frac{1}{2}(y-f_t(x))^2$
其负梯度为:$-\frac{\partial L(y,f_t(x))}{\partial f_t(x)}=y-f_t(x)$该负梯度就是残差。

  • 平方损失拟合残差值
  • 非平方损失拟合负梯度值

GBDT:其关键是利用损失函数的负梯度作为提升树算法中要拟合的值。

XGBoost

image-20200713173830276

q表示树结构,q把每一个样本点x映射到某一个叶子节点。
T是叶子节点的数量
w是叶子节点的权值(实际就是预测值)
$f_k$表示第k个树,该树的结构由q表示,预测值由w表示。
预测值表示为:

或:

XGBoost损失函数(Loss function of XGBoost)

假设损失函数是平方损失,则初值可取所有样本的平均数。

因此损失函数可写为:

橙框中为正则项,C为常数项。

例子:

image-20200713181511123

其中$\Omega=\gamma \times 3 + \frac{1}{2} \lambda (4+0.01 + 1)$

XGBoost求解(XGBoost solution)

目标函数:

根据Taylor公式:

令:

得:

对上式求解:

对于

令:

得:

对w求偏导:

代入上一步的目标函数$J(f_t)$得:

XGBoost中树结构的生成

image-20200713224002458

  • 构造决策树的结构:枚举可行的分割点,选择增益最大的划分

image-20200713224803648

代码1

sklearn并没有集成xgboost,使用前需安装 命令:pip install xgboost

读取数据

XGBoost中数据形式可以是libsvm的,libsvm作用是对稀疏特征进行优化,看个例子:
1 101:1.2 102:0.03
0 1:2.1 10001:300 10002:400
0 2:1.2 1212:21 7777:2
每行表示一个样本,每行开头0,1表示标签,而后面的则是特征索引:数值,其他未表示都是0.

我们以判断蘑菇是否有毒为例子来做后续的训练。数据集来自:http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/ ,其中蘑菇有22个属性,将这些原始的特征加工后得到126维特征,并保存为libsvm格式,标签是表示蘑菇是否有毒。

import xgboost as xgb
data_train = xgb.DMatrix('data/agaricus.txt.train')
data_test = xgb.DMatrix('data/agaricus.txt.test')
设置参数
  • eta:可看成学习率learning_rate。典型值一般设置为:0.01-0.2

  • gamma:分裂节点时,损失函数减小值只有大于等于gamma才分裂,gamma值越大,算法越保守,越不容易过拟合,但性能就不一定能保证,需要平衡。

  • objective

    • reg:linear:线性回归
    • reg:logistic:逻辑回归
    • binary:logistic 二分类的逻辑回归,返回预测的概率
    • binary:logitraw:二分类逻辑回归,输出是逻辑为0/1的前一步的分数
    • multi:softmax:用于Xgboost 做多分类问题,需要设置num_class(分类的个数)
    • multi:softprob:和softmax一样,但是返回的是每个数据属于各个类别的概率。
    • rank:pairwise:让Xgboost 做排名任务,通过最小化(Learn to rank的一种方法)
  • max_depth:决策树最大深度

  • silent:0 (silent), 1 (warning), 2 (info), 3 (debug)

    更多参数参见:https://xgboost.readthedocs.io/en/latest/parameter.html

param = {'max_depth': 3, 'eta': 0.3, 'objective': 'binary:logistic'}
watchlist = [(data_test, 'eval'), (data_train, 'train')]
n_round = 6
model = xgb.train(param, data_train, num_boost_round=n_round, evals=watchlist)
计算准确率
y_hat = model.predict(data_test)
y_pred = y_hat.copy()
y_pred[y_hat>=0.5]=1
y_pred[y_hat<0.5]=0
y = data_test.get_label()
from sklearn.metrics import accuracy_score
print('accuracy_score=',accuracy_score(y,y_pred))

输出:

accuracy_score= 1.0
查看[0-5]颗决策树
from matplotlib import pyplot
import graphviz
xgb.to_graphviz(model, num_trees=1) # 以第二颗为例

代码2

(写法不同)

import xgboost as xgb
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# 读取数据(把libsvm格式读取成以前我们常用的二维数组形式)
from sklearn.datasets import load_svmlight_file
#data_train = xgb.DMatrix('data/agaricus.txt.train')
#data_test = xgb.DMatrix('data/agaricus.txt.test')
X_train,y_train = load_svmlight_file('data/agaricus.txt.train')
X_test,y_test = load_svmlight_file('data/agaricus.txt.test')
# 把稀疏数组转换为稠密数组
X_train.toarray().shape
# (6513, 126)
# 设置参数
model =xgb.XGBClassifier(max_depth=2, learning_rate=1, n_estimators=6, objective='binary:logistic')

model.fit(X_train, y_train)
# 计算准确率
# 训练集上准确率
train_preds = model.predict(X_train)
train_predictions = [round(value) for value in train_preds]

train_accuracy = accuracy_score(y_train, train_predictions)
print ("Train Accuary: %.2f%%" % (train_accuracy * 100.0))
# 测试集上准确率
# make prediction
preds = model.predict(X_test)
predictions = [round(value) for value in preds]

test_accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy: %.2f%%" % (test_accuracy * 100.0))

输出:

Train Accuary: 99.88%
Test Accuracy: 100.00%
GridSearchcv搜索最优参数
from sklearn.model_selection import GridSearchCV
model = xgb.XGBClassifier(learning_rate=0.1, objective='binary:logistic')
param_grid = {
 'n_estimators': range(1, 51, 1),
 'max_depth':range(1,10,1)
}
clf = GridSearchCV(model, param_grid, "accuracy",cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_, clf.best_score_)

输出:

({'max_depth': 2, 'n_estimators': 30}, 0.9841860859908541)
early-stop

设置验证valid集,迭代过程中发现在验证集上错误率增加,则提前停止迭代。

from sklearn.model_selection import train_test_split
X_train_part, X_validate, y_train_part, y_validate = train_test_split(X_train, y_train, test_size=0.3,random_state=0)
# 设置boosting迭代计算次数
num_round = 100

bst =xgb.XGBClassifier(max_depth=2, learning_rate=0.1, n_estimators=num_round, objective='binary:logistic')

eval_set =[(X_validate, y_validate)]
# early_stopping_rounds:连续x次在验证集的错误率均不变或上升时即停止。
bst.fit(X_train_part, y_train_part, early_stopping_rounds=10, eval_metric="error", eval_set=eval_set, verbose=True)

输出:

[0]	validation_0-error:0.04862
Will train until validation_0-error hasn't improved in 10 rounds.
[1]	validation_0-error:0.04299
[2]	validation_0-error:0.04862
[3]	validation_0-error:0.04299
[4]	validation_0-error:0.04862
[5]	validation_0-error:0.04862
[6]	validation_0-error:0.04299
[7]	validation_0-error:0.04299
[8]	validation_0-error:0.04299
[9]	validation_0-error:0.04299
[10]	validation_0-error:0.04299
[11]	validation_0-error:0.02405
[12]	validation_0-error:0.02968
[13]	validation_0-error:0.01945
[14]	validation_0-error:0.01945
[15]	validation_0-error:0.01945
[16]	validation_0-error:0.01945
[17]	validation_0-error:0.01945
[18]	validation_0-error:0.01945
[19]	validation_0-error:0.01945
[20]	validation_0-error:0.01945
[21]	validation_0-error:0.02661
[22]	validation_0-error:0.02354
[23]	validation_0-error:0.02354
Stopping. Best iteration:
[13]	validation_0-error:0.01945

Out[13]:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=2,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

将错误率可视化,进行更直观的观察。

results = bst.evals_result()
epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)

# plot log loss
plt.plot(x_axis, results['validation_0']['error'], label='Test')

plt.ylabel('Error')
plt.xlabel('Round')
plt.title('XGBoost Early Stop')
plt.show()

输出3

可视化学习曲线

训练

# 设置boosting迭代计算次数
num_round = 100

# 没有 eraly_stop
bst =xgb.XGBClassifier(max_depth=2, learning_rate=0.1, n_estimators=num_round, silent=True, objective='binary:logistic')
 
eval_set = [(X_train_part, y_train_part), (X_validate, y_validate)]
bst.fit(X_train_part, y_train_part, eval_metric=["error", "logloss"], eval_set=eval_set, verbose=True)

输出:

[0]	validation_0-error:0.04562	validation_0-logloss:0.61466	validation_1-error:0.04862	validation_1-logloss:0.61509
[1]	validation_0-error:0.04102	validation_0-logloss:0.55002	validation_1-error:0.04299	validation_1-logloss:0.55021
[2]	validation_0-error:0.04562	validation_0-logloss:0.49553	validation_1-error:0.04862	validation_1-logloss:0.49615
[3]	validation_0-error:0.04102	validation_0-logloss:0.44924	validation_1-error:0.04299	validation_1-logloss:0.44970
[4]	validation_0-error:0.04562	validation_0-logloss:0.40965	validation_1-error:0.04862	validation_1-logloss:0.41051
[5]	validation_0-error:0.04562	validation_0-logloss:0.37502	validation_1-error:0.04862	validation_1-logloss:0.37527
[6]	validation_0-error:0.04102	validation_0-logloss:0.34374	validation_1-error:0.04299	validation_1-logloss:0.34409
[7]	validation_0-error:0.04102	validation_0-logloss:0.31648	validation_1-error:0.04299	validation_1-logloss:0.31680
[8]	validation_0-error:0.04102	validation_0-logloss:0.29229	validation_1-error:0.04299	validation_1-logloss:0.29217
[9]	validation_0-error:0.04102	validation_0-logloss:0.27028	validation_1-error:0.04299	validation_1-logloss:0.27031
[10]	validation_0-error:0.04102	validation_0-logloss:0.25084	validation_1-error:0.04299	validation_1-logloss:0.25075
[11]	validation_0-error:0.02303	validation_0-logloss:0.23349	validation_1-error:0.02405	validation_1-logloss:0.23322
[12]	validation_0-error:0.02895	validation_0-logloss:0.21413	validation_1-error:0.02968	validation_1-logloss:0.21411
[13]	validation_0-error:0.01645	validation_0-logloss:0.19726	validation_1-error:0.01945	validation_1-logloss:0.19747
[14]	validation_0-error:0.01645	validation_0-logloss:0.18254	validation_1-error:0.01945	validation_1-logloss:0.18296
[15]	validation_0-error:0.01645	validation_0-logloss:0.16969	validation_1-error:0.01945	validation_1-logloss:0.17029
[16]	validation_0-error:0.01645	validation_0-logloss:0.15845	validation_1-error:0.01945	validation_1-logloss:0.15923
[17]	validation_0-error:0.01645	validation_0-logloss:0.14999	validation_1-error:0.01945	validation_1-logloss:0.15106
[18]	validation_0-error:0.01645	validation_0-logloss:0.14108	validation_1-error:0.01945	validation_1-logloss:0.14227
[19]	validation_0-error:0.01645	validation_0-logloss:0.13417	validation_1-error:0.01945	validation_1-logloss:0.13544
[20]	validation_0-error:0.01645	validation_0-logloss:0.12691	validation_1-error:0.01945	validation_1-logloss:0.12827
[21]	validation_0-error:0.02500	validation_0-logloss:0.12055	validation_1-error:0.02661	validation_1-logloss:0.12199
[22]	validation_0-error:0.02062	validation_0-logloss:0.11535	validation_1-error:0.02354	validation_1-logloss:0.11683
[23]	validation_0-error:0.02062	validation_0-logloss:0.11015	validation_1-error:0.02354	validation_1-logloss:0.11169
[24]	validation_0-error:0.02062	validation_0-logloss:0.10586	validation_1-error:0.02354	validation_1-logloss:0.10725
[25]	validation_0-error:0.02062	validation_0-logloss:0.10196	validation_1-error:0.02354	validation_1-logloss:0.10337
[26]	validation_0-error:0.02062	validation_0-logloss:0.09799	validation_1-error:0.02354	validation_1-logloss:0.09945
[27]	validation_0-error:0.02062	validation_0-logloss:0.09453	validation_1-error:0.02354	validation_1-logloss:0.09620
[28]	validation_0-error:0.02062	validation_0-logloss:0.09112	validation_1-error:0.02354	validation_1-logloss:0.09283
[29]	validation_0-error:0.02062	validation_0-logloss:0.08809	validation_1-error:0.02354	validation_1-logloss:0.08991
[30]	validation_0-error:0.02062	validation_0-logloss:0.08521	validation_1-error:0.02354	validation_1-logloss:0.08693
[31]	validation_0-error:0.02062	validation_0-logloss:0.08188	validation_1-error:0.02354	validation_1-logloss:0.08324
[32]	validation_0-error:0.02062	validation_0-logloss:0.07883	validation_1-error:0.02354	validation_1-logloss:0.08027
[33]	validation_0-error:0.02062	validation_0-logloss:0.07606	validation_1-error:0.02354	validation_1-logloss:0.07722
[34]	validation_0-error:0.02062	validation_0-logloss:0.07349	validation_1-error:0.02354	validation_1-logloss:0.07491
[35]	validation_0-error:0.02062	validation_0-logloss:0.07078	validation_1-error:0.02354	validation_1-logloss:0.07213
[36]	validation_0-error:0.01470	validation_0-logloss:0.06844	validation_1-error:0.01791	validation_1-logloss:0.06985
[37]	validation_0-error:0.01009	validation_0-logloss:0.06620	validation_1-error:0.01023	validation_1-logloss:0.06742
[38]	validation_0-error:0.01470	validation_0-logloss:0.06400	validation_1-error:0.01791	validation_1-logloss:0.06544
[39]	validation_0-error:0.00153	validation_0-logloss:0.06188	validation_1-error:0.00307	validation_1-logloss:0.06349
[40]	validation_0-error:0.00153	validation_0-logloss:0.05988	validation_1-error:0.00307	validation_1-logloss:0.06132
[41]	validation_0-error:0.00153	validation_0-logloss:0.05798	validation_1-error:0.00307	validation_1-logloss:0.05936
[42]	validation_0-error:0.00153	validation_0-logloss:0.05621	validation_1-error:0.00307	validation_1-logloss:0.05774
[43]	validation_0-error:0.00153	validation_0-logloss:0.05441	validation_1-error:0.00307	validation_1-logloss:0.05600
[44]	validation_0-error:0.00153	validation_0-logloss:0.05274	validation_1-error:0.00307	validation_1-logloss:0.05426
[45]	validation_0-error:0.00153	validation_0-logloss:0.05121	validation_1-error:0.00307	validation_1-logloss:0.05266
[46]	validation_0-error:0.00153	validation_0-logloss:0.04972	validation_1-error:0.00307	validation_1-logloss:0.05131
[47]	validation_0-error:0.00153	validation_0-logloss:0.04820	validation_1-error:0.00307	validation_1-logloss:0.04985
[48]	validation_0-error:0.00153	validation_0-logloss:0.04680	validation_1-error:0.00307	validation_1-logloss:0.04838
[49]	validation_0-error:0.00153	validation_0-logloss:0.04552	validation_1-error:0.00307	validation_1-logloss:0.04715
[50]	validation_0-error:0.00153	validation_0-logloss:0.04423	validation_1-error:0.00307	validation_1-logloss:0.04583
[51]	validation_0-error:0.00153	validation_0-logloss:0.04302	validation_1-error:0.00307	validation_1-logloss:0.04456
[52]	validation_0-error:0.00153	validation_0-logloss:0.04190	validation_1-error:0.00307	validation_1-logloss:0.04358
[53]	validation_0-error:0.00153	validation_0-logloss:0.04072	validation_1-error:0.00307	validation_1-logloss:0.04234
[54]	validation_0-error:0.00153	validation_0-logloss:0.03967	validation_1-error:0.00307	validation_1-logloss:0.04134
[55]	validation_0-error:0.00153	validation_0-logloss:0.03866	validation_1-error:0.00307	validation_1-logloss:0.04027
[56]	validation_0-error:0.00153	validation_0-logloss:0.03766	validation_1-error:0.00307	validation_1-logloss:0.03940
[57]	validation_0-error:0.00153	validation_0-logloss:0.03663	validation_1-error:0.00307	validation_1-logloss:0.03849
[58]	validation_0-error:0.00153	validation_0-logloss:0.03569	validation_1-error:0.00307	validation_1-logloss:0.03751
[59]	validation_0-error:0.00153	validation_0-logloss:0.03482	validation_1-error:0.00307	validation_1-logloss:0.03658
[60]	validation_0-error:0.00153	validation_0-logloss:0.03393	validation_1-error:0.00307	validation_1-logloss:0.03581
[61]	validation_0-error:0.00153	validation_0-logloss:0.03272	validation_1-error:0.00307	validation_1-logloss:0.03450
[62]	validation_0-error:0.00153	validation_0-logloss:0.03192	validation_1-error:0.00307	validation_1-logloss:0.03367
[63]	validation_0-error:0.00153	validation_0-logloss:0.03116	validation_1-error:0.00307	validation_1-logloss:0.03295
[64]	validation_0-error:0.00153	validation_0-logloss:0.03041	validation_1-error:0.00307	validation_1-logloss:0.03216
[65]	validation_0-error:0.00153	validation_0-logloss:0.02968	validation_1-error:0.00307	validation_1-logloss:0.03140
[66]	validation_0-error:0.00153	validation_0-logloss:0.02901	validation_1-error:0.00307	validation_1-logloss:0.03079
[67]	validation_0-error:0.00153	validation_0-logloss:0.02834	validation_1-error:0.00307	validation_1-logloss:0.03017
[68]	validation_0-error:0.00153	validation_0-logloss:0.02771	validation_1-error:0.00307	validation_1-logloss:0.02948
[69]	validation_0-error:0.00153	validation_0-logloss:0.02708	validation_1-error:0.00307	validation_1-logloss:0.02896
[70]	validation_0-error:0.00153	validation_0-logloss:0.02646	validation_1-error:0.00307	validation_1-logloss:0.02834
[71]	validation_0-error:0.00153	validation_0-logloss:0.02589	validation_1-error:0.00307	validation_1-logloss:0.02773
[72]	validation_0-error:0.00153	validation_0-logloss:0.02533	validation_1-error:0.00307	validation_1-logloss:0.02712
[73]	validation_0-error:0.00153	validation_0-logloss:0.02479	validation_1-error:0.00307	validation_1-logloss:0.02663
[74]	validation_0-error:0.00153	validation_0-logloss:0.02427	validation_1-error:0.00307	validation_1-logloss:0.02600
[75]	validation_0-error:0.00153	validation_0-logloss:0.02376	validation_1-error:0.00307	validation_1-logloss:0.02553
[76]	validation_0-error:0.00153	validation_0-logloss:0.02327	validation_1-error:0.00307	validation_1-logloss:0.02512
[77]	validation_0-error:0.00153	validation_0-logloss:0.02278	validation_1-error:0.00307	validation_1-logloss:0.02461
[78]	validation_0-error:0.00153	validation_0-logloss:0.02233	validation_1-error:0.00307	validation_1-logloss:0.02421
[79]	validation_0-error:0.00153	validation_0-logloss:0.02186	validation_1-error:0.00307	validation_1-logloss:0.02378
[80]	validation_0-error:0.00153	validation_0-logloss:0.02143	validation_1-error:0.00307	validation_1-logloss:0.02330
[81]	validation_0-error:0.00153	validation_0-logloss:0.02090	validation_1-error:0.00307	validation_1-logloss:0.02271
[82]	validation_0-error:0.00153	validation_0-logloss:0.02050	validation_1-error:0.00307	validation_1-logloss:0.02239
[83]	validation_0-error:0.00153	validation_0-logloss:0.02009	validation_1-error:0.00307	validation_1-logloss:0.02206
[84]	validation_0-error:0.00153	validation_0-logloss:0.01969	validation_1-error:0.00307	validation_1-logloss:0.02159
[85]	validation_0-error:0.00153	validation_0-logloss:0.01931	validation_1-error:0.00307	validation_1-logloss:0.02126
[86]	validation_0-error:0.00153	validation_0-logloss:0.01895	validation_1-error:0.00307	validation_1-logloss:0.02088
[87]	validation_0-error:0.00153	validation_0-logloss:0.01837	validation_1-error:0.00307	validation_1-logloss:0.02021
[88]	validation_0-error:0.00153	validation_0-logloss:0.01804	validation_1-error:0.00307	validation_1-logloss:0.01992
[89]	validation_0-error:0.00153	validation_0-logloss:0.01772	validation_1-error:0.00307	validation_1-logloss:0.01956
[90]	validation_0-error:0.00153	validation_0-logloss:0.01741	validation_1-error:0.00307	validation_1-logloss:0.01929
[91]	validation_0-error:0.00153	validation_0-logloss:0.01710	validation_1-error:0.00307	validation_1-logloss:0.01897
[92]	validation_0-error:0.00153	validation_0-logloss:0.01660	validation_1-error:0.00307	validation_1-logloss:0.01839
[93]	validation_0-error:0.00153	validation_0-logloss:0.01602	validation_1-error:0.00307	validation_1-logloss:0.01748
[94]	validation_0-error:0.00153	validation_0-logloss:0.01576	validation_1-error:0.00307	validation_1-logloss:0.01721
[95]	validation_0-error:0.00153	validation_0-logloss:0.01547	validation_1-error:0.00307	validation_1-logloss:0.01692
[96]	validation_0-error:0.00153	validation_0-logloss:0.01520	validation_1-error:0.00307	validation_1-logloss:0.01672
[97]	validation_0-error:0.00153	validation_0-logloss:0.01492	validation_1-error:0.00307	validation_1-logloss:0.01640
[98]	validation_0-error:0.00153	validation_0-logloss:0.01444	validation_1-error:0.00307	validation_1-logloss:0.01570
[99]	validation_0-error:0.00153	validation_0-logloss:0.01422	validation_1-error:0.00307	validation_1-logloss:0.01543
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=2,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, silent=True, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

可视化

# retrieve performance metrics
results = bst.evals_result()
#print(results)


epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)

# plot log loss
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
plt.ylabel('Log Loss')
plt.title('XGBoost Log Loss')
plt.show()

# plot classification error
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Train')
ax.plot(x_axis, results['validation_1']['error'], label='Test')
ax.legend() # 显示标签label
plt.ylabel('Classification Error')
plt.title('XGBoost Classification Error')
plt.show()

Stacking(堆叠)

Stacking原理

image-20200713231126836

Stacking代码

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np
data = np.loadtxt('data/wine.data', delimiter=',')
X = data[:, 1:]
y = data[:, 0:1]
X_train, X_test, y_train, y_test = train_test_split(X, y.ravel(), train_size=0.8, random_state=0)
# 定义基分类器
clf1 = KNeighborsClassifier(n_neighbors=5)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
# 定义最终使用的逻辑回归分类器
lr = LogisticRegression()
# 使用stacking分类器
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
                          meta_classifier=lr,use_probas=True)
# 对每一个模型分别进行评价
for model in [clf1,clf2,clf3,lr,sclf]:
    model.fit(X_train,y_train)
    y_test_hat = model.predict(X_test)
    print(model.__class__.__name__,',test accuarcy:',accuracy_score(y_test,y_test_hat))

输出:

KNeighborsClassifier ,test accuarcy: 0.8055555555555556
RandomForestClassifier ,test accuarcy: 0.9444444444444444
GaussianNB ,test accuarcy: 0.9166666666666666
LogisticRegression ,test accuarcy: 0.9444444444444444
StackingClassifier ,test accuarcy: 0.9722222222222222

本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!

latex常用语法 上一篇
leetcode315 下一篇