Using dummy estimators to compare results使用虚拟估计值来对比结果

到不了的都叫做远方

修改于 2020-05-07 14:12:44

4540

修改于 2020-05-07 14:12:44

文章被收录于专栏：翻译scikit-learn Cookbook翻译scikit-learn Cookbook

This recipe is about creating fake estimators; this isn't the pretty or exciting stuff, but it is worthwhile to have a reference point for the model you'll eventually build.

这部分是关于生成虚拟估计值，这不是漂亮的或者令人兴奋的事情，但是这对你最终完成的模型是一个值得查阅的点

Getting ready准备工作

In this recipe, we'll perform the following tasks:在这部分，我们将展现以下目标：

1. Create some data random data.生成一些随机数据集

2. Fit the various dummy estimators.拟合变量的虚拟估计值

We'll perform these two steps for regression data and classification data.我们将对回归数据和分类数据展示这两步。

How to do it...怎么做

First, we'll create the random data:首先生成些随机数据：

from sklearn.datasets import make_regression, make_classification
# classification if for late
X, y = make_regression()
from sklearn import dummy
dumdum = dummy.DummyRegressor()
dumdum.fit(X, y)
DummyRegressor(constant=None, quantile=None, strategy='mean')

By default, the estimator will predict by just taking the mean of the values and predicting the mean values:

通过定义，估计值将要通过采用均值和预测均值来进行预测：

dumdum.predict(X)[:5]
array([10.39941733, 10.39941733, 10.39941733, 10.39941733, 10.39941733])

There are other two other strategies we can try. We can predict a supplied constant (refer to constant=None from the preceding command). We can also predict the median value.

这还有两种方法可以选择尝试，我们能预测一个可支持的常数（使用预测指令来代替constant=None），我们也能预测中位数。

Supplying a constant will only be considered if strategy is "constant".只有当策略是一个常数时，才考虑应用一个常数

Let's have a look:让我们看一下：

predictors = [("mean", None),("median", None),("constant", 10)]
for strategy, constant in predictors:
    dumdum = dummy.DummyRegressor(strategy=strategy,constant=constant)
    dumdum.fit(X, y)
    print("strategy: {}".format(strategy), ",".join(map(str,dumdum.predict(X)[:5])))
strategy: mean 10.399417325576314,10.399417325576314,10.399417325576314,10.399417325576314,10.399417325576314
strategy: median 7.262245710574845,7.262245710574845,7.262245710574845,7.262245710574845,7.262245710574845
strategy: constant 10,10,10,10,10

We actually have four options for classifiers. These strategies are similar to the continuous case,it's just slanted toward classification problems:

我们实际上对分类器有四种选择，这些决策很像连续性的例子，它只是倾向分类的问题。

predictors = [("constant", 0),("stratified", None),("uniform", None),("most_frequent", None)]

We'll also need to create some classification data:我们也需要生成一些分类数据：

X, y = make_classification()
for strategy, constant in predictors:
    dumdum = dummy.DummyClassifier(strategy=strategy,constant=constant)
    dumdum.fit(X, y)
    print("strategy: {}".format(strategy), ",".join(map(str,dumdum.predict(X)[:5])))
    
strategy: constant 0,0,0,0,0
strategy: stratified 0,0,1,1,1
strategy: uniform 0,1,0,1,0
strategy: most_frequent 0,0,0,0,0

How it works...如何运行的

It's always good to test your models against the simplest models and that's exactly what the dummy estimators give you. For example, imagine a fraud model. In this model, only 5 percent of the data set is fraud. Therefore, we can probably fit a pretty good model just by never guessing any fraud.

总是测试你的模型与最简单的模型之间的差异是非常好的，并且虚拟估计值将给你确切的结果。例如，想象一个欺诈模型，在这个模型里，只有5%的数据是欺诈的，然而我们可能通过不猜测任何欺诈来拟合一个非常好的结果。

We can create this model by using the stratified strategy, using the following command.We can also get a good example of why class imbalance causes problems:

我们能通过使用分层策略来生成一个模型，使用以下代码，我们能够得到一个关于为什么不均衡分类导致问题的好例子。

X, y = make_classification(20000, weights=[.95, .05])
dumdum = dummy.DummyClassifier(strategy='most_frequent')
dumdum.fit(X, y)
DummyClassifier(constant=None, random_state=None, strategy='most_frequent')
from sklearn.metrics import accuracy_score
print(accuracy_score(y, dumdum.predict(X)))

0.94585

We were actually correct very often, but that's not the point. The point is that this is our baseline. If we cannot create a model for fraud that is more accurate than this, then it isn't worth our time. 我们实际上常常正确，但是不是因为这点，这个点其实是我们的基准线，如果我们不能生成一个比这个更准确的判别欺诈模型，那它就不值得我们所花费的时间。

本文系外文翻译，前往查看

如有侵权，请联系?cloudcommunity@tencent.com?删除。

classification