网络机器人吧社区

【译】机器学习与Scikit-Learn

红色贝雷帽 2019-02-10 14:49:47

原文链接:http://blog.scottlogic.com/2018/02/15/scikit-machine-learning.html


Machine learning (ML) was one of the first fields of Computer Science to grab my attention when I was studying. The concept is remarkably straightforward, but its applications are incredibly powerful. Google, AirBnB, and Uber are amongst many big names to apply ML to their products. When I first attempted to apply ML to my own work, there was one library which stood out from them all as a great starting point: Scikit Learn.

当我还在读书的时候,机器学习 (ML) 就是最吸引我的计算机科学领域之一。它的概念非常简单,但其应用却非常强大。许多知名厂商,如GoogleAirBnBUber,都将机器学习应用于他们的产品。在我第一次尝试把机器学习用在我自己的工作上时,我发现了一个非常出色且适合初学者的程序库:Scikit-Learn,官网链接http://scikit-learn.org/stable/

Developed for Python, Scikit Learn allows developers to easily integrate ML into their own projects. I’m keen to walk through a simple application of Scikit Learn with the latest version of Python 3 (v3.4.6 at the time of writing). If Python’s new to you, have no fear! I’ll explain the purpose of the code as we go along.

基于PythonScikit-Learn允许使用者轻松地将机器学习集成到自己的项目中。我希望通过最新版本的Python 3 (撰写本文时为v3.4.6) 来演示Scikit-Learn的一个简单例子。如果你是Python新手,不必担心,我会一步步解释代码的意图。

Installation

安装

Critical to this walkthrough is the installation of Scikit Learn. Prior to doing this, make sure to download Python 3. With your terminal open, make sure that you have both NumPy and SciPy installed with pip:

首先,这个演示最重要的一步是安装Scikit-Learn。在安装之前请确保下载了Python 3 (下载链接https://www.python.org/downloads/)。运行终端程序,按照下面的命令用pip安装NumpyScipy

pip install numpy

pip install scipy

The rest of the installation process is satisfyingly simple. One command performs the magic:

接下来的安装过程非常简单,一个命令搞定:

pip install scikit-learn

After a short time you should receive successfully installed scikit-learn. In order to read CSV files for this walkthrough, we’ll also require Pandas. As before, one command covers our tracks:

短暂等待之后你会收到成功安装Scikit-Learn的信息。为了能够在示例中读取CSV文件,我们还要用到Pandas (下载地址https://pandas.pydata.org/)。安装和之前的步骤一样简单:

pip install pandas

There we have it - installation complete!

安装结束,万事俱备!

Setup

配置

Scikit Learn provides an abundance of example use cases on its own website, which I found particularly useful when I first started playing with the library. For the purpose of demonstrating Scikit Learn here, I’m going to implement a classifier to categorise handwritten digits from a UCI database consisting of ~11000 images. This dataset arose from 44 writers, each asked to write 250 digits, and each image (which we’ll call a sample) in this database corresponds to one handwritten digit between 0 and 9.

Scikit-Learn官网上提供了大量的示例 (链接http://scikit-learn.org/stable/auto_examples/index.html#examples-based-on-real-world-datasets),我觉得这在刚开始使用这个库的时候特别有用。在这里,为了演示Scikit-Learn,我将实现一个分类器,对来自由大约11000张图像组成的UCI数据集 (链接https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits) 中的手写数字进行分类。这个数据集来自于44位提供者,每个提供者被要求写250个数字,并且这个数据集中的每张图片 (我们称之为样本) 对应一个09之间的手写数字。

Each sample is represented as one feature vector holding values between 0 and 100. These values represent the intensity of individual pixels in the sample. Given that each sample was written inside a 500 x 500 tablet pixel resolution box, this approach would leave us with exceptionally long vectors to process. To resolve this, the images were resampled to reduce the number of pixels under consideration, resulting in feature vectors of length 16.

每个样本表示为一个拥有一组数值的特征向量,每个数值的范围在0100之间。这些数值代表样本中各个像素的强度。由于每个样本都是500*500像素的大小,这导致我们需要处理很长的向量。为了解决这个问题,减少需要关注的像素个数,这些图像被重新采样成长度为16的向量。

The set of digits 0-9 will be the set of categories for our classifier to consider in the classification process. The classifier will take samples from 30 writers (~7500) to learn about the samples for each category. The remaining samples will be reserved to test the classifier post-training. Each sample is already classified by hand - meaning we have the correct categories for each sample in the test set. This allows us to determine how successful the classifier was by comparing its own predictions to classifications we already hold.

数字集合0-9是分类器的类别集合。我们将30名提供者提供的样本 (7500) 用于分类器学习。剩余的样本将保留作测试分类器的post-training。这些样本已经经过人工分类这就意味着测试集中的每个样本都有正确的分类。这样我们就可以通过比较分类器的预测和已有的分类来确定分类器的成功程度。

Both the training and test datasets are provided online as CSV files by UCI. Importing these files into Python is made a simple process thanks to Pandas:

训练集和测试集均为CSV文件,来自UCI在线下载。用Pandas库可以很简单地将这些文件导入Python

import pandas as pd


def retrieveData():

  trainingData = pd.read_csv("training-data.csv").as_matrix()

  testData = pd.read_csv("test-data.csv").as_matrix()


  return trainingData, testData

Each file is read with read_csv to produce a Pandas DataFrame, which is converted into a Numpy array using as_matrix for later convenience. Each line of these files corresponds to one digit sample - consisting of one feature vector of length 16, followed by its corresponding category. Separating feature vectors and categories will prove useful later for Scikit Learn.

read_csv 读取每个文件将产生一个PandasDataFrame,然后用as_matrix将其转换成Numpy数组,以备后用。这些文件的每一行对应一个数字样本由一个长度为16的特征向量组成,后面跟着相应的类别。分离特征向量和类别对Scikit-Learn来说非常帮助。

def separateFeaturesAndCategories(trainingData, testData):

  trainingFeatures = trainingData[:, :-1]

  trainingCategories = trainingData[:, -1:]

  testFeatures = testData[:, :-1]

  testCategories = testData[:, -1:]


  return trainingFeatures, trainingCategories, testFeatures, testCategories

Pre-processing

预处理

The majority of classifiers offered by Scikit Learn are sensitive to feature scaling. Each feature vector holds values between 0 and 100, with no consistent mean or variance. Rescaling these vectors to satisfy a mean of zero and a variance of one helps the classifier both in the process of training and classifying to recognise samples for any of the digit categories. It’s an optional step in the ML pipeline, but one highly recommended if you’re looking to improve classifier performance. Using StandardScalar, provided by Scikit Learn’s Preprocessing package, this proves relatively straightforward to achieve. By first allowing the scaler to fit to the training data - learning what the unscaled features are like - the scaler can then transform features in both the training and testing sets to hold a mean of zero and a variance of one:

Scikit-Learn提供的大多数分类器对特征缩放比较敏感。每个特征向量包含一组0100之间的数值,没有一致的均值或方差。对这些向量进行缩放以满足均值为0和方差为1,有助于分类器在训练和分类过程中识别任何数字类别的样本。这是机器学习管道机制中的一个可选步骤,但如果您希望提高分类器性能,强烈建议使用该步骤。使用由Scikit Learn的预处理包提供的StandardScalar,可以简单地实现这一步。首先让缩放器适应训练数据 - 学习未缩放的特征是什么样 - 然后缩放器可以将训练集和测试集中的特征进行转换以保证均值为0和方差为1

from sklearn.preprocessing import StandardScaler


def scaleData(trainingFeatures, testFeatures):

    scaler = StandardScaler()

    scaler.fit(trainingFeatures)


    scaledTrainingFeatures = scaler.transform(trainingFeatures)

    scaledTestFeatures = scaler.transform(testFeatures)


    return scaledTrainingFeatures, scaledTestFeatures   

Classification

分类

Scikit Learn provides a range of classifiers which would suit our needs. I’ve decided to implement a Stochastic Gradient Descent (SGD) Classifier since I’ve found myself using this one a couple of times in the past. First we need to fit the classifier to the training data (i.e. train the classifier). Then we’re ready to set the classifier free to predict categories for unseen test samples. With Scikit Learn, all of this is achieved in a few lines of code:

Scikit-Learn提供了一系列适合我们要求的分类器 (链接http://scikit-learn.org/stable/supervised_learning.html#supervised-learning)。这里我将实现一个随机梯度下降 (SGD) 分类器 (链接http://scikit-learn.org/stable/modules/sgd.html#classification),因为我自己过去曾多次使用过这种分类器。首先,我们需要使分类器适应训练数据 (即训练分类器)。然后,我们就可以让分类器随意预测未见过的测试样本的类别。用Scikit-Learn,只要几行代码就可以做到:

from sklearn.linear_model.stochastic_gradient import SGDClassifier


def classifyTestSamples(trainingFeatures, trainingCategories, testFeatures):

    clf = SGDClassifier()


    clf.fit(trainingFeatures, trainingCategories)

    predictedCategories = clf.predict(testFeatures)


    return predictedCategories

Results

结果

We have our predictions! Now to compare them to the categories provided on file. Several questions arise here: How successful was the classifier? How do we measure its success? Given a particular measure, where do we set the threshold to distinguish a bad result from a good result? To answer the first two questions, I turn to Scikit Learn’s classification metrics package. I’ve picked four metrics to implement for this example: Accuracy, precision, recall, and the F1 score.

现在我们得到了预测,将它们与文件中提供的类别进行比较。这里出现几个问题:分类器有多成功?我们如何衡量其成功?对于一次具体的测量,我们应该怎样来界定结果是坏是好?要回答前两个问题,我使用Scikit-Learn的分类指标包。我为这个例子选择了四个指标:准确率,精度,召回率和F1分数。

Accuracy: Percentage of samples categorised correctly

准确率:正确分类的样本的百分比

Precision: Number of samples correctly assigned category x over the total number of samples assigned x

精度:被正确分类到类别x的样本数除以被分类到类别x的样本总数

Recall: Number of samples correctly assigned category x over the number of samples correctly assigned x plus the number of samples which are not x and not assigned x

召回率:被正确分类到类别x的样本数除以被正确分类到类别x的样本数加上不是类别x的样本数和没有分类到类别x的样本数

F1 Score: A weighted average of precision (P) and recall (R), defined in Scikit Learn as 2 * (P * R) / (P + R)

F1分数:精度 (P) 和召回率 (R) 的加权平均值 ,在Scikit-Learn中定义为2 * (P * R) / (P + R)

Scikit Learn’s accuracy_score function covers accuracy for our classifier. The remaining three metrics are covered by classification_report, which prints a break down of precision, recall, and F1 scores for each category, as well as providing average figures.

可以用Scikit-Learnaccuracy_score函数来计算我们分类器的准确率。其余三项指标涵盖在classification_report中,该指标打印每个类别的精度,召回率和F1得分的细分,并提供平均值

from sklearn.metrics import accuracy_score, classification_report


def gatherClassificationMetrics(testCategories, predictedCategories):

    accuracy = accuracy_score(testCategories, predictedCategories)

    metrics_report = classification_report(testCategories, predictedCategories)


    print("Accuracy rate: " + str(round(accuracy, 2)) + "\n")

    print(metrics_report)

There will always be a small variation in the metrics between runs of the classifier. There will be cases where the classifier categorises a test sample with a high degree of certainty, and whilst these predictions are likely to be consistent, there will also be cases where the classifier holds a lack of confidence in its work. Where this applies, the classifier is likely to make different predictions during each run. This may well be down to the training set holding an insufficient number of samples for particular digits, or the classifier is encountering handwriting which differs significantly from the training set. Taking into consideration these variations, here’s one set of results from the SGD classifier:

分类器每次运行的结果之间在指标上总是会有细微变化。有些情况下分类器会对测试样本的分类有高度确定性,虽然这样的预测可能是稳定的,但也会出现分类器对其工作缺乏确定性的情形。在这种情况下,分类器可能每次运行都会做出不同的预测。这可能是因为训练集中特定数字的样本数量不足,或者分类器遇到与训练集明显不同的笔迹。考虑到这些变化,以下是SGD分类器的一组结果:

Accuracy rate: 0.84


             precision    recall  f1-score   support


          0       0.98      0.84      0.90       363

          1       0.58      0.84      0.69       364

          2       0.97      0.81      0.88       364

          3       0.98      0.90      0.94       336

          4       0.95      0.93      0.94       364

          5       0.62      0.94      0.75       335

          6       1.00      0.96      0.98       336

          7       0.88      0.84      0.86       364

          8       0.85      0.76      0.80       336

          9       0.93      0.58      0.72       336


avg / total       0.87      0.84      0.85      3498

For a first attempt, 84% accuracy is pretty good! This leaves me with my third and final question, “where do we set the threshold to distinguish a bad result from a good result?”. That’s a tricky one to answer. This depends entirely on the purpose of the classifier, what each individual considers to be good or bad, and any previous attempts to apply ML to the same field. Could we alter our classifier to consistently perform better than the results observed here?

对于第一次尝试,84%的准确度相当不错!现在就剩下了第三个也是最后一个问题,我们应该怎样界定结果是坏的还是好的?。这是一个棘手的问题。这完全取决于分类器的目的、每个人对是好还是坏的定义、以及过去将机器学习应用于同一领域的那些尝试。我们是否可以改进我们的分类器,使其始终如一地得到比这里观察到的更好的结果?

Can we do better?

我们可以做得更好吗?

The likely answer: yes. There’s plenty of options to consider. First, this walkthrough has applied elementary pre-processing. More sophisticated rescaling approaches may reduce the sensitivity of the classifier even further to improve the metrics. Second, this example implemented a basic SGD classifier without any adjustments to the default parameter values provided by Scikit Learn. We could alter the number of iterations over the training data (called epochs), prevent the classifier from shuffling training data after each epoch, or run the classifier multiple times, enabling its warm start property in order for the classifier to recall previous predictions made.

答案是可能的。有很多选择可以考虑。首先,本示例使用的是基本的预处理。更复杂的缩放方法可能会进一步降低分类器的灵敏度从而改进指标。其次,本示例实现了一个基本的SGD分类器,并没有对Scikit-Learn提供的默认参数值进行任何调整。我们可以改变训练数据的迭代次数 (称为时期),防止分类器在每个时期后对训练数据进行洗牌,或多次运行分类器,激活它的热启动属性,以便分类器回忆先前做出的预测。

It’s also worth considering that we’ve only implemented one of many classifiers Scikit Learn has to offer. Whilst the SGD classifier suffices for this example, we could also consider LinearSVC or Multinomial Naive Bayes, amongst others. This is where the fun lies with ML: there’s so many variables to consider which may improve or worsen our results. Finding optimal solutions to any ML problem proves a difficult task.

同样值得一提的是,我们只实现了Scikit-Learn提供的众多分类器之一。虽然这个例子用SGD分类器足矣,但我们也可以考虑LinearSVC或多项朴素贝叶斯等其他分类器。这就是机器学习的乐趣所在:有很多变量需要考虑,这些变量可能会使我们的结果得到改善或恶化。为任何机器学习问题寻找最佳解决方案都是一项艰巨的任务。

Conclusion

结论

This wraps up our walkthrough of Scikit Learn! We’ve only covered the basics here. Scikit Learn has much more to offer, which I find is well documented on their own website. For anyone wishing to view the code in full, or try it out for themselves, I’ve made the CSV files and Python code available on GitHub to work with. Enjoy!

这就是我们的Scikit-Learn演示!在这里我们只介绍了基本知识。Scikit-Learn提供了更多的资料,在他们自己的网站上有详细的文档 (链接http://scikit-learn.org/stable/documentation.html)。对于那些希望查看完整代码或自行尝试的人,我已经把CSV文件和Python代码上传到GitHub (链接https://github.com/rrhodes/scikit-learn-example) 上。自己来吧!


Copyright © 网络机器人吧社区@2017