网络机器人吧社区

学习机器学习:从初学者到专家 (1/25)

GHBD 2018-12-05 15:59:29




这是 GHBD 的第 18 篇文章


GHBD | 旨在推广医疗大数据与人工智能的发展

“让我们与世界连接”




What is Machine Learning? Why Machine Learning?

什么是机器学习?为何要机器学习?


Aaron 编译 | 来源 Commonlounge



Motivation behind Machine Learning

机器学习背后的动机



Sometimes we encounter problems for which it's really hard to write a computer program to solve. For example, let's say we wanted to program a computer to recognize hand-written digits.


有时候,我们会遇到一些难以编写一套计算机程序来解决的问题。例如,假设我们想通过编程,让一台计算机识别手写数字:




来源:MNIST手写数据库


You could imagine trying to devise a set of rules to distinguish each individual digit. Zeros, for instance, are basically one closed loop. But what if the person didn't perfectly close the loop. Or what if the right top of the loop closes below where the left top of the loop starts?


你可以想象尝试制定一组规则来区分每个单独的数字。例如,零点基本上是一个封闭的环形。但是,假如这人写这个数字时,没有很好的封闭这个环形呢?或者,如果环形右上方结束的部分低于左上方开始的部分?


零难以区分六


In this case, we have difficulty differentiating zeroes from sixes. We could establish some sort of  cutoff, but how would you decide the cutoff in the first place? As you can see, it quickly becomes quite complicated to compile a list of heuristics (i.e., rules and guesses) that accurately classifies handwritten digits.


在这种情况下,我们(计算机程序)很难区分零和六。我们可以建立某种截止值 (cutoff) , 但是你将如何决定截止值 (cutoff)? 正如你所看到的,问题立刻变复杂了,当你要处理一系列这类“情况”(heuristics) (如:规则和猜测) 来准确的区分手写数字。


And there are so many more classes of problems that fall into this category. Recognizing objects, understanding concepts, comprehending speech. We don't know what program to write because we still don't know how it's done by our own brains. And even if we did have a good idea about how to do it, the program might be horrendously complicated.


属于这一类的问题有很多种类。例如:识别对象理解概念理解言语。我们不知道要编写什么程序,因为我们还不知道我们的大脑是如何做到的。就算我们想到一个好办法,该程序也会复杂到可怕。


So instead of trying to write a program, we try to develop an algorithm that a computer can use to look at hundreds or thousands of examples (and the correct answers), and then the computer uses that experience to solve the same problem in new situations. Essentially, our goal is to teach the computer to solve by example, very similar to how  we might teach a  young child to distinguish a cat from a dog.


因此,我们不是写程序,而是开发一个算法,让计算机可以用学习成千上万的例子(以及正确答案),然后让这台计算机用学到的经验,来解决新环境里同样的问题。本质上,我们的目标是通过示例,来教导计算机解决问题,类似于我们如何教小孩来区分猫和狗。



What is Machine Learning? - Definition

什么是机器学习? - 定义



The field itself: ML is a field of study which harnesses principles of computer science and statistics to create statistical models. These models are generally used to do two  things:


该领域本身:ML是一个学习的领域,利用计算机科学和统计学原理创建统计模型。这些模型,通常用于做两件事:


  1. Prediction: make predictions about the future based on data about the past

  2. Inference: discover patterns in data


  1. 预测:根据过去的数据预测未来

  2. 推论:发现数据中的模式


Difference between ML and AI : There is no universally agreed upon distinction between ML and artificial intelligence (AI). AI usually concentrates on programming computers to make decisions (based on ML models and sets of logical rules), whereas ML focuses more on making predictions about the future. They are highly interconnected fields, and, for most non-technical purposes, they are the same.


MI 和 AI 之间的区别: ML 和 人工智能 (AI) 并没有一个严格的区别。人工智能通常专注于编程计算机作出决策(基于ML模型和逻辑规则集),而ML 则更侧重于对未来进行预测。这两个领域关联十分紧密,大多数情况下,它们(非技术)的目标是一致的。



What's a statistical model?

什么是统计模型?



Models: Teaching a computer to make predictions involves feeding data into machine learning models, which are representations of how the world supposedly works. If I tell a statistical model that the world works a certain way (say, for example, that taller people make more money than shorter people), then this model can then tell me who it thinks will make more money, between Cathy, who is 5'22'', and Jill, who is 5'9''.


模型: 教计算机进行预测,包括提供数据给机器学习模型,模型是这个世界如何运行的表达方式。如果我告诉一个统计模型这个世界的某种运作方式(比如,高个赚钱比矮子多),那么这个模型可以告诉我,Cathy (5'2'') 和 Jill (5'9'') 谁赚的钱多。


What does a model actually look like? Surely the concept of a model makes sense in the abstract, but knowing this is just half the battle. You should also know how it's represented inside of a computer, or what it would look like if you wrote it down on paper.


模型实际上是个什么样子?模型的概念当然是抽象的,光了解这点只是一知半解。你应该了解模型在计算机里是怎么呈现的,或者写在纸上是什么样子。


A model is just a mathematical function, which, as you probably already know, is a relationship between a set of inputs and a set of outputs. Here's an example:


模型只是一个数学函数 (function),  它是一组输入和输出的关系。这是一个例子:


f(x)= x 2


This is a function that takes as input a number and returns that number squared. So, f(1) = 1, f(2) = 4, f(3) = 9.


这是一个数学函数,它将一个数字作为输入,并返回该数字的平方。所以,f(1)=1, f(2)=4, f(3)=9。


Let's briefly return to the example of the model that predicts income from height. I may believe, based on what I've seen in the corporate world, that a given human's annual income is , on average, equal to her height (in inches) times 1,000. So, if you're 60 inches tall (5 feet), then I'll guess that you probably make $60,000 a year. If you're a foot taller, I think you'll make $72,000 a year.


让我们回到根据身高预测收入模型的例子。根据我的观察,我相信,一个人的平均年收入等于他的身高(英寸)乘以 1,000。所以,如果你身高 60 英尺,那么我估计你每年可能挣 6 万美元。如果你高一点,我想你一年会赚 72,000 美元。


This model can be represented mathematically as follows:


这个模型可以用数学表示如下:


Income = Height × $1,000


收入 = 身高 x 1000 美元


In other words, income is a function of height.


换句话说,收入是身高的函数。


Here's the main point: Machine Learning refers to a set of techniques for estimating functions (like the one involving income) based on datasets (pairs of heights and their associated incomes). These functions, which are called models, can then be used for predictions of future data.

以下是主要观点:机器学习是这么一套技术:用于实现基于数据集(身高和收入数据对)建立预测函数的技术。这些函数被称为模型,能用来预测未来的数据。



Algorithms: These functions are estimated using algorithms. In this context, an algorithm is a predefined set of steps that takes as input a bunch of data and then transforms it through mathematical operations. You can think of an algorithm like a recipe — first do this, then do that, then do this. Done.


算法:这些函数是使用算法来实现的。在这种情况下,算法是一组预定义的步骤,将一组数据作为输入,然后通过数学运算对其进行转换。你可以把算法想成一个秘诀 — 先做这个,再做那个,然后再做这个,就完成了!


Machine learning of all types uses models and algorithms as its building blocks to make predictions and inferences about the world.


所有类型的机器学习都使用模型和算法作为基本构成,来预测和推论这个世界。



What exactly is being learnt

机器学习到底学什么?



To explain what is being learnt in machine learning, let's start with an example application, spam classification. One approach to write a computer program to classify spam emails from non-spam emails, is to split each email into individual words and maintain a list of words that appear more frequently in spam emails. For example, some example of such words might be 'loan', '$', 'credit', 'discount', 'offer', 'password', 'viagra', and so on. Then, if an email has a substantial number of these words, it should be classified as spam.


为了解释机器学习到底学什么,我们举一个垃圾邮件分类应用程序的例子。一种解决方法是编写一个计算机程序来分类垃圾邮件。这个程序可以将每封电子邮件分解一个个单词,并且维护一套垃圾邮件常见单词列表(例如:‘贷款’,‘$’, ‘信用’,‘折扣’,‘提供’,‘密码’,‘viagra’等等)。接下来,如果电子邮件中包含大量这些词汇,则应将其归类为垃圾邮件。


Although the strategy above might give fairly good results (say detect spam with an accuracy of 80%), the accuracy depends in large part on the list of words we maintain, and on the precise threshold we choose to classify an email as spam.


虽然上面的策略可能会有相当好的结果(比如检测到垃圾邮件的准确率为80%), 但这个准确性,很大程度上取决于我们维护的单词列表,以及我们选择将电子邮件归类为垃圾邮件的准确阀值(precise threshold)。


In machine learning, the strategy is to learn the list of words and the threshold from examples. In fact, in addition to which words are bad words, we could also learn how bad each word is. (This example is quite realistic, and is how many spam classification algorithms work.)


在机器学习中,我们的策略是学习上述的单词列表,以及阀值 (threshold) 。事实上,我们不单学习到了单词的好坏,还学到了单词的好坏程度。(这是一个很实际的例子,而且很多垃圾邮件分类算法就是这样动作的。)


So in this case, the thing being learnt is, a notion of how bad each word is. Note that that is not the only way to frame the problem, we framed the problem in this way because we noticed a pattern that spam emails often contain specific words, and then we came up with a strategy that would analyze every possible word as a possible suspect. This strategy might give inaccurate results for other tasks, or be too inefficient.


在这个例子,机器学习学到一个见解 - 对单词的好坏程度的见解。请注意,这不是解决问题的唯一方法,我们以这种方式构思问题,因为我们注意到垃圾邮件通常包含特定单词这个模式,然后我们提出了一种策略来尽可能去怀疑每个可疑的单词。对于其它任何,用这种策略可能结果准确率不高,或者效率太低。



Desirable properties of machine learning

机器学习的理想特性



You might notice that using machine learning to learn how bad each word is has many desirable properties over maintaining this list manually.


你可能会注意到,相对于手动维护垃圾单词列表,使用机器学习来了解每个单词有好坏程度,是个理想的特性。


  • It reduces the amount of manual work involved in creating the list. Think about how long this list could get if you try to do this manually. Also, if you're trying to maintain the list manually, how would you deal with hundreds of languages across the world? This task can easily become infeasible without machine learning.


  • 它减少手工创建这个单词列表的工作量。想象一下,如果手工来创建这个单词列表,这个单词列表可能会有多长。如果你尝试手动维护列表,如何处理全球数百种语言?如果没有机器学习,这个任务一下子就变得无法实现了。


  • The same strategy works for other similar tasks. Say we wanted to classify whether a movie review is speaking positively or negatively about a movie. If we were creating lists of words manually, then we would have to create a new list of words manually. But if we learn it , the same algorithm would work given that we already have some data (say ratings and reviews left by users on imdb).


  • 同样的策略也适用于其他类似的任务。假设我们想分类电影评论是正面的,还是负面的。如果我们手动建立一个单词列表,那么我们必须手动创建一个新的单词列表。但是,如果我们去学习它,基于我们已有的数据(如:评级和用户在IMDB上的留言),用同样的算法就可以实现了。


  • It updates automatically. Let's say tomorrow the spammers become more advanced and start typing the word 'password' as 'password'. Or they might try to sell you insurance, something we haven't yet encountered. We can simply set the machine learning algorithm to be tranined daily, and it will use the new data available and keep adapting over time to changing behavior.


  • 它会自动的更新。假设明天垃圾邮件发送者变得更加先进,并开始输入“password”这个词作为“password”。他们可能会试图向你推销保险,一些我们从未遇到的情形。我们可以简单地,将机器学习算法设置为每天训练,通过这些有效数据来不断地适应行为的变化。



end 



学习机器学习:从初学者到专家系列


该系列共包含25个学习机器学习的教程。

您可将此系列视为“免费在线图书馆”。

您将学习核心机器学习概念,算法和应用程序。

一切都是100%免费,

欢迎关注交流。





— 你也许还想看 —



GHBD | 旨在推广医疗大数据和人工智能发展

欢迎关注,转载请联系授权。






Global Healthcare Big Data

在已成功举办第二届环球医疗大数据研讨会(2017)、第一届国际云、移动和大数据研讨会(2015),并分别在斯坦福大学医学中心(2016)、香港大学(2016)和北京大学大数据中心(2017)举办了3次环球医疗大数据工作会议成果基础上。我们的目标是为国内外行业领域专家,搭建一个持续的国际平台,组成一个独特的专业群体,让政府机构、医疗从业者、科技研究人员和国内外学者等信息化专业人士从世界各地汇聚在一起相互交流未来医院 IT 发展的重要思想和成果。



Copyright © 网络机器人吧社区@2017