Python Libraries Artificial Neural Networks (ANNs) +3 What is the best neural network library for Python?

Source: https://www.quora.com/What-is-the-best-neural-network-library-for-Python

DeepMind成员、谷歌资深员工:神经网络序列学习突破及发展

2016-05-02 新智元

文章来源:O’Reilly 报告《The Future of Machine Intelligence)

作者:David Beyer

题目:Oriol : Sequence-to-Sequence Machine Learning

下载: future-of-machine-intelligence

【新智元导读】谷歌CEO在给投资人的信中写道谷歌搜索将更具有情景意识,其关键技术自然是深度学习。本文中,谷歌资深员工、DeepMind 成员 Oriol Vinyals 全面剖析神经网络序列学习的优势、瓶颈及解决方案。他指出机器翻译实质上是基于序列的深度学习问题,其团队希望用机器学习替代启发式算法,最后推测机器阅读并理解文本将在未来几年实现。

文章来源:O’Reilly 报告《The Future of Machine Intelligence)

作者:David Beyer

题目:Oriol Vinyals: Sequence-to-Sequence Machine Learning

关注新智元公众号,回复“0502”下载报告全文

受访者 Oriol Vinyals 是 Google 的研究科学家,在 DeepMind 团队工作,曾前在 Google Brain 团队工作。他在加州大学伯克利分校拿到 EECS 博士学位,在加州大学圣地亚哥分校拿到硕士学位。

要   点
使用神经网络的序列到序列学习(Sequence-to-sequence learning)在一些领域拥有最前沿的表现,比如机器翻译。

虽然很强大,序列到序列的学习方法也受到一些因素的制约,包括计算能力。长短期记忆(LSTM)在推动该领域前进方面作了很大贡献。

除了图像和文本理解,深度学习模型可以学会为一些著名的算法难题“编写”解决方案,其中包括邮差问题(Salesman Problem)。

机器翻译是基于序列的深度学习问题

【O’Reilly】让我们先了解一下你的背景吧。

【Oriol Vinyals】我来自西班牙巴塞罗那,在那里我完成了数学和通信工程的本科学习。很早,我就知道自己想要到美国学习 AI。我在卡耐基梅隆大学待了9个月,在那里我完成了本科毕业论文。之后我在加州大学圣地亚哥分校拿到硕士学位,然后 2009年在伯克利拿到博士学位。

读博期间,在 Google 实习时,我遇到了 Geoffrey Hinton 并和他一起工作;这段经历催化了我对深度学习的兴趣。加上我在微软和 Google 愉快的实习经历,当时我便下定决心要在产业界工作。2013 年我全职加入 Google。我起初对语音识别和优化 (重点放在自然语言处理和理解上) 有着浓厚的兴趣,后来转到使用深度学习解决这些以及别的问题这方面,包括最近基于数据来让算法自动学习的工作。

【O’Reilly】能不能谈一下你的关注点的变化,既然你离开了语言识别领域。现在最让你兴奋的是哪些领域?

【Oriol Vinyals】我的语言识别背景激发了我对序列的兴趣。最近,Ilya Sutskever, Quoc Le,还有我发表了一篇文章,是关于序列到序列映射的,可以使用循环神经网络(recurrent neural net) 进行从法语到英语的机器翻译

作为背景,监督学习在输入和输出是矢量的情形下取得了成功。往这些经典的模型输入图片,可以输出相应的类别标签。直到不久前,我们还不能通过输入图片就得到一个单词序列作为对这幅图片的描述。目前的快速进展是得益于可以获取带有图片描述的高质量数据集 (MSCOCO),以及与此并行的循环神经网络的复兴。

我们的工作把机器翻译问题重塑为基于序列的深度学习问题。结果表明深度学习可以把英语的单词序列映射为西班牙语的单词序列。由于深度学习令人吃惊的能力,我们可以相当快地达到领域前沿水平。这些结果本身暗示了新的应用,比如,自动把视频提炼成四个描述性句子。

序列到序列的瓶颈及解决方法 

【O’Reilly】序列到序列这种方法在什么地方工作得不好?

 

【Oriol Vinyals】假设你要把一个英语句子翻译成法语。你可以使用一个巨大的政治言论和辩论语料库作为训练数据。应用得当的话,你可以把政治言论转化为任何别的语言。但是,当你试图把——比如说——莎士比亚式的英语——翻译成法语的时候,你会遇到问题。这种领域切换对深度学习方法压力比较大,而传统机器翻译系统是基于规则的,这让它们能适应这种切换。

还有更多的难点。当序列长度超过一定值时,我们缺乏相应的计算能力。当前的模型可以把长度为 200 的序列与对应的同样长度的序列匹配。当序列变长,运行时间也变长。虽然目前我们被局限于相对较短的文档,我相信随着时间推移这个限制会越来越宽松。正如 GPU 压缩了大而复杂的模型的运行时间,内存和计算能力的提高会让可计算的序列越来越长。

除了计算的瓶颈,更长的序列还带来了有趣的数学问题。若干年前 Hochreiter 引入了梯度消失的概念。当你阅读数千个单词,你很容易忘掉三千个单词前的信息;如果不记得第三章的关键情节转换,(小说的) 结局就失去意义。从结果上讲,挑战来自记忆。循环神经网络一般能记住 10 到 15 个词。但如果你把一个矩阵乘 15 次,输出会收缩到 0。换句话说,梯度消失,学习变得不可能。

 

这个问题的一种重要解决方案依赖于长短期记忆 (LSTM)。这种结构对循环神经网络做了聪明的修改,让它们能记住远超正常极限的东西。我见过能记住 300 到 400 个词的 LSTM。虽然已经相当长了,这样的增长只是个开始,以后的神经网络将能处理日常生活规模的文本。

退一步讲,近几年我们看到出现了一些处理记忆问题的模型。我个人尝试过添加这种记忆到神经网络:与其把所有东西塞进循环神经网络的隐含态,记忆让你回忆起之前见过的词,从而帮助解决手头的优化任务。虽然这些年进展迅速,更深层的、关于知识表达究竟意味着什么这一挑战仍然存在,并且其本身仍旧是一个开放问题。尽管如此,我相信接下来我们会看到沿着这些方向的重大进展。

用机器学习代替启发式算法

【O’Reilly】让我们换个话题,谈谈你在算法生成方面的工作。你能不能讲讲这些努力背后的历史和动机?

【Oriol Vinyals】一个展示监督学习能力的经典练习涉及到把一些给定点分割为不同类别:这是 A 类,这是 B 类,等等。XOR (异或) (the“exclusive or” logical connective) 问题特别有教益。目标是要学会异或操作,也就是,给定两个二进制位作为输入,学习正确的输出。精确地讲,这涉及到两个位也就是四个实例:00,01,10,11。对于这些例子,输出是 0,1,1,0。这个问题不是线性模型能解决的,但深度学习可以。即便如此,目前计算能力的限制排除了更复杂的问题。

 

最近,Wojciech Zaremba (我们组的一个实习生) 发表了一篇文章,标题是“Learningto Execute”,描述了一个基于循环神经网络的从 Python 程序到执行这些程序的结果的映射。这个模型可以仅仅通过阅读源代码来预测 Python 程序的结果。这个问题虽然看起来简单,提供了一个良好开端。于是我把注意力转向一个 NP-hard 问题。

 

我们考虑的是一个高度复杂且资源需求高的方法,用来求解经过所有点的最短路径的问题,也就是著名的邮差问题。这个问题从提出开始,就吸引了大量解法;人们发明了各种启发式算法,在效率和精度之间求得平衡。在我们的情形,我们研究了深度学习系统是否能仅仅基于训练数据推断出与已有文献比肩的启发式算法。

出于效率的考虑,我们只考虑 10 个城市,而不是常见的10000 或 100000 个。我们的训练集输入城市位置,输出最短路径。就这样。我们不想让网络知晓任何别的关于这个问题的假设。

成功的神经网络应该能再现遍历所有点且最小化路程的行为。事实上,在一个可以称作奇迹的时刻,我们发现它能做到。

我应该补充一下,输出可能不是最优,因为毕竟是概率性的;但这是个好的开始。我们希望把这个方法应用到一些新问题。目标不是为了替换现有的、人工编码的解决方案,而是,我们要用机器学习代替启发式算法。

【O’Reilly】这会最终让我们成为更好的程序员吗?

【Oriol Vinyals】比如在编程竞赛中。开始是一段问题陈述,用直白的英语写:“在这个程序中,你需要找出 A,B,C,在 X,Y,以及 Z 的前提下。” 你编码你的解决方案,然后在服务器上测试。与此不同的是,想象一下,一个神经网络读入这样一个自然语言写的问题陈述,然后学到一个至少能给出近似解的算法,甚至能给出精确解。这个图景可能听起来太遥远。记住,仅仅几年前,读入 Python 程序然后输出答案也是听起来相当不靠谱的。

 未来几年机器能阅读并理解文本

【O’Reilly】你怎么看待接下来五年你的工作会如何进展?最大的未解决问题有哪些?

【Oriol Vinyals】也许五年的时间有点紧,但机器阅读并理解一本书这样的事不会离我们很远。类似地,我们可以预期看到机器通过从数据学习来回答问题,而不是基于给定的规则集合。现在如果我问你一个问题,你打开 Google 开始搜索;几次尝试后你可能得到答案。跟你一样,机器应该能返回一个答案作为对某个问题的响应。我们已经有沿着这个方向基于紧凑数据集的模型。更往前的挑战是深刻的:你如何区分正确和错误的答案?如何量化正确和错误?这些以及别的重要问题决定未来研究的进程。

谷歌搜索算法如何排名医疗广告?

2016-05-02 新智元

 新智元原创1

【新智元导读】青年魏则西的不幸病逝激起了国内公众对搜索引擎虚假医疗网络广告问题的热议。提到搜索引擎,必须想到谷歌,那么谷歌是如何处理医疗广告的呢,答案是使用机器学习的RankBrain算法。

青年魏则西的不幸病逝,激起了国内公众对搜索引擎虚假医疗网络广告问题的热议。根据《商业价值》微信公众号今日文章《谷歌也曾涉足医疗广告,美国司法是如何监管的呢?》,可以发现在谷歌搜索“滑膜肉瘤”也会出医疗广告,但都有明显的“Ad”标识。同时,与百度相比,谷歌的付费广告并不影响排名。

谷歌关于滑膜肉瘤治疗的搜索广告,有明确的广告标志。来源:商业价值

此外,《商业价值》文中提到,根据谷歌的搜索广告政策,要投放药品广告需要获得 FDA 以及美国药房理事会(NABP)认证。也就是说,只有获得政府审批的正规网上药店、药品与治疗才能在网站投放药品类广告。同时,谷歌的自动广告过滤机制,在很大程度上也能有效杜绝虚假医疗广告出现。根据谷歌发布的报告,他们 2015 年总计预先屏蔽了 7.8 亿条违规广告,封杀 21.4 万家广告商,其中包括 1250 万条违规的医疗和药品广告,涉及药品未获批准或者虚假误导性宣传等原因。

谷歌如何用算法排名

据统计,每天向 Google 提交的查询中有约 15% 是其未曾见过的。公司的资深研究科学家 Greg Corrado 透露,为了更好回答这些问题,Google 利用了 RankBrain 来将海量的书面语嵌入到计算机可以理解的向量里面。

如果 RankBrain 看到自己不熟悉的单词或短语,它会去猜测其类似的意思并对结果进行相应过滤,从而有效地处理一些从未见过的搜索查询。比方说 RankBrain 能够有效回答 “What’ s the title of the consumer at the highest level of a food chain?(食物链当中最高级的消费者的头衔叫做什么?)” 这样的问题。

对于 Google 的搜索处理机制来说,RankBrain 只是为其搜索算法提供输入的数百个信号之一,但这种信号跟别的信号的不同之处在于它懂得学习,而别的只是别人在信息获取中的发现和洞察。Google 内部曾让做算法的工程师人工去猜测搜索算法会选择哪个页面作为排名第一的结果,其准确率为 70%,然后 RankBrain 去做了同样的事情,准确率达到了 80%,超过了做算法的工程师的平均水平。

随着时间的推移,RankBrain 可能能够处理越来越多的当前通过手写代码分析来改善 Google 算法的各种各样的信号。Google 的各项业务也会发展地越来越智能。机器学习将会以各种有意义的方式整合进 Google 的搜索引擎中。Google 这所有的举动将会继续保持其搜索引擎的领头地位。

RankBrain 运行原理解析

RankBrain 是 Google 蜂鸟搜索算法的一部分。蜂鸟是整个搜索算法,就好比车里面有个引擎。引擎本身可能由许多部分组成,比如滤油器、燃油泵、散热器等。同理,蜂鸟也由多个部分组成,RankBrain就是其中一个组成部分。

蜂鸟同时包含其他的部分,这些名字对 SEO圈的人来说已经耳熟能详了,比如 Panda、 Penguin 和 Payday 用于垃圾邮件过滤, Pigeon 用于优化本地结果, Top Heavy 用于给广告太多的页面降级,Mobile Friendly 用于给移动友好型页面加分,Pirate 用于打击版权侵犯。

Google 用于排序的“信号”是什么?

Google 使用信号来决定如何为网页排序。比如,它会读取网页上的词语,那么词语就是一个信号。如果某些词语是粗体,那么这又是一个值得注意的信号。计算的结果作为PageRank的一部分,给一个网页设定一个PageRank分数,这作为一个信号。如果一张网页被检测到是移动友好型的,那么这又会成为一个信号。所有的这些信号都由蜂鸟算法中的各个部分处理,最后决定针对不同搜索返回哪些网页。

一共有多少种信号?

Google 称进行评估的主要排序信号大约有 200多种,反过来, 可能有上万种变种信号或者子信号。如果你想有一个更直观的排序信号向导,来看看 Google SEO成功因素元素周期表:

RankBrain到底做什么?

从与 Google 的来往电子邮件之中,RankBrain 主要用于翻译人们可能不清楚该输入什么确切词语的搜索词条。

Google 很早就找到不根据具体词条搜索页面的方式。比如,许多年前,如果你输入“鞋”(shoe), Google 可能不会找到那些有“鞋”(shoes)的页面,因为从技术上来说这是两个不同的词汇,但是“stemming”使得 Google 变得更聪明,让引擎了解shoes的词根是shoe,就像“running”的词根是“run”。 Google 同样了解同义词,因此,如果你搜索“运动鞋”,它可能知道你想找“跑鞋”。它甚至有概念性的知识,知道哪些网页是关于“苹果”公司,哪些是关于水果“苹果”的。

参考资料:

http://mp.weixin.qq.com/s?__biz=MTA2MTMwNjYwMQ==&mid=2650693625&idx=1&sn=8ab532faa66e69cc447e250f58807dda&scene=1&srcid=0502LFwayyLBIMhASaZX4zrt#rd

10 种机器学习算法的要点

2015-10-24 伯乐在线 程序员的那些事

前言

谷歌董事长施密特曾说过:虽然谷歌的无人驾驶汽车和机器人受到了许多媒体关注,但是这家公司真正的未来在于机器学习,一种让计算机更聪明、更个性化的技术。

也许我们生活在人类历史上最关键的时期:从使用大型计算机,到个人电脑,再到现在的云计算。关键的不是过去发生了什么,而是将来会有什么发生。

工具和技术的民主化,让像我这样的人对这个时期兴奋不已。计算的蓬勃发展也是一样。如今,作为一名数据科学家,用复杂的算法建立数据处理机器一小时能赚到好几美金。但能做到这个程度可并不简单!我也曾有过无数黑暗的日日夜夜。

谁能从这篇指南里受益最多?

我今天所给出的,也许是我这辈子写下的最有价值的指南。

这篇指南的目的,是为那些有追求的数据科学家和机器学习狂热者们,简化学习旅途。这篇指南会让你动手解决机器学习的问题,并从实践中获得真知。我提供的是几个机器学习算法的高水平理解,以及运行这些算法的 R 和 Python 代码。这些应该足以让你亲自试一试了。

我特地跳过了这些技术背后的数据,因为一开始你并不需要理解这些。如果你想从数据层面上理解这些算法,你应该去别处找找。但如果你想要在开始一个机器学习项目之前做些准备,你会喜欢这篇文章的。

广义来说,有三种机器学习算法

1、 监督式学习

工作机制:这个算法由一个目标变量或结果变量(或因变量)组成。这些变量由已知的一系列预示变量(自变量)预测而来。利用这一系列变量,我们生成一个将输入值映射到期望输出值的函数。这个训练过程会一直持续,直到模型在训练数据上获得期望的精确度。监督式学习的例子有:回归、决策树、随机森林、K – 近邻算法、逻辑回归等。

2、非监督式学习

工作机制:在这个算法中,没有任何目标变量或结果变量要预测或估计。这个算法用在不同的组内聚类分析。这种分析方式被广泛地用来细分客户,根据干预的方式分为不同的用户组。非监督式学习的例子有:关联算法和 K – 均值算法。

3、强化学习

工作机制:这个算法训练机器进行决策。它是这样工作的:机器被放在一个能让它通过反复试错来训练自己的环境中。机器从过去的经验中进行学习,并且尝试利用了解最透彻的知识作出精确的商业判断。 强化学习的例子有马尔可夫决策过程。

常见机器学习算法名单

这里是一个常用的机器学习算法名单。这些算法几乎可以用在所有的数据问题上:

  • 线性回归
  • 逻辑回归
  • 决策树
  • SVM
  • 朴素贝叶斯
  • K最近邻算法
  • K均值算法
  • 随机森林算法
  • 降维算法
  • Gradient Boost 和 Adaboost 算法

1、线性回归

线性回归通常用于根据连续变量估计实际数值(房价、呼叫次数、总销售额等)。我们通过拟合最佳直线来建立自变量和因变量的关系。这条最佳直线叫做回归线,并且用 Y= a *X + b 这条线性等式来表示。

理解线性回归的最好办法是回顾一下童年。假设在不问对方体重的情况下,让一个五年级的孩子按体重从轻到重的顺序对班上的同学排序,你觉得这个孩子会怎么做?他(她)很可能会目测人们的身高和体型,综合这些可见的参数来排列他们。这是现实生活中使用线性回归的例子。实际上,这个孩子发现了身高和体型与体重有一定的关系,这个关系看起来很像上面的等式。

在这个等式中:

  • Y:因变量
  • a:斜率
  • x:自变量
  • b :截距

系数 a 和 b 可以通过最小二乘法获得。

参见下例。我们找出最佳拟合直线 y=0.2811x+13.9。已知人的身高,我们可以通过这条等式求出体重。

线性回归的两种主要类型是一元线性回归和多元线性回归。一元线性回归的特点是只有一个自变量。多元线性回归的特点正如其名,存在多个自变量。找最佳拟合直线的时候,你可以拟合到多项或者曲线回归。这些就被叫做多项或曲线回归。

Python 代码

#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import linear_model
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets
# Create linear regression object
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)
#Equation coefficient and Intercept
print('Coefficient: n', linear.coef_)
print('Intercept: n', linear.intercept_)
#Predict Output
predicted= linear.predict(x_test)

R代码

#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train <- input_variables_values_training_datasets
y_train <- target_variables_values_training_datasets
x_test <- input_variables_values_test_datasets
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
linear <- lm(y_train ~ ., data = x)
summary(linear)
#Predict Output
predicted= predict(linear,x_test)

2、逻辑回归

别被它的名字迷惑了!这是一个分类算法而不是一个回归算法。该算法可根据已知的一系列因变量估计离散数值(比方说二进制数值 0 或 1 ,是或否,真或假)。简单来说,它通过将数据拟合进一个逻辑函数来预估一个事件出现的概率。因此,它也被叫做逻辑回归。因为它预估的是概率,所以它的输出值大小在 0 和 1 之间(正如所预计的一样)。

让我们再次通过一个简单的例子来理解这个算法。

假设你的朋友让你解开一个谜题。这只会有两个结果:你解开了或是你没有解开。想象你要解答很多道题来找出你所擅长的主题。这个研究的结果就会像是这样:假设题目是一道十年级的三角函数题,你有 70%的可能会解开这道题。然而,若题目是个五年级的历史题,你只有30%的可能性回答正确。这就是逻辑回归能提供给你的信息。

从数学上看,在结果中,几率的对数使用的是预测变量的线性组合模型。

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

在上面的式子里,p 是我们感兴趣的特征出现的概率。它选用使观察样本值的可能性最大化的值作为参数,而不是通过计算误差平方和的最小值(就如一般的回归分析用到的一样)。

现在你也许要问了,为什么我们要求出对数呢?简而言之,这种方法是复制一个阶梯函数的最佳方法之一。我本可以更详细地讲述,但那就违背本篇指南的主旨了。

Python代码

#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create logistic regression object
model = LogisticRegression()
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Equation coefficient and Intercept
print('Coefficient: n', model.coef_)
print('Intercept: n', model.intercept_)
#Predict Output
predicted= model.predict(x_test)

R代码

x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)

更进一步:

你可以尝试更多的方法来改进这个模型:

  • 加入交互项
  • 精简模型特性
  • 使用正则化方法
  • 使用非线性模型

3、决策树

这是我最喜爱也是最频繁使用的算法之一。这个监督式学习算法通常被用于分类问题。令人惊奇的是,它同时适用于分类变量和连续因变量。在这个算法中,我们将总体分成两个或更多的同类群。这是根据最重要的属性或者自变量来分成尽可能不同的组别。想要知道更多,可以阅读:简化决策树。

来源: statsexchange

在上图中你可以看到,根据多种属性,人群被分成了不同的四个小组,来判断 “他们会不会去玩”。为了把总体分成不同组别,需要用到许多技术,比如说 Gini、Information Gain、Chi-square、entropy。

理解决策树工作机制的最好方式是玩Jezzball,一个微软的经典游戏(见下图)。这个游戏的最终目的,是在一个可以移动墙壁的房间里,通过造墙来分割出没有小球的、尽量大的空间。

因此,每一次你用墙壁来分隔房间时,都是在尝试着在同一间房里创建两个不同的总体。相似地,决策树也在把总体尽量分割到不同的组里去。

更多信息请见:决策树算法的简化

Python代码

#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create tree object
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini
# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)

R语言

library(rpart)
x <- cbind(x_train,y_train)
# grow tree
fit <- rpart(y_train ~ ., data = x,method="class")
summary(fit)
#Predict Output
predicted= predict(fit,x_test)

4、支持向量机

这是一种分类方法。在这个算法中,我们将每个数据在N维空间中用点标出(N是你所有的特征总数),每个特征的值是一个坐标的值。

举个例子,如果我们只有身高和头发长度两个特征,我们会在二维空间中标出这两个变量,每个点有两个坐标(这些坐标叫做支持向量)。

现在,我们会找到将两组不同数据分开的一条直线。两个分组中距离最近的两个点到这条线的距离同时最优化。

上面示例中的黑线将数据分类优化成两个小组,两组中距离最近的点(图中A、B点)到达黑线的距离满足最优条件。这条直线就是我们的分割线。接下来,测试数据落到直线的哪一边,我们就将它分到哪一类去。

更多请见:支持向量机的简化

将这个算法想作是在一个 N 维空间玩 JezzBall。需要对游戏做一些小变动:

  • 比起之前只能在水平方向或者竖直方向画直线,现在你可以在任意角度画线或平面。
  • 游戏的目的变成把不同颜色的球分割在不同的空间里。
  • 球的位置不会改变。

Python代码

#Import Library
from sklearn import svm
#Assumed you have, X (predic
tor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object
model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)

R代码

library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-svm(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)

5、朴素贝叶斯

在预示变量间相互独立的前提下,根据贝叶斯定理可以得到朴素贝叶斯这个分类方法。用更简单的话来说,一个朴素贝叶斯分类器假设一个分类的特性与该分类的其它特性不相关。举个例子,如果一个水果又圆又红,并且直径大约是 3 英寸,那么这个水果可能会是苹果。即便这些特性互相依赖,或者依赖于别的特性的存在,朴素贝叶斯分类器还是会假设这些特性分别独立地暗示这个水果是个苹果。

朴素贝叶斯模型易于建造,且对于大型数据集非常有用。虽然简单,但是朴素贝叶斯的表现却超越了非常复杂的分类方法。

贝叶斯定理提供了一种从P(c)、P(x)和P(x|c) 计算后验概率 P(c|x) 的方法。请看以下等式:

在这里,

  • P(c|x) 是已知预示变量(属性)的前提下,类(目标)的后验概率
  • P(c) 是类的先验概率
  • P(x|c) 是可能性,即已知类的前提下,预示变量的概率
  • P(x) 是预示变量的先验概率

例子:让我们用一个例子来理解这个概念。在下面,我有一个天气的训练集和对应的目标变量“Play”。现在,我们需要根据天气情况,将会“玩”和“不玩”的参与者进行分类。让我们执行以下步骤。

步骤1:把数据集转换成频率表。

步骤2:利用类似“当Overcast可能性为0.29时,玩耍的可能性为0.64”这样的概率,创造 Likelihood 表格。

步骤3:现在,使用朴素贝叶斯等式来计算每一类的后验概率。后验概率最大的类就是预测的结果。

问题:如果天气晴朗,参与者就能玩耍。这个陈述正确吗?

我们可以使用讨论过的方法解决这个问题。于是 P(会玩 | 晴朗)= P(晴朗 | 会玩)* P(会玩)/ P (晴朗)

我们有 P (晴朗 |会玩)= 3/9 = 0.33,P(晴朗) = 5/14 = 0.36, P(会玩)= 9/14 = 0.64

现在,P(会玩 | 晴朗)= 0.33 * 0.64 / 0.36 = 0.60,有更大的概率。

朴素贝叶斯使用了一个相似的方法,通过不同属性来预测不同类别的概率。这个算法通常被用于文本分类,以及涉及到多个类的问题。

Python代码

#Import Library
from sklearn.naive_bayes import GaussianNB
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)

R代码

library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-naiveBayes(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)

6、KNN(K – 最近邻算法)

该算法可用于分类问题和回归问题。然而,在业界内,K – 最近邻算法更常用于分类问题。K – 最近邻算法是一个简单的算法。它储存所有的案例,通过周围k个案例中的大多数情况划分新的案例。根据一个距离函数,新案例会被分配到它的 K 个近邻中最普遍的类别中去。

这些距离函数可以是欧式距离、曼哈顿距离、明式距离或者是汉明距离。前三个距离函数用于连续函数,第四个函数(汉明函数)则被用于分类变量。如果 K=1,新案例就直接被分到离其最近的案例所属的类别中。有时候,使用 KNN 建模时,选择 K 的取值是一个挑战。

更多信息:K – 最近邻算法入门(简化版)

我们可以很容易地在现实生活中应用到 KNN。如果想要了解一个完全陌生的人,你也许想要去找他的好朋友们或者他的圈子来获得他的信息。

在选择使用 KNN 之前,你需要考虑的事情:

  • KNN 的计算成本很高。
  • 变量应该先标准化(normalized),不然会被更高范围的变量偏倚。
  • 在使用KNN之前,要在野值去除和噪音去除等前期处理多花功夫。

Python代码

#Import Library
from sklearn.neighbors import KNeighborsClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object model
KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)

R代码

library(knn)
x <- cbind(x_train,y_train)
# Fitting model
fit <-knn(y_train ~ ., data = x,k=5)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)

7、K 均值算法

K – 均值算法是一种非监督式学习算法,它能解决聚类问题。使用 K – 均值算法来将一个数据归入一定数量的集群(假设有 k 个集群)的过程是简单的。一个集群内的数据点是均匀齐次的,并且异于别的集群。

还记得从墨水渍里找出形状的活动吗?K – 均值算法在某方面类似于这个活动。观察形状,并延伸想象来找出到底有多少种集群或者总体。

K – 均值算法怎样形成集群:

  • K – 均值算法给每个集群选择k个点。这些点称作为质心。
  • 每一个数据点与距离最近的质心形成一个集群,也就是 k 个集群。
  • 根据现有的类别成员,找出每个类别的质心。现在我们有了新质心。
  • 当我们有新质心后,重复步骤 2 和步骤 3。找到距离每个数据点最近的质心,并与新的k集群联系起来。重复这个过程,直到数据都收敛了,也就是当质心不再改变。

如何决定 K 值:

K – 均值算法涉及到集群,每个集群有自己的质心。一个集群内的质心和各数据点之间距离的平方和形成了这个集群的平方值之和。同时,当所有集群的平方值之和加起来的时候,就组成了集群方案的平方值之和。

我们知道,当集群的数量增加时,K值会持续下降。但是,如果你将结果用图表来表示,你会看到距离的平方总和快速减少。到某个值 k 之后,减少的速度就大大下降了。在此,我们可以找到集群数量的最优值。

Python代码

#Import Library
from sklearn.cluster import KMeans
#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset
# Create KNeighbors classifier object model
k_means = KMeans(n_clusters=3, random_state=0)
# Train the model using the training sets and check score
model.fit(X)
#Predict Output
predicted= model.predict(x_test)

8、随机森林

随机森林是表示决策树总体的一个专有名词。在随机森林算法中,我们有一系列的决策树(因此又名“森林”)。为了根据一个新对象的属性将其分类,每一个决策树有一个分类,称之为这个决策树“投票”给该分类。这个森林选择获得森林里(在所有树中)获得票数最多的分类。

每棵树是像这样种植养成的:

  1. 如果训练集的案例数是 N,则从 N 个案例中用重置抽样法随机抽取样本。这个样本将作为“养育”树的训练集。
  2. 假如有 M 个输入变量,则定义一个数字 m<<M。m 表示,从 M 中随机选中 m 个变量,这 m 个变量中最好的切分会被用来切分该节点。在种植森林的过程中,m 的值保持不变。
  3. 尽可能大地种植每一棵树,全程不剪枝。

若想了解这个算法的更多细节,比较决策树以及优化模型参数,我建议你阅读以下文章:

  1. 随机森林入门—简化版
  2. 将 CART 模型与随机森林比较(上)
  3. 将随机森林与 CART 模型比较(下)
  4. 调整你的随机森林模型参数

Python

#Import Library
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Random Forest object
model= RandomForestClassifier()
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)

R代码

library(randomForest)
x <- cbind(x_train,y_train)
# Fitting model
fit <- randomForest(Species ~ ., x,ntree=500)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)

9、降维算法

在过去的 4 到 5 年里,在每一个可能的阶段,信息捕捉都呈指数增长。公司、政府机构、研究组织在应对着新资源以外,还捕捉详尽的信息。

举个例子:电子商务公司更详细地捕捉关于顾客的资料:个人信息、网络浏览记录、他们的喜恶、购买记录、反馈以及别的许多信息,比你身边的杂货店售货员更加关注你。

作为一个数据科学家,我们提供的数据包含许多特点。这听起来给建立一个经得起考研的模型提供了很好材料,但有一个挑战:如何从 1000 或者 2000 里分辨出最重要的变量呢?在这种情况下,降维算法和别的一些算法(比如决策树、随机森林、PCA、因子分析)帮助我们根据相关矩阵,缺失的值的比例和别的要素来找出这些重要变量。

想要知道更多关于该算法的信息,可以阅读《降维算法的初学者指南》。

Python代码

#Import Library
from sklearn import decomposition
#Assumed you have training and test data set as train and test
# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)
# For Factor analysis
#fa= decomposition.FactorAnalysis()
# Reduced the dimension of training dataset using PCA
train_reduced = pca.fit_transform(train)
#Reduced the dimension of test dataset
test_reduced = pca.transform(test)
#For more detail on this, please refer this link.

R Code

library(stats)
pca <- princomp(train, cor = TRUE)
train_reduced <- predict(pca,train)
test_reduced <- predict(pca,test)

10、Gradient Boosting 和 AdaBoost 算法

当我们要处理很多数据来做一个有高预测能力的预测时,我们会用到 GBM 和 AdaBoost 这两种 boosting 算法。boosting 算法是一种集成学习算法。它结合了建立在多个基础估计值基础上的预测结果,来增进单个估计值的可靠程度。这些 boosting 算法通常在数据科学比赛如 Kaggl、AV Hackathon、CrowdAnalytix 中很有效。

更多:详尽了解 Gradient 和 AdaBoost

Python代码

#Import Library
from sklearn.ensemble import GradientBoostingClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Gradient Boosting Classifier object
model= GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)

R code

library(caret)
x <- cbind(x_train,y_train)
# Fitting model
fitControl <- trainControl( method = "repeatedcv", number = 4, repeats = 4)
fit <- train(y ~ ., data = x, method = "gbm", trControl = fitControl,verbose = FALSE)
predicted= predict(fit,x_test,type= "prob")[,2]

结语GradientBoostingClassifier 和随机森林是两种不同的 boosting 树分类器。人们常常问起这两个算法之间的区别。

现在我能确定,你对常用的机器学习算法应该有了大致的了解。写这篇文章并提供 Python 和 R 语言代码的唯一目的,就是让你立马开始学习。

如果你想要掌握机器学习,那就立刻开始吧。做做练习,理性地认识整个过程,应用这些代码,并感受乐趣吧!

Machine Learning: An In-Depth, Non-Technical Guide – Part 5

By Alex Castrounis

Source: http://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide-part-5/

Chapters

  1. Overview, goals, learning types, and algorithms
  2. Data selection, preparation, and modeling
  3. Model evaluation, validation, complexity, and improvement
  4. Model performance and error analysis
  5. Unsupervised learning, related fields, and machine learning in practice

Introduction

Welcome to the fifth and final chapter in a five-part series about machine learning.

In this final chapter, we will revisit unsupervised learning in greater depth, briefly discuss other fields related to machine learning, and finish the series with some examples of real-world machine learning applications.

Unsupervised Learning

Recall that unsupervised learning involves learning from data, but without the goal of prediction. This is because the data is either not given with a target response variable (label), or one chooses not to designate a response. It can also be used as a pre-processing step for supervised learning.

In the unsupervised case, the goal is to discover patterns, deep insights, understand variation, find unknown subgroups (amongst the variables or observations), and so on in the data. Unsupervised learning can be quite subjective compared to supervised learning.

The two most commonly used techniques in unsupervised learning are principal component analysis (PCA) and clustering. PCA is one approach to learning what is called a latent variable model, and is a particular version of a blind signal separation technique. Other notable latent variable modeling approaches include expectation-maximization algorithm (EM) and Method of moments3.

PCA

PCA produces a low-dimensional representation of a dataset by finding a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated8. Another way to describe PCA is that it is a transformation of possibly correlated variables into a set of linearly uncorrelated variables known as principal components13.

Each of the components are mathematically determined and ordered by the amount of variability or variance that each is able to explain from the data. Given that, the first principal component accounts for the largest amount of variance, the second principal component the next largest, and so on.

Each component is also orthogonal to all others, which is just a fancy way of saying that they’re perpendicular to each other. Think of the X and Y axis’ in a two dimensional plot. Both axis are perpendicular to each other, and are therefore orthogonal. While not easy to visualize, think of having many principal components as being many axis that are perpendicular to each other.

While much of the above description of principal component analysis may be a bit technical sounding, it is actually a relatively simple concept from a high level. Think of having a bunch of data in any amount of dimensions, although you may want to picture two or three dimensions for ease of understanding.

Each principal component can be thought of as an axis of an ellipse that is being built (think cloud) to contain the data (aka fit to the data), like a net catching butterflies. The first few principal components should be able to explain (capture) most of the data, with the addition of more principal components eventually leading to diminishing returns.

One of the tricks of PCA is knowing how many components are needed to summarize the data, which involves estimating when most of the variance is explained by a given number of components. Another consideration is that PCA is sensitive to feature scaling, which was discussed earlier in this series.

PCA is also used for exploratory data analysis and data visualization. Exploratory data analysis involves summarizing a dataset through specific types of analysis, including data visualization, and is often an initial step in analytics that leads to predictive modeling, data mining, and so on.

Further discussion of PCA and similar techniques is out of scope of this series, but the reader is encouraged to refer to external sources for more information.

Clustering

Clustering refers to a set of techniques and algorithms used to find clusters (subgroups) in a dataset, and involves partitioning the data into groups of similar observations. The concept of ‘similar observations’ is a bit relative and subjective, but it essentially means that the data points in a given group are more similar to each other than they are to data points in a different group.

Similarity between observations is a domain specific problem and must be addressed accordingly. A clustering example involving the NFL’s Chicago Bears (go Bears!) was given in chapter 1 of this series.

Clustering is not a technique limited only to machine learning. It is a widely used technique in data mining, statistical analysis, pattern recognition, image analysis, and so on. Given the subjective and unsupervised nature of clustering, often data preprocessing, model/algorithm selection, and model tuning are the best tools to use to achieve the desired results and/or solution to a problem.

There are many types of clustering algorithms and models, which all use their own technique of dividing the data into a certain number of groups of similar data. Due to the significant difference in these approaches, the results can be largely affected, and therefore one must understand these different algorithms to some extent to choose the most applicable approach to use.

K-means and hierarchical clustering are two widely used unsupervised clustering techniques. The difference is that for k-means, a predetermined number of clusters (k) is used to partition the observations, whereas the number of clusters in hierarchical clustering is not known in advance.

Hierarchical clustering helps address the potential disadvantage of having to know or pre-determine k in the case of k-means. There are two primary types of hierarchical clustering, which include bottom-up and agglomerative8.

Here is a visualization, courtesy of Wikipedia, of the results of running the k-means clustering algorithm on a set of data with k equal to three. Note the lines, which represent the boundaries between the groups of data.

https://commons.wikimedia.org/wiki/File:KMeans-Gaussian-data.svg

There are two types of clustering, which define the degree of grouping or containment of data. The first is called hard clustering, where every data point belongs to only one cluster and not the others. Soft clustering, or fuzzy clustering on the other hand refers to the case where a data point belongs to a cluster to a certain degree, or is assigned a likelihood (probability) of belonging to a certain cluster.

Method comparison and general considerations

What is the difference then between PCA and clustering? As mentioned, PCA looks for a low-dimensional representation of the observations that explains a good fraction of the variance, while clustering looks for homogeneous subgroups among the observations8.

An interesting point to note is that in the absence of a target response, there is no way to evaluate solution performance or errors as one does in the supervised case. In other words, there is no objective way to determine if you’ve found a solution. This is a significant differentiator between supervised and unsupervised learning methods.

Predictive Analytics, Artificial Intelligence, and Data Mining, Oh My!

Machine learning is often interchanged with terms like predictive analytics, artificial intelligence, data mining, and so on. While machine learning is certainly related to these fields, there are some notable differences.

Predictive analytics is a subcategory of a broader field known as analytics in general. Analytics is usually broken into three sub-categories: descriptive, predictive, and prescriptive.

Descriptive analytics involves analytics applied to understanding and describing data. Predictive analytics deals with modeling, and making predictions or assigning classifications from data observations. Prescriptive analytics deals with making data-driven, actionable recommendations or decisions.

Artificial intelligence (AI) is a super exciting field, and machine learning is essentially a sub-field of AI due to the automated nature of the learning algorithms involved. According to Wikipedia, AI has been defined as the science and engineering of making intelligent machines, but also as the study and design of intelligent agents, where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success

Statistical learning is becoming popularized due to Stanford’s related online course and its associated books: An Introduction to Statistical Learning, and The Elements of Statistical Learning.

Machine learning arose as a subfield of artificial intelligence, statistical learning arose as a subfield of statistics. Both fields are very similar, overlap in many ways, and the distinction is becoming less clear over time. They differ in that machine learning has a greater emphasis on prediction accuracy and large scale applications, whereas statistical learning emphasizes models and their related interpretability, precision, and uncertainty8.

Lastly, data mining is a field that’s also often confused with machine learning. Data mining leverages machine learning algorithms and techniques, but also spans many other fields such as data science, AI, statistics, and so on.

The overall goal of the data mining process is to extract patterns and knowledge from a data set, and transform it into an understandable structure for further use26. Data mining often deals with large amounts of data, or big data.

Machine Learning in Practice

As discussed throughout this series, machine learning can be used to create predictive models, assign classifications, make recommendations, and find patterns and insights in an unlabeled dataset. All of these tasks can be done without requiring explicit programming.

Machine learning has been successfully used in the following non-exhaustive example applications1:

  • Spam filtering
  • Optical character recognition (OCR)
  • Search engines
  • Computer vision
  • Recommendation engines, such as those used by Netflix and Amazon
  • Classifying DNA sequences
  • Detecting fraud, e.g., credit card and internet
  • Medical diagnosis
  • Natural language processing
  • Speech and handwriting recognition
  • Economics and finance
  • Virtually anything else you can think of that involves data

In order to apply machine learning to solve a given problem, the following steps (or a variation) should to be taken, and should use machine learning elements discussed throughout this series.

  1. Define the problem to be solved and the project’s objective. Ask lots of questions along the way!
  2. Determine the type of problem and type of solution required.
  3. Collect and prepare the data.
  4. Create, validate, tune, test, assess, and improve your model and/or solution. This process should be driven by a combination of technical (stats, math, programming), domain, and business expertise.
  5. Discover any other insights and patterns as applicable.
  6. Deploy your solution for real-world use.
  7. Report on and/or present results.

If you encounter a situation where you or your company can benefit from a machine learning-based solution, simply approach it using these steps and see what you come up with. You may very well wind up with a super powerful and scalable solution!

Summary

Congratulations to those that have read all five chapters in full! I would like to thank you very much for spending your precious time joining me on this machine learning adventure.

This series took me a significant amount of time to write, so I hope that this time has been translated into something useful for as many people as possible.

At this point, we have covered virtually all major aspects of the entire machine learning process at a high level, and at times even went a little deeper.

If you were able to understand and retain the content in this series, then you should have absolutely no problem participating in any conversation involving machine learning and its applications. You may even have some very good opinions and suggestions about different applications, methods, and so on.

Despite all of the information covered in this series, and the details that were out of scope, machine learning and its related fields in practice are also somewhat of an art. There are many decisions that need to be made along the way, customized techniques to employ, as well as use creative strategies in order to best solve a given problem.

A high quality practitioner should also have a strong business acumen and expert-level domain knowledge. Problems involving machine learning are just as much about asking questions as they are about finding solutions. If the question is wrong, then the solution will be as well.

Thank you again, and happy learning (with machines)!

By on

About the Author: Alex Castrounis founded InnoArchiTech. Sign up for the InnoArchiTech newsletter and follow InnoArchiTech on Twitter at @innoarchitech for the latest content updates.


References

  1. Wikipedia: Machine Learning
  2. Wikipedia: Supervised Learning
  3. Wikipedia: Unsupervised Learning
  4. Wikipedia: List of machine learning concepts
  5. 3 Ways to Test the Accuracy of Your Predictive Models
  6. Practical Machine Learning Online Course – Johns Hopkins University
  7. Machine Learning Online Course – Stanford University
  8. Statistical Learning Online Course – Stanford University
  9. Latent variable model
  10. Wikipedia: Cluster analysis
  11. Wikipedia: Expectation maximization algorithm
  12. Wikipedia: Method of moments
  13. Wikipedia: Principal component analysis
  14. Wikipedia: Exploratory data analysis

Machine Learning: An In-Depth, Non-Technical Guide – Part 4

By Alex Castrounis

Source: http://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide-part-4/

Chapters

  1. Overview, goals, learning types, and algorithms
  2. Data selection, preparation, and modeling
  3. Model evaluation, validation, complexity, and improvement
  4. Model performance and error analysis
  5. Unsupervised learning, related fields, and machine learning in practice

Introduction

Welcome to the fourth chapter in a five-part series about machine learning.

In this chapter, we will take a deeper dive into model evaluation and performance metrics, and potential prediction-related errors that one may encounter.

Residuals and Classification Results

Before digging deeper into model performance and error types, we must first discuss the concept of residuals and errors for regression, positive and negative classifications for classification problems, and in-sample versus out-of-sample measurements.

Any reference to models, metrics, or errors computed with respect to the data used to train, validate, or tune a predictive model (i.e., data you have) is called in-sample. Conversely, reference to test data metrics and errors, or new data in general is called out-of-sample (i.e., data you don’t have).

Recall that regression involves predicting a continuous valued output (response) based on some set of input variables (features/predictors). The difference between the model’s predicted response value and the actual observed response value from the in-sample data is called the residual for each point, and residuals refers collectively to all of the differences between all predicted and actual values. Each out-of-sample (new/test data) difference is called a prediction error instead of residual.

For the classification case, and for simplicity, we will only discuss binary classification (two classes). Prior to performing classification on data observations, one must define what is a positive classification and what is a negative classification. In the case of spam or ham (i.e., not spam), spam may be the positive designation and ham is the negative.

If a model predicts an incoming email as being spam, and it really is spam, then that’s considered a true positive. Positive since the model predicted spam (the positive class), and true because the actual class matched the prediction. Conversely, if an incoming email is labeled spam when it’s actually not spam, it is considered a false positive.

Given this, we can see that the results of a classification model on new data can fall into four potential buckets. These include: true positives, false positives (type 1 error), true negatives, and false negatives (type 2 error). In all four cases, true or false refers to whether the actual class matched the predicted class, and positive or negative refers to which classification was assigned to an observation by the model.

Note that false is synonymous with error in this case since the model failed to predict correctly.

Model Performance Overview

Now that we’ve covered residuals and classification result types, we will begin the discussion of model performance metrics that are based on these concepts.

Here is a non-exhaustive list of model evaluation methods, visualizations, and performance metrics that are used in machine learning and predictive analytics. They are categorized by their most common use case, but some may apply to more than one category (e.g., accuracy).

In addition to model evaluation, many of these can also be used for model comparison, selection, and tuning. Many of these are very powerful when combined with the cross-validation technique described earlier in this series.

  • Regression performance
    • R2 and adjusted R2 (aka explained variance)
    • Mean squared error (MSE), or root mean squared error (RMSE)
    • Mean error, or mean absolute error
    • Median error, or median absolute error
  • Classification performance
    • Confusion matrix
    • Precision
    • Recall (aka sensitivity)
    • Specificity
    • Accuracy
    • Lift
    • Area under the ROC curve (AUC)
    • F-score
    • Log-loss
    • Average precision
    • Precision/recall break-even point
    • Root mean squared error (RMSE)
    • Mean cross entropy
    • Probability calibration
  • Bias variance tradeoff and model complexity
    • Validation curve
    • Learning curve
    • Residual sum of squares
    • Goodness-of-fit metrics
  • Model validation and selection
    • Mallow’s Cp
    • Akaike information criterion (AIC)
    • Bayesian information criterion (BIC)

Performance metrics should be chosen based on the problem domain, project goals, and the business objectives. Unfortunately there isn’t a one-size-fits-all approach, and often there are tradeoffs to consider.

While a discussion of all of these methods and metrics is out of scope for this series, we will cover a few key ones next.

Model Performance Evaluation Metrics

Regression

There are many metrics for determining model performance for regression problems, but the most commonly used metric is known as the mean square error (MSE), or variation called the root mean square error (RMSE), which is calculated by taking the square root of the mean squared error. The root mean square error is typically preferred since taking the square root changes the units of the error measurement to be the same and proportional to the response variable’s units.

The error in this case is the difference in value between a given model prediction and its actual value for an out-of-sample observation. The mean squared error is therefore the average of all of the squared errors across all new observations, which is the same as adding all of the squared errors (sum of squares) and dividing by the number of observations.

In addition to being used as a stand-alone performance metric, mean squared error (or RMSE) can also be used for model selection, controlling model complexity, and model tuning. Often many models are created and evaluated (e.g., cross-validation), and then MSE (or similar metric) is plotted on the y-axis, with the tuning or validation parameter given on the x-axis.

The tuning or validation parameter is changed in each model creation and evaluation step, and the plot described above can help determine the ideal tuning parameter value. The number of predictors is a great example of a potential tuning parameter in this case.

Before moving on to classification, it is worth mentioning R2 briefly. R2 is often thought of as a measure of model performance, but it’s actually not. R2 is a measure of the amount of variance explained by the model, and is given as a number between 0 and 1. A value of 1 means the model explains all of the data perfectly, but when computed with training data is more of an indication of potential overfitting than high predictive performance.

As discussed earlier, the more complex the model, the more the model tends to fit the data better and potentially overfit, or contribute to additional model variance. Given this, adjusted R2 is a more robust and reliable metric in that it adjusts for any increases in model complexity (e.g., adding more predictors), so that one can better gauge underlying model improvement in lieu of the increased complexity.

Classification

Recall the different results from a binary classifier, which are true positives, true negatives, false positives, and false negatives. These are often shown in a confusion matrix. Here is a very generalized and comprehensive example of one from Wikipedia, and note that the graphic is shown with concepts and metrics, and not actual data.

And here is an example from Wikipedia with the values filled in30 for different classifier models evaluated against 200 observations. Note the calculation and variation of the metrics across the different models.

A confusion matrix is conceptually the basis of many classification performance metrics as shown. We will discuss a few of the more popular ones associated with machine learning here.

Accuracy is a key measure of performance, and is more specifically the rate at which the model is able to predict the correct value (classification or regression) for a given data point or observation. In other words, accuracy is the proportion of correct predictions out of all predictions made.

The other two metrics from the confusion matrix worth discussing are precision and recall. Precision (positive predictive value) is the ratio of true positives to the total amount of positive predictions made (i.e., true or false). Said another way, precision measures the proportion of accurate positive predictions out of all positive predictions made.

Recall on the other hand, or true positive rate, is the ratio of true positives to the total amount of actual positives, whether predicted correctly or not. So in other words, recall measures the proportion of accurate positive predictions out of all actual positive observations.

A metric that is associated with precision and recall is called the F-score (also called F1 score), which combines them mathematically, and somewhat like a weighted average, in order to produce a single measure of performance based on the simultaneous values of both. Its values range from 0 (worst) to 1 (best).

Another important concept to know about is the receiver operating characteristic, which when plotted, results in what’s known as an ROC curve (shown below, image courtesy of BOR at the English language Wikipedia).

An ROC curve is a two-dimensional plot of sensitivity (recall, or true positive rate) vs specificity (false positive rate). The area under the curve is referred to as the AUC, and is a numeric metric used to represent the quality and performance of the classifier (model).

By BOR at the English language Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=10714489

An AUC of 0.5 is essentially the same as random guessing without a model, whereas an AUC of 1.0 is considered a perfect classifier. Generally, the higher the AUC value the better, and an AUC above 0.8 is considered quite good.

The higher the AUC value, the closer the curve gets to the upper left corner of the plot. One can easily see from the ROC curves then that the goal is to find and tune a model that maximizes the true positive rate, while simultaneously minimizing the false positive rate. Said another way, the goal as shown by the ROC curve is to correctly predict as many of the actual positives as possible, while also predicting as many of the actual negatives as possible, and therefore minimize errors (incorrect classifications) for both.

As mentioned previously in this series, model performance can be measured in many ways, and the method used should be chosen based on project goals, business domain considerations, and so on.

It is also worth noting that according to many experts, different performance metrics are thought to be biased for varying reasons. Given the breadth and complexity of this topic, the reader is encouraged to refer to external resources for further information on performance evaluation and the tradeoffs involved.

Error Analysis and Tradeoffs

There are multiple types of errors associated with machine learning and predictive analytics. The primary types are in-sample and out-of-sample errors. In-sample errors (aka resubstitution errors) are the error rate found from the training data, i.e., the data used to build predictive models.

Out-of-sample errors (aka generalization errors) are the error rates found on a new data set, and are the most important since they represent the potential performance of a given predictive model on new and unseen data.

In-sample error rates may be very low and seem to be indicative of a high-performing model, but one must be careful, as this may be due to overfitting as mentioned, which would result in a model that is unable to generalize well to new data.

Training and validation data is used to build, validate, and tune a model, but test data is used to evaluate model performance and generalization capability. One very important point to note is that prediction performance and error analysis should only be done on test data, when evaluating a model for use on non-training or new data (out-of-sample).

Generally speaking, model performance on training data tends to be optimistic, and therefore data errors will be less than those involving test data. There are tradeoffs between the types of errors that a machine learning practitioner must consider and often choose to accept.

For binary classification problems, there are two primary types of errors. Type 1 errors (false positives) and Type 2 errors (false negatives). It’s often possible through model selection and tuning to increase one while decreasing the other, and often one must choose which error type is more acceptable. This can be a major tradeoff consideration depending on the situation.

A typical example of this tradeoff dilemma involves cancer diagnosis, where the positive diagnosis of having cancer is based on some test. In this case, a false positive means that someone is told that have have cancer when they do not. Conversely, the false negative case is when someone is told that they do not have cancer when they actually do.

If no model is perfect, then in the example above, which is the more acceptable error type? In other words, of which one can we accept to a greater degree?

Telling someone they have cancer when they don’t can result in tremendous emotional distress, stress, additional tests and medical costs, and so on. On the other hand, failing to detect cancer in someone that actually has it can mean the difference between life and death.

In the spam or ham case, neither error type is nearly as serious as the cancer case, but typically email vendors err slightly more on the side of letting some spam get into your inbox as opposed to you missing a very important email because the spam classifier is too aggressive.

Summary

In this chapter, we have discussed many concepts and metrics associated with model evaluation, performance, and error analysis.

The fifth and final chapter of this series will revisit unsupervised learning in greater detail, followed by an overview of similar and highly related fields to machine learning. This series will conclude with an overview of machine learning as used in real world applications.

Stay tuned!

By on

About the Author: Alex Castrounis founded InnoArchiTech. Sign up for the InnoArchiTech newsletter and follow InnoArchiTech on Twitter at @innoarchitech for the latest content updates.


References

  1. Wikipedia: Machine Learning
  2. Wikipedia: Supervised Learning
  3. Wikipedia: Unsupervised Learning
  4. Wikipedia: List of machine learning concepts
  5. 3 Ways to Test the Accuracy of Your Predictive Models
  6. Practical Machine Learning Online Course – Johns Hopkins University
  7. Machine Learning Online Course – Stanford University
  8. Statistical Learning Online Course – Stanford University
  9. Wikipedia: Type I and type II errors
  10. Wikipedia: Accuracy Paradox
  11. Wikipedia: Errors and Residuals
  12. Wikipedia: Information Retrieval
  13. Data Mining in Metric Space: An Empirical Analysis of
    Supervised Learning Performance Criteria
  14. Wikipedia: Sensitivity and Specificity
  15. Wikipedia: Accuracy and precision
  16. Wikipedia: Precision and recall
  17. Wikipedia: F1 score
  18. Wikipedia: Residual sum of squares
  19. Wikipedia: Cohen’s kappa
  20. Wikipedia: Learning Curve
  21. Wikipedia: Coefficient of determination, aka R2
  22. Wikipedia: Mallows’s Cp
  23. Wikipedia: Bayesian information criterion
  24. Wikipedia: Akaike information criterion
  25. Wikipedia: Root-mean-square deviation
  26. Wikipedia: Knowledge Extraction
  27. Wikipedia: Data Mining
  28. Wikipedia: Confusion Matrix
  29. Simple guide to confusion matrix terminology
  30. Wikipedia: Receiver operation characteristic

Machine Learning: An In-Depth, Non-Technical Guide – Part 3

By Alex Castrounis

Source: http://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide-part-3/

Chapters

  1. Overview, goals, learning types, and algorithms
  2. Data selection, preparation, and modeling
  3. Model evaluation, validation, complexity, and improvement
  4. Model performance and error analysis
  5. Unsupervised learning, related fields, and machine learning in practice

 

Introduction

Welcome to the third chapter in a five-part series about machine learning.

In this chapter, we’ll continue our machine learning discussion, and focus on problems associated with overfitting data, as well as controlling model complexity, a model evaluation and errors introduction, model validation and tuning, and improving model performance.

Overfitting

Overfitting is one of the greatest concerns in predictive analytics and machine learning. Overfitting refers to a situation where the model chosen to fit the training data fits too well, and essentially captures all of the noise, outliers, and so on.

The consequence of this is that the model will fit the training data very well, but will not accurately predict cases not represented by the training data, and therefore will not generalize well to unseen data. This means that the model performance will be better with the training data than with the test data.

A model is said to have high variance when it leans more towards overfitting, and conversely has high bias when it doesn’t fit the data well enough. A high variance model will tend to be quite flexible and overly complex, while a high bias model will tend to be very opinionated and overly simplified. A good example of a high bias model is fitting a straight line to very nonlinear data.

In both cases, the model will not make very accurate predictions on new data. The ideal situation is to find a model that is not overly biased, nor does it have a high variance. Finding this balance is one of the key skills of a data scientist.

Overfitting can occur for many reasons. A common one is that the training data consists of many features relative to the number of observations or data points. In this case, the data is relatively wide as compared to long.

To address this problem, reducing the number of features can help, or finding more data if possible. The downside to reducing features is that you lose potentially valuable information.

Another option is to use a technique called regularization, which will be discussed later in this series.

Controlling Model Complexity

Model complexity can be characterized by many things, and is a bit subjective. In machine learning, model complexity often refers to the number of features or terms included in a given predictive model, as well as whether the chosen model is linear, nonlinear, and so on. It can also refer to the algorithmic learning complexity or computational complexity.

Overly complex models are less easily interpreted, at greater risk of overfitting, and will likely be more computationally expensive.

There are some really sophisticated and automated methods by which to control, and ultimately reduce model complexity, as well as help prevent overfitting. Some of them are able to help with feature and model selection as well.

These methods include linear model and subset selection, shrinkage methods (including regularization), and dimensionality reduction.

Regularization essentially keeps all features, but reduces (or penalizes) the effect of some features on the model’s predicted values. The reduced effect comes from shrinking the magnitude, and therefore the effect, of some of the model’s term’s coefficients.

The two most popular regularization methods are ridge regression and lasso. Both methods involve adding a tuning parameter (Greek lambda) to the model, which is designed to impose a penalty on each term’s coefficient based on its size, or effect on the model.

The larger the term’s coefficient size, the larger the penalty, which basically means the more the tuning parameter forces the coefficient to be closer to zero. Choosing the value to use for the tuning parameter is critical and can be done using a technique such as cross-validation.

The lasso technique works in a very similar way to ridge regression, but can also be used for feature selection as well. This is due to the fact that the penalty term for each predictor is calculated slightly differently, and can result in certain terms becoming zero since their coefficients can become zero. This essentially removes those terms from the model, and is therefore a form of automatic feature selection.

Ridge regression or lasso techniques may work better for a given situation. Often the lasso works better for data where the response is best modeled as a function of a small number of the predictors, but this isn’t guaranteed. Cross-validation is a great technique for evaluating one technique versus the other.

Given a certain number of predictors (features), there is a calculable number of possible models that can be created with only a subset of the total predictors. An example is when you have 10 predictors, but want to find all possible models using only 2 of the 10 predictors.

Doing this, and then selecting one of the models based on the smallest test error, is known as subset selection, or sometimes as best subset selection. Note that a very useful plot for subset selection is when plotting the residual sum of squares (discussed later) for each model against the number of predictors.

When the number of predictors gets large enough, best subset selection becomes unable to deal with the huge number of possible model combinations for a given subset of predictors. In this case, another method known as stepwise selection can be used. There are two primary versions, forward and backward stepwise selection.

In forward stepwise selection, predictors are added to the model one at a time starting at zero predictors, until all of the predictors are included. Backwards stepwise selection is the opposite, and involves starting with a model including all predictors, and then removing a single predictor at each step.

The model performance is evaluated at each step in both cases. In both subset selection and stepwise selection, the test error is used to determine the best model. There are many ways to estimate test errors, which will be discussed later in this series.

There is a concept that deals with highly dimensional data (i.e., large number of features) known as the curse of dimensionality. The curse of dimensionality refers to the fact that the computational speed and memory required increases exponentially as the number of data dimensions (features) increases.

This can manifest itself as a problem where a machine learning algorithm does not scale well to higher dimensional data11. One way to deal with this issue is to choose a different algorithm that can scale better with the data. The other is a technique known as dimensionality reduction.

Dimensionality Reduction

Dimensionality reduction is a technique used to reduce the number of features included in the machine learning process. It can help reduce complexity, reduce computational cost, and increase machine learning algorithm computational speed. It can be thought of as a technique that transforms the original predictors to a new, smaller set of predictors, which are then used to fit a model.

Principal component analysis (PCA) was discussed previously in the context of feature selection, but is also a widely-used dimensionality reduction technique as well. It helps reduce the number of features (i.e., dimensions) by finding, separating out, and sorting the features that explain the most variance in the data in descending order. Cross-validation is a great way to determine the number of principal components to include in the model.

An example of this would be a dataset where each observation is described by ten features, but only three of the features can describe the majority of the data’s variance, and therefore are adequate enough for creating a model with, and generating accurate predictions.

Note that people sometimes use PCA to prevent overfitting since fewer features implies that the model is less likely to overfit. While PCA may work in this context, it is not a good approach and is therefore not recommended. Regularization should be used to address overfitting concerns instead8.

Model Evaluation and Performance

Assuming you are working with high quality, unbiased, and representative data, then the next most important aspects of predictive analytics and machine learning is measuring model performance, possibly improving it if needed, and understanding potential errors that are often encountered.

We will have an introductory discussion here about model performance, improvement, and errors, but will continue with much greater detail on these topics in the next chapter.

Model performance is typically used to describe how well a model is able to make predictions on unseen data (e.g., test, but NOT training data), and there are multiple methods and metrics used to assess and gauge model performance. A key measure of model performance is to estimate the model’s test error.

The test error can be estimated either indirectly or directly. It can estimated and adjusted indirectly by making changes that affect the training error, since the training error is a measure of overfitting (bias and/or variance) to some extent.

Recall that the more the model overfits the data (high variance), the less well the model will generalize to unseen data. Given that, the assumption is that reducing variance should improve the test error as well.

The test error can also be estimated directly by testing the model with the held out test data, and usually works best in conjunction with a resampling method such as cross-validation, which we’ll discuss later.

Estimating a model’s test error not only helps determine a model’s performance and accuracy, but is also a very powerful way to select a model too.

Improving Model Performance and Ensemble Learning

There are many ways to improve a model’s performance. The quality and quantity of data used has a huge, if not the biggest impact on model performance, but sometimes these two can’t easily be changed.

Other major influencers on model performance include algorithm tuning, feature engineering, cross-validation, and ensemble methods.

Algorithm tuning refers to the process of tweaking certain values that effectively initialize and control how a machine learning algorithm learns and generates predictive models. This tuning can be used to improve performance using the separate validation data set, and later performance tested with the test dataset.

Since most algorithm tuning parameters are algorithm-specific and sometimes very complex, a detailed discussion is out of scope for this article, but note that the lambda parameter described for regularization is one such tuning parameter.

Ensemble learning, as mentioned in an earlier post, deals with combining or averaging (regression) the results from multiple learning models in order to improve predictive performance. In some cases (classification), ensemble methods can be thought of as a voting process where the majority vote wins.

Two of the most common ensemble methods are bagging (aka bootstrap aggregating) and boosting. Both are helpful with improving model performance and in reducing variance (overfitting) and bias (underfitting).

Bagging is a technique by which the training data is sampled with replacement multiple times. Each time a new training data set is created and a model is fitted to the sample data. The models are then combined to produce the overall model output, which can be used to measure model performance.

Boosting is a technique designed to transform a set of so-called weak learners into a single strong learner. In plain English, think of a weak learner as a model that predicts only slightly better than random guessing, and a strong learner as a model that can predict to certain degree of accuracy better than random guessing.

While complicated, boosting basically works by iteratively creating weak models and adding them to the single strong learner. While this process happens, model accuracy is tested and then weightings are applied so that future learners focus on improving model performance for cases that were previously not well predicted.

Another very popular ensemble method is known as random forests. Random forests are essentially the combination of decision trees and bagging.

Kaggle is arguably the world’s most prestigious data science competition platform, and features competitions that are created and sponsored by most of the notable Silicon Valley tech companies, as well as by other very well-known corporations. Ensemble methods such as random forests and boosting have enjoyed very high success rates in winning these competitions.

Model Validation and Resampling Methods

Model validation is a very important part of the machine learning process. Validation methods consist of creating models and testing them on a validation dataset.

Resulting validation-set error provides an estimate of the test error and is typically assessed using mean squared error (MSE) in the case of a quantitative response, and misclassification rate in the case of a qualitative (discrete) response.

Many validation techniques are categorized as resampling methods, which involve refitting models to different samples formed from a set of training data.

Probably the most popular and noteworthy technique is called cross-validation. The key idea of cross-validation is that the model’s accuracy on the training set is optimistic, and that a better estimate comes from the model’s accuracy on the test set. The idea then is to estimate the test set accuracy while in the model training stage.

The process involves repeated splitting of the data into different training and test sets, building the model on the training set, and then evaluating it on the test set, and finally repeating and averaging the estimated errors.

In addition to model validation and helping to prevent overfitting, cross-validation can be used for feature selection, model selection, model parameter tuning, and comparing different predictors.

A popular special case of cross-validation is known as k-fold cross-validation. This technique involves selecting a number k, which represents the number of partitions of equal size that the original data is divided into. Once divided, a single partition is designated as a validation dataset (i.e., for testing the model), and the remaining k-1 data partitions are used as training data.

Note that typically the larger the chosen k, the less bias, but more variance, and vice versa. In the case of cross-validation, random sampling is done without replacement.

There is another technique that involves random sampling with replacement that is known as the bootstrap. The bootstrap technique tends to underestimate the error more than cross-validation.

Another special case is when k=n, i.e., when k equals the number of observations. In this case, the technique is known as leave-one-out cross-validation (LOOCV).

Summary

In this chapter, we have discussed many concepts and techniques associated with model evaluation, validation, complexity, and improvement.

Chapter four of this series will provide a much deeper dive into concepts and metrics related to model performance evaluation and error analysis.

Stay tuned!

By on

About the Author: Alex Castrounis founded InnoArchiTech. Sign up for the InnoArchiTech newsletter and follow InnoArchiTech on Twitter at @innoarchitech for the latest content updates.

References

  1. Wikipedia: Machine Learning
  2. Wikipedia: Supervised Learning
  3. Wikipedia: Unsupervised Learning
  4. Wikipedia: List of machine learning concepts
  5. Wikipedia: Feature Selection
  6. Wikipedia: Cross-validation
  7. Practical Machine Learning Online Course – Johns Hopkins University
  8. Machine Learning Online Course – Stanford University
  9. Statistical Learning Online Course – Stanford University
  10. Wikipedia: Regularization
  11. Wikipedia: Curse of dimensionality
  12. Wikipedia: Bagging, aka Bootstrap Aggregating
  13. Wikipedia: Boosting

Machine Learning: An In-Depth, Non-Technical Guide – Part 2

By Alex Castrounis

Source: http://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide-part-2/

Chapters

  1. Overview, goals, learning types, and algorithms
  2. Data selection, preparation, and modeling
  3. Model evaluation, validation, complexity, and improvement
  4. Model performance and error analysis
  5. Unsupervised learning, related fields, and machine learning in practice

Introduction

Welcome to the second chapter in a five-part series about machine learning.

In this chapter, we will briefly introduce model performance concepts, and then focus on the following parts of the machine learning process: data selection, preprocessing, feature selection, model selection, and model tradeoff considerations.

Model Performance Introduction

Model performance can be defined in many ways, but in general, it refers to how effectively the model is able to achieve the solution goals for a given problem (e.g., prediction, classification, anomaly detection, recommendation).

Since the goals can differ for each problem, the measure of performance can differ as well. Some common performance measures include accuracy, precision, recall, receiver operator characteristic (ROC), and so on. These will be discussed in much greater detail throughout the rest of this series.

Data Selection and Preprocessing

Some say that garbage in equals garbage out, and this is definitely the case. This basically means that you may have built a predictive model, but it doesn’t matter if the data used to build the model is non-representative, low quality, error ridden, and so on. The quality, amount, preparation, and selection of data is critical to the success of a machine learning solution.

The first step to ensure success is to avoid selection bias. Selection bias occurs when the samples used to produce the model are not fully representative of cases that the model may be used for in the future, particularly with new and unseen data.

Data is typically messy and often consists of missing values, useless values (e.g., NA), outliers, and so on. Prior to modeling and analysis, raw data needs to be parsed, cleaned, transformed, and pre-processed. This is typically referred to a data munging or data wrangling.

For missing data, data is often imputed, which is a technique used to fill in, or substitute for missing values, and is very similar conceptually to interpolation.

In addition, sometimes feature values are scaled (feature scaling) and/or standardized (normalized). The most typical method of standardizing feature data is to subtract the mean across a given feature’s values from each individual observation value, and then divide by the standard deviation of that feature’s values.

Feature scaling is used to bring the different feature’s value ranges into similarity in order to help prevent certain features from dominating models and predictions, but also to prevent computing problems when running machine learning optimization algorithms (speed, convergence, etc.).

Another preprocessing technique is to create dummy variables, which basically means that you convert qualitative variables to quantitative variables. An example is taking a color feature (e.g., green, red, and blue), and transforming it to the values 1, 2, and 3 respectively. This makes it possible to perform regression with qualitative features.

Data Splitting

Recall from chapter 1 that the data used for machine learning should be split into training and test datasets, as well as an optional third validation dataset for model validation and tuning.

Choosing the size of each data set can be somewhat subjective and dependent on the overall sample size, and a full discussion is out of scope for this series. As an example however, given a training and test dataset only, some people may split the data into 80% training and 20% testing.

In general, more training data results in a better model and potential performance, and more testing data results in a greater evaluation of model performance and overall generalization capability.

Feature Selection and Feature Engineering

Once you have a representative, unbiased, cleaned, and fully prepared dataset, typical next steps include feature selection and feature engineering of the training data. Note that although discussed here, both of these techniques can also be used later in the process for improving model performance.

Feature selection is the process of selecting a subset of features from which to build a predictive regression model or classifier. This is usually done for model simplification and increased interpretability, reducing training times and computational cost, and to help reduce the risk of overfitting, and thus improve model generalization.

Basic techniques for feature selection, particularly for regression problems, involve estimates of model parameters (i.e., model coefficients) and their significance, and correlation estimates amongst features. This will be discussed further in a section about parametric models.

Some advanced techniques used for feature selection are principle component analysis (PCA), singular value decomposition (SVD), and Linear Discriminant Analysis (LDA).

Principal component analysis is a statistical technique that deals with determining which features, in order, represent the most to least variance in the data. Singular value decomposition is a lower level linear algebra algorithm that is used by PCA.

Linear discriminant analysis is closely related to PCA in that they’re both linear transformation techniques. PCA however is more general and is not concerned with class labels (unsupervised), whereas LDA is more specific and is concerned with class labels (supervised).

Feature engineering includes feature selection as a sub-category, but also involves other aspects such as creating new features, transforming raw data into domain-specific and interpretable features, and so on.

Parametric Models and Feature Selection

Many machine learning models are a type of parametric model. A good example is the equation describing a line (i.e., linear model), which is shown here9, and includes the slope (β), intercept coefficient (α), and an error term (ε).

With parametric models, the coefficients of the terms are called the parameters, and are usually designated by the Greek letter beta and a subscript (e.g., β1 … βn). In regression problems, the parameters are called regression coefficients.

Many models also include an error term, indicated by the Greek letter epsilon. Simply stated, this error term is meant to account for the difference between the model’s predicted value and the actual observed value for a given set of input values.

Understanding the concept of model parameters is very important for supervised learning because machine learning differs from other techniques, in that it learns model parameters automatically. It does this by estimating the optimal set of model parameters that best explains the relationship between the response variable and the independent feature variables through optimization techniques, as discussed in chapter one.

In regression problems, a p-value is assigned to each of the estimated model parameters (regression coefficients), and this value is used to indicate the potential predictive influence that each coefficient has on the response.

Coefficients with a p-value greater than some chosen threshold, typically 0.05 or 0.10, are often not included in the model since they will most likely not help explain (predict) the response. This is one key way to perform feature selection with parametric models.

Another technique involves estimating the correlation of the features with respect to the response, and removing redundant and highly correlated features. The idea is that including only one of a pair of correlated features (the most significant) should be enough to explain the impact of both of the correlated features on the response.

Model Selection

While the algorithm or model that you choose may not matter as much as other things discussed in this series (e.g., amount of data, feature selection, etc.), here is a list of things to take into account when choosing a model.

  • Interpretability
  • Simplicity (aka parsimony)
  • Accuracy
  • Speed (training, testing, and real-time processing)
  • Scalability

A good approach is to start with simple models and then increase model complexity as needed, and only when necessary. Generally, simplicity should be preferred unless you can achieve major accuracy gains through model selection.

Relatively simple models include simple and multiple linear regression for regression problems, and logistic and multinomial regression for classification problems.

A basic early model selection choice for supervised learning is whether to use a linear or nonlinear model. Nonlinear models best describe and predict situations when the effects on the response from certain feature values and their combination is nonlinear. Most of the time however, relationships are never truly linear.

Beyond basic linear models, variations in the response variable can also be due to interaction effects, which means that the response is dependent not only on certain individual features (main effects), but also on the combination of certain features (interaction effects). This combination of features in a model is represented by multiplying the feature values for each interaction term in the model (e.g., βx1x2) with a term coefficient.

Once interaction terms are included, the significance of the interactions in explaining the response, and whether to include them, can be determined through the usual methods such as p-value estimation. Note that there is a concept known as the hierarchy principle, which basically says that if an interaction is included in a model, the associated main effects should also be included.

While linear assumptions are often good enough and can produce adequate results, most real life feature/response relationships are nonlinear, and sometimes nonlinear models are required to get an acceptable level of accuracy. In this case, there are a wide variety of models to choose from.

Nonlinear models can include different degree polynomials, step functions, piecewise polynomials, splines, local regression (aka LOESS models), and generalized additive models (GAM). Due to the technical nature of nonlinear modeling, familiarity with the above model approaches by name should suffice for the purpose of this series.

Other notable model choices include decision trees, support vector machines (SVM), and artificial neural networks (modeled after biological neural networks, an interconnected system of neurons). Decision trees can be highly interpretable, while the latter two are black box and very complex technical methods. Decision trees involve creating a series of splits based on logical decisions, starting from the most important top-level node. Decision trees visually look like an upside down tree.

Here is an example of a decision tree created by Stephen Milborrow, which shows survival of passengers on board the Titanic. The term ‘sibsp’ is the number of spouses or siblings aboard, and the numbers under each leaf refer to the probability of survival and the percentage of the total observations (i.e., people on board). So the upper right leaf indicates that females that survived had a 73% chance of survival and represented 36% of those on board.

By Stephen Milborrow (Own work) CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html), via Wikimedia Commons

The final model selection decision discussed here is whether to leverage ensemble methods for additional performance gains. These methods combine models to produce a single consensus prediction or classification, and do so through averaging or voting techniques.

Some very common ensemble methods are bagging, boosting, and random forests. Random forests are essentially bagging applied to decision trees, with the additional element of random feature subset selection. Further discussion of these methods is out of scope of this series.

Model Tradeoffs

Model accuracy is determined in many ways, and will be discussed in detail later in this series. The primary measure of model accuracy comes from estimating the test error for a given model. The accuracy improvement goal of model selection is therefore to reduce the estimated test error.

It is important to note that the goal isn’t to find the absolute minimal error, but rather to find the simplest model that performs well enough. There are usually diminishing returns in trying the squeeze out the very last bit of performance. Given this, your choice of modeling approach won’t always be based on the one that results in the greatest degree of accuracy. Sometimes there are other important factors that must be taken into account as well, including interpretability, simplicity, speed, and scalability.

Often, it’s a tradeoff choosing whether prediction accuracy or model interpretability is more important for a given application. Artificial neural networks, support vector machines, and some ensemble methods can be used to create very accurate predictive models, but are very much of a black box except to highly specialized and technical individuals.

Black box algorithms may be preferred when predictive performance is the most important goal, and it’s not necessary to explain how the model works and makes predictions. In some cases however, model interpretability is preferred, and sometimes legally mandatory.

Here is an interpretability-driven example often seen in the financial industry. Suppose a machine learning algorithm is used to accept or reject an individual’s credit card application. If the applicant is rejected and decides to file a complaint or take legal action, the financial institution will need to explain how that decision was made. While that can be nearly impossible for a neural network or SVM system, it’s relatively straightforward for decision tree-based algorithms.

In terms of training, testing, processing, and prediction speed, some algorithms and model types take more time, and require greater computing power and memory than others. In some applications, speed and scalability are critical factors, particularly in any widely used, near real-time application (e.g., eCommerce site) where a model needs to be updated fairly regularly, and that performs predictions and/or classifications at scale on the fly.

Lastly, and as previously mentioned, model simplicity (or parsimony) should always be preferred unless there is a significant and justifiable gain in performance accuracy. Simplicity usually results in quicker, more scalable, and easier to interpret models and results.

Summary

We’ve now had a solid overview of the machine learning process from selecting data and features, through selecting appropriate models for a given problem type.

Chapter three of this series will continue with the machine learning process, and in particular will focus on model evaluation, performance, improvement, complexity, validation, and more.

Stay tuned!

By on

About the Author: Alex Castrounis founded InnoArchiTech. Sign up for the InnoArchiTech newsletter and follow InnoArchiTech on Twitter at @innoarchitech for the latest content updates.

References

  1. Wikipedia: Machine Learning
  2. Wikipedia: Supervised Learning
  3. Wikipedia: Unsupervised Learning
  4. Wikipedia: List of machine learning concepts
  5. Wikipedia: Feature Selection
  6. Practical Machine Learning Online Course – Johns Hopkins University
  7. Machine Learning Online Course – Stanford University
  8. Statistical Learning Online Course – Stanford University
  9. Wikipedia: Simple Linear Regression
  10. Stephen Milborrow (Own work)

Machine Learning: An In-Depth, Non-Technical Guide – Part 1

Source: http://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide/

By Alex Castrounis

Chapters

  1. Overview, goals, learning types, and algorithms
  2. Data selection, preparation, and modeling
  3. Model evaluation, validation, complexity, and improvement
  4. Model performance and error analysis
  5. Unsupervised learning, related fields, and machine learning in practice

Introduction

Welcome! This is the first chapter of a five-part series about machine learning.

Machine learning is a very hot topic for many key reasons, and because it provides the ability to automatically obtain deep insights, recognize unknown patterns, and create high performing predictive models from data, all without requiring explicit programming instructions.

Despite the popularity of the subject, machine learning’s true purpose and details are not well understood, except by very technical folks and/or data scientists.

This series is intended to be a comprehensive, in-depth, and non-technical guide to machine learning, and should be useful to everyone from business executives to machine learning practitioners. It covers virtually all aspects of machine learning (and many related fields) at a high level, and should serve as a sufficient introduction or reference to the terminology, concepts, tools, considerations, and techniques of the field.

This high level understanding is critical if ever involved in a decision-making process surrounding the usage of machine learning, how it can help achieve business and project goals, which machine learning techniques to use, potential pitfalls, and how to interpret the results.

Note that most of the topics discussed in this series are also directly applicable to fields such as predictive analytics, data mining, statistical learning, artificial intelligence, and so on.

Machine Learning Defined

The oft quoted and widely accepted formal definition of machine learning as stated by field pioneer Tom M. Mitchell is:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E

The following is my less formal way to describe machine learning.

Machine learning is a subfield of computer science, but is often also referred to as predictive analytics, or predictive modeling1. Its goal and usage is to build new and/or leverage existing algorithms to learn from data, in order to build generalizable models that give accurate predictions, or to find patterns, particularly with new and unseen similar data.

Machine Learning Process Overview

Imagine a dataset as a table, where the rows are each observation (aka measurement, data point, etc), and the columns for each observation represent the features of that observation and their values.

At the outset of a machine learning project, a dataset is usually split into two or three subsets. The minimum subsets are the training and test datasets, and often an optional third validation dataset is created as well.

Once these data subsets are created from the primary dataset, a predictive model or classifier is trained using the training data, and then the model’s predictive accuracy is determined using the test data.

As mentioned, machine learning leverages algorithms to automatically model and find patterns in data, usually with the goal of predicting some target output or response. These algorithms are heavily based on statistics and mathematical optimization.

Optimization is the process of finding the smallest or largest value (minima or maxima) of a function, often referred to as a loss, or cost function in the minimization case10. One of the most popular optimization algorithms used in machine learning is called gradient descent, and another is known as the the normal equation.

In a nutshell, machine learning is all about automatically learning a highly accurate predictive or classifier model, or finding unknown patterns in data, by leveraging learning algorithms and optimization techniques.

Types of Learning

The primary categories of machine learning are supervised, unsupervised, and semi-supervised learning. We will focus on the first two in this article.

In supervised learning, the data contains the response variable (label) being modeled, and with the goal being that you would like to predict the value or class of the unseen data. Unsupervised learning involves learning from a dataset that has no label or response variable, and is therefore more about finding patterns than prediction.

As i’m a huge NFL and Chicago Bears fan, my team will help exemplify these types of learning! Suppose you have a ton of Chicago Bears data and stats dating from when the team became a chartered member of the NFL (1920) until the present (2016).

Imagine that each row of the data is essentially a team snapshot (or observation) of relevant statistics for every game since 1920. The columns in this case, and the data contained in each, represent the features (values) of the data, and may include feature data such as game date, game opponent, season wins, season losses, season ending divisional position, post-season berth (Y/N), post-season stats, and perhaps stats specific to the three phases of the game: offense, defense, and special teams.

In the supervised case, your goal may be to use this data to predict if the Bears will win or lose against a certain team during a given game, and at a given field (home or away). Keep in mind that anything can happen in football in terms of pre and game-time injuries, weather conditions, bad referee calls, and so on, so take this simply as an example of an application of supervised learning with a yes or no response (prediction), as opposed to determining the probability or likelihood of ‘Da Bears’ getting the win.

Since you have historic data of wins and losses (the response) against certain teams at certain football fields, you can leverage supervised learning to create a model to make that prediction.

Now suppose that your goal is to find patterns in the historic data and learn something that you don’t already know, or group the team in certain ways throughout history. To do so, you run an unsupervised machine learning algorithm that clusters (groups) the data automatically, and then analyze the clustering results.

With a bit of analysis, one may find that these automatically generated clusters seemingly groups the team into the following example categories over time:

  • Strong defense, weak running offense, strong passing offense, weak special teams, playoff berth
  • Strong defense, strong running offense, weak passing offense, average special teams, playoff berth
  • Weak defense, strong all-around offense, strong special teams, missed the playoffs
  • and so on

An example of unsupervised cluster analysis would be to find a potential reason why they missed the playoffs in the third cluster above. Perhaps due to the weak defense? Bears have traditionally been a strong defensive team, and some say that defense wins championships. Just saying…

In either case, each of the above classifications may be found to relate to a certain time frame, which one would expect. Perhaps the team was characterized by one of these groupings more than once throughout their history, and for differing periods of time.

To characterize the team in this way without machine learning techniques, one would have to pour through all historic data and stats, manually find the patterns and assign the classifications (clusters) for every year taking all data into account, and compile the information. That would definitely not be a quick and easy task.

Alternatively, you could write an explicitly coded program to pour through the data, and that has to know what team stats to consider, what thresholds to take into account for each stat, and so forth. It would take a substantial amount of time to write the code, and different programs would need to be written for every problem needing an answer.

Or… you can employ a machine learning algorithm to do all of this automatically for you in a few seconds.

Machine Learning Goals and Outputs

Machine learning algorithms are used primarily for the following types of output:

  • Clustering (Unsupervised)
  • Two-class and multi-class classification (Supervised)
  • Regression: Univariate, Multivariate, etc. (Supervised)
  • Anomaly detection (Unsupervised and Supervised)
  • Recommendation systems (aka recommendation engine)

Specific algorithms that are used for each output type are discussed in the next section, but first, let’s give a general overview of each of the above output, or problem types.

As discussed, clustering is an unsupervised technique for discovering the composition and structure of a given set of data. It is a process of clumping data into clusters to see what groupings emerge, if any. Each cluster is characterized by a contained set of data points, and a cluster centroid. The cluster centroid is basically the mean (average) of all of the data points that the cluster contains, across all features.

Classification problems involve placing a data point (aka observation) into a pre-defined class or category. Sometimes classification problems simply assign a class to an observation, and in other cases the goal is to estimate the probabilities that an observation belongs to each of the given classes.

A great example of a two-class classification is assigning the class of Spam or Ham to an incoming email, where ham just means ‘not spam’. Multi-class classification just means more than two possible classes. So in the spam example, perhaps a third class would be ‘Unknown’.

Regression is just a fancy word for saying that a model will assign a continuous value (response) to a data observation, as opposed to a discrete class. A great example of this would be predicting the closing price of the Dow Jones Industrial Average on any given day. This value could be any number, and would therefore be a perfect candidate for regression.

Note that sometimes the word regression is used in the name of an algorithm that is actually used for classification problems, or to predict a discrete categorical response (e.g., spam or ham). A good example is logistic regression, which predicts probabilities of a given discrete value.

Another problem type is anomaly detection. While we’d love to think that data is well behaved and sensible, unfortunately this is often not the case. Sometimes there are erroneous data points due to malfunctions or errors in measurement, or sometimes due to fraud. Other times it could be that anomalous measurements are indicative of a failing piece of hardware or electronics.

Sometimes anomalies are indicative of a real problem and are not easily explained, such as a manufacturing defect, and in this case, detecting anomalies provides a measure of quality control, as well as insight into whether steps taken to reduce defects have worked or not. In either case, there are times where it is beneficial to find these anomalous values, and certain machine learning algorithms can be used to do just that.

The final type of problem is addressed with a recommendation system, or also called recommendation engine. Recommendation systems are a type of information filtering system, and are intended to make recommendations in many applications, including movies, music, books, restaurants, articles, products, and so on. The two most common approaches are content-based and collaborative filtering.

Two great examples of popular recommendation engines are those offered by Netflix and Amazon. Netflix makes recommendations in order to keep viewers engaged and supplied with plenty of content to watch. In other words, to keep people using Netflix. They do this with their “Because you watched …”, “Top Picks for Alex”, and “Suggestions for you” recommendations.

Amazon does a similar thing in order to increase sales through up-selling, maintain sales through user engagement, and so on. They do this through their “Customers Who Bought This Item Also Bought”, “Recommendations for You, Alex”, “Related to Items You Viewed”, and “More Items to Consider” recommendations.

Machine Learning Algorithms

We’ve now covered the machine learning problem types and desired outputs. Now we will give a high level overview of relevant machine learning algorithms.

Here is a list of algorithms, both supervised and unsupervised, that are very popular and worth knowing about at a high level. Note that some of these algorithms will be discussed in greater depth later in this series.

Supervised Regression

  • Simple and multiple linear regression
  • Decision tree or forest regression
  • Artificial Neural networks
  • Ordinal regression
  • Poisson regression
  • Nearest neighbor methods (e.g., k-NN or k-Nearest Neighbors)

Supervised Two-class & Multi-class Classification

  • Logistic regression and multinomial regression
  • Artificial Neural networks
  • Decision tree, forest, and jungles
  • SVM (support vector machine)
  • Perceptron methods
  • Bayesian classifiers (e.g., Naive Bayes)
  • Nearest neighbor methods (e.g., k-NN or k-Nearest Neighbors)
  • One versus all multiclass

Unsupervised

  • K-means clustering
  • Hierarchical clustering

Anomaly Detection

  • Support vector machine (one class)
  • PCA (Principle component analysis)

Note that a technique that’s often used to improve model performance is to combine the results of multiple models. This approach leverages what’s known as ensemble methods, and random forests are a great example (discussed later).

If nothing else, it’s a good idea to at least familiarize yourself with the names of these popular algorithms, and have a basic idea as to the type of machine learning problem and output that they may be well suited for.

Summary

Machine learning, predictive analytics, and other related topics are very exciting and powerful fields.

While these topics can be very technical, many of the concepts involved are relatively simple to understand at a high level. In many cases, a simple understanding is all that’s required to have discussions based on machine learning problems, projects, techniques, and so on.

Chapter two of this series will provide an introduction to model performance, cover the machine learning process, and discuss model selection and associated tradeoffs in detail.

Stay tuned!

By on

About the Author: Alex Castrounis founded InnoArchiTech. Sign up for the InnoArchiTech newsletter and follow InnoArchiTech on Twitter at @innoarchitech for the latest content updates.


References

  1. Wikipedia: Machine Learning
  2. Wikipedia: Supervised Learning
  3. Wikipedia: Unsupervised Learning
  4. Wikipedia: List of machine learning concepts
  5. A Tour of Machine Learning Algorithms – Machine Learning Mastery
  6. Common Machine Learning Algorithms – Analytics Vidhya
  7. A Tour of Machine Learning Algorithms – Data Science Central
  8. How to choose algorithms for Microsoft Azure Machine Learning
  9. Wikipedia: Gradient Descent
  10. Wikipedia: Loss Function
  11. Wikipedia: Recommender System

Four steps to master machine learning with python