Machine Learning: An In-Depth, Non-Technical Guide – Part 5

By Alex Castrounis

Source: http://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide-part-5/

Chapters

Introduction

Welcome to the fifth and final chapter in a five-part series about machine learning.

In this final chapter, we will revisit unsupervised learning in greater depth, briefly discuss other fields related to machine learning, and finish the series with some examples of real-world machine learning applications.

Unsupervised Learning

Recall that unsupervised learning involves learning from data, but without the goal of prediction. This is because the data is either not given with a target response variable (label), or one chooses not to designate a response. It can also be used as a pre-processing step for supervised learning.

In the unsupervised case, the goal is to discover patterns, deep insights, understand variation, find unknown subgroups (amongst the variables or observations), and so on in the data. Unsupervised learning can be quite subjective compared to supervised learning.

The two most commonly used techniques in unsupervised learning are principal component analysis (PCA) and clustering. PCA is one approach to learning what is called a latent variable model, and is a particular version of a blind signal separation technique. Other notable latent variable modeling approaches include expectation-maximization algorithm (EM) and Method of moments³.

PCA

PCA produces a low-dimensional representation of a dataset by finding a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated⁸. Another way to describe PCA is that it is a transformation of possibly correlated variables into a set of linearly uncorrelated variables known as principal components¹³.

Each of the components are mathematically determined and ordered by the amount of variability or variance that each is able to explain from the data. Given that, the first principal component accounts for the largest amount of variance, the second principal component the next largest, and so on.

Each component is also orthogonal to all others, which is just a fancy way of saying that they’re perpendicular to each other. Think of the X and Y axis’ in a two dimensional plot. Both axis are perpendicular to each other, and are therefore orthogonal. While not easy to visualize, think of having many principal components as being many axis that are perpendicular to each other.

While much of the above description of principal component analysis may be a bit technical sounding, it is actually a relatively simple concept from a high level. Think of having a bunch of data in any amount of dimensions, although you may want to picture two or three dimensions for ease of understanding.

Each principal component can be thought of as an axis of an ellipse that is being built (think cloud) to contain the data (aka fit to the data), like a net catching butterflies. The first few principal components should be able to explain (capture) most of the data, with the addition of more principal components eventually leading to diminishing returns.

One of the tricks of PCA is knowing how many components are needed to summarize the data, which involves estimating when most of the variance is explained by a given number of components. Another consideration is that PCA is sensitive to feature scaling, which was discussed earlier in this series.

PCA is also used for exploratory data analysis and data visualization. Exploratory data analysis involves summarizing a dataset through specific types of analysis, including data visualization, and is often an initial step in analytics that leads to predictive modeling, data mining, and so on.

Further discussion of PCA and similar techniques is out of scope of this series, but the reader is encouraged to refer to external sources for more information.

Clustering

Clustering refers to a set of techniques and algorithms used to find clusters (subgroups) in a dataset, and involves partitioning the data into groups of similar observations. The concept of ‘similar observations’ is a bit relative and subjective, but it essentially means that the data points in a given group are more similar to each other than they are to data points in a different group.

Similarity between observations is a domain specific problem and must be addressed accordingly. A clustering example involving the NFL’s Chicago Bears (go Bears!) was given in chapter 1 of this series.

Clustering is not a technique limited only to machine learning. It is a widely used technique in data mining, statistical analysis, pattern recognition, image analysis, and so on. Given the subjective and unsupervised nature of clustering, often data preprocessing, model/algorithm selection, and model tuning are the best tools to use to achieve the desired results and/or solution to a problem.

There are many types of clustering algorithms and models, which all use their own technique of dividing the data into a certain number of groups of similar data. Due to the significant difference in these approaches, the results can be largely affected, and therefore one must understand these different algorithms to some extent to choose the most applicable approach to use.

K-means and hierarchical clustering are two widely used unsupervised clustering techniques. The difference is that for k-means, a predetermined number of clusters (k) is used to partition the observations, whereas the number of clusters in hierarchical clustering is not known in advance.

Hierarchical clustering helps address the potential disadvantage of having to know or pre-determine k in the case of k-means. There are two primary types of hierarchical clustering, which include bottom-up and agglomerative⁸.

Here is a visualization, courtesy of Wikipedia, of the results of running the k-means clustering algorithm on a set of data with k equal to three. Note the lines, which represent the boundaries between the groups of data.

There are two types of clustering, which define the degree of grouping or containment of data. The first is called hard clustering, where every data point belongs to only one cluster and not the others. Soft clustering, or fuzzy clustering on the other hand refers to the case where a data point belongs to a cluster to a certain degree, or is assigned a likelihood (probability) of belonging to a certain cluster.

Method comparison and general considerations

What is the difference then between PCA and clustering? As mentioned, PCA looks for a low-dimensional representation of the observations that explains a good fraction of the variance, while clustering looks for homogeneous subgroups among the observations⁸.

An interesting point to note is that in the absence of a target response, there is no way to evaluate solution performance or errors as one does in the supervised case. In other words, there is no objective way to determine if you’ve found a solution. This is a significant differentiator between supervised and unsupervised learning methods.

Predictive Analytics, Artificial Intelligence, and Data Mining, Oh My!

Machine learning is often interchanged with terms like predictive analytics, artificial intelligence, data mining, and so on. While machine learning is certainly related to these fields, there are some notable differences.

Predictive analytics is a subcategory of a broader field known as analytics in general. Analytics is usually broken into three sub-categories: descriptive, predictive, and prescriptive.

Descriptive analytics involves analytics applied to understanding and describing data. Predictive analytics deals with modeling, and making predictions or assigning classifications from data observations. Prescriptive analytics deals with making data-driven, actionable recommendations or decisions.

Artificial intelligence (AI) is a super exciting field, and machine learning is essentially a sub-field of AI due to the automated nature of the learning algorithms involved. According to Wikipedia, AI has been defined as the science and engineering of making intelligent machines, but also as the study and design of intelligent agents, where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success

Statistical learning is becoming popularized due to Stanford’s related online course and its associated books: An Introduction to Statistical Learning, and The Elements of Statistical Learning.

Machine learning arose as a subfield of artificial intelligence, statistical learning arose as a subfield of statistics. Both fields are very similar, overlap in many ways, and the distinction is becoming less clear over time. They differ in that machine learning has a greater emphasis on prediction accuracy and large scale applications, whereas statistical learning emphasizes models and their related interpretability, precision, and uncertainty⁸.

Lastly, data mining is a field that’s also often confused with machine learning. Data mining leverages machine learning algorithms and techniques, but also spans many other fields such as data science, AI, statistics, and so on.

The overall goal of the data mining process is to extract patterns and knowledge from a data set, and transform it into an understandable structure for further use²⁶. Data mining often deals with large amounts of data, or big data.

Machine Learning in Practice

As discussed throughout this series, machine learning can be used to create predictive models, assign classifications, make recommendations, and find patterns and insights in an unlabeled dataset. All of these tasks can be done without requiring explicit programming.

Machine learning has been successfully used in the following non-exhaustive example applications¹:

Spam filtering
Optical character recognition (OCR)
Search engines
Computer vision
Recommendation engines, such as those used by Netflix and Amazon
Classifying DNA sequences
Detecting fraud, e.g., credit card and internet
Medical diagnosis
Natural language processing
Speech and handwriting recognition
Economics and finance
Virtually anything else you can think of that involves data

In order to apply machine learning to solve a given problem, the following steps (or a variation) should to be taken, and should use machine learning elements discussed throughout this series.

Define the problem to be solved and the project’s objective. Ask lots of questions along the way!
Determine the type of problem and type of solution required.
Collect and prepare the data.
Create, validate, tune, test, assess, and improve your model and/or solution. This process should be driven by a combination of technical (stats, math, programming), domain, and business expertise.
Discover any other insights and patterns as applicable.
Deploy your solution for real-world use.
Report on and/or present results.

If you encounter a situation where you or your company can benefit from a machine learning-based solution, simply approach it using these steps and see what you come up with. You may very well wind up with a super powerful and scalable solution!

Summary

Congratulations to those that have read all five chapters in full! I would like to thank you very much for spending your precious time joining me on this machine learning adventure.

This series took me a significant amount of time to write, so I hope that this time has been translated into something useful for as many people as possible.

At this point, we have covered virtually all major aspects of the entire machine learning process at a high level, and at times even went a little deeper.

If you were able to understand and retain the content in this series, then you should have absolutely no problem participating in any conversation involving machine learning and its applications. You may even have some very good opinions and suggestions about different applications, methods, and so on.

Despite all of the information covered in this series, and the details that were out of scope, machine learning and its related fields in practice are also somewhat of an art. There are many decisions that need to be made along the way, customized techniques to employ, as well as use creative strategies in order to best solve a given problem.

A high quality practitioner should also have a strong business acumen and expert-level domain knowledge. Problems involving machine learning are just as much about asking questions as they are about finding solutions. If the question is wrong, then the solution will be as well.

Thank you again, and happy learning (with machines)!

By Alex Castrounis on March 18, 2016

About the Author: Alex Castrounis founded InnoArchiTech. Sign up for the InnoArchiTech newsletter and follow InnoArchiTech on Twitter at @innoarchitech for the latest content updates.

References

Machine Learning: An In-Depth, Non-Technical Guide – Part 4

By Alex Castrounis

Source: http://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide-part-4/

Chapters

Introduction

Welcome to the fourth chapter in a five-part series about machine learning.

In this chapter, we will take a deeper dive into model evaluation and performance metrics, and potential prediction-related errors that one may encounter.

Residuals and Classification Results

Before digging deeper into model performance and error types, we must first discuss the concept of residuals and errors for regression, positive and negative classifications for classification problems, and in-sample versus out-of-sample measurements.

Any reference to models, metrics, or errors computed with respect to the data used to train, validate, or tune a predictive model (i.e., data you have) is called in-sample. Conversely, reference to test data metrics and errors, or new data in general is called out-of-sample (i.e., data you don’t have).

Recall that regression involves predicting a continuous valued output (response) based on some set of input variables (features/predictors). The difference between the model’s predicted response value and the actual observed response value from the in-sample data is called the residual for each point, and residuals refers collectively to all of the differences between all predicted and actual values. Each out-of-sample (new/test data) difference is called a prediction error instead of residual.

For the classification case, and for simplicity, we will only discuss binary classification (two classes). Prior to performing classification on data observations, one must define what is a positive classification and what is a negative classification. In the case of spam or ham (i.e., not spam), spam may be the positive designation and ham is the negative.

If a model predicts an incoming email as being spam, and it really is spam, then that’s considered a true positive. Positive since the model predicted spam (the positive class), and true because the actual class matched the prediction. Conversely, if an incoming email is labeled spam when it’s actually not spam, it is considered a false positive.

Given this, we can see that the results of a classification model on new data can fall into four potential buckets. These include: true positives, false positives (type 1 error), true negatives, and false negatives (type 2 error). In all four cases, true or false refers to whether the actual class matched the predicted class, and positive or negative refers to which classification was assigned to an observation by the model.

Note that false is synonymous with error in this case since the model failed to predict correctly.

Model Performance Overview

Now that we’ve covered residuals and classification result types, we will begin the discussion of model performance metrics that are based on these concepts.

Here is a non-exhaustive list of model evaluation methods, visualizations, and performance metrics that are used in machine learning and predictive analytics. They are categorized by their most common use case, but some may apply to more than one category (e.g., accuracy).

In addition to model evaluation, many of these can also be used for model comparison, selection, and tuning. Many of these are very powerful when combined with the cross-validation technique described earlier in this series.

Regression performance
- R² and adjusted R² (aka explained variance)
- Mean squared error (MSE), or root mean squared error (RMSE)
- Mean error, or mean absolute error
- Median error, or median absolute error
Classification performance
- Confusion matrix
- Precision
- Recall (aka sensitivity)
- Specificity
- Accuracy
- Lift
- Area under the ROC curve (AUC)
- F-score
- Log-loss
- Average precision
- Precision/recall break-even point
- Root mean squared error (RMSE)
- Mean cross entropy
- Probability calibration
Bias variance tradeoff and model complexity
- Validation curve
- Learning curve
- Residual sum of squares
- Goodness-of-fit metrics
Model validation and selection
- Mallow’s C_p
- Akaike information criterion (AIC)
- Bayesian information criterion (BIC)

Performance metrics should be chosen based on the problem domain, project goals, and the business objectives. Unfortunately there isn’t a one-size-fits-all approach, and often there are tradeoffs to consider.

While a discussion of all of these methods and metrics is out of scope for this series, we will cover a few key ones next.

Model Performance Evaluation Metrics

Regression

There are many metrics for determining model performance for regression problems, but the most commonly used metric is known as the mean square error (MSE), or variation called the root mean square error (RMSE), which is calculated by taking the square root of the mean squared error. The root mean square error is typically preferred since taking the square root changes the units of the error measurement to be the same and proportional to the response variable’s units.

The error in this case is the difference in value between a given model prediction and its actual value for an out-of-sample observation. The mean squared error is therefore the average of all of the squared errors across all new observations, which is the same as adding all of the squared errors (sum of squares) and dividing by the number of observations.

In addition to being used as a stand-alone performance metric, mean squared error (or RMSE) can also be used for model selection, controlling model complexity, and model tuning. Often many models are created and evaluated (e.g., cross-validation), and then MSE (or similar metric) is plotted on the y-axis, with the tuning or validation parameter given on the x-axis.

The tuning or validation parameter is changed in each model creation and evaluation step, and the plot described above can help determine the ideal tuning parameter value. The number of predictors is a great example of a potential tuning parameter in this case.

Before moving on to classification, it is worth mentioning R² briefly. R² is often thought of as a measure of model performance, but it’s actually not. R² is a measure of the amount of variance explained by the model, and is given as a number between 0 and 1. A value of 1 means the model explains all of the data perfectly, but when computed with training data is more of an indication of potential overfitting than high predictive performance.

As discussed earlier, the more complex the model, the more the model tends to fit the data better and potentially overfit, or contribute to additional model variance. Given this, adjusted R² is a more robust and reliable metric in that it adjusts for any increases in model complexity (e.g., adding more predictors), so that one can better gauge underlying model improvement in lieu of the increased complexity.

Classification

Recall the different results from a binary classifier, which are true positives, true negatives, false positives, and false negatives. These are often shown in a confusion matrix. Here is a very generalized and comprehensive example of one from Wikipedia, and note that the graphic is shown with concepts and metrics, and not actual data.

And here is an example from Wikipedia with the values filled in³⁰ for different classifier models evaluated against 200 observations. Note the calculation and variation of the metrics across the different models.

A confusion matrix is conceptually the basis of many classification performance metrics as shown. We will discuss a few of the more popular ones associated with machine learning here.

Accuracy is a key measure of performance, and is more specifically the rate at which the model is able to predict the correct value (classification or regression) for a given data point or observation. In other words, accuracy is the proportion of correct predictions out of all predictions made.

The other two metrics from the confusion matrix worth discussing are precision and recall. Precision (positive predictive value) is the ratio of true positives to the total amount of positive predictions made (i.e., true or false). Said another way, precision measures the proportion of accurate positive predictions out of all positive predictions made.

Recall on the other hand, or true positive rate, is the ratio of true positives to the total amount of actual positives, whether predicted correctly or not. So in other words, recall measures the proportion of accurate positive predictions out of all actual positive observations.

A metric that is associated with precision and recall is called the F-score (also called F₁ score), which combines them mathematically, and somewhat like a weighted average, in order to produce a single measure of performance based on the simultaneous values of both. Its values range from 0 (worst) to 1 (best).

Another important concept to know about is the receiver operating characteristic, which when plotted, results in what’s known as an ROC curve (shown below, image courtesy of BOR at the English language Wikipedia).

An ROC curve is a two-dimensional plot of sensitivity (recall, or true positive rate) vs specificity (false positive rate). The area under the curve is referred to as the AUC, and is a numeric metric used to represent the quality and performance of the classifier (model).

By BOR at the English language Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=10714489

An AUC of 0.5 is essentially the same as random guessing without a model, whereas an AUC of 1.0 is considered a perfect classifier. Generally, the higher the AUC value the better, and an AUC above 0.8 is considered quite good.

The higher the AUC value, the closer the curve gets to the upper left corner of the plot. One can easily see from the ROC curves then that the goal is to find and tune a model that maximizes the true positive rate, while simultaneously minimizing the false positive rate. Said another way, the goal as shown by the ROC curve is to correctly predict as many of the actual positives as possible, while also predicting as many of the actual negatives as possible, and therefore minimize errors (incorrect classifications) for both.

As mentioned previously in this series, model performance can be measured in many ways, and the method used should be chosen based on project goals, business domain considerations, and so on.

It is also worth noting that according to many experts, different performance metrics are thought to be biased for varying reasons. Given the breadth and complexity of this topic, the reader is encouraged to refer to external resources for further information on performance evaluation and the tradeoffs involved.

Error Analysis and Tradeoffs

There are multiple types of errors associated with machine learning and predictive analytics. The primary types are in-sample and out-of-sample errors. In-sample errors (aka resubstitution errors) are the error rate found from the training data, i.e., the data used to build predictive models.

Out-of-sample errors (aka generalization errors) are the error rates found on a new data set, and are the most important since they represent the potential performance of a given predictive model on new and unseen data.

In-sample error rates may be very low and seem to be indicative of a high-performing model, but one must be careful, as this may be due to overfitting as mentioned, which would result in a model that is unable to generalize well to new data.

Training and validation data is used to build, validate, and tune a model, but test data is used to evaluate model performance and generalization capability. One very important point to note is that prediction performance and error analysis should only be done on test data, when evaluating a model for use on non-training or new data (out-of-sample).

Generally speaking, model performance on training data tends to be optimistic, and therefore data errors will be less than those involving test data. There are tradeoffs between the types of errors that a machine learning practitioner must consider and often choose to accept.

For binary classification problems, there are two primary types of errors. Type 1 errors (false positives) and Type 2 errors (false negatives). It’s often possible through model selection and tuning to increase one while decreasing the other, and often one must choose which error type is more acceptable. This can be a major tradeoff consideration depending on the situation.

A typical example of this tradeoff dilemma involves cancer diagnosis, where the positive diagnosis of having cancer is based on some test. In this case, a false positive means that someone is told that have have cancer when they do not. Conversely, the false negative case is when someone is told that they do not have cancer when they actually do.

If no model is perfect, then in the example above, which is the more acceptable error type? In other words, of which one can we accept to a greater degree?

Telling someone they have cancer when they don’t can result in tremendous emotional distress, stress, additional tests and medical costs, and so on. On the other hand, failing to detect cancer in someone that actually has it can mean the difference between life and death.

In the spam or ham case, neither error type is nearly as serious as the cancer case, but typically email vendors err slightly more on the side of letting some spam get into your inbox as opposed to you missing a very important email because the spam classifier is too aggressive.

Summary

In this chapter, we have discussed many concepts and metrics associated with model evaluation, performance, and error analysis.

The fifth and final chapter of this series will revisit unsupervised learning in greater detail, followed by an overview of similar and highly related fields to machine learning. This series will conclude with an overview of machine learning as used in real world applications.

Stay tuned!

By Alex Castrounis on March 6, 2016

About the Author: Alex Castrounis founded InnoArchiTech. Sign up for the InnoArchiTech newsletter and follow InnoArchiTech on Twitter at @innoarchitech for the latest content updates.

References

Machine Learning: An In-Depth, Non-Technical Guide – Part 3

By Alex Castrounis

Source: http://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide-part-3/

Chapters

Introduction

Welcome to the third chapter in a five-part series about machine learning.

In this chapter, we’ll continue our machine learning discussion, and focus on problems associated with overfitting data, as well as controlling model complexity, a model evaluation and errors introduction, model validation and tuning, and improving model performance.

Overfitting

Overfitting is one of the greatest concerns in predictive analytics and machine learning. Overfitting refers to a situation where the model chosen to fit the training data fits too well, and essentially captures all of the noise, outliers, and so on.

The consequence of this is that the model will fit the training data very well, but will not accurately predict cases not represented by the training data, and therefore will not generalize well to unseen data. This means that the model performance will be better with the training data than with the test data.

A model is said to have high variance when it leans more towards overfitting, and conversely has high bias when it doesn’t fit the data well enough. A high variance model will tend to be quite flexible and overly complex, while a high bias model will tend to be very opinionated and overly simplified. A good example of a high bias model is fitting a straight line to very nonlinear data.

In both cases, the model will not make very accurate predictions on new data. The ideal situation is to find a model that is not overly biased, nor does it have a high variance. Finding this balance is one of the key skills of a data scientist.

Overfitting can occur for many reasons. A common one is that the training data consists of many features relative to the number of observations or data points. In this case, the data is relatively wide as compared to long.

To address this problem, reducing the number of features can help, or finding more data if possible. The downside to reducing features is that you lose potentially valuable information.

Another option is to use a technique called regularization, which will be discussed later in this series.

Controlling Model Complexity

Model complexity can be characterized by many things, and is a bit subjective. In machine learning, model complexity often refers to the number of features or terms included in a given predictive model, as well as whether the chosen model is linear, nonlinear, and so on. It can also refer to the algorithmic learning complexity or computational complexity.

Overly complex models are less easily interpreted, at greater risk of overfitting, and will likely be more computationally expensive.

There are some really sophisticated and automated methods by which to control, and ultimately reduce model complexity, as well as help prevent overfitting. Some of them are able to help with feature and model selection as well.

These methods include linear model and subset selection, shrinkage methods (including regularization), and dimensionality reduction.

Regularization essentially keeps all features, but reduces (or penalizes) the effect of some features on the model’s predicted values. The reduced effect comes from shrinking the magnitude, and therefore the effect, of some of the model’s term’s coefficients.

The two most popular regularization methods are ridge regression and lasso. Both methods involve adding a tuning parameter (Greek lambda) to the model, which is designed to impose a penalty on each term’s coefficient based on its size, or effect on the model.

The larger the term’s coefficient size, the larger the penalty, which basically means the more the tuning parameter forces the coefficient to be closer to zero. Choosing the value to use for the tuning parameter is critical and can be done using a technique such as cross-validation.

The lasso technique works in a very similar way to ridge regression, but can also be used for feature selection as well. This is due to the fact that the penalty term for each predictor is calculated slightly differently, and can result in certain terms becoming zero since their coefficients can become zero. This essentially removes those terms from the model, and is therefore a form of automatic feature selection.

Ridge regression or lasso techniques may work better for a given situation. Often the lasso works better for data where the response is best modeled as a function of a small number of the predictors, but this isn’t guaranteed. Cross-validation is a great technique for evaluating one technique versus the other.

Given a certain number of predictors (features), there is a calculable number of possible models that can be created with only a subset of the total predictors. An example is when you have 10 predictors, but want to find all possible models using only 2 of the 10 predictors.

Doing this, and then selecting one of the models based on the smallest test error, is known as subset selection, or sometimes as best subset selection. Note that a very useful plot for subset selection is when plotting the residual sum of squares (discussed later) for each model against the number of predictors.

When the number of predictors gets large enough, best subset selection becomes unable to deal with the huge number of possible model combinations for a given subset of predictors. In this case, another method known as stepwise selection can be used. There are two primary versions, forward and backward stepwise selection.

In forward stepwise selection, predictors are added to the model one at a time starting at zero predictors, until all of the predictors are included. Backwards stepwise selection is the opposite, and involves starting with a model including all predictors, and then removing a single predictor at each step.

The model performance is evaluated at each step in both cases. In both subset selection and stepwise selection, the test error is used to determine the best model. There are many ways to estimate test errors, which will be discussed later in this series.

There is a concept that deals with highly dimensional data (i.e., large number of features) known as the curse of dimensionality. The curse of dimensionality refers to the fact that the computational speed and memory required increases exponentially as the number of data dimensions (features) increases.

This can manifest itself as a problem where a machine learning algorithm does not scale well to higher dimensional data¹¹. One way to deal with this issue is to choose a different algorithm that can scale better with the data. The other is a technique known as dimensionality reduction.

Dimensionality Reduction

Dimensionality reduction is a technique used to reduce the number of features included in the machine learning process. It can help reduce complexity, reduce computational cost, and increase machine learning algorithm computational speed. It can be thought of as a technique that transforms the original predictors to a new, smaller set of predictors, which are then used to fit a model.

Principal component analysis (PCA) was discussed previously in the context of feature selection, but is also a widely-used dimensionality reduction technique as well. It helps reduce the number of features (i.e., dimensions) by finding, separating out, and sorting the features that explain the most variance in the data in descending order. Cross-validation is a great way to determine the number of principal components to include in the model.

An example of this would be a dataset where each observation is described by ten features, but only three of the features can describe the majority of the data’s variance, and therefore are adequate enough for creating a model with, and generating accurate predictions.

Note that people sometimes use PCA to prevent overfitting since fewer features implies that the model is less likely to overfit. While PCA may work in this context, it is not a good approach and is therefore not recommended. Regularization should be used to address overfitting concerns instead⁸.

Model Evaluation and Performance

Assuming you are working with high quality, unbiased, and representative data, then the next most important aspects of predictive analytics and machine learning is measuring model performance, possibly improving it if needed, and understanding potential errors that are often encountered.

We will have an introductory discussion here about model performance, improvement, and errors, but will continue with much greater detail on these topics in the next chapter.

Model performance is typically used to describe how well a model is able to make predictions on unseen data (e.g., test, but NOT training data), and there are multiple methods and metrics used to assess and gauge model performance. A key measure of model performance is to estimate the model’s test error.

The test error can be estimated either indirectly or directly. It can estimated and adjusted indirectly by making changes that affect the training error, since the training error is a measure of overfitting (bias and/or variance) to some extent.

Recall that the more the model overfits the data (high variance), the less well the model will generalize to unseen data. Given that, the assumption is that reducing variance should improve the test error as well.

The test error can also be estimated directly by testing the model with the held out test data, and usually works best in conjunction with a resampling method such as cross-validation, which we’ll discuss later.

Estimating a model’s test error not only helps determine a model’s performance and accuracy, but is also a very powerful way to select a model too.

Improving Model Performance and Ensemble Learning

There are many ways to improve a model’s performance. The quality and quantity of data used has a huge, if not the biggest impact on model performance, but sometimes these two can’t easily be changed.

Other major influencers on model performance include algorithm tuning, feature engineering, cross-validation, and ensemble methods.

Algorithm tuning refers to the process of tweaking certain values that effectively initialize and control how a machine learning algorithm learns and generates predictive models. This tuning can be used to improve performance using the separate validation data set, and later performance tested with the test dataset.

Since most algorithm tuning parameters are algorithm-specific and sometimes very complex, a detailed discussion is out of scope for this article, but note that the lambda parameter described for regularization is one such tuning parameter.

Ensemble learning, as mentioned in an earlier post, deals with combining or averaging (regression) the results from multiple learning models in order to improve predictive performance. In some cases (classification), ensemble methods can be thought of as a voting process where the majority vote wins.

Two of the most common ensemble methods are bagging (aka bootstrap aggregating) and boosting. Both are helpful with improving model performance and in reducing variance (overfitting) and bias (underfitting).

Bagging is a technique by which the training data is sampled with replacement multiple times. Each time a new training data set is created and a model is fitted to the sample data. The models are then combined to produce the overall model output, which can be used to measure model performance.

Boosting is a technique designed to transform a set of so-called weak learners into a single strong learner. In plain English, think of a weak learner as a model that predicts only slightly better than random guessing, and a strong learner as a model that can predict to certain degree of accuracy better than random guessing.

While complicated, boosting basically works by iteratively creating weak models and adding them to the single strong learner. While this process happens, model accuracy is tested and then weightings are applied so that future learners focus on improving model performance for cases that were previously not well predicted.

Another very popular ensemble method is known as random forests. Random forests are essentially the combination of decision trees and bagging.

Kaggle is arguably the world’s most prestigious data science competition platform, and features competitions that are created and sponsored by most of the notable Silicon Valley tech companies, as well as by other very well-known corporations. Ensemble methods such as random forests and boosting have enjoyed very high success rates in winning these competitions.

Model Validation and Resampling Methods

Model validation is a very important part of the machine learning process. Validation methods consist of creating models and testing them on a validation dataset.

Resulting validation-set error provides an estimate of the test error and is typically assessed using mean squared error (MSE) in the case of a quantitative response, and misclassification rate in the case of a qualitative (discrete) response.

Many validation techniques are categorized as resampling methods, which involve refitting models to different samples formed from a set of training data.

Probably the most popular and noteworthy technique is called cross-validation. The key idea of cross-validation is that the model’s accuracy on the training set is optimistic, and that a better estimate comes from the model’s accuracy on the test set. The idea then is to estimate the test set accuracy while in the model training stage.

The process involves repeated splitting of the data into different training and test sets, building the model on the training set, and then evaluating it on the test set, and finally repeating and averaging the estimated errors.

In addition to model validation and helping to prevent overfitting, cross-validation can be used for feature selection, model selection, model parameter tuning, and comparing different predictors.

A popular special case of cross-validation is known as k-fold cross-validation. This technique involves selecting a number k, which represents the number of partitions of equal size that the original data is divided into. Once divided, a single partition is designated as a validation dataset (i.e., for testing the model), and the remaining k-1 data partitions are used as training data.

Note that typically the larger the chosen k, the less bias, but more variance, and vice versa. In the case of cross-validation, random sampling is done without replacement.

There is another technique that involves random sampling with replacement that is known as the bootstrap. The bootstrap technique tends to underestimate the error more than cross-validation.

Another special case is when k=n, i.e., when k equals the number of observations. In this case, the technique is known as leave-one-out cross-validation (LOOCV).

Summary

In this chapter, we have discussed many concepts and techniques associated with model evaluation, validation, complexity, and improvement.

Chapter four of this series will provide a much deeper dive into concepts and metrics related to model performance evaluation and error analysis.

Stay tuned!

By Alex Castrounis on February 22, 2016

About the Author: Alex Castrounis founded InnoArchiTech. Sign up for the InnoArchiTech newsletter and follow InnoArchiTech on Twitter at @innoarchitech for the latest content updates.

References

Machine Learning: An In-Depth, Non-Technical Guide – Part 2

By Alex Castrounis

Source: http://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide-part-2/

Chapters

Introduction

Welcome to the second chapter in a five-part series about machine learning.

In this chapter, we will briefly introduce model performance concepts, and then focus on the following parts of the machine learning process: data selection, preprocessing, feature selection, model selection, and model tradeoff considerations.

Model Performance Introduction

Model performance can be defined in many ways, but in general, it refers to how effectively the model is able to achieve the solution goals for a given problem (e.g., prediction, classification, anomaly detection, recommendation).

Since the goals can differ for each problem, the measure of performance can differ as well. Some common performance measures include accuracy, precision, recall, receiver operator characteristic (ROC), and so on. These will be discussed in much greater detail throughout the rest of this series.

Data Selection and Preprocessing

Some say that garbage in equals garbage out, and this is definitely the case. This basically means that you may have built a predictive model, but it doesn’t matter if the data used to build the model is non-representative, low quality, error ridden, and so on. The quality, amount, preparation, and selection of data is critical to the success of a machine learning solution.

The first step to ensure success is to avoid selection bias. Selection bias occurs when the samples used to produce the model are not fully representative of cases that the model may be used for in the future, particularly with new and unseen data.

Data is typically messy and often consists of missing values, useless values (e.g., NA), outliers, and so on. Prior to modeling and analysis, raw data needs to be parsed, cleaned, transformed, and pre-processed. This is typically referred to a data munging or data wrangling.

For missing data, data is often imputed, which is a technique used to fill in, or substitute for missing values, and is very similar conceptually to interpolation.

In addition, sometimes feature values are scaled (feature scaling) and/or standardized (normalized). The most typical method of standardizing feature data is to subtract the mean across a given feature’s values from each individual observation value, and then divide by the standard deviation of that feature’s values.

Feature scaling is used to bring the different feature’s value ranges into similarity in order to help prevent certain features from dominating models and predictions, but also to prevent computing problems when running machine learning optimization algorithms (speed, convergence, etc.).

Another preprocessing technique is to create dummy variables, which basically means that you convert qualitative variables to quantitative variables. An example is taking a color feature (e.g., green, red, and blue), and transforming it to the values 1, 2, and 3 respectively. This makes it possible to perform regression with qualitative features.

Data Splitting

Recall from chapter 1 that the data used for machine learning should be split into training and test datasets, as well as an optional third validation dataset for model validation and tuning.

Choosing the size of each data set can be somewhat subjective and dependent on the overall sample size, and a full discussion is out of scope for this series. As an example however, given a training and test dataset only, some people may split the data into 80% training and 20% testing.

In general, more training data results in a better model and potential performance, and more testing data results in a greater evaluation of model performance and overall generalization capability.

Feature Selection and Feature Engineering

Once you have a representative, unbiased, cleaned, and fully prepared dataset, typical next steps include feature selection and feature engineering of the training data. Note that although discussed here, both of these techniques can also be used later in the process for improving model performance.

Feature selection is the process of selecting a subset of features from which to build a predictive regression model or classifier. This is usually done for model simplification and increased interpretability, reducing training times and computational cost, and to help reduce the risk of overfitting, and thus improve model generalization.

Basic techniques for feature selection, particularly for regression problems, involve estimates of model parameters (i.e., model coefficients) and their significance, and correlation estimates amongst features. This will be discussed further in a section about parametric models.

Some advanced techniques used for feature selection are principle component analysis (PCA), singular value decomposition (SVD), and Linear Discriminant Analysis (LDA).

Principal component analysis is a statistical technique that deals with determining which features, in order, represent the most to least variance in the data. Singular value decomposition is a lower level linear algebra algorithm that is used by PCA.

Linear discriminant analysis is closely related to PCA in that they’re both linear transformation techniques. PCA however is more general and is not concerned with class labels (unsupervised), whereas LDA is more specific and is concerned with class labels (supervised).

Feature engineering includes feature selection as a sub-category, but also involves other aspects such as creating new features, transforming raw data into domain-specific and interpretable features, and so on.

Parametric Models and Feature Selection

Many machine learning models are a type of parametric model. A good example is the equation describing a line (i.e., linear model), which is shown here⁹, and includes the slope (β), intercept coefficient (α), and an error term (ε).

With parametric models, the coefficients of the terms are called the parameters, and are usually designated by the Greek letter beta and a subscript (e.g., β₁ … β_n). In regression problems, the parameters are called regression coefficients.

Many models also include an error term, indicated by the Greek letter epsilon. Simply stated, this error term is meant to account for the difference between the model’s predicted value and the actual observed value for a given set of input values.

Understanding the concept of model parameters is very important for supervised learning because machine learning differs from other techniques, in that it learns model parameters automatically. It does this by estimating the optimal set of model parameters that best explains the relationship between the response variable and the independent feature variables through optimization techniques, as discussed in chapter one.

In regression problems, a p-value is assigned to each of the estimated model parameters (regression coefficients), and this value is used to indicate the potential predictive influence that each coefficient has on the response.

Coefficients with a p-value greater than some chosen threshold, typically 0.05 or 0.10, are often not included in the model since they will most likely not help explain (predict) the response. This is one key way to perform feature selection with parametric models.

Another technique involves estimating the correlation of the features with respect to the response, and removing redundant and highly correlated features. The idea is that including only one of a pair of correlated features (the most significant) should be enough to explain the impact of both of the correlated features on the response.

Model Selection

While the algorithm or model that you choose may not matter as much as other things discussed in this series (e.g., amount of data, feature selection, etc.), here is a list of things to take into account when choosing a model.

Interpretability
Simplicity (aka parsimony)
Accuracy
Speed (training, testing, and real-time processing)
Scalability

A good approach is to start with simple models and then increase model complexity as needed, and only when necessary. Generally, simplicity should be preferred unless you can achieve major accuracy gains through model selection.

Relatively simple models include simple and multiple linear regression for regression problems, and logistic and multinomial regression for classification problems.

A basic early model selection choice for supervised learning is whether to use a linear or nonlinear model. Nonlinear models best describe and predict situations when the effects on the response from certain feature values and their combination is nonlinear. Most of the time however, relationships are never truly linear.

Beyond basic linear models, variations in the response variable can also be due to interaction effects, which means that the response is dependent not only on certain individual features (main effects), but also on the combination of certain features (interaction effects). This combination of features in a model is represented by multiplying the feature values for each interaction term in the model (e.g., βx₁x₂) with a term coefficient.

Once interaction terms are included, the significance of the interactions in explaining the response, and whether to include them, can be determined through the usual methods such as p-value estimation. Note that there is a concept known as the hierarchy principle, which basically says that if an interaction is included in a model, the associated main effects should also be included.

While linear assumptions are often good enough and can produce adequate results, most real life feature/response relationships are nonlinear, and sometimes nonlinear models are required to get an acceptable level of accuracy. In this case, there are a wide variety of models to choose from.

Nonlinear models can include different degree polynomials, step functions, piecewise polynomials, splines, local regression (aka LOESS models), and generalized additive models (GAM). Due to the technical nature of nonlinear modeling, familiarity with the above model approaches by name should suffice for the purpose of this series.

Other notable model choices include decision trees, support vector machines (SVM), and artificial neural networks (modeled after biological neural networks, an interconnected system of neurons). Decision trees can be highly interpretable, while the latter two are black box and very complex technical methods. Decision trees involve creating a series of splits based on logical decisions, starting from the most important top-level node. Decision trees visually look like an upside down tree.

Here is an example of a decision tree created by Stephen Milborrow, which shows survival of passengers on board the Titanic. The term ‘sibsp’ is the number of spouses or siblings aboard, and the numbers under each leaf refer to the probability of survival and the percentage of the total observations (i.e., people on board). So the upper right leaf indicates that females that survived had a 73% chance of survival and represented 36% of those on board.

By Stephen Milborrow (Own work) CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html), via Wikimedia Commons

The final model selection decision discussed here is whether to leverage ensemble methods for additional performance gains. These methods combine models to produce a single consensus prediction or classification, and do so through averaging or voting techniques.

Some very common ensemble methods are bagging, boosting, and random forests. Random forests are essentially bagging applied to decision trees, with the additional element of random feature subset selection. Further discussion of these methods is out of scope of this series.

Model Tradeoffs

Model accuracy is determined in many ways, and will be discussed in detail later in this series. The primary measure of model accuracy comes from estimating the test error for a given model. The accuracy improvement goal of model selection is therefore to reduce the estimated test error.

It is important to note that the goal isn’t to find the absolute minimal error, but rather to find the simplest model that performs well enough. There are usually diminishing returns in trying the squeeze out the very last bit of performance. Given this, your choice of modeling approach won’t always be based on the one that results in the greatest degree of accuracy. Sometimes there are other important factors that must be taken into account as well, including interpretability, simplicity, speed, and scalability.

Often, it’s a tradeoff choosing whether prediction accuracy or model interpretability is more important for a given application. Artificial neural networks, support vector machines, and some ensemble methods can be used to create very accurate predictive models, but are very much of a black box except to highly specialized and technical individuals.

Black box algorithms may be preferred when predictive performance is the most important goal, and it’s not necessary to explain how the model works and makes predictions. In some cases however, model interpretability is preferred, and sometimes legally mandatory.

Here is an interpretability-driven example often seen in the financial industry. Suppose a machine learning algorithm is used to accept or reject an individual’s credit card application. If the applicant is rejected and decides to file a complaint or take legal action, the financial institution will need to explain how that decision was made. While that can be nearly impossible for a neural network or SVM system, it’s relatively straightforward for decision tree-based algorithms.

In terms of training, testing, processing, and prediction speed, some algorithms and model types take more time, and require greater computing power and memory than others. In some applications, speed and scalability are critical factors, particularly in any widely used, near real-time application (e.g., eCommerce site) where a model needs to be updated fairly regularly, and that performs predictions and/or classifications at scale on the fly.

Lastly, and as previously mentioned, model simplicity (or parsimony) should always be preferred unless there is a significant and justifiable gain in performance accuracy. Simplicity usually results in quicker, more scalable, and easier to interpret models and results.

Summary

We’ve now had a solid overview of the machine learning process from selecting data and features, through selecting appropriate models for a given problem type.

Chapter three of this series will continue with the machine learning process, and in particular will focus on model evaluation, performance, improvement, complexity, validation, and more.

Stay tuned!

By Alex Castrounis on February 8, 2016

About the Author: Alex Castrounis founded InnoArchiTech. Sign up for the InnoArchiTech newsletter and follow InnoArchiTech on Twitter at @innoarchitech for the latest content updates.

References

Machine Learning: An In-Depth, Non-Technical Guide – Part 1

Source: http://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide/

By Alex Castrounis

Chapters

Introduction

Welcome! This is the first chapter of a five-part series about machine learning.

Machine learning is a very hot topic for many key reasons, and because it provides the ability to automatically obtain deep insights, recognize unknown patterns, and create high performing predictive models from data, all without requiring explicit programming instructions.

Despite the popularity of the subject, machine learning’s true purpose and details are not well understood, except by very technical folks and/or data scientists.

This series is intended to be a comprehensive, in-depth, and non-technical guide to machine learning, and should be useful to everyone from business executives to machine learning practitioners. It covers virtually all aspects of machine learning (and many related fields) at a high level, and should serve as a sufficient introduction or reference to the terminology, concepts, tools, considerations, and techniques of the field.

This high level understanding is critical if ever involved in a decision-making process surrounding the usage of machine learning, how it can help achieve business and project goals, which machine learning techniques to use, potential pitfalls, and how to interpret the results.

Note that most of the topics discussed in this series are also directly applicable to fields such as predictive analytics, data mining, statistical learning, artificial intelligence, and so on.

Machine Learning Defined

The oft quoted and widely accepted formal definition of machine learning as stated by field pioneer Tom M. Mitchell is:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E

The following is my less formal way to describe machine learning.

Machine learning is a subfield of computer science, but is often also referred to as predictive analytics, or predictive modeling¹. Its goal and usage is to build new and/or leverage existing algorithms to learn from data, in order to build generalizable models that give accurate predictions, or to find patterns, particularly with new and unseen similar data.

Machine Learning Process Overview

Imagine a dataset as a table, where the rows are each observation (aka measurement, data point, etc), and the columns for each observation represent the features of that observation and their values.

At the outset of a machine learning project, a dataset is usually split into two or three subsets. The minimum subsets are the training and test datasets, and often an optional third validation dataset is created as well.

Once these data subsets are created from the primary dataset, a predictive model or classifier is trained using the training data, and then the model’s predictive accuracy is determined using the test data.

As mentioned, machine learning leverages algorithms to automatically model and find patterns in data, usually with the goal of predicting some target output or response. These algorithms are heavily based on statistics and mathematical optimization.

Optimization is the process of finding the smallest or largest value (minima or maxima) of a function, often referred to as a loss, or cost function in the minimization case¹⁰. One of the most popular optimization algorithms used in machine learning is called gradient descent, and another is known as the the normal equation.

In a nutshell, machine learning is all about automatically learning a highly accurate predictive or classifier model, or finding unknown patterns in data, by leveraging learning algorithms and optimization techniques.

Types of Learning

The primary categories of machine learning are supervised, unsupervised, and semi-supervised learning. We will focus on the first two in this article.

In supervised learning, the data contains the response variable (label) being modeled, and with the goal being that you would like to predict the value or class of the unseen data. Unsupervised learning involves learning from a dataset that has no label or response variable, and is therefore more about finding patterns than prediction.

As i’m a huge NFL and Chicago Bears fan, my team will help exemplify these types of learning! Suppose you have a ton of Chicago Bears data and stats dating from when the team became a chartered member of the NFL (1920) until the present (2016).

Imagine that each row of the data is essentially a team snapshot (or observation) of relevant statistics for every game since 1920. The columns in this case, and the data contained in each, represent the features (values) of the data, and may include feature data such as game date, game opponent, season wins, season losses, season ending divisional position, post-season berth (Y/N), post-season stats, and perhaps stats specific to the three phases of the game: offense, defense, and special teams.

In the supervised case, your goal may be to use this data to predict if the Bears will win or lose against a certain team during a given game, and at a given field (home or away). Keep in mind that anything can happen in football in terms of pre and game-time injuries, weather conditions, bad referee calls, and so on, so take this simply as an example of an application of supervised learning with a yes or no response (prediction), as opposed to determining the probability or likelihood of ‘Da Bears’ getting the win.

Since you have historic data of wins and losses (the response) against certain teams at certain football fields, you can leverage supervised learning to create a model to make that prediction.

Now suppose that your goal is to find patterns in the historic data and learn something that you don’t already know, or group the team in certain ways throughout history. To do so, you run an unsupervised machine learning algorithm that clusters (groups) the data automatically, and then analyze the clustering results.

With a bit of analysis, one may find that these automatically generated clusters seemingly groups the team into the following example categories over time:

Strong defense, weak running offense, strong passing offense, weak special teams, playoff berth
Strong defense, strong running offense, weak passing offense, average special teams, playoff berth
Weak defense, strong all-around offense, strong special teams, missed the playoffs
and so on

An example of unsupervised cluster analysis would be to find a potential reason why they missed the playoffs in the third cluster above. Perhaps due to the weak defense? Bears have traditionally been a strong defensive team, and some say that defense wins championships. Just saying…

In either case, each of the above classifications may be found to relate to a certain time frame, which one would expect. Perhaps the team was characterized by one of these groupings more than once throughout their history, and for differing periods of time.

To characterize the team in this way without machine learning techniques, one would have to pour through all historic data and stats, manually find the patterns and assign the classifications (clusters) for every year taking all data into account, and compile the information. That would definitely not be a quick and easy task.

Alternatively, you could write an explicitly coded program to pour through the data, and that has to know what team stats to consider, what thresholds to take into account for each stat, and so forth. It would take a substantial amount of time to write the code, and different programs would need to be written for every problem needing an answer.

Or… you can employ a machine learning algorithm to do all of this automatically for you in a few seconds.

Machine Learning Goals and Outputs

Machine learning algorithms are used primarily for the following types of output:

Clustering (Unsupervised)
Two-class and multi-class classification (Supervised)
Regression: Univariate, Multivariate, etc. (Supervised)
Anomaly detection (Unsupervised and Supervised)
Recommendation systems (aka recommendation engine)

Specific algorithms that are used for each output type are discussed in the next section, but first, let’s give a general overview of each of the above output, or problem types.

As discussed, clustering is an unsupervised technique for discovering the composition and structure of a given set of data. It is a process of clumping data into clusters to see what groupings emerge, if any. Each cluster is characterized by a contained set of data points, and a cluster centroid. The cluster centroid is basically the mean (average) of all of the data points that the cluster contains, across all features.

Classification problems involve placing a data point (aka observation) into a pre-defined class or category. Sometimes classification problems simply assign a class to an observation, and in other cases the goal is to estimate the probabilities that an observation belongs to each of the given classes.

A great example of a two-class classification is assigning the class of Spam or Ham to an incoming email, where ham just means ‘not spam’. Multi-class classification just means more than two possible classes. So in the spam example, perhaps a third class would be ‘Unknown’.

Regression is just a fancy word for saying that a model will assign a continuous value (response) to a data observation, as opposed to a discrete class. A great example of this would be predicting the closing price of the Dow Jones Industrial Average on any given day. This value could be any number, and would therefore be a perfect candidate for regression.

Note that sometimes the word regression is used in the name of an algorithm that is actually used for classification problems, or to predict a discrete categorical response (e.g., spam or ham). A good example is logistic regression, which predicts probabilities of a given discrete value.

Another problem type is anomaly detection. While we’d love to think that data is well behaved and sensible, unfortunately this is often not the case. Sometimes there are erroneous data points due to malfunctions or errors in measurement, or sometimes due to fraud. Other times it could be that anomalous measurements are indicative of a failing piece of hardware or electronics.

Sometimes anomalies are indicative of a real problem and are not easily explained, such as a manufacturing defect, and in this case, detecting anomalies provides a measure of quality control, as well as insight into whether steps taken to reduce defects have worked or not. In either case, there are times where it is beneficial to find these anomalous values, and certain machine learning algorithms can be used to do just that.

The final type of problem is addressed with a recommendation system, or also called recommendation engine. Recommendation systems are a type of information filtering system, and are intended to make recommendations in many applications, including movies, music, books, restaurants, articles, products, and so on. The two most common approaches are content-based and collaborative filtering.

Two great examples of popular recommendation engines are those offered by Netflix and Amazon. Netflix makes recommendations in order to keep viewers engaged and supplied with plenty of content to watch. In other words, to keep people using Netflix. They do this with their “Because you watched …”, “Top Picks for Alex”, and “Suggestions for you” recommendations.

Amazon does a similar thing in order to increase sales through up-selling, maintain sales through user engagement, and so on. They do this through their “Customers Who Bought This Item Also Bought”, “Recommendations for You, Alex”, “Related to Items You Viewed”, and “More Items to Consider” recommendations.

Machine Learning Algorithms

We’ve now covered the machine learning problem types and desired outputs. Now we will give a high level overview of relevant machine learning algorithms.

Here is a list of algorithms, both supervised and unsupervised, that are very popular and worth knowing about at a high level. Note that some of these algorithms will be discussed in greater depth later in this series.

Supervised Regression

Simple and multiple linear regression
Decision tree or forest regression
Artificial Neural networks
Ordinal regression
Poisson regression
Nearest neighbor methods (e.g., k-NN or k-Nearest Neighbors)

Supervised Two-class & Multi-class Classification

Logistic regression and multinomial regression
Artificial Neural networks
Decision tree, forest, and jungles
SVM (support vector machine)
Perceptron methods
Bayesian classifiers (e.g., Naive Bayes)
Nearest neighbor methods (e.g., k-NN or k-Nearest Neighbors)
One versus all multiclass

Unsupervised

K-means clustering
Hierarchical clustering

Anomaly Detection

Support vector machine (one class)
PCA (Principle component analysis)

Note that a technique that’s often used to improve model performance is to combine the results of multiple models. This approach leverages what’s known as ensemble methods, and random forests are a great example (discussed later).

If nothing else, it’s a good idea to at least familiarize yourself with the names of these popular algorithms, and have a basic idea as to the type of machine learning problem and output that they may be well suited for.

Summary

Machine learning, predictive analytics, and other related topics are very exciting and powerful fields.

While these topics can be very technical, many of the concepts involved are relatively simple to understand at a high level. In many cases, a simple understanding is all that’s required to have discussions based on machine learning problems, projects, techniques, and so on.

Chapter two of this series will provide an introduction to model performance, cover the machine learning process, and discuss model selection and associated tradeoffs in detail.

Stay tuned!

By Alex Castrounis on January 27, 2016

About the Author: Alex Castrounis founded InnoArchiTech. Sign up for the InnoArchiTech newsletter and follow InnoArchiTech on Twitter at @innoarchitech for the latest content updates.

References

Scrape Google Scholar

Source: http://lernpython.de/scrape-google-scholar

Google Scholar is a useful application. It refers every publications to its authors and allows to access easily the scientific output of every researcher. Two import key indicators are the number of citations and the H-Index. In this short python script you will see, how to extract/scrape these two parameters in Python.

To scrape Google Scholar we first load important libraries for this task and define a function, which is able to scrape the H-Index from a Google Scholar profile as long as we feed the function with the link to this profile. If this is the case the function returns the H-index.

Use Scholarly to scrape Google Scholar

In the next step we use the Python module scholarly. Is has several feature. the most important is that it can search the Google Scholar database for names and return their number of citation or the direct link to the Google profile. Hence, we give this function a list of scientist in the field of nanopores and use it to get the number of citations and link to the Google Scholar profile. This link is then fed to the previously defined function to return the H-index.

We save the H-Index, number of citation and researcher name into one list and plot the two integer parameters in a plot.

The result is a plott with the number of citations on the X-axis and the H-Index on the Y-axis. From these we can deduce that with increasing number of citations the H-Index grows too. Publications analysing citations behavior in more detail can be found here.

Four steps to master machine learning with python

Source: http://lernpython.de/four-steps-to-master-machine-learning-with-python-including-free-books-resources

To understand and apply machine learning techniques you have to learn Python or R. Both are programming languages similar to C, Java or PHP. However, since Python and R are much younger and “farer away” from the CPU, they are easier. The advantage of Python is that it can be adopted to many other problems than R, which is only used for handling data, analysing it with e.g. machine learning and statistic algorythms and ploting it in nice graphs. Because Python has a broader distribution (hosting websites with Jango, natural language proecssing, accessing APIs of websites such as Twitter, Linkedin etc.) and resembles more classical programming languages like C Python is more popular.

The four steps of learning machine learning in python

First you have to learn the basics of Python using books, courses and videos.
Then you have to master the different moduls such as Pandas, Numpy, Matplotlib and Natural Language Processing (NLP) in order to handle, clean, plot and understand data.
Afterwards you have to able to scrap data from the web which is either done by using APIs of websites or the web-scraping moduls Beautiful Soup. Web scraping allows you to collect data which you feed into you machine learning algorithms.
In the last step you have to learn machine learning (ML) tools like Scikit-Learn or implement ML-algorithm from scratch.

1. Getting started with Python:

And easy and fast way to learn Python is to register at codecademy.com and imediately start to code and learn the basics of python. A classic is the website learnpythonthehardway which is referenced by a lot of python programmers. A good PDF is a byte of python. A list of python resources for beginners is also provided by the python community. A book from O’Reilley is Think Python, which can be downloaded for free from here. A last resource is Introduction to Python for Econometrics, Statistics and Data Analysis which also covers the basics of Python.

2. Important Modules for machine learning

The most important modules for machine learning are NumPy, Pandas, Matplotlib and IPython. A book covering a couple of these modules is Data Analysis with Open Source Tools. The free book Introduction to Python for Econometrics, Statistics and Data Analysis from 1. also covers Numpy, Pandas, matplotlib and IPython. Another resource is Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, which also covers the most important modules. Her are other free Numpy (Numerical Python, Numpy Userguide, Guide to NumPy), Pandas (Pandas, Powerful Python Data Analysis Toolkit, Practical Business Python, Intros to Pandas Data Structure) and Matplotlib books.

Other resources:

3. Mining and scraping the data from websites and through APIs

Once you have understood the basics of python and the most important modules you have to learn how to collect data from different sources. This technique is also called web scrapping. Classic sources are text from websites, textual data through APIs to access websites such as twitter or linkedin. Good books on web scraping are Mining the Social Web (free book!), Web Scraping with Python and Web Scraping with Python: Collecting Data from the Modern Web.

Lastly this textual data has to be transformed into numerical data, which is done with natural language processing techniques covered by Natural language processing with Python and Natural Language Annotation for Machine Learning. Other data are images and videos, which can be analysed using computer vision techniques: Programming Computer Vision with Python, Programming Computer Vision with Python: Tools and algorithms for analyzing images and Practical Python and OpenCV are typical resources to analyse images.

Educational and interesting examples of what you can already do using basic python commands and web scraping techniques can be found in these examples:

4. Machine learning with Python

Machine learning can be divided into four groups. Classification, clustering, regression and dimensionalty reduction.

Classification can also be called supervised learning and helps one to classify an image in order to identify a symbol or face in the image, or to classify a user from its profile and to grant him different credit scores. Clustering happens under unsupervised learning and allows the user to identify groups/clusters within its data. Regression permits to estimate a value from a paramter set and can be used to predict the best price for a house, apartment or car.

All important modules, packages and techniques to learn Machine Learning in Python, C, Scala, Java, Julia, MATLAB, Go, R and Ruby. Books about machine learning in python:

I especially recommend the book Machine learning in action. Although a bit short it is probably a classic in machine learning due to its age Programming Collective Intelligence. These two books let you build machine learning algorithms from scratch.

Most recent publications about machine learning are base on the Python module scikit-learn. It makes machine learning very easy since all the algorithm are already implemented. The only thing you do is to tell python which ML-technique should be used to analyse the data.

A free scikit-learn tutorial can be found on the official scikit-learn website. Other posts are be found here:

Books about machine learning and the module scikit-learn in Python are:

Books which are published in the coming months are:

Courses and blogs about Machine learning

You want to earn a degree, take an online course or attand a real workshop, camp or university course? Here are some links: Collection of links to online education in analytics, Big Data, Data Mining, and Data Science. Coursera course in machine learning and Data Analyst Nanodegree from Udacity are other recommended online courses. List of frequently updated blogs about machine learning.

A great youtube video is this class from Jake Vanderplas, Olivier Grisel about Exploring Machine Learning with Scikit-learn!

Theory of Machine Learning

Want to learn the theory of machine learning? The Elements of statistical Learning and Introduction to Statistical Learning are often cited classics. Other books are Introduction to machine learning and A Course in Machine Learning. The links contain free PDF, so you don’t have to pay them! Don’t want to read this? Watch 15 hours theory of machine learning!

机器学习：用初等数学解读逻辑回归

2015-11-03 龙心尘、寒小阳

摘自：http://my.csdn.net/longxinchen_ml

为了降低理解难度，本文试图用最基础的初等数学来解读逻辑回归，少用公式，多用图形来直观解释推导公式的现实意义，希望使读者能够对逻辑回归有更直观的理解。

“

逻辑回归问题的通俗几何描述

逻辑回归处理的是分类问题。我们可以用通俗的几何语言重新表述它：
空间中有两群点，一群是圆点“〇”，一群是叉点“X”。我们希望从空间中选出一个分离边界，将这两群点分开。

注：分离边界的维数与空间的维数相关。如果是二维平面，分离边界就是一条线（一维）。如果是三维空间，分离边界就是一个空间中的面（二维）。如果是一维直线，分离边界就是直线上的某一点。不同维数的空间的理解下文将有专门的论述。

为了简化处理和方便表述，我们做以下4个约定：

我们先考虑在二维平面下的情况。
而且，我们假设这两类是线性可分的：即可以找到一条最佳的直线，将两类点分开。
用离散变量y表示点的类别，y只有两个可能的取值。y=1表示是叉点“X”，y=0表示是是圆点“〇”。
点的横纵坐标用表示。

于是，现在的问题就变成了：怎么依靠现有这些点的坐标和标签（y），找出分界线的方程。

“

如何用解析几何的知识找到逻辑回归问题的分界线？

我们用逆推法的思路：
假设我们已经找到了这一条线，再寻找这条线的性质是什么。根据这些性质，再来反推这条线的方程。
这条线有什么性质呢？
首先，它能把两类点分开来。——好吧，这是废话。(￣▽￣)”
然后，两类点在这条线的法向量p上的投影的值的正负号不一样，一类点的投影全是正数，另一类点的投影值全是负数！
- 首先，这个性质是非常好，可以用来区分点的不同的类别。
- 而且，我们对法向量进行规范:只考虑延长线通过原点的那个法向量p。这样的话，只要求出法向量p，就可以唯一确认这条分界线，这个分类问题就解决了。
还有什么方法能将法向量p的性质处理地更好呢？
因为计算各个点到法向量p投影，需要先知道p的起点的位置，而起点的位置确定起来很麻烦，我们就干脆将法向量平移使其起点落在坐标系的原点，成为新向量p’。因此，所有点到p’的投影也就变化了一个常量。

假设这个常量为，p’向量的横纵坐标为。空间中任何一个点到p’的投影就是，再加上前面的常量值就是：

看到上面的式子有没有感到很熟悉？这不就是逻辑回归函数中括号里面的部分吗？

令就可以根据z的正负号来判断点x的类别了。

“从概率角度理解z的含义。

由以上步骤，我们由点x的坐标得到了一个新的特征z，那么:

z的现实意义是什么呢？

首先，我们知道，z可正可负可为零。而且，z的变化范围可以一直到正负无穷大。

z如果大于0，则点x属于y=1的类别。而且z的值越大，说明它距离分界线的距离越大，更可能属于y=1类。

那可否把z理解成点x属于y=1类的概率P(y=1|x) （下文简写成P）呢？显然不够理想，因为概率的范围是0到1的。

但是我们可以将概率P稍稍改造一下：令Q=P/(1-P)，期望用Q作为z的现实意义。我们发现，当P的在区间[0,1]变化时，Q在[0,+∞)区间单调递增。函数图像如下（以下图像可以直接在度娘中搜“x/(1-x)”，超快）:

但是Q的变化率在[0,+∞)还不够，我们是希望能在(-∞,+∞)区间变化的。而且在P=1/2的时候刚好是0。这样才有足够的解释力。

注：因为P=1/2说明该点属于两个类别的可能性相当，也就是说这个点恰好在分界面上，那它在法向量的投影自然就是0了。

而在P=1/2时，Q=1，距离Q=0还有一段距离。那怎么通过一个函数变换然它等于0呢？有一个天然的函数log，刚好满足这个要求。
于是我们做变换R=log(Q)=log(P/(1-P))，期望用R作为z的现实意义。画出它的函数图像如图：

这个函数在区间[0,1]中可正可负可为零，单调地在(-∞,+∞)变化，而且1/2刚好就是唯一的0值!基本完美满足我们的要求。
回到我们本章最初的问题，

“我们由点x的坐标得到了一个新的特征z，那么z的具体意义是什么呢？”

由此，我们就可以将z理解成x属于y=1类的概率P经过某种变换后对应的值。也就是说，z= log(P/(1-P))。反过来就是P=。图像如下：

这两个函数log(P/(1-P)) 、看起来熟不熟悉？

这就是传说中的logit函数和sigmoid函数!

小小补充一下：

在概率理论中，Q=P/(1-P)的意义叫做赔率(odds)。世界杯赌过球的同学都懂哈。赔率也叫发生比，是事件发生和不发生的概率比。
而z= log(P/(1-P))的意义就是对数赔率或者对数发生比（log-odds）。

于是，我们不光得到了z的现实意义，还得到了z映射到概率P的拟合方程：

有了概率P，我们顺便就可以拿拟合方程P=来判断点x所属的分类：

当P>=1/2的时候，就判断点x属于y=1的类别；当P<1/2，就判断点x属于y=0的类别。

“

构造代价函数求出参数的值

到目前为止我们就有两个判断某点所属分类的办法，一个是判断z是否大于0，一个是判断g(z)是否大于1/2。
然而这并没有什么X用，

以上的分析都是基于“假设我们已经找到了这条线”的前提得到的，但是最关键的三个参数仍未找到有效的办法求出来。

还有没有其他的性质可供我们利用来求出参数的值？

我们漏了一个关键的性质：这些样本点已经被标注了y=0或者y=1的类别!
我们一方面可以基于z是否大于0或者g(z) 是否大于1/2来判断一个点的类别，另一方又可以依据这些点已经被标注的类别与我们预测的类别的插值来评估我们预测的好坏。
这种衡量我们在某组参数下预估的结果和实际结果差距的函数，就是传说中的代价函数Cost Function。
当代价函数最小的时候，相应的参数就是我们希望的最优解。

由此可见，设计一个好的代价函数，将是我们处理好分类问题的关键。而且不同的代价函数，可能会有不同的结果。因此更需要我们将代价函数设计得解释性强，有现实针对性。

为了衡量“预估结果和实际结果的差距”，我们首先要确定“预估结果”和“实际结果”是什么。

“实际结果”好确定，就是y=0还是y=1。
“预估结果”有两个备选方案，经过上面的分析，我们可以采用z或者g(z)。但是显然g(z)更好，因为g(z)的意义是概率P，刚好在[0,1]范围之间，与实际结果{0，1}很相近，而z的意思是逻辑发生比，范围是整个实数域(-∞,+∞)，不太好与y={0，1}进行比较。

接下来是衡量两个结果的“差距”。

我们首先想到的是y-hθ(x)。
- 但这是当y=1的时候比较好。如果y=0，则y- hθ(x)= – hθ(x)是负数，不太好比较，则采用其绝对值hθ(x)即可。综合表示如下：
- 但这个函数有个问题：求导不太方便，进而用梯度下降法就不太方便。
- 因为梯度下降法超出的初等数学的范围，这里就暂且略去不解释了。
于是对上面的代价函数进行了简单的处理，使之便于求导。结果如下：

代价函数确定了，接下来的问题就是机械计算的工作了。常见的方法是用梯度下降法。于是，我们的平面线形可分的问题就可以说是解决了。

“

从几何变换的角度重新梳理我们刚才的推理过程。

回顾我们的推理过程，我们其实是在不断地将点进行几何坐标变换的过程。

第一步是将分布在整个二维平面的点通过线性投影映射到一维直线中，成为点x(z)
第二步是将分布在整个一维直线的点x(z)通过sigmoid函数映射到一维线段[0,1]中成为点x(g(z))。
第三步是将所有这些点的坐标通过代价函数统一计算成一个值，如果这是最小值，相应的参数就是我们所需要的理想值。

“

对于简单的非线性可分的问题。

由以上分析可知。比较关键的是第一步，我们之所以能够这样映射是因为假设我们点集是线性可分的。但是如果分离边界是一个圆呢？考虑以下情况。
我们仍用逆推法的思路：
- 通过观察可知，分离边界如果是一个圆比较合理。
- 假设我们已经找到了这个圆，再寻找这个圆的性质是什么。根据这些性质，再来反推这个圆的方程。
我们可以依据这个性质：
- 圆内的点到圆心的距离小于半径，圆外的点到圆心的距离大于半径
- 假设圆的半径为r，空间中任何一个点到原点的距离为。
- 令，就可以根据z的正负号来判断点x的类别了
- 然后令，就可以继续依靠我们之前的逻辑回归的方法来处理和解释问题了。
从几何变换的角度重新梳理我们刚才的推理过程。
- 第一步是将分布在整个二维平面的点通过某种方式映射到一维直线中，成为点x(z)
- 第二步是将分布在整个一维射线的点x(z)通过sigmoid函数映射到一维线段[0,1]中成为点x(g(z))。
- 第三步是将所有这些点的坐标通过代价函数统一计算成一个值v，如果这是最小值，相应的参数就是我们所需要的理想值。

“

从特征处理的角度重新梳理我们刚才的分析过程

其实，做数据挖掘的过程，也可以理解成做特征处理的过程。我们典型的数据挖掘算法，也就是将一些成熟的特征处理过程给固定化的结果。
对于逻辑回归所处理的分类问题，我们已有的特征是这些点的坐标，我们的目标就是判断这些点所属的分类y=0还是y=1。那么最理想的想法就是希望对坐标进行某种函数运算，得到一个（或者一些）新的特征z，基于这个特征z是否大于0来判断该样本所属的分类。

对我们上一节非线性可分问题的推理过程进行进一步抽象，我们的思路其实是：

第一步，将点的坐标通过某种函数运算，得到一个新的类似逻辑发生比的特征，
第二步是将特征z通过sigmoid函数得到新的特征。
第三步是将所有这些点的特征q通过代价函数统一计算成一个值，如果这是最小值，相应的参数(r)就是我们所需要的理想值。

“

对于复杂的非线性可分的问题

由以上分析可知。比较关键的是第一步，如何设计转换函数。我们现在开始考虑分离边界是一个极端不规则的曲线的情况。

我们仍用逆推法的思路：

通过观察等先验的知识（或者完全不观察乱猜），我们可以假设分离边界是某种6次曲线（这个曲线方程可以提前假设得非常复杂，对应着各种不同的情况）。
第一步：将点的坐标通过某种函数运算，得到一个新的特征。并假设z是某种程度的逻辑发生比，通过其是否大于0来判断样本所属分类。
第二步：将特征z通过sigmoid函数映射到新的特征
第三步：将所有这些样本的特征q通过逻辑回归的代价函数统一计算成一个值，如果这是最小值，相应的参数就是我们所需要的理想值。相应的，分离边界其实就是方程=0，也就是逻辑发生比为0的情况嘛。

“

多维逻辑回归的问题

以上考虑的问题都是基于在二维平面内进行分类的情况。其实，对于高维度情况的分类也类似。

高维空间的样本，其区别也只是特征坐标更多，比如四维空间的点x的坐标为。但直接运用上文特征处理的视角来分析，不过是对坐标进行参数更多的函数运算得到新的特征。并假设z是某种程度的逻辑发生比，通过其是否大于0来判断样本所属分类。

而且，如果是高维线性可分的情况，则可以有更近直观的理解。

如果是三维空间，分离边界就是一个空间中的一个二维平面。两类点在这个二维平面的法向量p上的投影的值的正负号不一样，一类点的投影全是正数，另一类点的投影值全是负数。
如果是高维空间，分离边界就是这个空间中的一个超平面。两类点在这个超平面的法向量p上的投影的值的正负号不一样，一类点的投影全是正数，另一类点的投影值全是负数。
特殊的，如果是一维直线空间，分离边界就是直线上的某一点p。一类点在点p的正方向上，另一类点在点p的负方向上。这些点在直线上的坐标可以天然理解成类似逻辑发生比的情况。可见一维直线空间的分类问题是其他所有高维空间投影到法向量后的结果，是所有逻辑回归问题的基础。

“

多分类逻辑回归的问题

以上考虑的问题都是二分类的问题，基本就是做判断题。但是对于多分类的问题，也就是做选择题，怎么用逻辑回归处理呢？

其基本思路也是二分类，做判断题。

比如你要做一个三选一的问题，有ABC三个选项。首先找到A与BUC（”U”是并集符号）的分离边界。然后再找B与AUC的分离边界，C与AUB的分离边界。

这样就能分别得到属于A、B、C三类的概率，综合比较，就能得出概率最大的那一类了。

“

总结

本文的分析思路——逆推法

画图，观察数据，看出（猜出）规律，假设规律存在，用数学表达该规律，求出相应数学表达式。
该思路比较典型，是数据挖掘过程中的常见思路。

两个视角：几何变换的视角与特征处理的视角。

小结：
- 几何变换的视角：高维空间映射到一维空间 → 一维空间映射到[0,1]区间 → [0,1]区间映射到具体的值，求最优化解
- 特征处理的视角：特征运算函数求特征单值z → sigmoid函数求概率 → 代价函数求代价评估值，求最优化解
首先要说明的是，在逻辑回归的问题中，这两个视角是并行的，而不是包含关系。它们是同一个数学过程的两个方面。
- 比如，我们后来处理复杂的非线性可分问题的时候，看似只用的是特征处理的思路。其实，对于复杂的非线性分离边界，也可以映射到高维空间进行线性可分的处理。在SVM中，有时候某些核函数所做的映射与之非常类似。这将在我们接下来的SVM系列文章中有更加详细的说明。
在具体的分析过程中，运用哪种视角都可以，各有优点。
- 比如，作者个人比较倾向几何变换的视角来理解，这方便记忆整个逻辑回归的核心过程，画几张图就够了。相应的信息都浓缩在图像里面，异常清晰。
- 于此同时，特征处理的视角方便你思考你手上掌握的特征是什么，怎么处理这些特征。这其实的数据挖掘的核心视角。因为随着理论知识和工作经验的积累，越到后面越会发现，当我们已经拿到无偏差、倾向性的数据集，并且做过数据清洗之后，特征处理的过程是整个数据挖掘的核心过程：怎么收集这些特征，怎么识别这些特征，挑选哪些特征，舍去哪些特征，如何评估不同的特征……这些过程都是对你算法结果有决定性影响的极其精妙的精妙部分。这是一个庞大的特征工程，里面的内容非常庞大，我们将在后续的系列文章中专门讨论。
- 总的来说，几何变换视角更加直观具体，特征处理视角更加抽象宏观，在实际分析过程中，掌握着两种视角的内在关系和转换规律，综合分析，将使得你对整个数据挖掘过程有更加丰富和深刻的认识。
- 为了将这两种视角更集中地加以对比，我们专门制作了下面的图表，方便读者查阅。

原文链接：http://blog.csdn.net/longxinchen_ml/article/details/49284391

封面来源：www.taopic.com

作者介绍：

龙心尘和寒小阳：从事机器学习/数据挖掘相关应用工作，热爱机器学习/数据挖掘

『我们是一群热爱机器学习，喜欢交流分享的小伙伴，希望通过“ML学分计划”交流机器学习相关的知识，认识更多的朋友。欢迎大家加入我们的讨论群获取资源资料，交流和分享。』

联系方式：

龙心尘 johnnygong.ml@gmail.com

寒小阳 hanxiaoyang.ml@gmail.com

逻辑回归：从入门到精通

本文由天眼查创始人柳超原创首发于腾讯

导读

与算法、随机森林、支持向量积、神经网络、以及各种算法的花式排列组合相比，逻辑回归在多数人看来似乎是太过传统的统计方法。2014年底的我带着拯救世界的梦想投向硅谷怀抱的时候，也是这么认为的。

但是在工作的过程中我渐渐发现，不管听起来多fancy、多高大上的项目，硅谷的数据分析大佬们多数都会首选逻辑回归。而我之前自以为可以拯救世界的那些花式算法，其实都是逻辑回归的变换和推广，只是原理有轻微的不同。

后来做到了别的领域的项目，比如搜索，比如广告投放，也愈发认识到逻辑回归的重要性。因此，作为一名统计学出身的数据科学家，我极力向不喜欢看教科书的各位读者推荐以下这篇文章。我不知道怎么描述我第一次看到这篇文章的心情，就好比高考的时候突然有人给我了一份答案的感觉（虽然这个比喻不恰当但是真的是那种感觉，相信你们能感受到的）。

至于怎么能看透花式，洞悉一切，请看大神的文章吧！（By纪思亮）

◆ ◆ ◆

Abstract

逻辑回归（Logistic Regression，简称LR）可以说是互联网领域应用最广的自动分类算法：从单机运行的垃圾邮件自动识别程序到需要成百上千台机器支撑的互联网广告投放系统，其算法主干都是LR。由于其普适性与重要性，大家在工作中都或多或少的谈论着LR，但是笔者发现很多同学对于LR的理解可以进一步提高与深化。所以，笔者准备了这样一个关于逻辑回归从入门到精通的文章和同学们一同探讨。本文的目标不像是基维百科那样泛泛而谈、面面俱到地介绍LR，相反而是更注重对LR的理解和其背后的优化算法的掌握，从而使大家更有信心的实现中需要的大规模LR模型，并根据实际问题持续地改进它。另外，由于求解LR是一个性质很好的优化问题。本文也借此机会比较系统的介绍了从最速梯度下降法，到牛顿方法，再到拟牛顿方法（包括DFP、BFGS、L-BFGS）这一系列数值优化算法的脉络，也算是对数值优化算法中文教程的一个补充吧。最后，请各位领导、大拿、和冲在一线的研究猿与攻城狮们不吝赐教、切磋琢磨、一同进步！

◆ ◆ ◆

1、动机与目标读者

大家在平时的工作与学习当中经常会遇到各种决策问题：例如这封邮件是不是垃圾邮件，这个用户是不是对这个商品感兴趣，这个房子该不该买等等。熟悉或接触过机器学习（Machine Learning，简称ML）的同学知道如果我们需要对这类问题进行决策的时候，最常用的方法是构建一个叫做分类器（Classifier）的程序。这种程序的输入待决策问题的一系列特征（feature），输出就是程序判定的结果。以垃圾邮件分类为例，每一封邮件就是一个待决策的问题，而通常使用的特征就是从这个邮件本身抽取一系列我们认为可能相关的信息，例如，发件人、邮件长度、时间、邮件中的关键词、标点符号、是否有多个收件人等等。给定了这些特征，我们的垃圾邮件分类器就可以判定出这封邮件是否是垃圾邮件。至于怎么得到这个垃圾邮件分类器程序，通常的做法是通过某些机器学习算法。之所以称其为”学习“ ，是因为这些算法通常需要一些已经标注好的样本（例如，100封邮件，每封信已经被明确标注为是否是垃圾邮件），然后这个算法就自动的产生一个关于这个问题的自动分类器程序。我们在这篇文章中将要讲得逻辑回归（Logistic Regression，简称LR）就是最常用的一个机器学习分类算法。

很多同学可能知道机器学习中有几十种分类器，那么我们为什么偏偏挑LR来讲呢？原因有三：

LR模型原理简单，并且有一个现成的叫LIBLINEAR 的工具库，易于上手，并且效果不错。
LR可以说是互联网上最常用也是最有影响力的分类算法。LR几乎是所有广告系统中和推荐系统中点击率（Click Through Rate（CTR））预估模型的基本算法。
LR同时也是现在炙手可热的“深度学习”（Deep Lerning）的基本组成单元，扎实的掌握LR也将有助于你的学好深度学习。

但是文本并不是一篇关于LR的科普性文章，如果你想泛泛地了解LR，最好的办法是去维基百科或者找一本像样的机器学习教材翻一下。相反的，本文的目标是是你不仅仅“知其然”，并且更“知其所以然”，真正做到从入门到精通，从而更加有信心地解决LR实践中出现的新问题。我们可以粗略的把入门到精通分为三个层次。

了解LR：了解LR模型、L1和L2规则化（Regularization）、为什么L1规则化更倾向于产生稀疏模型（Sparse Model）、以及稀疏模型的优点。
理解LR：理解LR模型的学习算法、能够独自推导基于L-BFGS的L1和L2规则化的LR算法，并将其在MPI平台上并行化实现。
改进LR：能够在实际中自如应用LR，持续改进LR来解决实际中未曾遇见到的问题。例如，当数据中的正样本远远小于负样本的情况下（例如，广告点击率预告问题），该怎么调整LR？当数据中有相当部分缺失时该如何调整算法？

由于求解LR是一个性质很好的无约束优化问题，本文在介绍LR的同时，也相对系统的介绍了无约束优化问题中的几个常用的数值方法，包括最速梯度下降方法、牛顿方法、和拟牛顿方法的DFP、BFGS、与L-BFGS。期望同学们能在知道了解这些算法的同时真正明白其原理与应用场景，例如理解为什么L-BFGS中的二次循环方法（two iteration method ）能够近似计算牛顿方向，并且可以轻松的并行化。这些算法是关于无约束问题的通过优化算法，应用场景非常广泛，但笔者并未发现关于他们比较系统化的、又同时比较容易理解中文教程，本文也算是填补这方面空白的一个尝试吧。所以，希望能在学习LR的同时将优化算法一并学了，相得益彰吧。

所以，本文预期的读者大概如下几类：

高阶机器学习人员：硕士毕业后5年及以上的机器学习经历，你就把这个当成一个关于LR和无约束优化算法的回顾材料好了，如有错误和不足请加以斧正。
中阶机器学习人员：硕士毕业后3~5年的机器学习经历，你可以把这个当做学习资料，把以前学过的东西串在一起，查漏补缺，做到真正懂得LR和相关优化的算法，从而能对工程实践做出正确的指导。
入门机器学习人员：硕士毕业后少于3年的机器学习经历，请你把纸和笔拿出来，把里面的公式一个个推导一遍，从而当你的leader告诉你某些事情的时候，你知道如何下手。
机器学习人员以外的高富帅和白富美们：你只需要知道LR是一个好用的自动分类算法，吩咐研究猿和攻城狮做就好了。另外，还可以用这篇文章嘲弄机器学习的屌丝们：累死累活，死那么多脑细胞，挣那两儿钱。

总而言之，不管你在机器学习上的造诣几何，希望这篇文章或多或少的都能给你带来点什么。笔者非常欢迎各方人士对本文以及任何与机器学习、数据挖掘相关问题的垂询、切磋、与探讨，好吧，闲话不多说，进入正题。

◆ ◆ ◆

正文概览

*鉴于文章的长度，这里只对文章内容做标题概览

2、初识逻辑回归

3、L1 vs. L2规则化

4、求解L2规则化的逻辑回归

5、OWL-QN：用L-BFGS求解L1规则化的逻辑回归

6、相关工作

7、前沿研究

◆ ◆ ◆

关于作者

柳超博士，天眼查创始人、董事长兼总经理，国家青年“千人计划”专家、北京市特聘专家、北京航空航天大学“大数据”特聘教授、中国大数据专家委员会委员、国家下一代互联网产业技术创新联盟专家。柳超博士创办天眼查之前，曾任搜狗科技首席科学家，美国微软研究院总部研究经理，美国国家自然科学基金数据挖掘方向的评审专家。

柳超于2007年在美国伊利诺伊大学获得计算机博士学位，并获得伊利诺伊大学杰出毕业论文奖，之后在数据挖掘、云计算、大数据分析、软件工程等方面取得了出色的研究成果。2008年至2012年，柳超博士任职于美国微软研究院，主管数据智能团队，期间在信息检索、数据挖掘和机器学习等诸多大数据相关领域作出了突出贡献，共计出版了 3 本英文专著、5篇国际期刊文章，以及30余篇国际一流学术会议文章，共计1300+次独立引用。

柳超博士在美期间曾担任美国国家自然科学基金数据挖掘方向的评审专家，多次应邀赴国际知名会议做主题报告。其工作成果在国际上备受关注，曾被国际电子电气工程师协会的IEEE Computer专业杂志以特邀封面文章的形式进行报道。

2012年，柳超博士回国加入腾讯科技（北京）有限公司，领导“腾讯搜索”的相关数据挖掘与机器学习业务。2014年，在腾讯与搜狗的战略合并之际加入搜狗科技，出任首席科学家，从零组建了搜狗数据科学研究院，全面负责搜狗互联网业务的数据挖掘与机器学习的前沿研究。
柳超博士创办的天眼查公司是中国首款关系发现平台，秉持“让每个人公平地看清这个世界”的使命，天眼查系列产品不仅可以可视化呈现复杂的商业关系，还可以深度挖掘和分析相关数据，预警风险等。目前，天眼查已经形成了针对媒体、金融、政府、法律等众多领域的大数据解决方案。更多精彩，请访问：www.tianyancha.com。

原文下载: LR_intro

2016年，互联网创业者一定要读这20本书

2016-02-10 方军做書

创业成为热潮，互联网创业更热，从2016开始暂列名为“互联网创业的二十本书”清单，和在创业的朋友们“共同学习”，或者说“共同度过”。

在创业或创造的过程中，我们会有很多的迷惑和困惑，而其中一个重要的解决方法是去读书，和伟大的头脑对话，在他们的思考中印证和反思。

这个书单将分为几个部分：关于观念，关于方法，关于思维。

这个清单基本上没有商业人物传记，那是创业者获取灵感和激励的重要阅读品类，但正因为其明显而无需纳入；也没有纳入商业模式讨论的，那也是自然的阅读选择。

这个清单有不少新书，但也有很多老书，这个清单并不需要一口气读完，它所列的图书及提到的只是给出一个提示，当你遇到困惑或者有空闲时，你可以找到它。

关于观念

01 《网络经济的十种策略》

凯文·凯利（Kevin Kelly, KK）,广州出版社，2000年

“蜜蜂比狮子重要；级数比加法重要；普及比稀有重要；免费比利润重要；网络比公司重要；造山比登山重要；空间比场所重要；流动比平衡重要；关系比产能重要；机会比效率重要。”

虽然清单并无先后，但把凯文·凯利的这本《网络经济的十种策略》放在第一个，还是有选择的：

在2013年再听到他拿二十年多前的PPT讲述时，发现这些思考者的厉害之处在于，他们预见到了我们正在经历的一切。

正如这几年我一再推荐的科幻小说《雪崩》（斯蒂芬森）一样，在所有人设想的未来是控制操作系统的人控制世界的科幻小说界，它描绘了一个“快递员”控制的世界——我们就活在这样的世界。

02 《创业维艰：如何完成比难更难的事》

本·霍洛维茨/著，中信出版社，2015年

“创业公司的CEO不应该计算成功的概率。创建公司时，你必须坚信，任何问题都有一个解决办法。而你的任务就是找到解决办法，无论这一概率是十分之九，还是千分之一，你的任务始终不变。”

这本书的英文版、中文版，我读过很多遍，因为已经转型成为成功风险投资人的本·霍洛维茨没有讲漂亮话，他讲自己所面对的处境，他如何选择的，他如何思考的。

这本书遍是金句，因此除了上面所引这句外，我另加一句：

“真正的难题不是绘制一张组织结构图表，而是让大家在你刚设计好的组织结构内相互交流。”

03 《麦哲伦传》

茨威格/传，海燕出版社，2001年

“这个从来不再任何人面前流露自己感情的严厉的军人，突然被内心深处涌出的一股热流所制服了。他的眼睛模糊了，激动的热泪盈眶而出，滚落到他那散乱的黑须上。”

是的，就是那个第一个完成全球航行的麦哲伦传，是的，就是那个作家茨威格。

自从申音在某次推荐这本书之后，我一再阅读，读的理由不是麦哲伦在那个时代如何证明地球是圆的，对我们今天来说，环绕地球不过是坐飞机而已。

读它是反复体会他所经历的那个过程：拥有一个大胆的计划，怀着一个后来被证明正确的信念，成功找到钱和一大批人，带着错误的假设和错误的行动方案出发，遇到困难和团队不断地哗变，但最终完成第一人的航行。

04 《一代新机器的灵魂》

特雷西·基德尔/著，机械工业出版社，1990年

“至于机器的真正发明人，工程师们，我看他们在这种活动中显得有点不合群、或许是因为这些人很少经历这样的场合。……不知为什么，我竟冷不丁冒出这样一句话：“只不过是台计算机，你知道，这在世界上确实是件很小的事。””

这本书讲述了还在小型计算机时代，一台计算机被创造出来的历程。

如果不是这两年的智能手机制造和智能硬件热潮，我们很多人应该很少有机会感受一个“机器”创造出来的过程。

我们创造的多数是网站、APP、商业系统，但是，所经历的过程是一样的。其实这本书和迈克尔·刘易斯记录早期互联网创业的《将世界甩在背后》（The New New Thing）争夺一个清单推荐位，而最终选择了《机器》，因为读它的过程很多感慨，而《世界》并没有。

05 《黑客与画家：硅谷创业之父Paul Graham文集》

Paul Graham/著，人民邮电出版社，2011年

“创造优美事物的方式往往不是从头做起，而是在现有成果的基础上做一些小小的调整，或者将已有的观点用比较新的方式组合起来。”

YC创业营的创办人Paul Graham已经变成一种象征，推荐这本书实际上并非仅仅推荐这本书，因为这本书是完结不变的，而他还在不断地写作长文（essay），讲述他的思考，值得持续关注。

比如他最近有一篇新文章讨论的是“Life is short”，他讨论的这个问题，他的答案隐藏在题目中：“从问题的终极反过来看，去培养一种对你想做的事迫不及待的急躁习惯。”

关于方法

06 《精益创业：新创企业的成长思维》

埃里克·莱斯/著，2012年

“新创企业是一个由人组成的机构，在极端不确定的情况下，开发新产品或新服务。”

精益创业度过了一次热潮大概在慢慢地沉寂，但偶尔看看还是很有价值，因为他以一个逻辑模型讲述了我们在过程中必然经历的学习过程，一个公司（产品）从没有到有的过程。

07 《创业必经的那些事》

1/2共两册，迈克尔·格伯/著，2010年10月

“如果你正在经营一家小公司，或者说你想拥有一家小公司，那么，本书正是为你而写的。”

格伯的这本书不是为那些要创办指数级公司的创业者写的，其实这本书在创业热潮之前的书名是很朴实的《突破瓶颈》，它是一个管理故事，把小企业主所要经历的一些体会、知识放进去了。

但是，谁不是从小企业主走过来的。

我隐约还记得当时从这本书得到的一个实用智慧：即便只有一个人，也要把管理结构图画出来。

这个人既是董事长，又是CEO，又是负责市场的副总，又是负责运营的副总，又是负责财务的副总，又是销售经理，又是行政经理……董事长给CEO和管理班子派任务，CEO又给负责市场的自己、负责运营的自己、负责财务的自己派任务，如此下去。

08 《启示录：打造用户喜欢的产品》

Marty Cagan/著，华中科技大学出版社，2011年

“本书是写给软件产品（包括企业级产品和大众产品）开发团队（特别是互联网软件产品团队）中负责定义产品的成员看的，他们通常被称为产品经理。这个职位常常由公司的创始人、高层主管、主程序员、设计师兼任。”

这是一本朴实的手册，讲些软件或互联网产品（及技术）团队的基本常识，比如需要哪些人、分别是什么角色；比如怎么定义产品；也有一些开发方法的讨论。很朴实，又很全面。

09 《四步创业法》

Steven Blank/著，华中科技大学出版社，2012年

“提出客户发展方法的目标是解决产品开发方法面临的10大问题。该方法把创业初期与客户相关的活动按目标划分为四个易于理解的阶段：客户探索、客户验证、客户培养和组建公司。”

仔细读过《精益创业》的都了解这本名为《The Four Steps to the Epiphany》的书，它是《精益创业》的灵感之源，埃里克·莱斯说他送了很多箱出去。

这本书还值得单独占有清单推荐位是因为，客户发展方法，是从客户的视角讲述这个过程，和精益强调学习不一样，客户是一切。

10 《大决策：九个不朽的领导力传奇故事》

迈克尔·尤西姆/著，机械工业出版社，2007年

“此次探险活动需要一个强有力的领导，而不是一个独裁者。”

沃顿商学院教授迈克尔·尤西姆讲述了一系列领导力时刻，我为此还专门找他带队攀登雪山的书阅读。

在读的过程中，几乎都是非商业性的故事却一再激发我去想这个问题：换作是我，会怎么做？

11 《丰田汽车案例：精益制造的14项管理原则》

杰弗里·莱克/著，中国财政经济出版社，2004年

“丰田模式可以扼要地总结为两大支柱：一为“持续改进”（continuous Improvement）,二为“尊重员工”（respect for people）。

一般把持续改进成为日语的改善（kaizen）,它挑战所有事，其精髓含义不仅仅是个人贡献的实际改善，更重要的是创造持续学习的精神，接受并保持变革的环境。”

这本不是关于精益的最经典的著作，最经典的是沃麦克等所著的《改变世界的机器》和《精益思想》。

但杰弗里·莱克也是知名的丰田与精益研究专家，更重要的是这本简单明了，比较容易投入实用。

精益创业的思路曾经大热门，但其实精益生产的思路只有到了一定的规模时才有使用价值。

不过，早点了解没有坏处，万一突然高速发展了呢？丰田的例子的价值还在于，它基本上是一个完备的系统，而不是一个的被强调的个别理念/工具，它的整套兄从文化、到组织、到产品开发、到生产都可以沿用。

12 《跨越鸿沟：颠覆性产品营销圣经》

杰弗里·摩尔/著，机械工业出版社，2009年

“在高科技产品市场的开发过程中，最危险最关键的一点就是由少数有远见者所主宰的早期市场向由实用主义占支配地位的大批顾客所占据的主流市场的过渡。”

这就是提出技术产品接纳周期曲线、指出这条曲线里的 “鸿沟”（Chasm ）的那个杰弗里·摩尔最早的作品之一。

要理解那条曲线可以看看他的多本书，其实那条曲线并不如平常所见那么直观，有很多微妙之处。

13 《创新者的窘境》

克莱顿·克里斯坦森/著，中信出版社，2010年

““技术”一词指的是一个组织将劳动力、资本、原材料和技术转化为价值更高的产品和服务的过程。”

这就是时下热门的“颠覆式创新” (disruptive creation，破坏式创新)的原典，最早读夹杂在汉译大众精品文库中这本时，做了非常多笔记，尤其对克里斯坦森对理论和现实的看法感兴趣，在当面向他请教时提了很多问题，但现在已经完全不记得那一个多小时问了什么，只依稀记得讨论linux、google docs等等。

这本书为什么会在过去三十年吸引创业者的眼光，大概是因为克莱顿·克里斯坦森以超级严谨的案例和逻辑论证了，新事物终究要战胜老的，以及究竟在什么样的场合下最可能战胜。

14 《企业参谋》

大前研一/著，中信出版社，2007年

“我的处女作《企业参谋》是我30岁到31岁间所做的笔记。29岁那年我进入麦肯锡公司工作，对经营一无所知，于是一边工作一边学习，留下这些笔记。那时我初出茅庐，做这些笔记时完全没有想到有一天能出版，我只是按照自己多年的习惯把学到的东西表达出来，仅此而已。”

没有更好的公司战略思考手册，过了几十年这本书依然有效，它并没有多少结论，仅仅是讲述了方法和过程。

15 《创业之初你不可不知的融资知识》

桂曙光/著，机械工业出版社，2012年

“VC其实跟一般的生产企业模式类似，他们先从一些优质的创业者手里低价买入原料——这些创业企业的股份，然后对原料进行加工——给企业提供一些增值服务，或者干脆就等着创业者自己努力，从而使这些股份原材料成为更加规范的产品，并获得价值提升。”

读完它，关于风险投资该了解的东西基本上都了解到了。需要时再读，提前读意义不大。

关于思维

16 《错不在我？》

卡罗尔·塔夫里斯、艾略特·阿伦森/著，中信出版社，2013年

“在有意识撒谎欺骗他人与下意识地自我辩护欺骗自己之间，有一块被不可靠、自利的记忆掌握的灰色地带。记忆通常都会受自我提升偏误(self-enhancing bias)修正和改变，让过去发生的事情变得模糊，减轻责难，扭曲事实的真相。”

创业者大概很难有机会说“错不在我”，但这本书还是要看，因为它不是讲怎么逃避责任的，而是讲我们的思维习惯的。

我们所有人都有着一种强烈的“事后合理化”的倾向，只有意识到我们的思维习惯，有意识地看它，我们才能看到它对我们的影响。

17 《创业无畏：指数级成长路线图》

彼得·戴曼迪斯、斯蒂芬·科特勒/著，浙江人民出版社，2015年

“为便于大家更加方便地掌握指数型技术的特点，我构建了一个“6D”框架：数字化(digitalization)、欺骗性(deception)、颠覆性(disruption)、非货币化(demonetization)、非物质化(dematerialization)和大众化(democratization)。这6D其实是指数型技术发展的连锁反应，也是导致巨大动荡并带来难得机遇的快速发展路线图。”

对我来说，多次阅读来自奇点大学的几本著作，的确总能激发技术乐观情绪，包括：《指数型组织》、《富足》、《创业维艰：指数级成长路线图》等。

我们也看到马斯克以一个人的力量在四个领域突破着：电动汽车、私人航天、太阳能和超级高铁，他几乎是技术乐观主义的现实商业代言人。

奇点大学的几位研究者和以往的技术乐观派不同的是，他们不是未来学家的方式，他们是技术实干派，他们将理论和实战混合在一起。

这有时会增加他们的魅力，有时又显得不够未来感，因为过于实用易于被挑刺，实用就需要快速迭代修正。但这是他们真实的状态。

这可能也是我们在实务中的人应该有的状态，保持一种实用的技术乐观，并努力将它实现。

18 《九败一胜：美团创始人王兴创业十年》

李志刚/著，北京联合出版社，2014年

“我们是一家电商公司，交易额是由B端和C端完成的，怎样把用户从七八千万，变成2亿、4亿、5亿，这需要我们扩展B，也扩展C，有足够的B，足够的C就有足够的交易额。我们会尝试新的机会，包括餐饮、酒店、电影、休闲游等；在公司外面也会有新的尝试。总体上来说，应对这场战争，我们是要增强团队，通过各方面的改进提升人均效率，我们还有太多地方需要改进和提升。”

原来的美团，已经合并大众点评变成新美大（据说英文名为China Internet Plus），但它还在奋战。

书名是借鉴自优衣库的《一胜九败》，把这本书放在思维方式，是因为它是关于我们怎么看待和经历失败，优衣库的更朴实些，一胜九败，但还在继续，其实王兴九败一胜，但也还没最终取胜。王兴的思维应该是中国互联网创业者里面最强大的思维。

19 《深度生存：生还是死难？》

劳伦斯·冈萨雷斯/著，中国对外翻译出版公司，2006年

“一般而言，我建议大家尽量远离那些英雄气过重的人，比如兰博式的硬汉人物，也不要接近那些总爱抱怨或者啼啼哭哭的人。应该信赖富有幽默感——特别是自嘲的那种幽默感——以及对自己有清醒认识的人。谁能充分利用周围条件，而且能够承认现实，熟悉环境，并且关爱他人，谁的生还几率就往往更大。说来说去，无论身处何境地，生还不外乎是对环境的适应。”

这本是《创业维艰》的户外探险版，这是真的生死。第9章“扭曲的地图”是我最推荐的部分。

20 《卓有成效的管理者》

彼得·德鲁克/著，上海译文出版社，1999年

““认识你自己”这句充满智慧的古训对现代的凡人来说实在是太难理解了。不过，如果你喜欢自己的工作卓有成效、能为别人作出贡献的话，那你还是可以遵照掌握自己的时间这一条去做的。”

这是一本平淡无奇的书，讲了些普通的道理：卓有成效是可以学到的，掌握自己的时间，问我能能作哪些贡献，如何发挥自己的优势，重要的事先做，等等。

与之相似的，英特尔的格鲁夫也有一本平淡无奇的书《格鲁夫给经理人的第一课》（high output management），他相对更关注系统的高效率一些，也值得推荐。

但是，如果这20本关于所谓互联网创业只看一本的，毫无疑问，请读德鲁克的《卓有成效的管理者》。