Common Machine Learning Algorithms

Introduction

Although Google's self-driving cars and robots receive a lot of attention, the company's real future lies in machine learning, the technology that makes computers smarter and more individualized.

– Google Chairman Eric Schmidt

We are probably living through the most pivotal time in human history. the time when PCs, large mainframes, and the cloud dominated computing. However, what makes it unique is not what has occurred, but rather what lies ahead for us in the years to come.

For someone like me, the democratization of various tools and methods that followed the rise in computing is what makes this period exciting and captivating. The field of data science is waiting for you!

For a few dollars an hour, a data scientist can build machines that crunch data using complex algorithms. However, getting here was difficult! I had my bad nights and days.

Who can benefit the most from this guide?

The goal of this blog is to make the process of becoming a data scientist interested in machine learning easier for people all over the world. I will make it possible for you to work on machine learning issues and gain experience through this guide. I'm providing various machine learning algorithms and a high-level understanding of them. You should be able to get your hands dirty with these.

You don't need to understand the statistics behind these methods at the outset, so I've deliberately left them out. Therefore, you should look elsewhere for a statistical understanding of these algorithms. However, you are in for a treat if you want to prepare yourself to begin building a machine learning project.

Types of Machine Learning

Supervised Learning

How it operates: A target/outcome variable—also known as a dependent variable—that must be predicted from a particular set of predictors—also known as independent variables—make up this algorithm. We generate a function that maps inputs to desired outputs using this set of variables. The model is trained until the desired level of accuracy on the training data is reached. Supervised Learning Examples: KNN, Logistic Regression, Decision Tree, Random Forest, and others

Unsupervised Learning

How it operates: We do not have any target or outcome variables to estimate or predict in this algorithm. It is widely utilized for segmenting customers into distinct groups for specific interventions. It is used to cluster populations into distinct groups. Unsupervised Learning Examples: K-means, the a priori algorithm.

Reinforcement Learning:

How it operates: The algorithm is used to train the machine to make specific choices. This is how it works: The environment in which the machine trains continuously through trial and error in order to make accurate business decisions, this machine learns from previous experiences and tries to collect as much information as possible. An illustration of reinforcer learning is the Method of Markov Decisions.

List of Common Machine Learning Algorithms

Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:

Linear Regression
Logistic Regression
Decision Tree
SVM
Naive Bayes
kNN
K-Means
Random Forest
Dimensionality Reduction Algorithms
Gradient Boosting algorithms

GBM
XGBoost
LightGBM
CatBoost

Linear Regression

It is used to estimate actual values (such as the cost of houses, the number of calls, the total number sold, etc.).based on one or more continuous variables by fitting the best line, we can determine the relationship between the independent and dependent variables in this case. Reliving this childhood memory is the best way to comprehend linear regression, which is represented by the linear equation Y= a *X + b. This best fit line is the regression line.

Let's say you ask a child in the fifth grade to put people in his class in order of weight without asking them how much they weigh! What do you anticipate the child doing? He or she probably would visually analyze people's height and build and arrange them according to a mix of these visible parameters. In real life, this is linear regression! The child has actually realized that there is a relationship between weight and height and build, which looks like the equation above.

In this equation:

Y – Dependent Variable

a – Slope

X – Independent variable

b – Intercept

These coefficients a and b are derived based on minimizing the sum of the squared difference of distance between data points and the regression line.

Look at the below example. Here we have identified the best fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.

Linear Regression is mainly of two types: Simple Linear Regression and Multiple Linear Regression. Simple Linear Regression is characterized by one independent variable. And, Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1) independent variables. While finding the best fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression.

Logistic Regression

Avoid getting lost in its name! It is a classification algorithm rather than a regression one. Based on a particular set of independent variables, it is used to estimate discrete values (binary values like 0/1, yes/no, true/false). Simply put, it uses data fitting to a logit function to predict an event's likelihood of occurring. As a result, it's also called logit regression. The expected range of its output values is between 0 and 1, as it predicts the probability.

Again, let's try to understand this by using a straightforward example.

Let's say a puzzle is given to you by a friend. There are only two possible outcomes: either you solve the problem or you don't. Now, imagine that a variety of puzzles and questions are given to you in an effort to determine your strengths. The study's conclusion would be something along the lines of, "If you are given a tenth-grade trigonometry-based problem, you have a 70% chance of solving it."However, the probability of receiving an answer to a fifth-grade history question is only 30%. You can get this from Logistic Regression.

In terms of mathematics, the predictor variables are modeled as a linear combination, resulting in the log odds of the outcome.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

The probability that the characteristic of interest is present (p) is shown above. Instead of focusing on minimizing the sum of squared errors, as is the case with conventional regression, it selects parameters that have the highest probability of observing the sample values.

Now, you might ask, why log it? Let's just say that this is one of the best ways to mathematically replicate a step function for the sake of simplicity. Although that would defeat the purpose of this article, I could provide additional specifics.

Decision Tree

This is one of my favorite algorithms and I use it quite frequently. It is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables. In this algorithm, we split the population into two or more homogeneous sets. This is done based on the most significant attributes/ independent variables to make as distinct groups as possible. For more details, you can read Decision Tree Simplified.

image-source: statsexchange

To determine "if they will play or not," the population is divided into four distinct groups based on a variety of characteristics, as shown in the image above. It employs a variety of methods, including entropy, information gain, Gini, and information gain, to divide the population into various heterogeneous groups.

Play Jezzball, a classic Microsoft game, to learn how the decision tree works best (see image below).In essence, you need to construct walls in a room with moving walls so that as little space as possible is cleared of balls.

So, every time you split the room with a wall, you are trying to create 2 different populations within the same room. Decision trees work in a very similar fashion by dividing a population into as different groups as possible.

SVM (Support Vector Machine)

It is a method of classification. We plot each data point as a point in n-dimensional space using this algorithm, where n is the number of features you have and the value of each feature is a specific coordinate.

For instance, if a person only had two characteristics, such as height and hair length, we would first plot these two variables in two-dimensional space with two coordinates for each point (support vectors).

Now, we will locate some lines that divide the data into two distinct groups of data. This will be the line along which the farthest points from the closest point in each of the two groups will be.

Because the two points closest to the line are the ones that are the furthest from the line, the black line in the example above divides the data into two distinct groups. Our classifier is in this line. Then, that is the class that we can classify the new data as depending on where the testing data lands on either side of the line.

Imagine playing JezzBall in n-dimensional space with this algorithm. The game's tweaks are as follows:

Instead of drawing lines or planes only horizontally or vertically, as in the classic game, you can draw lines or planes at any angle. The goal of the game is to separate balls of different colors from different rooms.

Additionally, the balls do not move.

Naive Bayes

Based on Bayes' theorem and the presumption of predictor independence, it is a classification method.A Naive Bayes classifier, to put it simply, makes the assumption that the presence of one feature in a class is independent of the presence of any other features. An apple, for instance, is one that is red, round, and approximately 3 inches in diameter. A naive Bayes classifier would consider each of these properties to independently contribute to the probability that this fruit is an apple, regardless of whether these features are dependent on each other or the presence of the other features.

It is simple to construct the Naive Bayesian model, which is especially useful for very large data sets. Naive Bayes is known to perform better than even highly sophisticated classification methods due to its simplicity.

From P(c), P(x), and P(x|c), the Bayes theorem provides a method for calculating the posterior probability P(c|x). Consider the following equation:

Here,

P(c|x) is the posterior probability of class (target) given predictor (attribute).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of the predictor.

Example: Let's use an example to make sense of it. A weather training data set and the target variable "Play" are provided below. Now, based on the weather, we need to classify whether players will play or not. Let's carry it out by following the steps below.

Step 1:Frequency tables can be created from the data set.

Step 2:By determining the probabilities, such as the Overcast probability of 0.29 and the Playing probability of 0.64, construct a Likelihood table.

Step 3:Now, determine the posterior probability for each class by employing the Naive Bayesian equation. The prediction's outcome is the class with the highest posterior probability.

Problem: Is it true that players will pay if the weather is sunny?

P(Yes | Sunny) = P(Sunny | Yes) * P(Yes) / P (Sunny) Here, P(Sunny | Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, and P(Yes) = 9/14 = 0.64. Now, P(Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has a higher probability. We can solve it using the method

Similar methods are used by Naive Bayes to predict the probability of various classes based on various attributes. Most of the time, this algorithm is used to classify text and solve problems with multiple classes.

kNN (k- Nearest Neighbors)

It can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case assigned to the class is most common amongst its K nearest neighbors measured by a distance function.

These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distances. The first three functions are used for continuous functions and the fourth one (Hamming) for categorical variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to be a challenge while performing kNN modeling.

KNN can easily be mapped to our real lives. If you want to learn about a person, with whom you have no information, you might like to find out about his close friends and the circles he moves in and gain access to his/her information!

Things to consider before selecting kNN:

KNN is computationally expensive
Variables should be normalized else higher range variables can bias it
Works on pre-processing stage more before going for kNN like an outlier, noise removal

K-Means

The clustering issue is solved by this type of unsupervised algorithm.It uses a straightforward method to classify a given data set into a certain number of clusters (assume k clusters) in its procedure. Peer groups contain heterogeneous and homogeneous data points within a cluster.

Do you remember making shapes from blots of ink? This activity and the k sign are somewhat comparable. To determine the number of distinct clusters or populations present, you examine the shape and spread!

How K-means clustering works:

K-means selects k points, or centroids, for each cluster.
With the closest centroids or k clusters, each data point is a cluster.
uses members of the current cluster to determine the centroid of each cluster. We have brand-new centroids here.
Repeat steps 2 and 3 as new centroids are created. Get associated with new k-clusters by determining each data point's closest distance to new centroids. This procedure should be repeated until convergent centroids remain constant.

How to figure out K's value:

We have clustered in K-means, and each cluster has its own centroid. The sum of the square value for a cluster is the square of the difference between the centroid and the data points in that cluster. Additionally, when the square values of all the clusters are added together, the result is a sum within the square value of the cluster solution.

We are aware that this value continues to decrease as the number of clusters increases; however, if you plot the result, you may notice that the sum of squared distance decreases rapidly up to a certain value of k and gradually thereafter. We can determine the optimal number of clusters here

Random Forest

The term "random forest" refers to a collection of decision trees. We have a collection of decision trees, or "Forest," in Random Forest. Each tree assigns a classification to a new object based on its attributes, and the tree "votes" for that class. The classification with the most votes (out of all the trees in the forest) is chosen by the forest.

Each tree is planted & grown as follows:

If the number of cases in the training set is N, then a sample of N cases is taken at random but with replacement. This sample will be the training set for growing the tree.
If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on this m is used to split the node. The value of m is held constant during the forest growth.
Each tree is grown to the largest extent possible. There is no pruning.

Dimensionality Reduction Algorithms

In the last 4-5 years, there has been an exponential increase in data capturing at every possible stage. Corporates/ Government Agencies/ Research organizations are not only coming up with new sources but also they are capturing data in great detail.

For example, E-commerce companies are capturing more details about customers like their demographics, web crawling history, what they like or dislike, purchase history, feedback and many others to give them personalized attention more than your nearest grocery shopkeeper.

As a data scientist, the data we are offered also consists of many features, this sounds good for building a good robust model but there is a challenge. How’d you identify highly significant variable(s) out of 1000 or 2000? In such cases, the dimensionality reduction algorithm helps us along with various other algorithms like Decision Tree, Random Forest, PCA, Factor Analysis, Identity-based on the correlation matrix, missing value ratio, and others.

Gradient Boosting Algorithms

GBM

GBM is a boosting algorithm used when we deal with plenty of data to make a prediction with high prediction power. Boosting is actually an ensemble of learning algorithms which combines the prediction of several base estimators in order to improve robustness over a single estimator. It combines multiple weak or average predictors to a build strong predictor. These boosting algorithms always work well in data science competitions like Kaggle, AV Hackathon, and CrowdAnalytix.

XGBoost

In some Kaggle competitions, another classic gradient boosting algorithm is known to be the difference between winning and losing.

Because it possesses both a linear model and the tree learning algorithm, the XGBoost is the best option for event accuracy. Additionally, the algorithm is nearly ten times faster than existing gradient booster techniques.

Regression, classification, and ranking are just a few of the various objective functions that are included in the support.

The fact that the XGBoost is also referred to as a regularized boosting method is one of its most intriguing aspects. This greatly supports a variety of programming languages, including Scala, Java, R, Python, Julia, and C++, and it helps to reduce overfit modeling.

enables widespread and distributed training on a large number of machines, including GCE, AWS, Azure, and Yarn clusters. Cross-validation is built into XGBoost at every iteration of the boosting process, so it can be used in conjunction with Spark, Flink, and other cloud dataflow systems.

LightGBM

LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:

Faster training speed and higher efficiency
Lower memory usage
Better accuracy
Parallel and GPU learning supported
Capable of handling large-scale data

The framework is a fast and high-performance gradient boosting one based on decision tree algorithms, used for ranking, classification, and many other machine learning tasks. It was developed under the Distributed Machine Learning Toolkit Project of Microsoft.

Since the LightGBM is based on decision tree algorithms, it splits the tree leaf-wise with the best fit whereas other boosting algorithms split the tree depth-wise or level-wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms.

Also, it is surprisingly very fast, hence the word ‘Light’.

Catboost

CatBoost is one of open-sourced machine learning algorithms from Yandex. It can easily integrate with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML.

The fact that CatBoost can work with a wide range of data formats and does not necessitate extensive data training, unlike other machine learning models, is its greatest advantage. not undermining its potential for strength.

Before beginning the implementation, ensure that any missing data are handled appropriately.

Catboost can deal with categorical variables automatically without displaying the type conversion error, allowing you to concentrate on fine-tuning your model rather than fixing minor errors.

Conclusion

You should be familiar with common machine learning algorithms by this point. The sole purpose for which I have written this article is to get you started immediately. Start right away if you want to master machine learning algorithms. Use these algorithms to solve problems, gain a physical understanding of the process, and experience the fun!

How helpful was this article to you? In the comments section below, please share your thoughts on the machine learning algorithms.

Stay connected for more articles on Machine Learning.

CodeITronics