Introduction to Scikit Learn

Scikit-Learn package provides efficient versions of a large number of common algorithms. It is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation. A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward.

Data Representation in Scikit-Learn

Data is represented in a dataframe or a numpy array. For use in Scikit-Learn, we extract the features matrix and target array from the DataFrame, which we can do using some of the Pandas DataFrame operations.

import seaborn as sns iris = sns.load_dataset('iris') iris.head()# Start writing code here...

%matplotlib inline sns.set() sns.pairplot(iris, hue='species', height=1.5);

#Feature Matrix X_iris = iris.drop('species', axis=1) X_iris.shape

#Target Array y_iris = iris['species'] y_iris.shape

Scikit-Learn's Estimator API

The Scikit-Learn API paper provides following guiding principles:

Consistency: All objects share a common interface drawn from a limited set of methods, with consistent documentation.

Inspection: All specified parameter values are exposed as public attributes.

Limited object hierarchy: Only algorithms are represented by Python classes; datasets are represented in standard formats (NumPy arrays, Pandas DataFrames, SciPy sparse matrices) and parameter names use standard Python strings.

Composition: Many machine learning tasks can be expressed as sequences of more fundamental algorithms, and Scikit-Learn makes use of this wherever possible.

Sensible defaults: When models require user-specified parameters, the library defines an appropriate default value.Every machine learning algorithm in Scikit-Learn is implemented via the Estimator API, which provides a consistent interface for a wide range of machine learning applications.

Basics of the API

Most commonly, the steps in using the Scikit-Learn estimator API are as follows.

Choose a class of model by importing the appropriate estimator class from Scikit-Learn.

Choose model hyperparameters by instantiating this class with desired values.Arrange data into a features matrix and target vector following the discussion above.

Fit the model to your data by calling the fit() method of the model instance.

Apply the Model to new dataFor supervised learning, often we predict labels for unknown data using the predict() method.

For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.

Simple Linear Regression

As an example of this process, let's consider a simple linear regression—that is, the common case of fitting a line to (x,y) data. We will use the following simple data for our regression example.

import matplotlib.pyplot as plt import numpy as np rng = np.random.RandomState() x = 10 * rng.rand(50) y = 2 * x - 1 + rng.randn(50) plt.scatter(x, y);

x.shape

1. Choose a class of model

ln Scikit-Learn, every class of model is represented by a Python class. So, for example, if we would like to compute a simple linear regression model, we can import the linear regression class:

from sklearn.linear_model import LinearRegression

2. Choose model hyperparameters

An important point is that a class of model is not the same as an instance of a model.Once we have decided on our model class, there are still some options open to us. Depending on the model class we are working with, we might need to answer one or more questions like the following:a. Would we like to fit for the offset (i.e., y-intercept)? b. Would we like the model to be normalized? c. Would we like to preprocess our features to add model flexibility? d. What degree of regularization would we like to use in our model? e. How many model components would we like to use?These are examples of the important choices that must be made once the model class is selected. These choices are often represented as hyperparameters, or parameters that must be set before the model is fit to data. In Scikit-Learn, hyperparameters are chosen by passing values at model instantiation. We will explore how you can quantitatively motivate the choice of hyperparameters in Hyperparameters and Model Validation.For our linear regression example, we can instantiate the LinearRegression class and specify that we would like to fit the intercept using the fit_intercept hyperparameter:

model = LinearRegression(fit_intercept=True) model

3. Arrange data into a features matrix and target vector

Here our target variable y is already in the correct form (a length-n_samples array), but we need to massage the data x to make it a matrix of size [n_samples, n_features]. In this case, this amounts to a simple reshaping of the one-dimensional array.

X = x[:, np.newaxis] X.shape

4. Fit the model to your data

Now it is time to apply our model to data. This can be done with the fit() method of the model

model.fit(X, y)

print(model.coef_) print(model.intercept_)

5. Predict labels for unknown data

Once the model is trained, the main task of supervised machine learning is to evaluate it based on what it says about new data that was not part of the training set. In Scikit-Learn, this can be done using the predict() method. For the sake of this example, our "new data" will be a grid of x values, and we will ask what y values the model predicts.

xfit = np.linspace(-1, 11) Xfit = xfit[:, np.newaxis] Xfit[:5]

yfit = model.predict(Xfit) yfit

plt.scatter(x, y) plt.plot(Xfit, yfit);

Supervised learning: Iris classification

from sklearn.model_selection import train_test_split Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,train_size=100)

Xtest.shape

from sklearn.naive_bayes import GaussianNB # 1. choose model class model = GaussianNB() # 2. instantiate model model.fit(Xtrain, ytrain) # 3. fit model to data y_model = model.predict(Xtest) # 4. predict on new data

from sklearn.metrics import accuracy_score accuracy_score(ytest, y_model)

Feature Engineering

One of the most important steps in using machine learning in practice is feature engineering. It takes whatever information we have about our problem and turn it into numbers that you can use to build your feature matrix.

Derived Features

Another useful type of feature is one that is mathematically derived from some input features. We can convert a linear regression into a polynomial regression not by changing the model, but by transforming the input! For example, this data clearly cannot be well described by a straight line. Still, we can fit a line to the data using LinearRegression and get the optimal result. This idea of improving a model not by changing the model, but by transforming the inputs, is fundamental to many of the more powerful machine learning methods.

%matplotlib inline import numpy as np import matplotlib.pyplot as plt x = np.array([1, 2, 3, 4, 5]) y = np.array([4, 2, 1, 3, 7]) plt.scatter(x, y)

from sklearn.linear_model import LinearRegression X = x[:, np.newaxis] model = LinearRegression().fit(X, y) yfit = model.predict(X) plt.scatter(x, y) plt.plot(x, yfit)

from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=3, include_bias=False) X2 = poly.fit_transform(X) X2

model = LinearRegression().fit(X2, y) yfit = model.predict(X2) plt.scatter(x, y) plt.plot(x, yfit)

Missing Data

Another common need in feature engineering is handling of missing data. Often the NaN value is used to mark missing values. For example, we might have a dataset that looks like this

from numpy import nan X = np.array([[ nan, 0, 3 ], [ 3, 7, 9 ], [ 3, 5, 2 ], [ 4, nan, 6 ], [ 8, 8, 1 ]])

from sklearn.impute import SimpleImputer imp = SimpleImputer(strategy='mean') X2 = imp.fit_transform(X) X2

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Introduction to Scikit Learn