Get started with ML: How to train your own ML model

Safina Ma | 13/12/2024
get started with mml

Want to take the next step into machine learning? This blog will help you set up your first machine learning model, written by the AI experts of DataNorth.

Fundamentals of Machine Learning

Before we set up our machine learning model, we first need to understand the fundamentals of machine learning. In machine learning there are three important types in machine learning; supervised learning, unsupervised learning and reinforcement learning.

  • Supervised learning; uses labeled datasets to train algorithms.
  • Unsupervised learning; uses non labeled datasets to train algorithms.
  • Reinforcement learning; can be trained by itself through trial and error and from the results it will improve over time.

In this blog we will talk about supervised and unsupervised learning. Reinforcement learning uses more technical methods and algorithms, which would be the next step in your machine learning journey.

Setting up your Machine Learning Model

Installation

Before training the model you need to install some tools on your computer to run Python code and to set up your first machine learning model.

  1. Python: First, you’ll need to install Python. It’s the primary programming language for machine learning, and you can download it from python.org.
  2. IDE/Editor: You’ll also need an Integrated Development Environment (IDE) or a code editor to write your Python scripts. Popular choices include:
    • Jupyter Notebook: Ideal for experimenting with code in an interactive, notebook-style environment. You can install it via Anaconda or using the command pip install notebook.
    • VS Code: A powerful code editor with great support for Python. Install it from here and add the Python extension.
  3. Libraries: To run the machine learning models in this blog, you’ll need the following Python libraries:
    • NumPy: For handling arrays and mathematical operations. Install it using pip install numpy.
    • Scikit-learn: A machine learning library that includes algorithms like Linear Regression and KMeans. Install it with pip install scikit-learn.

Once you have these tools installed, the next step is to choose the machine learning algorithm you want to use.

Collecting your data

When starting your own machine learning model, the first thing we need to do is collect your data. This can be your own dataset, or datasets from other platforms like Kaggle. Kaggle is a platform that provides datasets that you can use to build AI models.

Prepare your data

When you have your dataset you have to decide if the usability is high enough to use for your machine learning model. What we mean by usability is that the dataset has enough data to work from. When there is data missing, incorrect data or outliers, we have to first clean the dataset. When data needs to be cleaned you can follow the steps below to prep the data before using it to train your model.

Key steps for Data Preparation

  • Data cleaning; remove duplicates to avoid unbiased results
  • Feature engineering; create new features from existing data and encode categorical variables into numerical formats.
  • Normalization and scaling; normalize numerical features to ensure a similar scale using techniques like min-max scaling or z-score normalization.
  • Outlier management; identify outliers and decide how to handle them based on their impact on your model.
  • Data splitting; divide your dataset into training and test sets.
  • Feature selection; select relevant features to improve model performance and reduce overfitting.
  • Data validation; conduct checks to ensure the cleaning process hasn’t introduced biases or errors.

Where to Put the Dataset

Organize your files by saving your dataset in a folder on your computer, ideally in a directory where all your project files will be stored. To load in your dataset in Python, you can use libraries like Pandas to load the dataset into your script. 

Choosing a Model

Now that your dataset is ready, the next step is choosing a model. This depends on several factors, including the type of problem you are solving (classification, regression, clustering, etc.). 

Let’s break it down, in the beginning of this blog we talked about the machine learning types; supervised learning, unsupervised learning and reinforcement learning. Let’s match the types of machine learning models with methods and algorithms to make it easier to choose your model.

Supervised learning approaches

Supervised learning involves training a model using labeled data, and it can be categorized into two main approaches: classification and regression. Classification focuses on assigning data points to predefined labels or classes, such as identifying whether an email is spam or not. In contrast, regression is used to predict continuous numerical values, such as forecasting house prices or stock market trends.

Key differences
FeatureClassificationRegression
OutputDiscrete categories (e.g., 0, 1, “A”, “B”)Continuous values (e.g., 3.5, 98.7)
PurposeClassify into categoriesPredict numerical output
Use caseFraud detection, disease diagnosisprice prediction, stock forecasting
VariableCategoricalContinuous

Classification

This is a machine learning method that tries to predict the correct label of a given input data. As an example we take two classification algorithms; Naive Bayes and Super Vector Machine. These algorithms build a model from the training dataset before making any prediction on future datasets. Algorithms to use when using this method;

  • Naive bayes
  • Super Vector Machine

Regression

This is a machine learning method used to analyze the relationship between a dependent variable (target variable) and an independent variable (predictor variable). The goal is to determine the most powerful function that characterizes the connection between these two. Algorithms to use for this method; 

  • Decision tree
  • Logistic regression
  • Linear regression

Unsupervised learning

Unsupervised learning is a type of machine learning that works with unlabeled data, discovering hidden patterns or structures within it. Two primary approaches are clustering and dimensionality reduction. Clustering aims to group similar data points into clusters based on shared patterns or features, such as segmenting customers by purchasing behavior. Dimensionality reduction simplifies datasets by reducing the number of features while preserving as much valuable information as possible, often used for visualizing high-dimensional data or improving model efficiency.

Key differences
FeatureClusteringDimensionality reduction
OutputCluster labels for data points (e.g., “Cluster 1”, “Cluster 2”A dataset with fewer dimensions, typically in numerical form
PurposeGrouping similar data points into clustersReducing data complexity while preserving key information
Use caseCustomer segmentation, pattern recognition, anomaly detectionData visualization, improving model performance
VariableUses original variables to form clustersTransforms variables into new combinations or a lower-dimensional space

Clustering

This is a machine learning method that uses both categorical and numeric features to group unlabeled data based on their similarities to each other. Algorithms to use when using this method;

  • K-means
  • Mean shift

Dimensionality reduction

This is a machine learning method for representing a given dataset using a lower number of features. It will remove irrelevant data to create a model with a lower number of variables. Algorithms to use when using this method;

  • Principal component analysis
  • Feature selection
  • Linear discriminant analysis

Training the Model

In this chapter we will go through all the steps to training your own machine learning model. From what we have discussed before you can use the two machine learning methods; supervised and unsupervised learning. Now we will focus on the technical aspects of setting up the machine learning model. 

Supervised learning

The machine learning algorithm is trained on a labeled dataset. This means that each example in the training dataset, the algorithm knows what the correct output is. The algorithm uses that knowledge to try to generalize to new examples that it’s never seen before. Using labeled inputs and outputs, the model can measure its accuracy and learn over time.

For this example we will use linear regression, let’s start training a linear regression model together. Let’s take this example of sklearn to walk through the steps of setting up a linear regression.

Step 1: Open your IDE or code editor

To get started open your preferred IDE or code editor, such as Jupyter Notebook or VS Code. 

Step 2: Importing Libraries

Import a library (NumPy) for working with numbers and arrays (tables of numbers).

import numpy as np

Then brings in a tool called Linear Regression to find relationships between numbers.

from sklearn.linear_model import LinearRegression

Step 3: Creating Data (X and y)

What is X?

x is a table where each row represents a set of input numbers. Each input has two numbers (columns). For example, the first row [1, 1] means the first input is 1, and the second input is also 1.

What is y?

y is a set of output numbers that are calculated using a specific rule. Think of it as the “answers” that correspond to each row in X.

For each row in X, follow this rule: take the first number and multiply it by 1, then take the second number and multiply it by 2. Add these two results together, and finally, add 3 to get the total.

Step 4: Training the Model

reg = LinearRegression().fit(X, y): Trains the model to learn the relationship between X and y using Linear Regression.

Step 5: Checking the Fit

reg.score(X, y): Checks how well the model learned the relationship; a score of 1.0 means the model learned it perfectly.

Step 6: Finding the Relationship

reg.coef_: Shows the coefficients (weights) the model learned for each column in X. The result is [1.,2.], it multiplies the first column by 1 and the second by 2.

reg.intercept_: Shows the constant (starting point) the model learned. In this case the result is 3.0.

Step 7: Making Predictions

reg.predict(np.array([[3, 5]])): Predicts the output for new input [3, 5]. The formula used; (1×3) + (2×5) + 3 = 16. Which results in [16]. Underneath you can see an example of how a linear regression would look like. As you can see it’s a straight line going up, where each point represents a relationship between two things. The goal is drawing the straight line that fits the best between those points. The line will then help you predict the value of the data you want to collect.

Unsupervised learning

This machine learning algorithm is trained on an unlabeled dataset. Because of that the algorithm has to discover hidden patterns in data without the need of human input.

For this example we will use k-means, let’s start training an k-means model together. Let’s go to sklearn again for an example and walk through the steps of setting up an k-means model. 

Step 1: Open your IDE or code editor

To get started open your preferred IDE or code editor, such as Jupyter Notebook or VS Code. 

Step 2 : Importing Libraries

Import a library (NumPy) for working with numbers and arrays (tables of numbers).

import numpy as np

Then brings in a tool called Linear Regression to find relationships between numbers.

from sklearn.linear_model import KMeans

from sklearn.cluster import KMeans: Brings in a tool called KMeans, which is used for grouping data points into clusters based on their similarities. KMeans helps identify patterns by grouping similar data together.

Step 3: Creating Data (X)

First, we create some data that we want to group together. Your X- values is the data we are going to analyze, and it represents the points we want to cluster (group) based on their similarities. We don’t need y in KMeans because it doesn’t use labeled data (it’s unsupervised learning). For KMeans, we just focus on the X data.

Step 4: Checking the Fit

Now, we train the model. KMeans; this means we are asking the algorithm to look at the data and figure out how many groups (clusters) it should divide the data into. For example, you can tell KMeans you want 2 clusters (groups), and it will find two sets of points that are close to each other.

Step 5: Finding the Relationship

Once the model is trained, we can check how well it grouped the data by looking at which points belong to which group. KMeans will give a “label” to each point; kmeans.labels. It shows which group it belongs to. For example, KMeans might say the first three points belong to one group and the last three belong to another.

Step 6: Making Predictions

Finally, you can ask the model to predict which group a new point will belong to. The KMeans clustering graph shows data points grouped into clusters, each represented by a different color. The algorithm divides the points based on their similarities, which is the average position of all points in that group. Each point is assigned to a cluster, and the clusters help identify patterns in the data based on how close the points are to each other.

Conclusion

To conclude this blog we discussed what machine learning is and gave you examples of how to set up your first machine learning model. Whether you are using supervised learning like linear regression for labeled data or unsupervised learning like kmeans clustering for unlabeled data. The process for setting up a machine learning model involves understanding the problem, preparing the data and choosing the right model. By following these steps you can start building your own machine learning model. Overall, it’s important to experiment with different models and see which one works the best for your datatype.

If you are ready to take the next step in your machine learning journey but want more expertise, we at DataNorth are here to help. Our AI engineers have their bachelors and masters in this field and are there to consult you in your machine learning process. Don’t hesitate to contact us.