HomeIntroduction to Machine Learning

This quiz consists of 0 mandatory questions where you can gather 0 points.
You will complete the quiz by answering all the questions and gathering at least 0 points.

Answered: 0 / 0

Achieved points: 0 (NaN %)

Introduction to Machine Learning

Machine Learning Made Simple: A Guide for Beginners

Which NBA player handles the ball most like Luka Doncic? Ever wondered which tennis player’s serving style is most like Serena Williams'? Which cheese has a flavor profile closest to Brie? Among the vast array of tropical forests, which one shares the most similarities with the Amazon? Which contemporary film director can be compared to the legendary Alfred Hitchcock in cinematic style? Is watermelon really similar in nutritional value to grapes and cherries? Is the world truly divided socio-economically from North to South? And for those captivated by the prose of Jane Austen, which modern author comes closest in narrative style? Which profile of people stays in the company the longest? Is it really hard to distinguish between Pekingeser and Tibetan Spaniel? Which of the recent news best describes promises of AI?

In this two-hour course, we won’t answer all these questions, but we’ll explore approaches that might. With the right data representing objects of interest—like foods with nutritional values, tennis players with service statistics, or countries with socioeconomic indicators—these questions become easier to tackle. We can assess similarities, group similar objects, and interpret their characteristics. If the data is already grouped, we can even build models to classify new objects into these groups.

Finding groups is called clustering, or, in fancy machine learning lingo, unsupervised learning. Once the groups exist, classifying new objects into them is known as classification, or, for the machine learning aficionados, supervised learning.

Most of the training material is presented in the videos, but a concise tutorial is also available for download in a single document. In the videos, we use Orange, a free and powerful data mining toolbox. While you don't need Orange to complete this course, if you're interested, you can download it and explore a video series with training materials on how to use it.

The lessons below are a gentle introduction to classification and clustering—or, if we wanted to sound more scientific :), (we don’t), supervised and unsupervised machine learning.

The material in this course was developed by the Biolab group at the University of Ljubljana and is offered under the Creative Commons CC BY-NC-ND license.

Chapter 1: Data

Machine learning always starts with the data. In this lesson, we’ll explore different types of data that form the foundation of machine learning, from familiar tabular data like Excel spreadsheets to more complex data such as images and text. You’ll see examples of student grades, socio-economic data, employee records, and even images of dogs. While this lesson focuses on understanding and preparing data, later lessons will dive into how we use this data for tasks like grouping, identifying patterns, and making predictions.

It is now time to watch the video! Don’t worry—it’s not a long one, and if you’re new to computer science, machine learning, or AI, there’s no need to be concerned. This video, along with the entire lecture series, starts from the basics and requires no prior knowledge. We’ll guide you through everything step by step!

Here’s a list of key concepts we have covered in this lesson, along with brief descriptions:

  1. Data Instances (Examples/Cases): Rows in a dataset, representing the individual objects of interest (e.g., students, countries).
  2. Attributes (Features): Columns in a dataset, representing variables or characteristics of the objects (e.g., student grades, life expectancy).
  3. Meta-Features: Additional information, such as names or geographical positions, that provide context for the data but are not used in development of machine learning models.
  4. Numerical Variables: Features expressed as numbers, such as grades or income.
  5. Categorical Variables: Features expressed as categories, such as department or travel frequency.
  6. Class Attribute: A special attribute in a dataset used to categorize or predict an outcome, such as employee attrition (whether they left the company).
  7. Tabular Data: Structured data arranged in rows and columns, like a spreadsheet.
  8. Unstructured Data: Data types like images and text that don’t follow a tabular format but can still be analyzed through machine learning.
  9. Data Representation: The process of converting data, such as images and text, into numerical forms suitable for machine learning.
  10. Foundation Models: Large, pre-trained machine learning models that can process complex data types, such as images or text, by converting them into numerical representations. They provide a starting point for tasks like classification or prediction and are a core component of modern AI.
  11. Distance and Similarity: Measures used to compare data instances. Distance quantifies how far apart two data points are, while similarity shows how alike they are. These concepts help in tasks like grouping similar data items or finding patterns.

Now, please complete the following quiz:

What is the most common format of data in machine learning? (1pt.)

In the student grades dataset, what do the rows represent? (1pt.)

What are the columns in a dataset called in machine learning? (1pt.)

What kind of data types are included in the employee dataset? (1pt.)

In the employee dataset, what is the class attribute used for? (1pt.)

What do you think happens to images before they can be used in machine learning? (1pt.)

What do you think is the first step to using text data in machine learning? (1pt.)

What makes image and text data different from tabular data in machine learning? (1pt.)

Chapter 2: Clustering

In this lesson, we dive into one of the most exciting and common tasks in machine learning: clustering. Whether you're new to machine learning or data science, clustering is a powerful way to discover hidden groups in any dataset, without needing any prior labels or categories. We'll start by finding similar data points, like students with similar grades, and then move on to more advanced techniques like hierarchical clustering, where we can visualize groups through dendrograms. You'll see how clustering works on everything from student grades to socioeconomic data of countries. We also introduce t-SNE, a visualization tool that lets us explore clusters in a two-dimensional map, making it easier to see patterns in complex data. This lesson shows how clustering can unlock insights from your data without needing deep technical knowledge. No prior experience with machine learning is required—just curiosity to explore the patterns hiding in your data!

Now it’s time to dive into the video! Don’t worry if this is your first time hearing about clustering or machine learning—this video, like the whole series, starts from the ground up. We’ll walk you through the concepts in simple steps, no prior knowledge needed!

Here’s a list of key concepts from the second lecture on clustering, along with brief descriptions:

  1. Clustering: A machine learning technique used to group data instances based on their similarities without predefined labels.
  2. Neighbors: Data instances that are most similar to a selected instance, often measured by a distance metric like Euclidean distance.
  3. Hierarchical Clustering: A clustering method that groups data based on distances between data instances and represents the resulting groups visually using a tree-like diagram called a dendrogram.
  4. Dendrogram: A visual representation of results of hierarchical clustering, showing how data instances are grouped together at various levels.
  5. Algorithm: A step-by-step procedure or set of rules followed by a computer to perform a task, solve a problem, or make decisions based on input data.
  6. Hierarchical Clustering Algorithm: A type of algorithm used in clustering that builds a hierarchy of clusters by either merging smaller clusters into larger ones (agglomerative) or dividing larger clusters into smaller ones (divisive).
  7. Unsupervised Learning: A type of machine learning where the algorithm discovers patterns in data without predefined outcomes or labels.
  8. Distance Metric: A method, such as Euclidean distance, used to measure how similar or different two data instances are from each other.
  9. High-Dimensional Data: Data with a large number of features or variables, often making it challenging to analyze and visualize due to its complexity. This type of data is also common in image or text analysis, where each data point can have hundreds or thousands of dimensions.
  10. t-SNE: A dimensionality reduction technique that maps high-dimensional data into a two-dimensional plane for easier visualization. This visualization can also be used for cluster discovery.
  11. Cluster Exploration: The process of analyzing clusters to understand the characteristics that differentiate them from each other.

Some datasets have many features, sometimes hundreds or thousands. This is called high-dimensional data. It can be challenging to work with because, as the number of features increases, it becomes harder to measure similarity between data points. This is sometimes called the "curse of dimensionality." Techniques like t-SNE help by reducing the data to just two dimensions, making it easier to see patterns and groups.

Now, please complete the following quiz:

What is the primary goal of clustering in machine learning? (1pt.)

What does finding 'neighbors' in a dataset mean? (1pt.)

What is hierarchical clustering? (1pt.)

What is a dendrogram used for in clustering? (1pt.)

Which distance metric was used in the lecture for clustering? (1pt.)

What does t-SNE do? (1pt.)

Why is clustering considered unsupervised learning? (1pt.)

What is cluster exploration? (1pt.)

Chapter 3: Classification

In this lesson, we dive into predictive modeling, a key technique in supervised learning. Using employee data, we’ll show how models like Naive Bayes and logistic regression can help predict which employees are likely to leave a company. This could, for instance, empower HR to take action. You'll also learn how to evaluate the accuracy of these models using cross-validation, and why different models might be better suited for different tasks. This video will help you understand the power of machine learning predictions, even if you're just starting to explore the field!

Time for the video! Don’t worry if predictive modeling sounds complicated—it’s simpler than you think. In this video, we’ll walk you through how machine learning models make predictions and how we can use them to help with real-world decisions, like predicting which employees will leave the company and which will stay. Whether you’re new to machine learning or just curious, this video breaks it all down step by step!

Here are the essential concepts covered in the third lecture on supervised learning, with explanations to help you understand the core ideas:

  1. Supervised Learning: A machine learning approach where models are trained on labeled data to make predictions based on input features.
  2. Predictive Modeling: The process of building models to predict outcomes, such as which employees are likely to leave a company.
  3. Class: The target variable in classification tasks, such as predicting whether an employee will leave (class: "left" or "stayed").
  4. Naive Bayes Classifier: A simple supervised learning model used for classification tasks, particularly effective for categorical data.
  5. Logistic Regression: A commonly used supervised learning model for binary classification problems, like predicting employee attrition (leave/stay).
  6. Cross-Validation: A method used to evaluate model accuracy by splitting the data into training and testing sets and repeating the process.
  7. Prediction: The output of a machine learning model, providing an estimate of the likelihood of a certain outcome based on input data.
  8. Classification Accuracy: A measure of how often the model makes correct predictions, typically expressed as a percentage.
  9. Confusion Matrix: A tool used to compare predicted outcomes with actual outcomes, helping to assess model performance.
  10. Training Data Set: The portion of the data used to train the machine learning model.
  11. Test Data Set: The portion of the data used to evaluate the performance of the trained model, typically not seen by the model during training.

Now, it's time to take the quiz:

What is supervised learning? (1pt.)

What is predictive modeling used for? (1pt.)

What is a 'class' in a classification task? (1pt.)

Which model was used to predict employee attrition in the video? (1pt.)

What is logistic regression commonly used for? (1pt.)

What is the purpose of cross-validation? (1pt.)

What does an accuracy of 79% mean in a predictive model? (1pt.)

What does a confusion matrix help with? (1pt.)

What is the main task of a supervised learning model? (1pt.)

Why, in the video, might HR choose Naive Bayes over logistic regression for predicting employee attrition? (1pt.)

Chapter 4: Images and Text

In this lesson, we explore how machine learning works with images and text. Starting with dog breeds, we’ll see how deep learning models, like convolutional neural networks, can turn images into numbers—a process called embedding. Once converted, these images can be clustered or classified just like tabular data. We then move on to text, using language representation models to transform articles into numerical representations, allowing us to map and analyze them. Whether it’s identifying dog breeds, analyzing news articles, or working with medical images, this video demonstrates the limitless potential of machine learning on unstructured data like images and text.

Ready to see how machine learning works with images and text? In this video, you’ll discover how we transform visual and written data into numerical formats that machine learning models can understand. Whether it’s identifying patterns in images or clustering articles, this video introduces practical examples that make the process easy to follow, even if you’re just starting out!

Here are the key concepts covered in the fourth video on machine learning with images and text, along with brief explanations to help you grasp the fundamental ideas:

  1. Embedding: The process of converting images or text into numerical data, that is, into vector space that machine learning models can analyze.
  2. Vector Space: A mathematical space where data is represented as vectors, allowing machine learning models to compute similarities and distances between them.
  3. Convolutional Neural Networks (CNN): A type of deep learning model used for processing and classifying image data.
  4. Neural Networks: A family of machine learning models inspired by the human brain, capable of learning patterns from data, especially useful for complex data types like images and text.
  5. Clustering with Images: Grouping similar images based on their numerical representations, often using distance metrics.
  6. t-SNE for Images: A technique that reduces high-dimensional data (like image embeddings) to two dimensions for easier visualization.
  7. Text Embedding: The conversion of text data into numerical vectors using models like sBERT, allowing machine learning algorithms to process and analyze text.
  8. Nearest Neighbor Search: A method used to find similar data points (e.g., articles) based on their numerical representations in vector space.
  9. Foundation Models: Pre-trained models developed to handle various types of data, such as images, text, sounds, and more, by embedding them into numerical formats.

Here's the quiz:

What does 'embedding' refer to in machine learning? (1pt.)

What is a Convolutional Neural Network (CNN) commonly used for? (1pt.)

What is a neural network? (1pt.)

What is a vector space in the context of machine learning? (1pt.)

What is the purpose of clustering images? (1pt.)

How is t-SNE used with images? (1pt.)

What is text embedding? (1pt.)

What does t-SNE do with text embeddings? (1pt.)

What is nearest neighbor search used for in text analysis? (1pt.)

What are foundation models used for in machine learning? (1pt.)

Chapter 5: Looking Ahead

Congratulations on completing the course! You’ve taken a big step in understanding the foundational concepts of machine learning, from handling data, clustering, and predictive modeling to working with images and text. Machine learning offers limitless possibilities, and now you have the tools to explore them. Whether you're looking to apply these techniques to real-world problems or dive deeper into more advanced topics, the journey has just begun. Keep experimenting, keep learning, and remember—the world of machine learning is full of exciting opportunities waiting for you to discover!