HomeData Mining 101
cover image

Data Mining 101

Elective Course for PhD Students at University of Ljubljana

This course is an introduction to data science for non-computer scientists. The course covers topics from data preparation, clustering, regression and classification, model evaluation, and embedding of unstructured data.

Type of course: Lectures + Homework Assignments

Course Code: B2126 (UL MF) | BZ1718 (UL BTF)

ECTS: 5

Course name in Slovenian: Uvod v podatkovno rudarjenje (UL MF) | Uvod v znanost v podatkih (UL BTF) | Odkrivanje znanj iz podatkov (UL Statistika)

Semester: Fall 2024 (November and December)

Location for lectures: Lecture room P3, Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, Ljubljana

Time of the Lectures: The lectures will take place in November and December, between 5:15 PM and 7:00 PM, at the following days:

  • Monday, November 11, 2024 (Demšar)
  • Monday, November 18, 2024 (Demšar)
  • Monday, November 25, 2024 (Demšar)
  • Wednesday, November 27, 2024 (Zupan)
  • Wednesday, December 4, 204 (Zupan)

Prerequisites: No prior knowledge of the topics is assumed. This course will not use computer programming, and no prior statistics or data science knowledge is required.

Language: All course materials and lectures will be conducted in English.

Course Content

This is a data science course intended for starters and non-computer science students. We particularly encourage students from social sciences, humanities, natural sciences, engineering, and arts to enroll. No prior knowledge of statistics, computer science, or math is required. The course has a gentle learning curve, with additional video material and lecture notes available for all students. The course covers the following state-of-the-art topics:

Lecture 1: Data Profiling and Introduction to Orange

Lecturer: Janez Demšar

  • Data preparation: features, data instances and attribute-based data profiling.
  • Visual programming and exploratory data analysis in [Orange] (http://orangedatamining.com). Selected data visualization techniques.
  • Case studies: Slovenian surnames, socio-economic data analysis, geo-weather patterns.

Lecture 2: Data Clustering

Lecturer: Janez Demšar

  • Distances and similarities.
  • Hierarchical clustering distances, linkages, and k-means clustering.
  • Visualization of clustering results. Explanation of clusters.

Lecture 3: Data Visualization & Statistics

Lecturer: Janez Demšar

  • Discover essential data visualization methods and statistical tools for analyzing data.
  • Learn about p-value dredging and its implications in data science.

Lecture 4: Projections and Dimensionality Reduction

Lecturer: Blaž Zupan

  • Dimensionality reduction techniques, including PCA, t-SNE, and MDS, for simplifying and visualizing multivariate datasets.
  • Learn how to interpret projections for meaningful insights.
  • Case studies: analyzing human resource data, visualizing animal kingdom data, exploring large molecular biology datasets.

Lecture 5: Explainable AI with Trees & Nomograms

Lecturer: Blaž Zupan

  • Introduction to classification.
  • Inference from classification trees, why they are great for explanations, and why they do not help us much.
  • Another classifier: the naive Bayesian model. Why it is more stable and why its nomograms can give us great insights into data.
  • Case studies: passengers on the Titanic, iris flower dataset, explaining clusters with the nomograms.

Lecturers

Prof. Dr. Blaž Zupan teaches artificial intelligence and machine learning at the University of Ljubljana and Baylor College of Medicine. His research has focused on explainable AI and combinations of machine learning and data visualization techniques. He runs a twenty-member bioinformatics laboratory, which also develops Orange, a comprehensive open-source toolbox for machine learning.

Prof. Dr. Janez Demšar researches machine learning, data mining, with emphasis on data visualization. He spends most of his time programming an open-source component-based system for machine learning and data mining toolbox Orange. He also teaches courses in programming and in didactics of computer science.

Both Demšar and Zupan and have been awarded the best teacher awards at the Faculty of Computer Science, where Demšar receiving this award every year (except one :) ) since the students introduced this award about 20 years back. They have jointly received a Slovenian innovation "Puh" award for leading the development of Orange Data Mining, a software that will be used in the course.

Software Tool

In the course, we will be using Orange Data Mining, a free, popular open-source tool designed for data visualization and analysis in machine learning and artificial intelligence. Renowned for its ease of use and user-friendly interface, Orange employs a visual programming approach that allows users to create data analysis workflows through an intuitive drag-and-drop system. This makes it especially appealing for newcomers, as it offers a gentle learning curve while still providing robust capabilities for more advanced users. Orange's modular design and comprehensive library of widgets enable users to perform complex data manipulations, statistical analyses, and predictive modeling without needing extensive programming knowledge.

The screenshot below shows Orange in action: in the course, we will learn how to construct the workflows of components that read and process the data, build and evaluate predictive models, and visualize the data and results.

Enrollment Information

This is an elective course offered to all students at the University of Ljubljana. Students need to enroll at their own Faculty, which will then send their enrollment information to the Faculty of Computer and Information Science.

Course Materials

All course materials will be provided at the start of the course and will be available on the course's homepage (Moodle). The materials include lecture notes, short optional educational videos, and quizzes. We will provide course material to the students upon enrollment.

Homework Assignments and Grading

The course will include six practical homework assignments, each involving the use of Orange Data Mining software and requiring students to analyze chosen datasets. Assignments will be submitted through quiz-like questionnaires.

The final grade for the course will be computed based on scores from the homework assignments. There will be no final exam. Optional "bonus" assignments may be provided.

Course Attendance

This course is primarily organized for on-site attendance to facilitate interactive learning and engagement. However, we understand that circumstances may occasionally prevent students from attending in person. For those instances, comprehensive study materials—including lecture notes, recorded videos, and essential literature—will be made available. These resources are designed to ensure that students can thoroughly understand the course content and successfully complete the required home assignments, even if they miss an on-site session.

Chapters