Recommender Systems

Additional materials for teachers: recommender systems

The page contains additional materials for teachers who want to conduct the lesson on the Cartoon Recommendation System. The lesson is designed so that students, without using computers, measure the similarity of their tastes in cartoons and create recommendations based on this.

About Recommender Systems: introduces the basic idea of recommender systems with a few thoughts that can serve as an introduction to the lesson.
Data Entry Page: more details on preparing the data entry website, primarily on how, if desired, we can prepare our own data.
Computer Demonstration: describes the workflow for the Orange program. If we want to include a computer, we can use a ready-made workflow that doesn't require prior knowledge of the program; here, we describe the details for those interested in understanding what each part of the workflow does.
Background for Teachers: explains real-world recommender systems in more detail.

Placement in the Curriculum

The activity flirts with mathematics: students are essentially looking for intersections of sets. If more math is desired, it can also involve unions and even written division with decimals. Additionally, we draw graphs similar to sociograms; while these may not be a part of the elementary school curriculum, they essentially represent relationships and are foundational to many mathematical topics.

Given the topic—recommendation systems—it obviously fits into computer science and informatics.

The activity can be linked to other subjects if, instead of cartoons or musicians, students choose other items, such as their favorite books. It is also interesting as a group activity, and the "tasteogram" for the entire class can sometimes be seen as an approximation of a sociogram — to the extent that taste similarity overlaps with socializing.

This material is being developed as part of the DALI4US project.

Chapter 1: About Recommender Systems

A completely unrelated but interesting fact: GPS must account for the theory of relativity. Since satellites are further from the centre of the Earth than receivers, they experience a weaker gravity, and their clocks run faster. The difference is +45.8 μs per day! Even though signals travel at the speed of light, they would already travel a good meter in the time of a single day's error.

Arthur C. Clarke was probably right when he wrote that "any sufficiently advanced technology is indistinguishable from magic". However, let us hope he didn't assume that this magic would overly astonish, shock, or even frighten anyone, because in that case — he would have been mistaken. On the contrary: any sufficiently widespread technology quickly becomes something we take for granted. Once it integrates into daily life, no one cares whether it relies on semiconductors or magical spells. Take, for example, the fact that our phones pinpoint our location with meter-level accuracy at any given moment (and even constantly share it with our loved ones). This surprises no one, even though a common Earthling (and most of the little less common ones too) knows little more about how this technology (or magic?) works than about the spells in Harry Potter.

One of the technologies that has become so common we barely notice it anymore is recommender systems. In music apps and video platforms, they suggest songs and videos we might like; in online stores, they offer products we're likely to buy; on trip-planning websites, they recommend new routes similar to those we’ve walked, run, or cycled before; and on social media, they supposedly help us discover new friends.

Recommender systems influence our decisions, so it's important to understand how they work, and so on. At the same time, their inner workings are simple enough to explain to fourth graders. Not only can we explain them, but we can also easily (and without a computer) demonstrate them.

Recommender systems — unlike GPS mentioned in the introduction — are not exactly a rocket science. What they need to accomplish and how they do it is obvious. If we are looking to buy a new rear derailleur for a bike, they should also offer us a front derailleur, a chain, and cassette.

In 2024, 500 hours of new videos were uploaded to YouTube every minute. How should YouTube decide who to recommend them to?

Connections between products — what to offer alongside what — must, of course, be inferred from data. Manually inputting this information would be time-consuming, less accurate, and often impossible. Recommender systems rely on observing similarities. Two products, videos, images, songs etc., are considered similar if they interest the same people. If Ann, Bella and Charlie listened to both Taylor Swift and Dua Lipa, the two singers are likely similar, at least in style. (The author of this text is not an expert on the subject though.) Therefore, when Daisy listens to one of them, it makes sense to recommend her the other as well.

Real recommender systems use both approaches and a lot of others, too.

We can also take the opposite approach. Instead of measuring similarities between musicians (i.e. two musicians are similar if they are listened to by the same people) and recommending all artists similar to Taylor Swift to each of her listeners, we can focus on measuring similarities between users. Two shoppers, two video viewers, or two planners of Sunday trips are similar if they buy, watch, and choose the same products, videos, and trips. If we need to make a recommendation for Joe, and find that Benny and Teddy are similar to him, we will recommend what they liked.

Here, we will take the second approach, as it is more engaging for classroom use: students will measure the similarity of their tastes in music, cartoons, movies, books, or anything else, and use this information to create recommendations for one another.

Chapter 2: Data Entry Page

After students choose their favorite cartoons, draw the "tasteogram," and derive recommendations from it, the lesson can logically continue by having students enter their choices on the website using tablets or computers. This will allow us to show the recommendation system based on the data for the entire class.

To prepare for this, we first set up the data entry form before the lesson. First, we select the type of data: cartoons, tv series, music groups, our own data, or data from a previous event. The first two are already prepared. We will write about our own data below. If we want to repeat the lesson with data we previously uploaded, we select the fourth option and enter the key. If the URL for the past event was https://data.pumice.si/grandparent-train, we enter grandparent-train here.

We can also set how many things the students must choose at a minimum and how many they can choose at a maximum. The current setting of 7 is suitable for groups of 8-10 members choosing from 50-70 items. Since these are typical group sizes and item numbers, this setting is probably appropriate.

We have tested the activity with varying numbers of items to choose from. It seems that the appropriate size for the selection is the number of items divided by ten. A rough estimate.

The idea is that there should be enough choice (so not everyone selects the same things, but they can "profile" themselves), but not too much, so they don’t spread out too thinly, resulting in too few common choices. A larger number of items increases diversity. A higher required number of items to choose from obviously increases intersections; if they have to select too many things, their choices can overlap too much, and it can become overwhelming.

If they are not limited to a specific number of items, the similarity calculation becomes complicated. This is described in the next section.

The "Shuffle item order" option is used if we won’t be using cards but instead students will select items directly from the website. In this case, it makes sense to show each student a different order.

When we click Prepare page for activity, we get a link for data entry, which we enter into the tablets or computers that the students will use, and a link for downloading the collected data, which we input into the Orange workflow. If we click on the latter link directly in the browser, we will download an Excel file with the selection data — of course, after the students have entered their choices.

Additionally, here we can get a PDF with the cards.

Preparing Your Own Data

If we want to upload our own data, we prepare it in Excel. There are two possible formats.

We can prepare a table with two columns. The first contains links to images (for example, cartoon posters), and the second contains names (the titles of the cartoons), which will appear below the cards. You can see an example of a file with series.

The second option is a table with more columns. The first two columns contain the information described above, and the others can include any additional data. The first row of the table must contain the column names. The website will not use the data from the additional columns, but these columns will be used in the data table to show which student selected which item. Advanced users of Orange can use this for deeper analysis, but this is not described here.

If we upload our own data, the website will also generate a PDF with the cards. To make them look better, it's recommended that the images have a similar (preferably the same) aspect ratio. If some images are portrait-oriented and others are landscape-oriented, the cards with the landscape images will have a lot of empty space.

The Page Used by Students

The link we get on the form preparation page, such as https://data.pumice.si/prince-apple, is the page where students select cartoons, series, musicians, or other items. Once they select the appropriate number of items and enter their name (the system ensures that two students don’t enter the same name, so they may need to add the first letter of their last name), they can submit their selection and pass the tablet to the next student.

Currently, the page is not suitable for use on smaller screens, such as phones.

Chapter 3: Computer Demonstration

Teachers who do not wish to delve into how the Orange workflow used in the classroom is structured can safely skip this section.

If we decide to use computers for the activity, we prepare the data entry link before the lesson by selecting pre-prepared data (Cartoons, TV Series or Music Groups) on the website or by uploading our own data, as described in the previous section.

To work with a computer, we need to install the Orange program and the Networks and Pumice add-ons. We install them by selecting them in Preferences / Add-ons, confirming the selection, and restarting Orange.

No special knowledge of Orange is required, as a ready-made Orange workflow is available on the Pumice website. We simply enter the link for downloading the collected data, and we can immediately observe the "tasteogram" for the whole class and recommendations for each student.

Understanding the workflow is not necessary, but it may be useful, so this guide explains its components.

The File widget retrieves the collected data. In the URL field, we enter the link to the page for downloading the data collected in the classroom. If the URL used for data collection in the tablets was https://data.pumice.si/grandparent-train, then the data will be available at https://data.pumice.si/grandparent-train/data. The default URL loads test data.

The Distances widget calculates the differences between students. The distance measure used is based on Jaccard similarity, which divides the intersection size of two selections by the size of their union. Although students only consider intersections, the ranking by similarity remains the same because each student selects the same number of items.

There is nothing to change here. Other distance measures may give reasonable results, but Jaccard is the most suitable for our purposes and ensures consistency with the method used by students.

Next is the Neighbors Network widget. The input to this widget is the distance matrix. It constructs a network by connecting each student to their three most similar peers.

You may want to open this widget and check that the number of neighbors is set to the (default) value of 3.

We can experiment by increasing or decreasing the number of neighbors to get broader or more specific recommendations. One extreme is connecting all students to everyone, in which case recommendations will simply reflect the most popular choices in the class. The other extreme is each student having only one neighbor, meaning they will receive recommendations solely based on a single classmate.

The Network Explorer widget visualizes similarities between students. The connections are directed: each student has exactly three outgoing arrows. Sometimes, it may seem like only two arrows are visible, but this happens when lines overlap.

The Recommendations widget requires the similarity network from Neighbors Network and the cartoon data from File, so both need to be connected. The widget needs the cartoon data to display titles and images and the network to calculate recommendations.

Chapter 4: Background for Teachers

The recommendation system we created in the class is, of course, quite simplified. At the end, here is a section for teachers who wish to deepen their understanding.

Graphs

Friendship graphs are characterized by triangles: two people with a common acquanintance are likely to know each other. In contrast, adversarial networks are characterized by quadrilaterals: two people with a common enemy usually do not hate each other, but they probably share another common enemy. This creates a quadrilateral of enemies.

We presented the similarities in the form of a graph. Graph is a term used by mathematicians to refer to a set of points, some of which are connected. Graphs are useful in many domains; teachers are most likely to encounter them in a form of a sociogram of students.

A graph is not necessarily always drawn. Facebook and Twitter certainly don't print a graph of friends and followers, as it would be too large (three billion people with countless connections). Instead, they use the relationship between friends or followers for everything they do, from recommending new connections to selecting ads. They are usually transparent about this and explain that they recommend following someone because, for example, they share common followers.

Connections can have different weights: some are stronger, while others are weaker, as some pairs are more similar than others. Connections with greater weights can be given more importance than those with smaller. Once we introduce weights, it is no longer necessary for each person to be connected to a specific number of other people: we can connect everyone to everyone, but we weight the connections based on similarities, which can also be 0.

There can also exist multiple graphs simultaneously. Twitter certainly doesn't just track who follows whom and whose tweets people like. It is also interested in what we write about: two users are considered similar if they write about similar topics. This creates multiple similarity relationships, which are combined with different weights.

Calculating similarity

In the class, we calculated similarity simply based on the number of common cartoons. Mathematicians would inform us that this is called the size of the intersection of two sets of cartoons.

In practice, this would be a poor measure of similarity, as two people who have chosen (bought, listened to, or watched) many items will usually also have more of them in common, thus overestimating their similarity. A more common approach to account for the number of selected items is to divide the number of items both have chosen by the total number of items chosen by at least one of them. Or, as mathematicians would put it, to divide the size of the intersection by the size of the union. This definition of similarity is useful enough to have earned its own name: we call it the Jaccard index.

For a lesson, this approach would be inconvenient, as it requires either decimal numbers or the comparison of fractions. If Ann shares 4 out of 12 cartoons in common with Ben, and 3 out of 10 with Cedric, which of them is more similar? Which is larger, 4 / 12 or 3 / 10? This method is more suitable for older students, but keep in mind that it will require quite a bit of comparison: each group member will need to calculate similarities with all other members (say, seven) and then either sort these fractions by size or convert them into decimal numbers for easier comparison.

Of course, there is nothing wrong with this if we want to practice arithmetics. However, if we want to focus on recommendations and not let the mathematics take up all our energy and time, we specify that each person must select a fixed number of items. In this case, comparing based on the number of shared cartoons gives the same ranking by similarity as we would get if we calculated the Jaccard measure.

Profiles of things and users

In fact, a graph is not usually the first thing that comes to mind when creating a recommendation system. Graphs are useful for social networks (in fact, the terms network or web are often synonyms for the word graph), but they are less commonly used for recommendations in stores or on music and video platforms. In those cases, users and items are typically profiled.

For those who are interested in the subject and not intimidated by mathematics (no formulas, but still: mathematics!), let's illustrate the process with an example.

In 2006, Netflix released anonymized data for nearly half a million users who had rated 18.000 movies. Altogether, the data set included a total of 100 million ratings. Netflix withheld some three million of these ratings as a test set and offered a million-dollar prize to the team that could exceed the accuracy of their in-house system for predicting the ratings by more than 10 %. The prize was awarded in late 2009.

This doesn't go beyond high school mathematics: the attributes of each movie and the preferences of each viewer are represented as fifty-dimensional vectors, and the predicted rating of a movie by a viewer is calculated as their dot product.

The winners' approach was based on the idea that each movie could be described using around 50 attributes. These can be thought of as different "dimensions" of a movie — for instance, how funny, romantic, scary, visually impressive it is, whether it features well-known actors, or has an unusual storyline. The degree to which a movie possesses a certain attribute is expressed by positive or negative numbers: for example, the number representing humor will be more positive and higher for a funnier movie, and more negative for a sadder one.

On the other side — the assumption goes — the same 50 attributes can be used to describe viewer preferences. A movie's predicted rating is then determined by the match between the movie's attributes and the viewer's preferences.

But — who will define these 50 attributes and manually input them for all 18.000 movies? And more importantly, who will do the same for half a million viewers? How would they get their taste? By interviewing each viewer to find out whether they prefer comedies or horror movies?! Of course not, that's precisely why we have the data.

This is merely an intuitive explanation. What we are describing can be mathematically expressed through derivatives: the attributes of movies and viewer preferences are adjusted in the direction opposite to the gradient.

One possible approach is to describe each movie and each viewer's preferences using 50 random (positive and negative) numbers within an appropriate range. We don't concern ourselves too much about what specific attributes these 50 numbers represent. Then, for each movie rating — derived from matching the attributes and preferences — we check how it deviates from the true value and adjust (invented!) movie attributes and viewer preferences in a way that will reduce the error. For example, if the movie rating was too high, we examine which attributes of the movie are positive. If the movie is supposed to be funny, we decrease the preference component for humor, as this viewer doesn't enjoy funny movies. We also look at negative attributes: if the movie is not supposed to have famous actors, we increase the viewer's preference for famous actors, as that will lower the rating for this movie. If the movie rating was too low, we do the opposite.

We treat the movie symmetrically: if the rating was too high or too low, we examine the viewer's preferences, adjusting the (assumed) attributes of the movie accordingly.

Actually … earlier, we discussed humor and famous actors, even though all the data is randomly generated, and we decided that we wouldn't predefine movie attributes or tastes. That's the beauty of it. We don't care what the attributes are. If we go over all the ratings of all the movies multiple times (say, forty times), they will become more accurate each time. We will probably never know exactly what each of the fifty attributes really means, but we will know that they reflect the characteristics of movies and tastes in a way that makes sense for movie ratings.

The described process is known as matrix factorization, and in this case specifically, it refers to singular value decomposition (singular value decomposition). Real-world systems, in addition to this, also consider numerous other factors, such as changes in taste over time.

Each movie and each viewer is thus described by 50 numbers representing their profile. We do not know what these numbers mean, but we do know that a viewer's satisfaction with a movie depends on how well the viewer's and the movie's profiles match.

Profiles are useful for many things. A pair of movies—or a pair of viewers—is similar to the extent that their profiles are similar. We can use profiles to find groups of similar movies; these groups will reflect "genres" defined by viewers' tastes. Or we can find groups of viewers, which might be useful for the marketing department interested in better meeting the needs of different types of viewers.

A similar idea of profiling—using slightly different mathematics—is also used in deep neural networks for recognizing, organizing, and searching images, analyzing texts, and more. However, that is beyond the scope of this material on recommendation systems.