Game playing and reinforcement learning

How computers play games and a lesson about reinforcement learning. When is something (artificially) intelligent? Is AI even possible?

The topic of lessons with Intelligent Paper and The Candy Computer is fundamental: what exactly is artificial intelligence?

We strive to integrate artificial intelligence into various school subjects, which is why most lessons revolve around data. Intelligent Paper and The Candy Computer do not follow this patter and are not tied to any specific subject (except computer science). We include them because, more than other lessons, they challenge the question of when something can truly be called intelligent.

At the same time, both lessons are so interesting that it would be a shame to leave them out.

This material is being developed as part of the DALI4US project.

Chapter 1: Introduction

One of the pioneers of artificial intelligence, Donald Michie (1923 - 2007), invented MENACE, Matchbox Educable Noughts and Crosses Engine, a mechanical computer that learned to play noughts and crosses against a human opponent. Back in 1961, even top scientists like Michie didn't always have computers at their disposal.

Instead of paper cups, he used 304 matchboxes, into which he placed colored beads, with a piece of paper in the shape of a V at the end of each. He would shake the box and select a bead from the tip of the V. In case of a loss, he would remove beads from the boxes, and in case of a win, he would add more. Although the computer worked, it is impractical for our purposes as it requires 304 matchboxes (or papers and cups), and Michie also needed 220 games played over two days, to train it.

The game we use here, Hexapawn, was invented by the popular writer of articles and games in "recreational mathematics," Martin Gardner (1914 – 2010). The advantage of this game over Noughts and Crosses is that it has many fewer possible positions (and therefore requires fewer cups or matchboxes). The game is simple to analyze all possible positions and see that a player who is second to move can always win if they play intelligently, yet this is not immediately obvious. Rhere is no need to analyze this with children, unless perhaps after completing the activity. The game's simplicity doesn't make the activity any less interesting: the computer is "smart" simply because it managed to find the optimal strategy.

Chapter 2: The Tree of Positions and Winning Strategies

All games where players make moves can be represented by a tree, like the tree below for tic-tac-toe.

The game is like a journey through a tree. We start at the top and always travel down, deciding on one of the branches at each move. If it’s a single-player game with the goal of reaching a target, all moves are chosen by the same player. In a two-player game, like tic-tac-toe, the players take turns choosing a move, that is, the next path in the tree.

The tree ends in the "leaves." Each leaf is associated with an outcome — a win for one player, the other player, or a draw.

A good player chooses branches that lead them to victory. In some positions, it may happen that one of the players, with smart play, can ensure a win. This is clearly a position where the player can achieve a win in a single move; or a position where the player has a move that will, regardless of the opponent's response, lead them to a position with a winning move; or a position that will lead them to a position that leads them … You get it, right?

Not interesting in practice?! In fact, every game in with no elements of chance and with both players having complete information about the state of the game has a strategy for one of the players that will certainly lead to a win or at least a draw. This includes chess, but it has around $10^{120}$ possible positions, so such strategies cannot be found — like we did with hexapawn - by searching the entire tree.

Sometimes it happens that a game is decided before it even starts: in the starting position, one of the players has the option to choose moves that will certainly lead them to victory. Such games are not very interesting in practice, but they are interesting for research. In some games, there is a winning strategy for the player who makes the first move ("white"), while in others, the winning strategy is for the player who moves second ("black"). Tic-tac-toe is an example of a game where white, with a smart selection of moves, can ensure they do not lose. Hexapawn is an example of a game with a winning strategy for black. That’s why this activity is designed so that the computer makes the second move.

The task of the computer is to find the winning strategy. When explaining the process, we can once again use the tree. When the computer loses the game, we remove the candy that leads to the move where the human can win. It is important to remove only the last one, not all of the ones along the way, because we can be certain only that the last move was incorrect.

By gradually removing candies, we cut off parts of the tree that potentially lead to defeat. This leaves only moves that will certainly lead the computer to victory. The tree does not turn into a single, linear path: the human player still has two possible moves at the beginning, and they can make different decisions in later moves. The computer can also — at least in principle — have several possible responses in some positions that lead to a win.

An interesting aspect is the role of the plastic cups and candies: they serve as the "memory" of our computer, eventually recording branches of the tree where the computer has never lost.

Chapter 3: Reinforcement Learning

What about when a game doesn't have a clear winning strategy? Perhaps because of an element of chance? Or the complexity of the game makes it impossible to traverse all possible positions? Or perhaps it’s not even a game, but rather a goal we want to achieve, where different paths to that goal can be more or less successful, simpler or more complex, cheaper or more expensive?

In this case, we turn to reinforcement learning. With predictive models, we have learned about machine learning algorithms that predict a certain outcome or class based on data. Reinforcement learning is more complex: its task is to create a model that, in a given situation, predicts which of the possible actions will more likely or more successfully lead to the goal. Unlike in learning predictive models, which we explored in earlier material, reinforcement learning does not provide immediate feedback about whether a decision was correct. After making a decision, we find ourselves in a new situation and face a new decision. We only know how successful we were at the end.

Reinforcement learning procedures generally work by "rewarding" all the moves we made after successfully reaching the goal; higher rewards are given to the last moves, with earlier ones receiving progressively smaller rewards. This way, the correct decisions are reinforced, while the bad ones fade away.

Teaching the computer with candies is, in some sense, reinforcement learning. More precisely: "punishment learning" (the term is informal and spontaneous :), since we punish bad decisions. In this case, the "punishment," unlike in reinforcement learning, only applies to the last decision and is not gradual but definite: instead of just reducing the likelihood that we will choose the wrong decision again, we prohibit it altogether.

Chapter 4: What Is Artificial Intelligence

The description of the lesson The Candy Computer ends with a reflection on whether a computer that learns to play a game optimally can already be considered intelligent. Children will surely object, because they know how it works.

But is a computer really intelligent just because children don’t know how it works? Just like the candy computer, a regular computer always works deterministically, based on rules, procedures, and programs written by humans. In principle, there is no difference between a real computer and the candy one. This very idea probably fascinated Donald Michie, the British scientist who first created a prototype of such a computer — otherwise, he wouldn’t have spent two days playing tic-tac-toe with little boxes, even though he surely knew what the experiment’s final outcome would be.

So, is artificial intelligence possible or not?

The last two paragraphs of this text were written ten years before the arrival of ChatGPT and other large language models. Since their arrival, the Turing test has lost some of its significance, while these paragraphs — especially the Chinese room experiment — have only grown more relevant.

Everything depends on what we call (artificial) intelligence. Philosophers of artificial intelligence — and real philosophers, too — love to discuss this and have come up with quite a few interesting thought experiments. Alan Turing (1912 — 1954), who appears at every corner of computer science, devised a test that we now call the Turing test: a computer will be considered intelligent when test participants, conversing (say, via keyboard) with both a computer and a human in another room, won’t be able to tell which is which. Philosopher John Searle (1932 — ) argues that a computer will never truly be intelligent because it will never understand what it outputs. He illustrated this with the Chinese room experiment: imagine a person inside a room who pretends to understand Chinese by receiving slips of paper with Chinese characters and returning written responses — not based on understanding but by following detailed instructions from a book. To an outside observer, this person would seem to know Chinese, but in reality, they wouldn’t understand anything. Therefore, the Turing test is flawed. As for what it really means to "understand," Ludwig Wittgenstein (1889 — 1951) attempted to describe it with his Beetle box experiment, where we imagine a group of people with small boxes that …

Actually, it doesn’t really matter. The point is that the question of whether computers can truly be intelligent is more a matter of philosophy than computer science. So let’s leave it to the philosophers.