Deep Learning

What do deep neural networks look like, how they recognize images ...

Deep learning is a branch of machine learning that uses deep neural networks — massive neural networks with dozens of layers of neurons — as its model. These networks are responsible for the recent prominence of artificial intelligence. It all started with image and audio recognition, then progressed to image generation, and eventually led to large language models capable of generating text that is difficult to distinguish from human-written content. (A large language model rather smugly generated the ending of the previous sentence, from »that is« onward.)

The basic idea behind neural networks is simple, and we will explain it briefly below. However, the details are too numerous and too diverse to delve into here. At Pumice (at least for now), there are no activities explaining how the technology works—we primarily use deep models rather than dissecting them.

This material is being developed as part of the DALI4US project.

Poglavje 1: Introduction

We assigned professions to dwarves based on their appearance and the tools they carried. When we gave the same task to a computer, we had already described each dwarf with a list of attributes—does it have a shovel, does it have a lantern, what color are its shoes and belt, which way is its hat facing... But could it recognize the dwarves' professions without this data, directly from images?

Difficult, right? An image is described by the colors of individual pixels. In the case of the dwarves, we know which pixels are white and which are black, but could their professions really be inferred from such data?

With houses, shrines, or other images, the challenge is even greater—their characteristics are less obvious, more embedded within the image. Yet, when it comes to classifying places of worship, the shape of the roof, for instance, tell us quite a lot, but not necessarily everything. It is often about an »impression.« We intuitively recognize that a particular shrine is Buddhist rather than Hindu, even if — at least as Westerners — we may not be able to explicitly explain why. Our impression is therefore implicit; we cannot describe it using concrete attributes.

Similarly, the way images — and other similar data, such as text and audio recordings — are represented in a computer is also somewhat implicit. Deep models represent an image using a specific number of variables, say 2048 variables, whose meanings we cannot explain. However, we do know that within them lies a description of the image’s content, good enough not only to distinguish cats from dogs but also to differentiate a Buddhist shrine from a Hindu one. These numbers essentially form a kind of »profile« of the image.

Poglavje 2: Neural Networks

Neural networks are made up of neurons. So let's start with — once upon a time, there was a neuron.

A Single Neuron

Imagine a person visiting a doctor because they feel unwell. As they enter the office, they list their symptoms: coughing, a slight fever, a headache — but no vomiting. They also have some rashes on their abdomen, but those have been there for a while … The doctor listens and adds things up. Let's say this is a flu specialist who, in the end, only cares about one thing: could this person have the flu or not? Each symptom adds or subtracts a few points, and in the end, the doctor sums up all the points—if the total exceeds a certain threshold, they conclude it's the flu.

Of course, this is a simplification. Symptoms are interconnected or may even exclude one another, and the doctor takes all of this into account. They also see the patient in front of them, know their medical history, and consider other conditions …

Doctors don’t consciously calculate like this, but they still use an unconscious model of weighing evidence for and against a diagnosis.

That’s how a human expert works. But could a computer learn to do this? Learn in the sense of building its own model for flu detection? A simple way would be to take the approach of the Candy Computer, which initially makes random guesses but gradually adjusts its decisions based on past outcomes. The model could start by assigning random weights to symptoms. When seeing the first patient, it would make a prediction, then check whether it was correct. If the patient had the flu, the model would increase the weights of present symptoms and decrease those of absent ones. Otherwise, it would do the opposite. After enough training examples, the model could become quite good.

Logistic function

If we wanted to predict the probability of having the flu, we would need a formula to transform the sum of weights — which can be arbitrarily large or even negative — into a probability. Since probabilities must be between 0 and 1, a popular choice is the logistic function, which is defined as:

$\frac{1}{1 + e^{-x}}$

where $x$ is the sum of the weights. If the sum of the weights is 0, this formula gives a probability of 0.5. That makes sense: since $e^0 = 1$ , we get: $\frac{1}{1 + 1} = 0.5$ . As the sum increases, $e^{-x}$ becomes smaller, and the probability approaches 1, since the denominator is just 1 plus a very small number. Conversely, the more negative the sum, the larger $e^{-x}$ becomes, making the denominator larger and the probability closer to 0.

What we just described is a neuron — if you're coming from the fields of machine learning and artificial intelligence. Statisticians, on the other hand, would recognize this as logistic regression, which is a type of linear model.

We can represent a neuron visually like this:

This neuron receives five inputs (such as symptoms), multiplies each by its corresponding weight, sums the results, applies the logistic function to convert the sum into a probability, and returns that probability.

Multiple Neurons

A doctor who only diagnoses the flu wouldn't have many patients. They might compensate by teaming up with equally specialized colleagues: one diagnosing pneumonia, another identifying COVID, yet another recognizing chickenpox, and three others specializing in different types of stomach viruses. Each would be trained on their specific dataset and would develop their own model for their respective disease.

While writing this, I realized that calling these individuals »doctors« might be a stretch — they're more like disease — specific classifiers than »healers«. Nevertheless, let's continue and frame the problem differently: our task is to determine whether a person is sick or not, without caring to identify the specific illness.

At first glance, this task seems similar — it just requires a »chief doctor« who gathers diagnoses from the specialized experts. But this new task also introduces a new type of data: whereas before we knew whether a patient had a specific disease or not, now we only know whether they are sick in general.

We could handle this in the same way as before: forget about the flu and COVID specialists and instead train a single expert who learns weights based on whether a person is sick or not. This might work — or it might not. Maybe individual symptoms mean nothing on their own and only gain significance in combinations. For example, a headache suggests illness only if accompanied by joint pain and fever (flu), or fever plus loss of taste and smell (an older strain of COVID), or chest pain and difficulty breathing (pneumonia). On its own, a headache might just indicate dehydration or a stiff neck — not an illness.

Given this, it makes sense to keep some structure: even though our final goal is simply to determine whether a person is sick or not, we can build a two-level model. The first level predicts individual diseases, and the second combines them into a final decision.

Neural Network

This idea deserves a new title: such a structure is called a neural network.

Imagine that there aren't just three intermediate neurons but, say, ten. We've drawn only a few to keep the diagram readable.

To dispel any discomfort about this section: later, we'll see that the neurons in the first layer don't actually »know« which disease they specialize in.

The neurons in the first layer receive symptom data—each neuron gets all symptoms. They sum the weighted symptoms, with each one implicitly specializing in a particular disease. Some might not predict diseases at all but rather states like dehydration, fatigue, or stress.

At the top level, there's a neuron that receives predictions from the first layer and combines them into a final decision. Like any neuron, it uses weights: assigning a certain number of points if the flu neuron predicts flu, another number for pneumonia, and so on. It also accounts for non-disease conditions: if the dehydration neuron predicts dehydration, the »top-level neuron« will reduce the probability of illness accordingly.

This learning process is called backpropagation. The error is determined at the network's output, and its consequences—penalties and rewards—are propagated backward through the network.

Once we invented a single neuron, we also found a way to train it: when it sees a patient, it adjusts its weights based on whether its prediction was correct. Here, we do the same. When we find out whether a training example was actually sick or not, the »top-level neuron« rewards or penalizes the experts—adjusting the importance it assigns to their opinions. The experts then do the same: if they're penalized by the »top-level neuron,« they penalize the symptoms that misled them (by adjusting their weights) and reward the symptoms that correctly suggested the opposite decision. Likewise, if an expert is rewarded, it strengthens the weights of the symptoms that led to the correct prediction and weakens the weights of those that misled it.

Also, note that expert weights can be negative if an expert indicates that someone is not sick.

We've omitted—except in the side notes — that this method does not train intermediate experts to specialize in specific diseases. If we only know whether a patient is sick, but not which disease they have, we can't explicitly train an expert for flu.

Even more interesting: remember that initially, the experts are making random guesses. We'll do the same here: at first, all weights will be random. No one will know what they're predicting. The network will make nonsensical decisions and issue rewards and penalties accordingly.

Over time, something interesting will happen: the network will self-organize. Suppose one of the first-layer neurons is slightly — even just a tiny bit—biased toward predicting flu. Eventually, a flu patient will appear, and that neuron will randomly suggest flu and be rewarded. As a result, it will strengthen flu-related symptoms, making it more likely to predict flu in the future. Over time, it will become a flu specialist.

Similarly, another neuron might randomly lean toward dehydration and gradually specialize in it. A third might become a pneumonia expert.

For this to work, we need enough neurons in the first layer. Then, probability will take care of the rest.

We don’t explicitly train first-layer neurons for specific diseases. And if our only goal is to determine whether someone is sick, we don’t even need to know what those intermediate neurons are doing. We know that they specialize in different diseases and conditions, but which does what is irrelevant. In fact, it may not even be possible to determine. Yes, we could analyze their weight distributions and identify some patterns, but many of them might not correspond clearly to any particular disease.

And that's perfectly fine! We simply call the first layer the hidden layer and the final layer the output layer.

A minor drawback: training such a network requires more data. The hidden neurons need time to specialize, and the rewards and penalties are too indirect to yield quick results.

Deep Neural Networks

What if we added another hidden layer? That is, instead of passing their decisions directly to the output neuron, the first-layer neurons would pass them to a second layer of hidden neurons. Each neuron from the first layer would connect to all neurons in the second layer, and only then would the second-layer neurons communicate with the final output neuron.

Again, imagine that each row contains far more neurons. Also, the number of neurons in each row doesn't necessarily have to be the same.

Is this extra complexity necessary? That depends on the difficulty of the problem we're trying to solve with our neural network. It also depends on the amount of data we have, since training such a network is even more demanding. For disease prediction? Probably not.

Some researchers asked: what if we added even more hidden layers? Not just one more, but a dozen or more.

This, of course, didn’t work. The main reason was simple: they just didn’t have enough data to train such a network.

The past tense here is intentional. Now it works — because we do have enough data, and there have also been theoretical and technological advances that made it possible.

We hinted that each neuron receives inputs from all neurons in the previous layer — just like all first-layer neurons receive all symptoms. This would mean an enormous number of weights, and the more weights there are, the more data we need for training. However, it turned out that this isn’t necessary.

For example, if we want to build a neural network for image recognition, we structure the connections to focus on local information. The inputs — our »symptoms« — are the colors of individual pixels in an image. Instead of connecting every neuron to every pixel, it's more logical to connect neurons to nearby pixels. This way, neurons in the first layers can detect small features like edges, those in the next layers can combine edges into lines, and deeper layers can merge lines into more complex shapes. At even higher levels, these shapes come together into recognizable structures—perhaps an oval containing two symmetrical circles (eyes?), a triangular shape below (a nose?), and a horizontal line (a mouth?).

Layers that process local regions of an image in this way are called convolutional layers. Different problems require different network architectures. In networks that process sequences, such as text, it's logical to connect a neuron to several previous words. These are called recurrent layers, as they »loop back« over time. In text processing, it's often crucial to remember words that appeared much earlier, which is where attention mechanisms come in—parts of the network that store specific information until they determine it’s no longer needed.

Alongside different network structures, various learning methods have been developed, that is, different ways of adjusting weights. Even the logistic function isn’t the only possible activation function for neurons; for certain tasks, other mathematical functions have proven to be more effective.

But the most important factor remains: data. Deep neural networks still contain millions or even billions of adjustable weights, requiring massive amounts of training data. We don’t have a billion examples of flu patients, but then again, we don’t need a deep neural network to predict the flu. Deep networks are essential for object recognition in images, and large companies do have billions of images for training. The same applies to networks that process text, audio, and video—such networks are feasible today because we finally have enough data.

Poglavje 3: Embeddings

Embeddings

Let’s stick with images for a bit longer.

Imagine a neural network trained to determine whether a given image contains a cat. The input consists of pixel colors, followed by a dozen hidden layers of neurons connected as required for image recognition. The output neurons use suitable activation functions, and we apply the appropriate weight adjustment techniques — details that we don’t need to worry about in this section. Let’s assume we end up with a network that successfully identifies cats.

Think about how this network is structured: neurons from the last hidden layer — let’s say there are 2048 of them — connect to the output neuron. This output neuron receives 2048 numbers as input, representing the final, processed »symptoms« of a cat. It computes a weighted sum, and if the result exceeds a certain threshold (e.g., greater than 0), the network classifies the image as containing a cat.

Now, suppose we want to train the network to recognize houses. Instead of starting with random weights, we reuse the ones from the cat classifier and fine-tune them for houses. This strategy makes sense — and it works in practice. Why? Recall how image recognition works: lower layers detect edges, lines, and abstract shapes. The difference between cats and houses only emerges at higher layers, while the lower layers remain largely unchanged. When switching from cats to houses, only the upper layers need to be »retrained.«

After training the network to recognize houses, we could extend it further, to recognize clouds, for example.

But why not train the network for multiple categories simultaneously? We could add multiple output neurons: one for cats, one for houses, one for clouds, another for maps, another for faces... and a hundred more for other objects. The goal is for the correct output neuron to »fire« depending on the image content. During training, we adjust weights for all output neurons: if an image contains a cat but the network mistakenly activates the »house« neuron, the responsible neurons are penalized. If the »cat« neuron fires correctly, the contributing neurons are rewarded; if it doesn’t fire, the neurons that suppressed it are penalized.

This way, the network learns to recognize multiple objects at once.

Now comes the trick: once the network is trained, we remove the output neurons. We feed in a new image and, instead of getting a classification (cat, house, cloud, etc.), we extract the output of the last hidden layer — those 2048 numbers.

In a way, these 2048 numbers encode the image’s content. They contain the essence of whether the image depicts a cat, house, or cloud—after all, the removed output neurons were able to distinguish them. And because the network was trained on such a variety of objects, these 2048 numbers aren’t just useful for classifying things the network has seen before, but also for recognizing new objects it has never encountered.

We don’t know what each individual number represents, but we do know that together they describe the image’s content. If someone gives us 40 images of places of worship for different religions (one of the activities in Pumice) the images from each group will have similar numerical representations. Using standard machine learning techniques (logistic regression works well here — it’s essentially a single neuron), the computer can learn to differentiate house types, just as it can distinguish between different professions of dwarves, types of quadrilaterals, or animal species. The only difference is that in traditional classification, we know what each feature represents (e.g., »has fur?«, »produces milk?«), whereas in images, the features remain unknown.

Additionally, similar images produce similar numerical representations — not in terms of color or composition, but in terms of content, since the network was trained to focus on content.

The number 2048 isn’t magic, but it turns out to be appropriate.

When we describe an image with 2048 numbers, we say that we’ve embedded it in a 2048-dimensional space — these 2048 numbers serve as the »coordinates« of the image in this high-dimensional space. This is called an embedding.

The idea of embeddings is not limited to images. We use them for words (each word can be represented by, say, 300 numbers encoding its meaning), for audio recordings, and more.