Fonctionnement et entraînement d’un LLM

Module IA · TP 4

Denis Dréano & Raphaël Holzer

3MOC informatique · Gymnase du Bugnon

Planning du module IA

Semaine	Cours (mardi)	TP (jeudi)	Lecture
Vacances d’hiver	—	—	—
23	Qu’est-ce que l’IA ?	Comparer des IA, prompt engineering	Exercice du blog
24	Histoire de l’IA	Installation et analyse d’un LLM	Lecture 2-3
25	Cycle de vie d’un LLM	Fine-tuning d’un LLM	Lecture 4
26	Fonctionnement d’un LLM	Entraînement d’un LLM	Lecture 5
27	Développements récents et enjeux sociaux	—	Exercice
Semaine spéciale	—	—	—
Vacances de Pâques	—	—	—
29	Examen	—	—

TP 4 : Fonctionnement et entraînement d’un LLM

Installation de l’environnement
Rappel sur le machine learning
- Entraînement d’un modèle linéaire
- Entraînement d’une couche de neurones
- Entraînement de deux couches de neurones
Du texte au nombre : tokenisation et embedding
Du nombre au texte : scores de probabilité
Entraînement d’un GRU
Du réseau de neurones à ChatGPT : l’architecture Transformer
Algèbre linéaire, GPU et géopolitique

Installation de l’environnement

Téléchargement du package de TP
VS Code
uv
Kernel Python

Configuration

#| eval: false
uv sync

Enregistrement du kernel Jupyter (une seule fois par TP)

#| eval: false
.venv/bin/python -m ipykernel install --user --name=ia-a-05 --display-name="IA A-05 (Python 3.11)"

Sélection du kernel dans VS Code : Cmd+Maj+P → Python: Select interpreter → ia-a-05

#| eval: false
quarto preview ./IA-A-05-SL.qmd --no-browser --no-watch-inputs

Partie 1 — Rappels du machine learning

Modèle lineaire
Réseau de neurones

1.1 Le dataset Boston Housing

La régression linéaire — modèle

Le dataset Boston Housing contient des informations sur 506 quartiers de Boston (années 1970) avec 13 features et un prix médian en milliers de dollars.

import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import boston_housing

(x_train_full, y_train), (x_test, y_test) = boston_housing.load_data()

feature_names = [
    "CRIM",    # taux de criminalite
    "ZN",      # % terrains residentiels
    "INDUS",   # % zones industrielles
    "CHAS",    # bord de riviere (0/1)
    "NOX",     # concentration en oxydes d'azote
    "RM",      # nombre moyen de pieces  <- la plus correlee au prix
    "AGE",     # % maisons construites avant 1940
    "DIS",     # distance aux centres d'emploi
    "RAD",     # accessibilite aux autoroutes
    "TAX",     # taux d'imposition
    "PTRATIO", # ratio eleves/enseignant
    "B",       # indice demographique
    "LSTAT",   # % population a faible revenu
]

print(f"Train : {x_train_full.shape}  — {x_train_full.shape[0]} quartiers, {x_train_full.shape[1]} features")
print(f"Test  : {x_test.shape}")
print(f"Prix  : min={y_train.min():.1f} k$  max={y_train.max():.1f} k$  moyenne={y_train.mean():.1f} k$")
print()
print(f"  {'Feature':<10}  Quartier 0   Quartier 1")
print("-" * 35)
for name, v0, v1 in zip(feature_names, x_train_full[0], x_train_full[1]):
    print(f"  {name:<10}  {v0:>10.3f}   {v1:>10.3f}")

Train : (404, 13)  — 404 quartiers, 13 features
Test  : (102, 13)
Prix  : min=5.0 k$  max=50.0 k$  moyenne=22.4 k$

  Feature     Quartier 0   Quartier 1
-----------------------------------
  CRIM             1.232        0.022
  ZN               0.000       82.500
  INDUS            8.140        2.030
  CHAS             0.000        0.000
  NOX              0.538        0.415
  RM               6.142        7.610
  AGE             91.700       15.700
  DIS              3.977        6.270
  RAD              4.000        2.000
  TAX            307.000      348.000
  PTRATIO         21.000       14.700
  B              396.900      395.380
  LSTAT           18.720        3.110

Le modèle de régression linéare

Voici un exemple d’un modèle de régression linéaire. Identifiez sur la figure et dans le code :

l’input $X$
l’output $Y$
le modèle $f_\theta$
les paramètres du modèle $\theta$
les données $(X_i, Y_i)$
la fonction de coût

flowchart TD
  X["X"] --> F["f_θ(X)"]
  F --> Y["Y"]
  T["θ"] --> F

  style X fill:#F5F5F5, color:#111111, stroke:#111111
  style F fill:#111111, color:#FAFAFA, stroke:#111111
  style T fill:#EBEBEB, color:#666666, stroke:#AAAAAA
  style Y fill:#F5F5F5, color:#111111, stroke:#111111

:::

La régression linéaire

import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ars-typographica')

rng = np.random.default_rng(42)
x = np.linspace(0, 36, 7)
y = 2.5 * x + 40 + rng.normal(0, 5, 7)

a, b = np.polyfit(x, y, 1)
x_line = np.linspace(0, 36, 200)

fig, ax = plt.subplots()
ax.scatter(x, y, color='#111111', s=20, zorder=3)
ax.plot(x_line, a * x_line + b, color='#111111', linewidth=1.8)

ax.set_xlabel('Âge (en mois)')
ax.set_ylabel('Taille en centimètres')

plt.tight_layout()
plt.show()

Taille d’un enfant mesurée entre 0 et 36 mois

Descente de gradient — Keras sans centrage

import numpy as np
import keras

rng = np.random.default_rng(42)
x = np.linspace(0, 36, 7)
y = 2.5 * x + 40 + rng.normal(0, 5, 7)

# Normalisation
x_mean, x_std = x.mean(), x.std()
y_mean, y_std = y.mean(), y.std()
x_n = (x - x_mean) / x_std
y_n = (y - y_mean) / y_std

model = keras.Sequential([keras.layers.Dense(1, input_shape=(1,))])
model.compile(optimizer=keras.optimizers.SGD(learning_rate=0.01), loss='mse')

print(f"{'Epoch':>5}  {'Erreur (MSE)':>12}  {'a estimé':>10}  {'b estimé':>10}")
print("-" * 46)

for epoch in range(0, 200, 20):
    history = model.fit(x_n, y_n, epochs=20, verbose=0)
    w_n, b_n = model.layers[0].get_weights()
    loss = history.history['loss'][-1]

    # Dénormalisation : y = a·x + b → a_orig = w_n·(y_std/x_std)
    a_orig = w_n[0][0] * y_std / x_std
    b_orig = b_n[0] * y_std + y_mean - a_orig * x_mean

    print(f"{epoch+20:>5}  {loss:>12.4f}  {a_orig:>10.4f}  {b_orig:>10.4f}")

print(f"\nValeurs vraies :  a = 2.5,  b = 40")

Epoch  Erreur (MSE)    a estimé    b estimé
----------------------------------------------

   20        0.3481      1.0470     64.6047
   40        0.1715      1.4928     56.5801
   60        0.0927      1.7904     51.2228
   80        0.0576      1.9891     47.6462
  100        0.0420      2.1218     45.2585
  120        0.0350      2.2103     43.6644
  140        0.0319      2.2695     42.6002
  160        0.0305      2.3089     41.8897
  180        0.0299      2.3353     41.4154
  200        0.0296      2.3529     41.0987

Valeurs vraies :  a = 2.5,  b = 40

Régression linéaire — neurone

flowchart LR
  X["x"] --> W["× w"]
  W --> S["Σ"]
  B["b"] --> S
  S --> Y["ŷ = w·x + b"]

  style X fill:#F5F5F5, color:#111111, stroke:#111111
  style W fill:#EBEBEB, color:#666666, stroke:#AAAAAA
  style B fill:#EBEBEB, color:#666666, stroke:#AAAAAA
  style S fill:#111111, color:#FAFAFA, stroke:#111111
  style Y fill:#F5F5F5, color:#111111, stroke:#111111

Un seul neurone sans fonction d’activation :

Entrée : $x$ (âge)
Poids : $w$ (pente $a$)
Biais : $b$ (ordonnée à l’origine)
Sortie : $\hat{y} = w \cdot x + b$ (taille estimée)
Paramètres : $\theta = (w, b)$

Évolution des droites de régression

import numpy as np
import matplotlib.pyplot as plt
import keras

plt.style.use('ars-typographica')

rng = np.random.default_rng(42)
x = np.linspace(0, 36, 7)
y = 2.5 * x + 40 + rng.normal(0, 5, 7)

x_mean, x_std = x.mean(), x.std()
y_mean, y_std = y.mean(), y.std()
x_n = (x - x_mean) / x_std
y_n = (y - y_mean) / y_std

model = keras.Sequential([keras.layers.Dense(1, input_shape=(1,))])
model.compile(optimizer=keras.optimizers.SGD(learning_rate=0.01), loss='mse')

x_line = np.linspace(0, 36, 200)
x_line_n = (x_line - x_mean) / x_std

fig, ax = plt.subplots()
ax.scatter(x, y, color='#111111', s=20, zorder=5)

epochs_display = [1, 5, 20, 50, 200]
colors = ['#DDDDDD', '#BBBBBB', '#999999', '#555555', '#111111']

prev = 0
for n_ep, color in zip(epochs_display, colors):
    model.fit(x_n, y_n, epochs=n_ep - prev, verbose=0)
    prev = n_ep
    w_n, b_n = model.layers[0].get_weights()
    a = w_n[0][0] * y_std / x_std
    b = b_n[0] * y_std + y_mean - a * x_mean
    ax.plot(x_line, a * x_line + b, color=color, linewidth=1.5,
            label=f'epoch {n_ep}')

ax.set_xlabel('Âge (en mois)')
ax.set_ylabel('Taille (cm)')
ax.legend(fontsize=8)
plt.tight_layout()
plt.show()

Décroissance du MSE

import numpy as np
import matplotlib.pyplot as plt
import keras

plt.style.use('ars-typographica')

rng = np.random.default_rng(42)
x = np.linspace(0, 36, 7)
y = 2.5 * x + 40 + rng.normal(0, 5, 7)

x_n = (x - x.mean()) / x.std()
y_n = (y - y.mean()) / y.std()

model = keras.Sequential([keras.layers.Dense(1, input_shape=(1,))])
model.compile(optimizer=keras.optimizers.SGD(learning_rate=0.01), loss='mse')

history = model.fit(x_n, y_n, epochs=200, verbose=0)

fig, ax = plt.subplots()
ax.plot(history.history['loss'], color='#111111', linewidth=1.5)
ax.set_xlabel('Epoch')
ax.set_ylabel('MSE')
ax.set_title('Décroissance de la fonction de coût')
plt.tight_layout()
plt.show()

Réseaux de neurones

Une neurone
Une couche de neurone
Deep learning

Neurone simple

flowchart LR
  subgraph X["x"]
    x1["x₁"]
    x2["x₂"]
    x3["x₃"]
    x4["x₄"]
  end

  x1 --> w1["× w₁"]
  x2 --> w2["× w₂"]
  x3 --> w3["× w₃"]
  x4 --> w4["× w₄"]

  w1 --> S["Σ"]
  w2 --> S
  w3 --> S
  w4 --> S

  S --> F["h"]
  F --> Y["y"]

  style x1 fill:#F5F5F5, color:#111111, stroke:#111111
  style x2 fill:#F5F5F5, color:#111111, stroke:#111111
  style x3 fill:#F5F5F5, color:#111111, stroke:#111111
  style x4 fill:#F5F5F5, color:#111111, stroke:#111111
  style w1 fill:#EBEBEB, color:#666666, stroke:#AAAAAA
  style w2 fill:#EBEBEB, color:#666666, stroke:#AAAAAA
  style w3 fill:#EBEBEB, color:#666666, stroke:#AAAAAA
  style w4 fill:#EBEBEB, color:#666666, stroke:#AAAAAA
  style S  fill:#111111, color:#FAFAFA, stroke:#111111
  style F  fill:#111111, color:#FAFAFA, stroke:#111111
  style Y  fill:#F5F5F5, color:#111111, stroke:#111111
  style X  fill:#FAFAFA, color:#111111, stroke:#CCCCCC

Un modèle = une fonction à paramètres $f_\theta$ :

\[Y = f_\theta(X)\]

Ici :

$X = (x_1, x_2, x_3, x_4)$
$Y = f(w_1\cdot x_1 + \cdot w_1\cdot x_2 + w_3 \cdot x_3 + w_4\cdot x_4)$
$\theta = (w_1, w_2, w_3, w_4)$ (Les poids)
$h$ s’appelle la fonction d’activation

Autrement dit :

$f_\theta(X) = h(\theta \cdot X)$ (produit scalaire)

Ou :

$f_\theta(X) = h(\theta \cdot X^T)$ (produit matriciel)

Couche de neurones

flowchart LR
  subgraph X["x"]
    x1["x₁"]
    x2["x₂"]
    x3["x₃"]
    x4["x₄"]
  end

  subgraph C["couche dense"]
    n1["f"]
    n2["f"]
    n3["f"]
  end

  subgraph Y["ŷ"]
    y1["ŷ₁"]
    y2["ŷ₂"]
    y3["ŷ₃"]
  end

  x1 --> n1 & n2 & n3
  x2 --> n1 & n2 & n3
  x3 --> n1 & n2 & n3
  x4 --> n1 & n2 & n3

  n1 --> y1
  n2 --> y2
  n3 --> y3

  style x1 fill:#F5F5F5, color:#111111, stroke:#111111
  style x2 fill:#F5F5F5, color:#111111, stroke:#111111
  style x3 fill:#F5F5F5, color:#111111, stroke:#111111
  style x4 fill:#F5F5F5, color:#111111, stroke:#111111
  style n1 fill:#111111, color:#FAFAFA, stroke:#111111
  style n2 fill:#111111, color:#FAFAFA, stroke:#111111
  style n3 fill:#111111, color:#FAFAFA, stroke:#111111
  style y1 fill:#F5F5F5, color:#111111, stroke:#111111
  style y2 fill:#F5F5F5, color:#111111, stroke:#111111
  style y3 fill:#F5F5F5, color:#111111, stroke:#111111
  style X  fill:#FAFAFA, color:#111111, stroke:#CCCCCC
  style C  fill:#FAFAFA, color:#111111, stroke:#CCCCCC
  style Y  fill:#FAFAFA, color:#111111, stroke:#CCCCCC

Un modèle = une fonction à paramètres

\[Y = f_\theta(X)\]

$X = (x_1, x_2, x_3, x_4)$
$Y = (y_1, y_2, y_3)$
$\theta = \begin{bmatrix} w_{1,1} & \cdots & w_{1,4} \\ \vdots & \ddots & \vdots \\ w_{3,1} & \cdots & w_{3,4} \end{bmatrix}$
$f_\theta(X) = h(\theta \cdot X^T)$

Et c’est pourquoi l’algèbre linéaire devient importante !

Deep learning

graph LR
  subgraph X["x"]
    x1["x₁"]
    x2["x₂"]
    x3["x₃"]
    x4["x₄"]
  end

  subgraph C1["couche 1"]
    n11["f"]
    n12["f"]
    n13["f"]
    n14["f"]
    n15["f"]
  end

  subgraph C2["couche 2"]
    n21["f"]
    n22["f"]
    n23["f"]
  end

  subgraph C3["couche 3"]
    n31["f"]
    n32["f"]
    n33["f"]
    n34["f"]
  end

  subgraph Y["ŷ"]
    y1["ŷ₁"]
    y2["ŷ₂"]
    y3["ŷ₃"]
    y4["ŷ₄"]
  end

  x1 --> n11 & n12 & n13 & n14 & n15
  x2 --> n11 & n12 & n13 & n14 & n15
  x3 --> n11 & n12 & n13 & n14 & n15
  x4 --> n11 & n12 & n13 & n14 & n15

  n11 --> n21 & n22 & n23
  n12 --> n21 & n22 & n23
  n13 --> n21 & n22 & n23
  n14 --> n21 & n22 & n23
  n15 --> n21 & n22 & n23

  n21 --> n31 & n32 & n33 & n34
  n22 --> n31 & n32 & n33 & n34
  n23 --> n31 & n32 & n33 & n34

  n31 --> y1
  n32 --> y2
  n33 --> y3
  n34 --> y4

  style x1 fill:#F5F5F5, color:#111111, stroke:#111111
  style x2 fill:#F5F5F5, color:#111111, stroke:#111111
  style x3 fill:#F5F5F5, color:#111111, stroke:#111111
  style x4 fill:#F5F5F5, color:#111111, stroke:#111111
  style n11 fill:#111111, color:#FAFAFA, stroke:#111111
  style n12 fill:#111111, color:#FAFAFA, stroke:#111111
  style n13 fill:#111111, color:#FAFAFA, stroke:#111111
  style n14 fill:#111111, color:#FAFAFA, stroke:#111111
  style n15 fill:#111111, color:#FAFAFA, stroke:#111111
  style n21 fill:#111111, color:#FAFAFA, stroke:#111111
  style n22 fill:#111111, color:#FAFAFA, stroke:#111111
  style n23 fill:#111111, color:#FAFAFA, stroke:#111111
  style n31 fill:#111111, color:#FAFAFA, stroke:#111111
  style n32 fill:#111111, color:#FAFAFA, stroke:#111111
  style n33 fill:#111111, color:#FAFAFA, stroke:#111111
  style n34 fill:#111111, color:#FAFAFA, stroke:#111111
  style y1 fill:#F5F5F5, color:#111111, stroke:#111111
  style y2 fill:#F5F5F5, color:#111111, stroke:#111111
  style y3 fill:#F5F5F5, color:#111111, stroke:#111111
  style y4 fill:#F5F5F5, color:#111111, stroke:#111111
  style X  fill:#FAFAFA, color:#111111, stroke:#CCCCCC
  style C1 fill:#FAFAFA, color:#111111, stroke:#CCCCCC
  style C2 fill:#FAFAFA, color:#111111, stroke:#CCCCCC
  style C3 fill:#FAFAFA, color:#111111, stroke:#CCCCCC
  style Y  fill:#FAFAFA, color:#111111, stroke:#CCCCCC

  linkStyle default interpolate basis

Réseau de neurones dense — 3 couches (5, 3, 4).

Un modèle = une fonction à paramètres $f_\theta$ :

\[Y = f_\theta(X)\]

Modèle	Couches	Dimension cachée
GPT-2 small	12	768
GPT-2 large	36	1 280
GPT-3	96	12 288
LLaMA 3 8B	32	4 096
LLaMA 3 70B	80	8 192

Réseaux de neurones – Backpropagation

graph LR
  subgraph X["x"]
    x1["x₁"]
    x2["x₂"]
    x3["x₃"]
    x4["x₄"]
  end

  subgraph C1["couche 1"]
    n11["f"]
    n12["f"]
    n13["f"]
    n14["f"]
    n15["f"]
  end

  subgraph C2["couche 2"]
    n21["f"]
    n22["f"]
    n23["f"]
  end

  subgraph C3["couche 3"]
    n31["f"]
    n32["f"]
    n33["f"]
    n34["f"]
  end

  subgraph Y["ŷ"]
    y1["ŷ₁"]
    y2["ŷ₂"]
    y3["ŷ₃"]
    y4["ŷ₄"]
  end

  x1 --> n11 & n12 & n13 & n14 & n15
  x2 --> n11 & n12 & n13 & n14 & n15
  x3 --> n11 & n12 & n13 & n14 & n15
  x4 --> n11 & n12 & n13 & n14 & n15

  n11 --> n21 & n22 & n23
  n12 --> n21 & n22 & n23
  n13 --> n21 & n22 & n23
  n14 --> n21 & n22 & n23
  n15 --> n21 & n22 & n23

  n21 --> n31 & n32 & n33 & n34
  n22 --> n31 & n32 & n33 & n34
  n23 --> n31 & n32 & n33 & n34

  n31 --> y1
  n32 --> y2
  n33 --> y3
  n34 --> y4

  style x1 fill:#F5F5F5, color:#111111, stroke:#111111
  style x2 fill:#F5F5F5, color:#111111, stroke:#111111
  style x3 fill:#F5F5F5, color:#111111, stroke:#111111
  style x4 fill:#F5F5F5, color:#111111, stroke:#111111
  style n11 fill:#111111, color:#FAFAFA, stroke:#111111
  style n12 fill:#111111, color:#FAFAFA, stroke:#111111
  style n13 fill:#111111, color:#FAFAFA, stroke:#111111
  style n14 fill:#111111, color:#FAFAFA, stroke:#111111
  style n15 fill:#111111, color:#FAFAFA, stroke:#111111
  style n21 fill:#111111, color:#FAFAFA, stroke:#111111
  style n22 fill:#111111, color:#FAFAFA, stroke:#111111
  style n23 fill:#111111, color:#FAFAFA, stroke:#111111
  style n31 fill:#111111, color:#FAFAFA, stroke:#111111
  style n32 fill:#111111, color:#FAFAFA, stroke:#111111
  style n33 fill:#111111, color:#FAFAFA, stroke:#111111
  style n34 fill:#111111, color:#FAFAFA, stroke:#111111
  style y1 fill:#F5F5F5, color:#111111, stroke:#111111
  style y2 fill:#F5F5F5, color:#111111, stroke:#111111
  style y3 fill:#F5F5F5, color:#111111, stroke:#111111
  style y4 fill:#F5F5F5, color:#111111, stroke:#111111
  style X  fill:#FAFAFA, color:#111111, stroke:#CCCCCC
  style C1 fill:#FAFAFA, color:#111111, stroke:#CCCCCC
  style C2 fill:#FAFAFA, color:#111111, stroke:#CCCCCC
  style C3 fill:#FAFAFA, color:#111111, stroke:#CCCCCC
  style Y  fill:#FAFAFA, color:#111111, stroke:#CCCCCC

  linkStyle default interpolate basis

La fonction d’inférence (forward-pass) devient une fonction composée :

\[ \begin{align} y &= f_\theta(x) \\ &= L^3_{\theta_1} \circ L^2_{\theta_2} \circ L^1_{\theta_3}(x) \end{align} \]

La dérivation suit la règle de dérivation en chaîne

\[ \frac{\partial \mathcal{L}}{\partial \theta_3} = \frac{\partial \mathcal{L}}{\partial y} \cdot \frac{\partial L^1}{\partial L^2} \cdot \frac{\partial L^2}{\partial L^3} \cdot \frac{\partial L^3}{\partial \theta_3} \]

En pratique l’optimisation par descente de gradient devient une série de calculs matriciels

Descente de gradient

Problème : minimiser $\mathcal{L}(\theta)$ — souvent non-convexe, de dimension $10^9$.

Algorithme :

\[\theta \leftarrow \theta - \eta \cdot \nabla_\theta \mathcal{L}\]

$\eta$ — taux d’apprentissage (learning rate)
$\nabla_\theta \mathcal{L}$ — gradient calculé par rétropropagation

Rétropropagation = dérivée de la composée (règle de chaîne) :

\[\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}\]

En pratique : PyTorch calcule tout automatiquement (loss.backward()).

Du texte au vecteur – du vecteur au texte

Tokenisation
Embedding
Logit

Prédiction du token suivant

Objectif : maximiser la probabilité du token suivant

\[P(\text{token}_{t+1} \mid \text{token}_1, \ldots, \text{token}_t)\]

Exemple :

“Bonjour mon …” → coeur (42%) · choeur (28%) · trésor (15%) · … · choeur (0.0001%) · …

\[P(\text{"coeur"} \mid \text{"Bonjour"}, \text{"mon"})\]

Problème fondamental :

Les mots ne sont pas des nombres. L’optimisation ne fonctionne bien que dans un espace continu. → Il faut une représentation vectorielle des tokens.

Tokenisation et embedding

Pipeline complet :

\[\text{Mots} \xrightarrow{\text{tokenisation}} \text{IDs} \xrightarrow{\text{embedding}} \text{vecteurs} \xrightarrow{\text{attention}} \text{contexte} \xrightarrow{\text{softmax}} P(\text{token suivant})\]

Tokenisation

“Fonctionnement” → ['Fonct', 'ion', 'nement'] → [12043, 287, 3890]

Algorithme BPE (Byte-Pair Encoding) — vocabulaire ~50 000 tokens.

Embedding

Chaque token ID → vecteur de dimension $d$ (ex. 768, 4096…)

Distance entre vecteurs ≈ similarité sémantique.

roi − homme + femme ≈ reine

Embedding

Embedding

Embedding

Espace de représentation de vectoriel de token

Architecture Transformer

L’architecture dominante depuis 2017 (Attention is All You Need, Vaswani et al.)

Mécanisme clé — Self-Attention :

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\]

Chaque token peut “regarder” tous les autres tokens du contexte.

Outils utilisés en pratique :

torch — calcul tensoriel et autograd
transformers — modèles pré-entraînés
jax — alternative performante (Google)

GPU et géopolitique

Le lingot d’or de l’IA

$50,000 d’occasion GPU —

Pourquoi les GPU ?

CPU — généraliste

Quelques cœurs puissants (8–32)
Optimisé pour les tâches séquentielles
Grande mémoire cache
Faible parallélisme

Inadapté à la multiplication de matrices à grande échelle.

GPU — spécialisé

Des milliers de petits cœurs (H100 : 16 896 CUDA cores)
Optimisé pour les calculs parallèles
Mémoire HBM très rapide (80 GB, 3.35 TB/s)
Conçu pour : jeux vidéo → Deep Learning

GPU vs CPU

Comparaison des hardwares

Caractéristique	Apple M5 Max (2026)	NVIDIA RTX 5090 (2025)	NVIDIA H100 SXM (2022)
Usage principal	Laptop pro / IA embarquée	Gaming haut de gamme / IA perso	Datacenter / entraînement LLM
Architecture	Apple Silicon (3 nm, TSMC)	Blackwell (4 nm, TSMC)	Hopper (4 nm, TSMC)
Nombre de cœurs GPU	40 cœurs GPU Apple	21 760 cœurs CUDA	16 896 cœurs CUDA
TFLOPS FP32	~20 TFLOPS (estimé)	125 TFLOPS	67 TFLOPS
TFLOPS FP16 (IA)	~30 TFLOPS (estimé)	210 TFLOPS (Tensor FP16)	1 979 TFLOPS (Tensor FP16)
Mémoire (VRAM)	128 GB unifiée (CPU+GPU)	32 GB GDDR7	80 GB HBM3
Bande passante mém.	~400 GB/s	1 792 GB/s	3 350 GB/s
Prix indicatif	~5 000–7 000 CHF (MacBook)	~2 000 USD (carte seule)	~25 000–40 000 USD (carte seule)
Consommation	~80–100 W (tout le SoC)	575 W	350–700 W
Refroidissement	Passif + ventilateurs laptop	Ventirad dédié (PC de bureau)	Refroidissement liquide (serveur)
Contexte d’usage IA	LLM <30B en local, inference	LLM <30B en local, fine-tuning	Entraînement GPT-4, clusters

Coût de calcul

Modèle	Paramètres	GPU-heures (entraînement)	Coût estimé
GPT-2 (2019)	1.5 B	~300	< 50 k$
LLaMA 1 7B (2023)	7 B	82 432	~500 k$
GPT-4 (2023, estimé)	~1 800 B	> 25 M	> 100 M$
Llama 4 Scout (2025)	109 B	~5 M	> 50 M$

Inférence (générer 1 réponse) : bien moins coûteux — mais multiplié par des milliards de requêtes.

Fermes de GPU et investissements

Acteur	Infrastructure	Investissement annoncé
Microsoft / OpenAI	Azure (>400 k GPU H100)	80 G$ (2025)
Google DeepMind	TPU v5 (propriétaire)	75 G$ (2025)
Meta	350 k GPU H100	65 G$ (2025)
xAI (Elon Musk)	Colossus — Memphis	6 G$

Une seule puce H100 = env. 30 000 $ · délai de livraison : 6–12 mois (2023–2024)

Tensions géopolitiques

Export Control Act (USA)

Restriction des puces NVIDIA A100/H100 vers la Chine depuis 2022
Les GPU sont devenus un outil de politique étrangère
Terres rares : dépendance minière (Congo, Chine, Australie)

Conséquence : course aux fournisseurs alternatifs, puces souveraines (EU Chips Act)

DeepSeek — janvier 2025

Modèle chinois entraîné malgré les restrictions :

DeepSeek-R1 : performances ≈ GPT-4o
Entraînement avec H800 (version bridée autorisée)
Techniques d’optimisation agressives (MoE, MLA)
Open-source — impact mondial immédiat

→ La restriction d’accès aux puces accélère l’innovation locale.

En Conclusion

Quiz
Lecture

Quiz 5

Expliquer comment les tensions sur les terre rares entre la Chine et les États-Unis sont liées aux mathématiques des réseaux de neurones.
Expliquer comment le texte est transformé en vecteur par les LLMs et pourquoi.
Faire un diagramme d’un réseau de Deep Learning et montrer comment il se traduit en termes mathématiques.

Pour jeudi

TP 4 — Entraînement d’un LLM
Lecture 5 — Artificial Intelligence Index Report 2025

Questions pour la Lecture 5 :

Dans les pages 3, 4 et 5, les auteurs donnent les douze conclusions principales du rapport. En choisir 5 et illustrez chacune avec un exemple concret.