We present Transformer Explainer (SimulasiLLM), an interactive, browser-based educational tool that visualizes the internal mechanisms of GPT-2 style transformer models during text generation. The system provides real-time, step-by-step visualization of token embeddings, query-key-value (Q/K/V) projections, causal masking, softmax-normalized attention weights, and autoregressive sampling with adjustable temperature and top-k parameters. Built entirely with pure HTML, CSS, and JavaScript with zero external dependencies, the tool enables learners, educators, and researchers to develop intuitive understanding of how large language models process and generate text. By rendering each computational stage of the transformer architecture as an interactive, manipulable display, Transformer Explainer bridges the gap between abstract mathematical formulations and concrete, observable model behavior. The tool simulates a GPT-2 Small architecture and is freely accessible as an open-source web application.
Large language models (LLMs) based on the transformer architecture have become foundational to modern natural language processing, powering systems that generate text, translate languages, answer questions, and write code. Despite their widespread adoption and profound impact, the internal mechanisms by which transformers process sequences and generate output remain opaque to most practitioners, students, and even many researchers. The mathematical formulations describing self-attention, positional encoding, and autoregressive decoding, while precise, do not readily convey an intuitive understanding of the computational dynamics at work.
This opacity presents a significant educational challenge. As transformer-based models become integral to an increasing number of applications, the need for accessible, interactive tools that demystify their internal workings grows correspondingly. Static diagrams in textbooks and papers, while valuable, cannot capture the dynamic, data-dependent nature of attention weight computation. Video tutorials offer temporal progression but lack interactivity. Existing code implementations, though functional, require significant programming expertise to interpret and modify.
Transformer Explainer (SimulasiLLM) addresses this gap by providing a fully interactive, browser-based visualization tool that renders each stage of the transformer inference pipeline as a manipulable, observable display. The key contributions of this work are:
The transformer architecture, introduced by Vaswani et al. (2017), fundamentally changed sequence modeling by replacing recurrent computation with a purely attention-based mechanism. This section provides a concise overview of the architectural components that Transformer Explainer visualizes, establishing the conceptual framework for the tool's design.
Transformers process discrete tokens by mapping each to a continuous vector representation through an embedding layer. Given a vocabulary V and an embedding dimension d, each input token ti is mapped to a dense vector ei ∈ ℝd. Since the self-attention mechanism is inherently permutation-invariant, positional information must be injected through positional encodings, which are added to the token embeddings to produce the final input representations.
In the GPT-2 architecture (Radford et al., 2019), learned positional embeddings are used rather than the sinusoidal encodings of the original transformer. The combined token and positional embeddings form the input to the first transformer block.
The self-attention mechanism enables each token in a sequence to attend to all other tokens, computing context-dependent representations. For each input vector, three projections are computed: a query (Q), a key (K), and a value (V), obtained by multiplying the input by learned weight matrices WQ, WK, and WV respectively.
The attention weights are computed as the scaled dot product of queries and keys, followed by softmax normalization:
The scaling factor √dk prevents the dot products from growing excessively large in magnitude, which would push the softmax function into regions of extremely small gradients.
In autoregressive language models such as GPT-2, a causal mask is applied to the attention scores before softmax normalization. This mask ensures that each token can only attend to itself and to tokens at earlier positions in the sequence, preventing information leakage from future tokens during generation. The mask sets the attention scores for future positions to negative infinity, causing them to become zero after the softmax operation.
Text generation in GPT-2 proceeds autoregressively: the model generates one token at a time, appending each generated token to the input sequence and reprocessing the entire extended sequence to produce the next token. At each step, the model outputs a probability distribution over the vocabulary, from which the next token is sampled. Sampling strategies such as temperature scaling and top-k filtering control the diversity and quality of generated text.
Transformer Explainer is designed as a single-page, client-side web application that simulates the inference pipeline of a GPT-2 Small model. The system architecture prioritizes interactivity, visual clarity, and pedagogical effectiveness over computational fidelity, making deliberate design choices to render the internal mechanics of the transformer visible and manipulable.
The tool decomposes the transformer inference pipeline into discrete, observable stages, each rendered as a dedicated visual section within the interface. This decomposition follows the natural computational flow of the model:
Each stage is presented as an interactive panel that updates in real time as the user modifies inputs or parameters. This allows learners to observe how changes propagate through the pipeline, building causal intuition about transformer behavior.
The system provides six primary interactive sections:
| Section | Content Displayed | Pedagogical Purpose |
|---|---|---|
| Embedding Display | Token embeddings as numerical vectors | Shows how discrete tokens become continuous representations |
| Attention Core | Q/K/V matrices, attention weight heatmap | Reveals how tokens relate to each other through attention |
| Causal Mask | Triangular mask matrix visualization | Demonstrates autoregressive constraint on attention |
| Probability Distribution | Softmax output over vocabulary tokens | Shows how the model selects the next token |
| Query Token Details | Per-token Q/K/V vector values | Enables deep inspection of individual token representations |
| Generated Output | Autoregressive token-by-token output | Illustrates the sequential generation process |
Two primary generation parameters are exposed to the user for real-time adjustment:
Temperature (default: 0.8): Controls the sharpness of the probability distribution over vocabulary tokens. Lower temperatures concentrate probability mass on the most likely tokens, producing more deterministic output. Higher temperatures flatten the distribution, increasing diversity and randomness. The visualization updates the probability distribution display in real time as the temperature slider is adjusted.
Top-k Sampling (default: 5): Restricts sampling to the k most probable tokens at each generation step, setting the probability of all other tokens to zero and renormalizing. This prevents the model from selecting highly improbable tokens while maintaining controlled diversity within the top candidates.
The attention mechanism visualization is the central pedagogical component of Transformer Explainer, providing detailed, interactive renderings of the computational steps that underlie self-attention in GPT-2.
For each token in the input sequence, the tool displays the computed query, key, and value vectors. These projections are rendered as labeled numerical arrays, allowing users to inspect the specific values that determine how each token interacts with others in the attention computation. By selecting different tokens, users can observe how the projection vectors vary across positions and how these variations drive differences in attention patterns.
The attention weights—computed as the softmax-normalized dot products of queries and keys—are rendered as an interactive heatmap. Each cell in the heatmap represents the attention weight from one query token to one key token, with color intensity proportional to the weight magnitude. This visualization enables users to observe:
The causal mask is rendered as a distinct matrix overlay showing which attention connections are permitted and which are blocked. Permitted connections (lower-triangular entries) are displayed prominently, while blocked connections (upper-triangular entries) are visually suppressed. This visualization directly demonstrates the autoregressive property: each token can only "see" tokens at its own position and earlier positions, never future tokens.
The mask visualization is particularly effective for learners who struggle with the abstract concept of causal masking from textual descriptions alone. By seeing the triangular structure and understanding its role in preventing information leakage, users develop a concrete mental model of how autoregressive constraints shape the attention computation.
Transformer Explainer provides a step-by-step visualization of the autoregressive text generation process, rendering each stage from initial input processing through final token selection.
The generation process is displayed as an iterative loop: at each step, the system processes the current token sequence through the simulated transformer pipeline, computes the output probability distribution, samples a token, and appends it to the sequence. This loop is animated token by token, allowing users to observe:
The temperature parameter is applied to the logits before softmax normalization according to the formula:
where zi are the raw logits and T is the temperature. The tool visualizes the probability distribution at different temperature settings, allowing users to observe how:
After temperature scaling, the top-k filtering step retains only the k highest-probability tokens and redistributes their probability mass through renormalization. The visualization displays which tokens survive the top-k cutoff and which are eliminated, providing clear insight into how this sampling strategy balances diversity and quality. With the default setting of k = 5, users can observe that only a small subset of the full vocabulary is considered at each generation step.
The final probability distribution over candidate tokens is rendered as a ranked list showing each token and its associated probability. This display updates in real time as temperature and top-k parameters are adjusted, providing immediate visual feedback on how these parameters reshape the distribution. Users can observe the direct relationship between parameter settings and sampling behavior, developing practical intuition for hyperparameter tuning in text generation applications.
Transformer Explainer is built entirely with vanilla web technologies, requiring zero external dependencies, libraries, or frameworks. This design choice serves multiple objectives:
| Technology | Role | Rationale |
|---|---|---|
| HTML5 | Document structure and semantic layout | Universal browser support, no build step required |
| CSS3 | Styling, animations, heatmap rendering | Hardware-accelerated animations, no CSS framework overhead |
| Vanilla JavaScript | Simulation logic, DOM manipulation, interactivity | Zero dependency, minimal bundle size, full transparency |
The zero-dependency architecture ensures that the application can be served from any static hosting provider, runs in any modern browser without installation, and presents no supply-chain security risks. The entire application is contained within a single HTML file, maximizing portability and minimizing deployment complexity.
The tool simulates a GPT-2 Small architecture with the following configuration:
| Parameter | Value |
|---|---|
| Model | GPT-2 Small (simulation) |
| Embedding Dimension | 768 |
| Attention Heads | 12 |
| Transformer Blocks | 12 |
| Vocabulary Size | 50,257 |
| Context Window | 1,024 tokens |
It is important to note that the tool provides a pedagogical simulation rather than a full inference engine. The Q/K/V projections, attention weights, and probability distributions are computed using representative numerical values that demonstrate the correct mathematical relationships and computational flow. This approach enables real-time interactivity in the browser without requiring the download or execution of the full 124M-parameter GPT-2 model, while preserving the structural and behavioral fidelity necessary for educational purposes.
The decision to avoid frameworks such as React, Vue, or Angular was deliberate and pedagogically motivated. By implementing all interactivity through direct DOM manipulation and vanilla JavaScript, the tool's source code itself serves as a learning resource. Students can inspect the implementation to see exactly how attention scores are computed, how the causal mask is applied, and how the sampling procedure operates—without navigating framework-specific abstractions, build pipelines, or dependency trees.
This approach also eliminates version compatibility issues, build tool requirements, and the need for Node.js or npm on the user's machine. The application can be opened directly from the filesystem or deployed to any static hosting service without modification.
Transformer Explainer addresses a well-documented challenge in machine learning education: the difficulty of translating mathematical abstractions into operational understanding. The self-attention mechanism, while concisely expressed as a matrix operation, involves complex data-dependent computation that static representations cannot fully convey.
The tool targets several specific conceptual barriers that learners commonly encounter when studying transformer models:
The tool is designed to serve multiple user groups:
Compared to existing educational resources for transformer understanding, Transformer Explainer offers several distinct advantages:
| Approach | Interactivity | Dependencies | Accessibility |
|---|---|---|---|
| Textbook diagrams | None (static) | None | High (but limited depth) |
| Video tutorials | None (passive) | None | High (but no exploration) |
| Jupyter notebooks | Moderate | Python, PyTorch, etc. | Low (requires setup) |
| BertViz / exBERT | High | Python, model weights | Low (requires installation) |
| Transformer Explainer | High | None (pure browser) | High (zero setup) |
Transformer Explainer (SimulasiLLM) demonstrates that the core mechanisms of transformer-based language models can be made accessible and intuitive through carefully designed interactive visualization. By decomposing the GPT-2 inference pipeline into observable, manipulable stages—from token embedding through autoregressive sampling—the tool enables learners to build concrete mental models of how LLMs process and generate text. The zero-dependency, browser-based architecture ensures universal accessibility, requiring no installation, configuration, or technical prerequisites beyond a modern web browser.
The deliberate design choice to implement the entire tool in pure HTML, CSS, and JavaScript, without frameworks or external libraries, serves a dual purpose: it eliminates all barriers to deployment and usage, and it makes the tool's source code itself a transparent, inspectable learning resource.
Future directions for this work include:
The complete source code is available at https://github.com/romizone/simulasillm under the MIT license, and a live deployment is accessible at https://simulasillm.vercel.app/.