Transformer Explainer: An Interactive Visualization Tool for Understanding GPT-2 Attention Mechanisms and Text Generation

Abstract

We present Transformer Explainer (SimulasiLLM), an interactive, browser-based educational tool that visualizes the internal mechanisms of GPT-2 style transformer models during text generation. The system provides real-time, step-by-step visualization of token embeddings, query-key-value (Q/K/V) projections, causal masking, softmax-normalized attention weights, and autoregressive sampling with adjustable temperature and top-k parameters. Built entirely with pure HTML, CSS, and JavaScript with zero external dependencies, the tool enables learners, educators, and researchers to develop intuitive understanding of how large language models process and generate text. By rendering each computational stage of the transformer architecture as an interactive, manipulable display, Transformer Explainer bridges the gap between abstract mathematical formulations and concrete, observable model behavior. The tool simulates a GPT-2 Small architecture and is freely accessible as an open-source web application.

1. Introduction

Large language models (LLMs) based on the transformer architecture have become foundational to modern natural language processing, powering systems that generate text, translate languages, answer questions, and write code. Despite their widespread adoption and profound impact, the internal mechanisms by which transformers process sequences and generate output remain opaque to most practitioners, students, and even many researchers. The mathematical formulations describing self-attention, positional encoding, and autoregressive decoding, while precise, do not readily convey an intuitive understanding of the computational dynamics at work.

This opacity presents a significant educational challenge. As transformer-based models become integral to an increasing number of applications, the need for accessible, interactive tools that demystify their internal workings grows correspondingly. Static diagrams in textbooks and papers, while valuable, cannot capture the dynamic, data-dependent nature of attention weight computation. Video tutorials offer temporal progression but lack interactivity. Existing code implementations, though functional, require significant programming expertise to interpret and modify.

Transformer Explainer (SimulasiLLM) addresses this gap by providing a fully interactive, browser-based visualization tool that renders each stage of the transformer inference pipeline as a manipulable, observable display. The key contributions of this work are:

2. Transformer Architecture Overview

The transformer architecture, introduced by Vaswani et al. (2017), fundamentally changed sequence modeling by replacing recurrent computation with a purely attention-based mechanism. This section provides a concise overview of the architectural components that Transformer Explainer visualizes, establishing the conceptual framework for the tool's design.

2.1 Token Embeddings and Positional Encoding

Transformers process discrete tokens by mapping each to a continuous vector representation through an embedding layer. Given a vocabulary V and an embedding dimension d, each input token t_i is mapped to a dense vector e_i ∈ ℝ^d. Since the self-attention mechanism is inherently permutation-invariant, positional information must be injected through positional encodings, which are added to the token embeddings to produce the final input representations.

In the GPT-2 architecture (Radford et al., 2019), learned positional embeddings are used rather than the sinusoidal encodings of the original transformer. The combined token and positional embeddings form the input to the first transformer block.

2.2 Self-Attention Mechanism

The self-attention mechanism enables each token in a sequence to attend to all other tokens, computing context-dependent representations. For each input vector, three projections are computed: a query (Q), a key (K), and a value (V), obtained by multiplying the input by learned weight matrices W_Q, W_K, and W_V respectively.

The attention weights are computed as the scaled dot product of queries and keys, followed by softmax normalization:

The scaling factor √d_k prevents the dot products from growing excessively large in magnitude, which would push the softmax function into regions of extremely small gradients.

2.3 Causal Masking

In autoregressive language models such as GPT-2, a causal mask is applied to the attention scores before softmax normalization. This mask ensures that each token can only attend to itself and to tokens at earlier positions in the sequence, preventing information leakage from future tokens during generation. The mask sets the attention scores for future positions to negative infinity, causing them to become zero after the softmax operation.

2.4 Autoregressive Generation

Text generation in GPT-2 proceeds autoregressively: the model generates one token at a time, appending each generated token to the input sequence and reprocessing the entire extended sequence to produce the next token. At each step, the model outputs a probability distribution over the vocabulary, from which the next token is sampled. Sampling strategies such as temperature scaling and top-k filtering control the diversity and quality of generated text.

3. System Design

Transformer Explainer is designed as a single-page, client-side web application that simulates the inference pipeline of a GPT-2 Small model. The system architecture prioritizes interactivity, visual clarity, and pedagogical effectiveness over computational fidelity, making deliberate design choices to render the internal mechanics of the transformer visible and manipulable.

3.1 Visualization Approach

The tool decomposes the transformer inference pipeline into discrete, observable stages, each rendered as a dedicated visual section within the interface. This decomposition follows the natural computational flow of the model:

Each stage is presented as an interactive panel that updates in real time as the user modifies inputs or parameters. This allows learners to observe how changes propagate through the pipeline, building causal intuition about transformer behavior.

3.2 Interactive Components

3.3 Parameter Controls

Two primary generation parameters are exposed to the user for real-time adjustment:

Table 1: Interactive visualization sections and their pedagogical purpose
Section	Content Displayed	Pedagogical Purpose
Embedding Display	Token embeddings as numerical vectors	Shows how discrete tokens become continuous representations
Attention Core	Q/K/V matrices, attention weight heatmap	Reveals how tokens relate to each other through attention
Causal Mask	Triangular mask matrix visualization	Demonstrates autoregressive constraint on attention
Probability Distribution	Softmax output over vocabulary tokens	Shows how the model selects the next token
Query Token Details	Per-token Q/K/V vector values	Enables deep inspection of individual token representations
Generated Output	Autoregressive token-by-token output	Illustrates the sequential generation process

Temperature (default: 0.8): Controls the sharpness of the probability distribution over vocabulary tokens. Lower temperatures concentrate probability mass on the most likely tokens, producing more deterministic output. Higher temperatures flatten the distribution, increasing diversity and randomness. The visualization updates the probability distribution display in real time as the temperature slider is adjusted.

Top-k Sampling (default: 5): Restricts sampling to the k most probable tokens at each generation step, setting the probability of all other tokens to zero and renormalizing. This prevents the model from selecting highly improbable tokens while maintaining controlled diversity within the top candidates.

4. Attention Mechanism Visualization

The attention mechanism visualization is the central pedagogical component of Transformer Explainer, providing detailed, interactive renderings of the computational steps that underlie self-attention in GPT-2.

4.1 Q/K/V Projection Display

For each token in the input sequence, the tool displays the computed query, key, and value vectors. These projections are rendered as labeled numerical arrays, allowing users to inspect the specific values that determine how each token interacts with others in the attention computation. By selecting different tokens, users can observe how the projection vectors vary across positions and how these variations drive differences in attention patterns.

4.2 Attention Weight Heatmap

The attention weights—computed as the softmax-normalized dot products of queries and keys—are rendered as an interactive heatmap. Each cell in the heatmap represents the attention weight from one query token to one key token, with color intensity proportional to the weight magnitude. This visualization enables users to observe:

4.3 Causal Mask Visualization

The causal mask is rendered as a distinct matrix overlay showing which attention connections are permitted and which are blocked. Permitted connections (lower-triangular entries) are displayed prominently, while blocked connections (upper-triangular entries) are visually suppressed. This visualization directly demonstrates the autoregressive property: each token can only "see" tokens at its own position and earlier positions, never future tokens.

The mask visualization is particularly effective for learners who struggle with the abstract concept of causal masking from textual descriptions alone. By seeing the triangular structure and understanding its role in preventing information leakage, users develop a concrete mental model of how autoregressive constraints shape the attention computation.

5. Text Generation Pipeline

Transformer Explainer provides a step-by-step visualization of the autoregressive text generation process, rendering each stage from initial input processing through final token selection.

5.1 Autoregressive Sampling Demonstration

The generation process is displayed as an iterative loop: at each step, the system processes the current token sequence through the simulated transformer pipeline, computes the output probability distribution, samples a token, and appends it to the sequence. This loop is animated token by token, allowing users to observe:

5.2 Temperature Effects

The temperature parameter is applied to the logits before softmax normalization according to the formula:

where z_i are the raw logits and T is the temperature. The tool visualizes the probability distribution at different temperature settings, allowing users to observe how:

5.3 Top-k Filtering

After temperature scaling, the top-k filtering step retains only the k highest-probability tokens and redistributes their probability mass through renormalization. The visualization displays which tokens survive the top-k cutoff and which are eliminated, providing clear insight into how this sampling strategy balances diversity and quality. With the default setting of k = 5, users can observe that only a small subset of the full vocabulary is considered at each generation step.

5.4 Probability Distribution Display

The final probability distribution over candidate tokens is rendered as a ranked list showing each token and its associated probability. This display updates in real time as temperature and top-k parameters are adjusted, providing immediate visual feedback on how these parameters reshape the distribution. Users can observe the direct relationship between parameter settings and sampling behavior, developing practical intuition for hyperparameter tuning in text generation applications.

6. Implementation Details

6.1 Technology Stack

Transformer Explainer is built entirely with vanilla web technologies, requiring zero external dependencies, libraries, or frameworks. This design choice serves multiple objectives:

The zero-dependency architecture ensures that the application can be served from any static hosting provider, runs in any modern browser without installation, and presents no supply-chain security risks. The entire application is contained within a single HTML file, maximizing portability and minimizing deployment complexity.

6.2 Simulation Model

It is important to note that the tool provides a pedagogical simulation rather than a full inference engine. The Q/K/V projections, attention weights, and probability distributions are computed using representative numerical values that demonstrate the correct mathematical relationships and computational flow. This approach enables real-time interactivity in the browser without requiring the download or execution of the full 124M-parameter GPT-2 model, while preserving the structural and behavioral fidelity necessary for educational purposes.

6.3 No-Framework Philosophy

The decision to avoid frameworks such as React, Vue, or Angular was deliberate and pedagogically motivated. By implementing all interactivity through direct DOM manipulation and vanilla JavaScript, the tool's source code itself serves as a learning resource. Students can inspect the implementation to see exactly how attention scores are computed, how the causal mask is applied, and how the sampling procedure operates—without navigating framework-specific abstractions, build pipelines, or dependency trees.

This approach also eliminates version compatibility issues, build tool requirements, and the need for Node.js or npm on the user's machine. The application can be opened directly from the filesystem or deployed to any static hosting service without modification.

7. Educational Impact

Transformer Explainer addresses a well-documented challenge in machine learning education: the difficulty of translating mathematical abstractions into operational understanding. The self-attention mechanism, while concisely expressed as a matrix operation, involves complex data-dependent computation that static representations cannot fully convey.

7.1 Bridging the Understanding Gap

The tool targets several specific conceptual barriers that learners commonly encounter when studying transformer models:

7.2 Target Audiences

7.3 Advantages Over Existing Resources

Compared to existing educational resources for transformer understanding, Transformer Explainer offers several distinct advantages:

8. Conclusion and Future Work

Transformer Explainer (SimulasiLLM) demonstrates that the core mechanisms of transformer-based language models can be made accessible and intuitive through carefully designed interactive visualization. By decomposing the GPT-2 inference pipeline into observable, manipulable stages—from token embedding through autoregressive sampling—the tool enables learners to build concrete mental models of how LLMs process and generate text. The zero-dependency, browser-based architecture ensures universal accessibility, requiring no installation, configuration, or technical prerequisites beyond a modern web browser.

The deliberate design choice to implement the entire tool in pure HTML, CSS, and JavaScript, without frameworks or external libraries, serves a dual purpose: it eliminates all barriers to deployment and usage, and it makes the tool's source code itself a transparent, inspectable learning resource.

References

Table 2: Technology stack and design rationale
Technology	Role	Rationale
HTML5	Document structure and semantic layout	Universal browser support, no build step required
CSS3	Styling, animations, heatmap rendering	Hardware-accelerated animations, no CSS framework overhead
Vanilla JavaScript	Simulation logic, DOM manipulation, interactivity	Zero dependency, minimal bundle size, full transparency

Table 3: Simulated GPT-2 Small model parameters
Parameter	Value
Model	GPT-2 Small (simulation)
Embedding Dimension	768
Attention Heads	12
Transformer Blocks	12
Vocabulary Size	50,257
Context Window	1,024 tokens

Table 4: Comparison with existing educational approaches
Approach	Interactivity	Dependencies	Accessibility
Textbook diagrams	None (static)	None	High (but limited depth)
Video tutorials	None (passive)	None	High (but no exploration)
Jupyter notebooks	Moderate	Python, PyTorch, etc.	Low (requires setup)
BertViz / exBERT	High	Python, model weights	Low (requires installation)
Transformer Explainer	High	None (pure browser)	High (zero setup)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30, pp. 5998–6008.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI Technical Report.
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems (NeurIPS), 33, pp. 1877–1901.
Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019, pp. 4171–4186.
Vig, J. (2019). A Multiscale Visualization of Attention in the Transformer Model. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 37–42.
Hoover, B., Strobelt, H., & Gehrmann, S. (2020). exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models. Proceedings of the 58th Annual Meeting of the ACL: System Demonstrations, pp. 187–196.
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degeneration. Proceedings of the 8th International Conference on Learning Representations (ICLR).
Fan, A., Lewis, M., & Dauphin, Y. (2018). Hierarchical Neural Story Generation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 889–898.
Alammar, J. (2018). The Illustrated Transformer. Blog Post. https://jalammar.github.io/illustrated-transformer/