We present an interactive web-based simulator that visualizes the performance characteristics of three heterogeneous processor architectures—Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Tensor Processing Units (TPUs)—when executing representative AI and deep learning workloads. The simulator features a real-time processing race where three processor lanes compete to complete identical tasks, exposing fundamental architectural differences through animated core visualization grids, live performance metrics, and ASCII architecture diagrams. Users select from four workload types (matrix multiplication, CNN training, batch inference, NLP transformer operations), configure batch sizes and simulation speed, and observe how each processor's design philosophy translates to throughput differences. The system implements a tick-based simulation engine with stochastic jitter to produce realistic performance variance, color-coded processor dashboards with pulsing core animations, timestamped event logging, and ranked finish analysis with speedup calculations. Built as a single HTML file with zero external dependencies using vanilla JavaScript, CSS3 Grid/Flexbox, and interval-driven animation, the application is deployed on both GitHub Pages and Vercel. We describe the simulation model, processor abstraction layer, workload parameterization, and visualization architecture.
The rapid expansion of deep learning has driven the development of specialized hardware accelerators designed to handle the massive parallelism inherent in neural network computations. While CPUs remain the general-purpose backbone of computing, GPUs have become the de facto standard for training deep learning models, and TPUs represent Google's purpose-built approach to tensor operations. Understanding the architectural differences between these processors and their implications for AI workload performance is essential for practitioners making hardware selection decisions.
However, direct access to all three processor types for comparative benchmarking is impractical for most learners and practitioners. GPU clusters require significant investment, and TPU access is limited to Google Cloud Platform. Existing educational materials rely primarily on static benchmark tables, performance charts, and theoretical descriptions that fail to convey the dynamic nature of parallel processing.
GPU vs CPU vs TPU Simulator addresses this educational gap by providing a browser-based interactive tool that simulates the relative performance characteristics of all three processor types across representative AI workloads. Users observe a visual race between processors, examine animated core grids, and analyze real-time metrics—developing intuitive understanding of why certain architectures excel at specific workload patterns. The key contributions of this work are:
The divergence between CPU, GPU, and TPU architectures reflects fundamentally different design philosophies. CPUs optimize for single-thread performance with deep pipelines, branch prediction, out-of-order execution, and large cache hierarchies. GPUs sacrifice single-thread performance for massive parallelism, organizing thousands of simpler cores into streaming multiprocessors that execute identical instructions across data elements (SIMT model). TPUs take specialization further, implementing a systolic array architecture where processing elements are arranged in a grid and data flows between them in a pipelined fashion, optimizing specifically for matrix multiply-accumulate operations.
Jouppi et al. (2017) described the TPU architecture in detail, reporting 15–30x performance improvements over contemporary CPUs and GPUs for inference workloads. Subsequent TPU generations (v2, v3, v4) extended this advantage to training, with TPU v4 pods delivering 1.1 exaflops of peak performance. Our simulator abstracts these architectural differences into configurable processor models that produce performance ratios consistent with published benchmarks.
MLPerf (Mattson et al., 2020) established standardized benchmarks for comparing ML hardware across training and inference tasks. However, MLPerf results are static tables requiring specialized hardware to reproduce. Tools like AI Benchmark (Ignatov et al., 2019) focus on mobile processors. Our simulator complements these by providing an interactive, visual experience that runs in any web browser without specialized hardware.
Interactive visualizations have proven effective for teaching computing concepts. TensorFlow Playground demonstrates neural network training, CNN Explainer visualizes convolutional operations, and Transformer Explainer (our prior work) visualizes attention mechanisms. However, no comparable interactive tool exists for comparing hardware architectures in the context of AI workloads. Our work fills this gap with a processor-centric visualization.
The simulator abstracts each processor type into a parameterized model that captures the essential performance characteristics without requiring cycle-accurate simulation. Each model is defined by its core count, clock frequency, parallelism factor, and workload-specific throughput scaling.
| Parameter | Value | Rationale |
|---|---|---|
| Core Count | 8 cores / 16 threads | Representative high-end desktop CPU |
| Clock Frequency | 5.0 GHz | Modern boost clock for single-thread performance |
| Cache Hierarchy | L1: 32KB, L2: 256KB, L3: 16MB | Standard multi-level cache architecture |
| Parallelism Model | MIMD (multi-thread) | Independent instruction streams per core |
| Strengths | Low latency, branch handling, sequential logic | Deep pipeline with branch prediction |
The CPU model excels at workloads with complex control flow and data-dependent branching but is constrained by its limited thread count for massively parallel tensor operations. In the simulation, CPU processing speed reflects this by applying a parallelism penalty proportional to the workload's inherent parallelism.
| Parameter | Value | Rationale |
|---|---|---|
| CUDA Cores | 4096 | Representative high-end data center GPU |
| Clock Frequency | 1.8 GHz | Lower per-core frequency, massive parallelism |
| Memory | 24GB HBM2e | High-bandwidth memory for tensor throughput |
| Parallelism Model | SIMT (single instruction, multiple threads) | Thousands of threads executing identical kernels |
| Strengths | Massive parallelism, high memory bandwidth | Optimized for data-parallel operations |
The GPU model achieves its performance advantage through massive parallelism—4096 CUDA cores executing the same instruction on different data elements simultaneously. The simulation models this by scaling GPU throughput proportionally to the data-parallel dimension of each workload, producing 10–50x speedups over CPU for suitable tasks.
| Parameter | Value | Rationale |
|---|---|---|
| Systolic Array | 128 × 128 | Matrix multiply unit with 16,384 multiply-accumulate units |
| Clock Frequency | 940 MHz | Lower frequency, higher efficiency per operation |
| Memory | 32GB HBM | Unified high-bandwidth memory |
| Parallelism Model | Systolic data flow | Data flows through processing elements in a pipeline |
| Strengths | Matrix operations, deterministic latency | Purpose-built for tensor computations |
The TPU model represents the most specialized architecture, achieving peak efficiency for matrix multiply-accumulate operations through its systolic array. The 128×128 array processes 16,384 multiply-accumulate operations per cycle in a pipelined fashion. The simulation reflects this by giving TPU the highest throughput for matrix-heavy workloads while applying efficiency penalties for operations that don't map well to the systolic array structure.
The simulator supports four representative AI workloads, each characterized by distinct computational patterns that exercise the three architectures differently.
| Workload | Operation | Dominant Pattern | CPU Factor | GPU Factor | TPU Factor |
|---|---|---|---|---|---|
| Matrix Multiplication | 1024 × 1024 GEMM | Dense linear algebra | 1.0x | 25x | 45x |
| CNN Training | ResNet-50, 1 epoch | Conv2D + backprop | 1.0x | 30x | 40x |
| Batch Inference | Image classification | Forward pass only | 1.0x | 20x | 50x |
| NLP Transformer | BERT-Base forward | Attention + FFN | 1.0x | 22x | 35x |
Users configure batch sizes from 32 to 256. Larger batch sizes increase the total computation and amplify the parallelism advantage of GPU and TPU architectures. The simulation models this relationship:
This formulation captures the reality that CPUs scale linearly with batch size (limited parallelism), GPUs scale sub-linearly (parallelism saturates at high batch sizes), and TPUs maintain near-linear throughput scaling due to the systolic array's efficient data flow.
A speed multiplier (adjustable via slider) scales the animation tick rate without affecting the relative performance ratios between processors. This allows users to observe the race dynamics at their preferred pace—slow for detailed analysis or fast for rapid comparison.
The simulation runs on a setInterval-driven tick system. Each tick advances all three processors by their respective throughput amounts, calculated from the processor model parameters and current workload configuration. The tick interval is set to 50ms, providing smooth visual updates at 20 frames per second.
To prevent the simulation from appearing artificially deterministic, each processor's progress increment includes a random jitter component. The jitter is sampled from a uniform distribution scaled to ±15% of the base increment:
This produces realistic variation in finish times across simulation runs while preserving the expected performance ordering. The jitter also creates visual dynamism in the progress bars, making the race more engaging.
When a processor's cumulative progress reaches 100%, the simulation records its finish timestamp and assigns a rank (1st, 2nd, 3rd) with corresponding medal badges. The global elapsed timer continues until all processors complete. After all finish, the system calculates speedup ratios relative to the slowest processor and displays a comprehensive results summary.
The centerpiece of the simulator is the processing race—three horizontal progress bars representing CPU, GPU, and TPU advancing toward 100% completion. Each lane is color-coded (cyan for CPU, green for GPU, orange for TPU) and displays the processor name, current progress percentage, and status badge (IDLE, RUNNING, or FINISHED). A global elapsed timer above the lanes shows cumulative time in seconds.
Each processor card displays an animated grid representing its processing elements:
| Processor | Grid Size | Visual Representation | Animation |
|---|---|---|---|
| CPU | 4 × 4 (16 cores) | Large squares, widely spaced | Sequential pulsing, 1–2 active at a time |
| GPU | 16 × 16 (256 cores) | Small dots, densely packed | Wave-pattern pulsing across grid |
| TPU | 8 × 8 (64 units) | Medium squares, uniform grid | Diagonal sweep pattern (systolic flow) |
The grid animations are designed to convey each architecture's parallelism model: CPU cores activate individually (limited parallelism), GPU cores activate in large waves (SIMT execution), and TPU units activate in diagonal sweeps (systolic data flow).
ASCII-style technical diagrams illustrate the internal structure of each processor, providing educational context for the performance differences:
CPU Architecture: ┌─────────────────────────────┐ │ Control Unit │ Branch Predictor │ │──────────────┼──────────────┤ │ L1$ │ L2$ │ L3 Cache │ │────┼────┼──────────────┤ │ Core0 │ Core1 │ Core2 │ Core3 │ └───────┴───────┴───────┴──────┘ GPU Architecture: ┌─────────────────────────────┐ │ GPC0: SM0 SM1 SM2 SM3 │ │ GPC1: SM4 SM5 SM6 SM7 │ │ Each SM: 128 CUDA Cores │ │─────────────────────────────┤ │ HBM2e Memory Controller │ └─────────────────────────────┘ TPU Architecture: ┌─────────────────────────────┐ │ 128x128 Systolic Array │ │ [PE][PE][PE]...[PE] │ │ [PE][PE][PE]...[PE] │ │ Matrix Multiply Unit │ │─────────────────────────────┤ │ Unified HBM Controller │ └─────────────────────────────┘
Each processor card displays live metrics updated on every simulation tick:
| Metric | Description | Update Frequency |
|---|---|---|
| Operations/Second | Simulated throughput based on processor model | Every tick (50ms) |
| Tasks Completed | Cumulative count of completed work units | On completion of each unit |
| Throughput | Data processed per second (GB/s or TFLOPS) | Every tick |
| Start Time | Timestamp when processing began | Once at start |
| Elapsed Time | Running duration since start | Every tick |
| Finish Time | Timestamp when processing completed | Once at completion |
| Progress | Percentage complete with animated bar | Every tick |
A scrollable console at the bottom of the interface displays timestamped log entries tracking simulation events. Events include processor initialization, workload assignment, milestone completions (25%, 50%, 75%), finish events with timing data, and final rankings with speedup calculations. The log provides a textual record complementing the visual race display.
Upon completion of all three processors, the simulator presents a comprehensive results panel:
The results display reinforces the quantitative learning by presenting multiple representations of the same performance data—numerical, visual, and chronological.
| Processor | Primary Color | Background Tint | Accent |
|---|---|---|---|
| CPU | Cyan (#06b6d4) | #0e2433 | #22d3ee |
| GPU | Green (#22c55e) | #0e2418 | #4ade80 |
| TPU | Orange (#f97316) | #241509 | #fb923c |
The dark theme with processor-specific color coding creates immediate visual association between each lane and its processor type. The color scheme is consistent across progress bars, core grids, status badges, architecture diagrams, and results displays.
| Control | Type | Options | Default |
|---|---|---|---|
| Workload Type | Dropdown | Matrix Mul / CNN Training / Batch Inference / NLP Transformer | Matrix Multiplication |
| Batch Size | Dropdown | 32 / 64 / 128 / 256 | 64 |
| Simulation Speed | Range slider | 0.5x – 3x | 1x |
| Start / Reset | Buttons | — | — |
The interface uses CSS Grid and Flexbox to adapt across viewport sizes. On desktop, processor cards are displayed in a three-column layout for side-by-side comparison. On tablet, the layout transitions to a stacked arrangement with preserved race lanes. On mobile, all elements stack vertically with touch-optimized controls.
The project is deployed on two platforms simultaneously, demonstrating the portability of the zero-dependency architecture:
| Platform | URL | Deployment Method | Status |
|---|---|---|---|
| GitHub Pages | romizone.github.io/gpu-cpu-tpu-simulator | Automatic on push to main | Completed |
| Vercel | gpu-cpu-tpu-simulator.vercel.app | Automatic on push to main | Completed |
Both deployments serve the identical index.html file with no build step, transpilation, or bundling required. The zero-dependency architecture ensures identical behavior across both CDN networks.
git clone https://github.com/romizone/gpu-cpu-tpu-simulator.git cd gpu-cpu-tpu-simulator npx serve . # Or: python3 -m http.server 3000 # Or: open index.html (direct file access)
| Layer | Technology | Purpose |
|---|---|---|
| Markup | HTML5 | Semantic structure, accessibility |
| Styling | CSS3 | Grid/Flexbox layout, animations, responsive design, dark theme |
| Logic | Vanilla JavaScript (ES6+) | Simulation engine, event handling, DOM manipulation |
| Animation | setInterval + CSS transitions | Tick-based progress updates, core grid pulsing |
| Hosting | GitHub Pages + Vercel | Dual-platform static deployment with CDN |
| Source Control | GitHub | Version control with automatic deployment triggers |
The entire application resides in a single index.html file containing HTML structure, CSS styles, and JavaScript logic. No npm packages, frameworks, build tools, or external CDN references are used. This architecture provides:
gpu-cpu-tpu-simulator/ ├── index.html # Complete application (single-file, all-in-one) │ ├── <style> # CSS: dark theme, processor colors, animations, responsive │ ├── <body> # HTML: race lanes, processor cards, controls, results │ └── <script> # JS: simulation engine, tick loop, core animations, logging ├── LICENSE # MIT License └── README.md # Documentation and usage guide
The simulator serves as an educational bridge between theoretical hardware architecture knowledge and practical performance intuition. Key learning outcomes include:
| Learning Objective | Simulator Feature | Insight Gained |
|---|---|---|
| Parallelism disparity | Core visualization grids | GPU has 256x more cores than CPU; quantity vs. quality trade-off |
| Workload-architecture fit | Workload selector | TPU dominates matrix ops; GPU excels at training; CPU handles sequential logic |
| Batch size scaling | Batch size control | Larger batches amplify GPU/TPU advantage over CPU |
| Systolic array concept | TPU architecture diagram | Data flows through processing elements in a pipelined grid |
| Performance magnitude | Processing race | GPU/TPU finish 20–50x faster than CPU for parallel workloads |
| Real-world variance | Stochastic jitter | Performance is not perfectly deterministic; variance exists in real hardware |
| Browser | CSS Grid | CSS Animations | ES6+ JavaScript | Overall |
|---|---|---|---|---|
| Chrome 90+ | Full | Full | Full | Full support |
| Edge 90+ | Full | Full | Full | Full support |
| Safari 15+ | Full | Full | Full | Full support |
| Firefox 103+ | Full | Full | Full | Full support |
| Chrome (Android) | Full | Full | Full | Full support |
| Safari (iOS 15+) | Full | Full | Full | Full support |
GPU vs CPU vs TPU Simulator demonstrates that the fundamental performance characteristics of heterogeneous processor architectures can be made accessible through interactive browser-based visualization. By staging a visual race between three processors across representative AI workloads, users develop intuitive understanding of why GPUs and TPUs dramatically outperform CPUs for data-parallel tensor operations, and how workload characteristics influence the relative advantage of each architecture.
The animated core visualization grids convey the parallelism disparity between architectures more effectively than static diagrams or benchmark tables. The configurable workload selection and batch size controls enable users to explore the workload-architecture fit landscape, discovering that TPUs excel at matrix-heavy inference, GPUs dominate training workloads, and CPUs remain competitive only for sequential or control-flow-heavy operations.
Built as a single HTML file with zero external dependencies and deployed on both GitHub Pages and Vercel, the application is instantly accessible, fully transparent, and suitable for educational contexts ranging from computer architecture courses to AI practitioner workshops. The dual-platform deployment demonstrates the portability benefits of dependency-free web development.
The complete source code is available at https://github.com/romizone/gpu-cpu-tpu-simulator and live demos are accessible at GitHub Pages and Vercel.