GPU vs CPU vs TPU Simulator: An Interactive Web-Based Visualization of Heterogeneous Processor Performance for AI Workloads

Abstract

We present an interactive web-based simulator that visualizes the performance characteristics of three heterogeneous processor architectures—Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Tensor Processing Units (TPUs)—when executing representative AI and deep learning workloads. The simulator features a real-time processing race where three processor lanes compete to complete identical tasks, exposing fundamental architectural differences through animated core visualization grids, live performance metrics, and ASCII architecture diagrams. Users select from four workload types (matrix multiplication, CNN training, batch inference, NLP transformer operations), configure batch sizes and simulation speed, and observe how each processor's design philosophy translates to throughput differences. The system implements a tick-based simulation engine with stochastic jitter to produce realistic performance variance, color-coded processor dashboards with pulsing core animations, timestamped event logging, and ranked finish analysis with speedup calculations. Built as a single HTML file with zero external dependencies using vanilla JavaScript, CSS3 Grid/Flexbox, and interval-driven animation, the application is deployed on both GitHub Pages and Vercel. We describe the simulation model, processor abstraction layer, workload parameterization, and visualization architecture.

1. Introduction

The rapid expansion of deep learning has driven the development of specialized hardware accelerators designed to handle the massive parallelism inherent in neural network computations. While CPUs remain the general-purpose backbone of computing, GPUs have become the de facto standard for training deep learning models, and TPUs represent Google's purpose-built approach to tensor operations. Understanding the architectural differences between these processors and their implications for AI workload performance is essential for practitioners making hardware selection decisions.

However, direct access to all three processor types for comparative benchmarking is impractical for most learners and practitioners. GPU clusters require significant investment, and TPU access is limited to Google Cloud Platform. Existing educational materials rely primarily on static benchmark tables, performance charts, and theoretical descriptions that fail to convey the dynamic nature of parallel processing.

GPU vs CPU vs TPU Simulator addresses this educational gap by providing a browser-based interactive tool that simulates the relative performance characteristics of all three processor types across representative AI workloads. Users observe a visual race between processors, examine animated core grids, and analyze real-time metrics—developing intuitive understanding of why certain architectures excel at specific workload patterns. The key contributions of this work are:

2. Related Work

2.1 Heterogeneous Computing Architectures

The divergence between CPU, GPU, and TPU architectures reflects fundamentally different design philosophies. CPUs optimize for single-thread performance with deep pipelines, branch prediction, out-of-order execution, and large cache hierarchies. GPUs sacrifice single-thread performance for massive parallelism, organizing thousands of simpler cores into streaming multiprocessors that execute identical instructions across data elements (SIMT model). TPUs take specialization further, implementing a systolic array architecture where processing elements are arranged in a grid and data flows between them in a pipelined fashion, optimizing specifically for matrix multiply-accumulate operations.

Jouppi et al. (2017) described the TPU architecture in detail, reporting 15–30x performance improvements over contemporary CPUs and GPUs for inference workloads. Subsequent TPU generations (v2, v3, v4) extended this advantage to training, with TPU v4 pods delivering 1.1 exaflops of peak performance. Our simulator abstracts these architectural differences into configurable processor models that produce performance ratios consistent with published benchmarks.

2.2 Performance Benchmarking Tools

MLPerf (Mattson et al., 2020) established standardized benchmarks for comparing ML hardware across training and inference tasks. However, MLPerf results are static tables requiring specialized hardware to reproduce. Tools like AI Benchmark (Ignatov et al., 2019) focus on mobile processors. Our simulator complements these by providing an interactive, visual experience that runs in any web browser without specialized hardware.

2.3 Educational Visualizations

Interactive visualizations have proven effective for teaching computing concepts. TensorFlow Playground demonstrates neural network training, CNN Explainer visualizes convolutional operations, and Transformer Explainer (our prior work) visualizes attention mechanisms. However, no comparable interactive tool exists for comparing hardware architectures in the context of AI workloads. Our work fills this gap with a processor-centric visualization.

3. Processor Models

The simulator abstracts each processor type into a parameterized model that captures the essential performance characteristics without requiring cycle-accurate simulation. Each model is defined by its core count, clock frequency, parallelism factor, and workload-specific throughput scaling.

3.1 CPU Model

The CPU model excels at workloads with complex control flow and data-dependent branching but is constrained by its limited thread count for massively parallel tensor operations. In the simulation, CPU processing speed reflects this by applying a parallelism penalty proportional to the workload's inherent parallelism.

3.2 GPU Model

The GPU model achieves its performance advantage through massive parallelism—4096 CUDA cores executing the same instruction on different data elements simultaneously. The simulation models this by scaling GPU throughput proportionally to the data-parallel dimension of each workload, producing 10–50x speedups over CPU for suitable tasks.

3.3 TPU Model

The TPU model represents the most specialized architecture, achieving peak efficiency for matrix multiply-accumulate operations through its systolic array. The 128×128 array processes 16,384 multiply-accumulate operations per cycle in a pipelined fashion. The simulation reflects this by giving TPU the highest throughput for matrix-heavy workloads while applying efficiency penalties for operations that don't map well to the systolic array structure.

4. Workload Parameterization

The simulator supports four representative AI workloads, each characterized by distinct computational patterns that exercise the three architectures differently.

4.1 Batch Size Scaling

Users configure batch sizes from 32 to 256. Larger batch sizes increase the total computation and amplify the parallelism advantage of GPU and TPU architectures. The simulation models this relationship:

This formulation captures the reality that CPUs scale linearly with batch size (limited parallelism), GPUs scale sub-linearly (parallelism saturates at high batch sizes), and TPUs maintain near-linear throughput scaling due to the systolic array's efficient data flow.

4.2 Simulation Speed Control

A speed multiplier (adjustable via slider) scales the animation tick rate without affecting the relative performance ratios between processors. This allows users to observe the race dynamics at their preferred pace—slow for detailed analysis or fast for rapid comparison.

5. Simulation Engine

5.1 Tick-Based Architecture

Table 1: CPU model specifications
Parameter	Value	Rationale
Core Count	8 cores / 16 threads	Representative high-end desktop CPU
Clock Frequency	5.0 GHz	Modern boost clock for single-thread performance
Cache Hierarchy	L1: 32KB, L2: 256KB, L3: 16MB	Standard multi-level cache architecture
Parallelism Model	MIMD (multi-thread)	Independent instruction streams per core
Strengths	Low latency, branch handling, sequential logic	Deep pipeline with branch prediction

Table 2: GPU model specifications
Parameter	Value	Rationale
CUDA Cores	4096	Representative high-end data center GPU
Clock Frequency	1.8 GHz	Lower per-core frequency, massive parallelism
Memory	24GB HBM2e	High-bandwidth memory for tensor throughput
Parallelism Model	SIMT (single instruction, multiple threads)	Thousands of threads executing identical kernels
Strengths	Massive parallelism, high memory bandwidth	Optimized for data-parallel operations

Table 3: TPU model specifications
Parameter	Value	Rationale
Systolic Array	128 × 128	Matrix multiply unit with 16,384 multiply-accumulate units
Clock Frequency	940 MHz	Lower frequency, higher efficiency per operation
Memory	32GB HBM	Unified high-bandwidth memory
Parallelism Model	Systolic data flow	Data flows through processing elements in a pipeline
Strengths	Matrix operations, deterministic latency	Purpose-built for tensor computations

Table 4: Supported workloads and their computational characteristics
Workload	Operation	Dominant Pattern	CPU Factor	GPU Factor	TPU Factor
Matrix Multiplication	1024 × 1024 GEMM	Dense linear algebra	1.0x	25x	45x
CNN Training	ResNet-50, 1 epoch	Conv2D + backprop	1.0x	30x	40x
Batch Inference	Image classification	Forward pass only	1.0x	20x	50x
NLP Transformer	BERT-Base forward	Attention + FFN	1.0x	22x	35x

The simulation runs on a setInterval-driven tick system. Each tick advances all three processors by their respective throughput amounts, calculated from the processor model parameters and current workload configuration. The tick interval is set to 50ms, providing smooth visual updates at 20 frames per second.

5.2 Stochastic Jitter Model

To prevent the simulation from appearing artificially deterministic, each processor's progress increment includes a random jitter component. The jitter is sampled from a uniform distribution scaled to ±15% of the base increment:

This produces realistic variation in finish times across simulation runs while preserving the expected performance ordering. The jitter also creates visual dynamism in the progress bars, making the race more engaging.

5.3 Completion Detection and Ranking

When a processor's cumulative progress reaches 100%, the simulation records its finish timestamp and assigns a rank (1st, 2nd, 3rd) with corresponding medal badges. The global elapsed timer continues until all processors complete. After all finish, the system calculates speedup ratios relative to the slowest processor and displays a comprehensive results summary.

6. Visualization Architecture

6.1 Processing Race Display

The centerpiece of the simulator is the processing race—three horizontal progress bars representing CPU, GPU, and TPU advancing toward 100% completion. Each lane is color-coded (cyan for CPU, green for GPU, orange for TPU) and displays the processor name, current progress percentage, and status badge (IDLE, RUNNING, or FINISHED). A global elapsed timer above the lanes shows cumulative time in seconds.

6.2 Core Visualization Grids

Each processor card displays an animated grid representing its processing elements:

The grid animations are designed to convey each architecture's parallelism model: CPU cores activate individually (limited parallelism), GPU cores activate in large waves (SIMT execution), and TPU units activate in diagonal sweeps (systolic data flow).

6.3 Architecture Diagrams

ASCII-style technical diagrams illustrate the internal structure of each processor, providing educational context for the performance differences:

6.4 Real-Time Metrics Dashboard

6.5 Event Log Console

A scrollable console at the bottom of the interface displays timestamped log entries tracking simulation events. Events include processor initialization, workload assignment, milestone completions (25%, 50%, 75%), finish events with timing data, and final rankings with speedup calculations. The log provides a textual record complementing the visual race display.

7. Results and Analysis Display

Upon completion of all three processors, the simulator presents a comprehensive results panel:

The results display reinforces the quantitative learning by presenting multiple representations of the same performance data—numerical, visual, and chronological.

8. User Interface Design

8.1 Visual Design System

The dark theme with processor-specific color coding creates immediate visual association between each lane and its processor type. The color scheme is consistent across progress bars, core grids, status badges, architecture diagrams, and results displays.

8.2 Interactive Controls

8.3 Responsive Layout

The interface uses CSS Grid and Flexbox to adapt across viewport sizes. On desktop, processor cards are displayed in a three-column layout for side-by-side comparison. On tablet, the layout transitions to a stacked arrangement with preserved race lanes. On mobile, all elements stack vertically with touch-optimized controls.

9. Deployment

9.1 Dual-Platform Deployment

The project is deployed on two platforms simultaneously, demonstrating the portability of the zero-dependency architecture:

Table 5: Core visualization grid parameters
Processor	Grid Size	Visual Representation	Animation
CPU	4 × 4 (16 cores)	Large squares, widely spaced	Sequential pulsing, 1–2 active at a time
GPU	16 × 16 (256 cores)	Small dots, densely packed	Wave-pattern pulsing across grid
TPU	8 × 8 (64 units)	Medium squares, uniform grid	Diagonal sweep pattern (systolic flow)

Table 6: Real-time metrics displayed per processor
Metric	Description	Update Frequency
Operations/Second	Simulated throughput based on processor model	Every tick (50ms)
Tasks Completed	Cumulative count of completed work units	On completion of each unit
Throughput	Data processed per second (GB/s or TFLOPS)	Every tick
Start Time	Timestamp when processing began	Once at start
Elapsed Time	Running duration since start	Every tick
Finish Time	Timestamp when processing completed	Once at completion
Progress	Percentage complete with animated bar	Every tick

Table 7: Color-coded processor theming
Processor	Primary Color	Background Tint	Accent
CPU	Cyan (#06b6d4)	#0e2433	#22d3ee
GPU	Green (#22c55e)	#0e2418	#4ade80
TPU	Orange (#f97316)	#241509	#fb923c

Table 8: User-configurable simulation parameters
Control	Type	Options	Default
Workload Type	Dropdown	Matrix Mul / CNN Training / Batch Inference / NLP Transformer	Matrix Multiplication
Batch Size	Dropdown	32 / 64 / 128 / 256	64
Simulation Speed	Range slider	0.5x – 3x	1x
Start / Reset	Buttons	—	—

Table 9: Deployment platforms
Platform	URL	Deployment Method	Status
GitHub Pages	romizone.github.io/gpu-cpu-tpu-simulator	Automatic on push to main	Completed
Vercel	gpu-cpu-tpu-simulator.vercel.app	Automatic on push to main	Completed

Both deployments serve the identical index.html file with no build step, transpilation, or bundling required. The zero-dependency architecture ensures identical behavior across both CDN networks.

9.2 Local Development

10. Implementation Details

10.1 Technology Stack

10.2 Zero-Dependency Architecture

Table 10: Core technology stack
Layer	Technology	Purpose
Markup	HTML5	Semantic structure, accessibility
Styling	CSS3	Grid/Flexbox layout, animations, responsive design, dark theme
Logic	Vanilla JavaScript (ES6+)	Simulation engine, event handling, DOM manipulation
Animation	setInterval + CSS transitions	Tick-based progress updates, core grid pulsing
Hosting	GitHub Pages + Vercel	Dual-platform static deployment with CDN
Source Control	GitHub	Version control with automatic deployment triggers

The entire application resides in a single index.html file containing HTML structure, CSS styles, and JavaScript logic. No npm packages, frameworks, build tools, or external CDN references are used. This architecture provides:

10.3 Project Structure

11. Educational Value

The simulator serves as an educational bridge between theoretical hardware architecture knowledge and practical performance intuition. Key learning outcomes include:

12. Browser Compatibility

13. Future Work

14. Conclusion

GPU vs CPU vs TPU Simulator demonstrates that the fundamental performance characteristics of heterogeneous processor architectures can be made accessible through interactive browser-based visualization. By staging a visual race between three processors across representative AI workloads, users develop intuitive understanding of why GPUs and TPUs dramatically outperform CPUs for data-parallel tensor operations, and how workload characteristics influence the relative advantage of each architecture.

The animated core visualization grids convey the parallelism disparity between architectures more effectively than static diagrams or benchmark tables. The configurable workload selection and batch size controls enable users to explore the workload-architecture fit landscape, discovering that TPUs excel at matrix-heavy inference, GPUs dominate training workloads, and CPUs remain competitive only for sequential or control-flow-heavy operations.

Built as a single HTML file with zero external dependencies and deployed on both GitHub Pages and Vercel, the application is instantly accessible, fully transparent, and suitable for educational contexts ranging from computer architecture courses to AI practitioner workshops. The dual-platform deployment demonstrates the portability benefits of dependency-free web development.

References

Table 11: Educational objectives and simulator features
Learning Objective	Simulator Feature	Insight Gained
Parallelism disparity	Core visualization grids	GPU has 256x more cores than CPU; quantity vs. quality trade-off
Workload-architecture fit	Workload selector	TPU dominates matrix ops; GPU excels at training; CPU handles sequential logic
Batch size scaling	Batch size control	Larger batches amplify GPU/TPU advantage over CPU
Systolic array concept	TPU architecture diagram	Data flows through processing elements in a pipelined grid
Performance magnitude	Processing race	GPU/TPU finish 20–50x faster than CPU for parallel workloads
Real-world variance	Stochastic jitter	Performance is not perfectly deterministic; variance exists in real hardware

Table 12: Browser compatibility
Browser	CSS Grid	CSS Animations	ES6+ JavaScript	Overall
Chrome 90+	Full	Full	Full	Full support
Edge 90+	Full	Full	Full	Full support
Safari 15+	Full	Full	Full	Full support
Firefox 103+	Full	Full	Full	Full support
Chrome (Android)	Full	Full	Full	Full support
Safari (iOS 15+)	Full	Full	Full	Full support

Jouppi, N.P., Young, C., Patil, N., et al. (2017). In-Datacenter Performance Analysis of a Tensor Processing Unit. Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 1–12.
Mattson, P., Cheng, C., Diamos, G., et al. (2020). MLPerf Training Benchmark. Proceedings of Machine Learning and Systems (MLSys), 2, 336–349.
Ignatov, A., Timofte, R., Chou, W., et al. (2019). AI Benchmark: Running Deep Neural Networks on Android Smartphones. Proceedings of the European Conference on Computer Vision (ECCV) Workshops.
Hennessy, J.L. and Patterson, D.A. (2019). Computer Architecture: A Quantitative Approach. 6th Edition. Morgan Kaufmann.
NVIDIA Corporation (2020). NVIDIA A100 Tensor Core GPU Architecture. NVIDIA Whitepaper.
Jouppi, N.P., Yoon, D.H., Ashcraft, M., et al. (2021). Ten Lessons from Three Generations of Tensor Processing Units. Proceedings of the 48th Annual International Symposium on Computer Architecture (ISCA).
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30.
Smilkov, D., Carter, S., Sculley, D., Viégas, F.B., and Wattenberg, M. (2017). Direct-Manipulation Visualization of Deep Networks. ICML Visualization for Deep Learning Workshop.