GPU vs CPU vs TPU Simulator: An Interactive Web-Based Visualization of Heterogeneous Processor Performance for AI Workloads

Romi Nur Ismanto
Jakarta AI Research Lab, Jakarta, Indonesia
rominur@gmail.com
March 2026

Abstract

We present an interactive web-based simulator that visualizes the performance characteristics of three heterogeneous processor architectures—Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Tensor Processing Units (TPUs)—when executing representative AI and deep learning workloads. The simulator features a real-time processing race where three processor lanes compete to complete identical tasks, exposing fundamental architectural differences through animated core visualization grids, live performance metrics, and ASCII architecture diagrams. Users select from four workload types (matrix multiplication, CNN training, batch inference, NLP transformer operations), configure batch sizes and simulation speed, and observe how each processor's design philosophy translates to throughput differences. The system implements a tick-based simulation engine with stochastic jitter to produce realistic performance variance, color-coded processor dashboards with pulsing core animations, timestamped event logging, and ranked finish analysis with speedup calculations. Built as a single HTML file with zero external dependencies using vanilla JavaScript, CSS3 Grid/Flexbox, and interval-driven animation, the application is deployed on both GitHub Pages and Vercel. We describe the simulation model, processor abstraction layer, workload parameterization, and visualization architecture.

Keywords: GPU, CPU, TPU, heterogeneous computing, AI accelerators, performance simulation, CUDA cores, systolic array, matrix multiplication, CNN training, transformer inference, interactive visualization, educational tool, zero dependencies

1. Introduction

The rapid expansion of deep learning has driven the development of specialized hardware accelerators designed to handle the massive parallelism inherent in neural network computations. While CPUs remain the general-purpose backbone of computing, GPUs have become the de facto standard for training deep learning models, and TPUs represent Google's purpose-built approach to tensor operations. Understanding the architectural differences between these processors and their implications for AI workload performance is essential for practitioners making hardware selection decisions.

However, direct access to all three processor types for comparative benchmarking is impractical for most learners and practitioners. GPU clusters require significant investment, and TPU access is limited to Google Cloud Platform. Existing educational materials rely primarily on static benchmark tables, performance charts, and theoretical descriptions that fail to convey the dynamic nature of parallel processing.

GPU vs CPU vs TPU Simulator addresses this educational gap by providing a browser-based interactive tool that simulates the relative performance characteristics of all three processor types across representative AI workloads. Users observe a visual race between processors, examine animated core grids, and analyze real-time metrics—developing intuitive understanding of why certain architectures excel at specific workload patterns. The key contributions of this work are:

2. Related Work

2.1 Heterogeneous Computing Architectures

The divergence between CPU, GPU, and TPU architectures reflects fundamentally different design philosophies. CPUs optimize for single-thread performance with deep pipelines, branch prediction, out-of-order execution, and large cache hierarchies. GPUs sacrifice single-thread performance for massive parallelism, organizing thousands of simpler cores into streaming multiprocessors that execute identical instructions across data elements (SIMT model). TPUs take specialization further, implementing a systolic array architecture where processing elements are arranged in a grid and data flows between them in a pipelined fashion, optimizing specifically for matrix multiply-accumulate operations.

Jouppi et al. (2017) described the TPU architecture in detail, reporting 15–30x performance improvements over contemporary CPUs and GPUs for inference workloads. Subsequent TPU generations (v2, v3, v4) extended this advantage to training, with TPU v4 pods delivering 1.1 exaflops of peak performance. Our simulator abstracts these architectural differences into configurable processor models that produce performance ratios consistent with published benchmarks.

2.2 Performance Benchmarking Tools

MLPerf (Mattson et al., 2020) established standardized benchmarks for comparing ML hardware across training and inference tasks. However, MLPerf results are static tables requiring specialized hardware to reproduce. Tools like AI Benchmark (Ignatov et al., 2019) focus on mobile processors. Our simulator complements these by providing an interactive, visual experience that runs in any web browser without specialized hardware.

2.3 Educational Visualizations

Interactive visualizations have proven effective for teaching computing concepts. TensorFlow Playground demonstrates neural network training, CNN Explainer visualizes convolutional operations, and Transformer Explainer (our prior work) visualizes attention mechanisms. However, no comparable interactive tool exists for comparing hardware architectures in the context of AI workloads. Our work fills this gap with a processor-centric visualization.

3. Processor Models

The simulator abstracts each processor type into a parameterized model that captures the essential performance characteristics without requiring cycle-accurate simulation. Each model is defined by its core count, clock frequency, parallelism factor, and workload-specific throughput scaling.

3.1 CPU Model

Table 1: CPU model specifications
ParameterValueRationale
Core Count8 cores / 16 threadsRepresentative high-end desktop CPU
Clock Frequency5.0 GHzModern boost clock for single-thread performance
Cache HierarchyL1: 32KB, L2: 256KB, L3: 16MBStandard multi-level cache architecture
Parallelism ModelMIMD (multi-thread)Independent instruction streams per core
StrengthsLow latency, branch handling, sequential logicDeep pipeline with branch prediction

The CPU model excels at workloads with complex control flow and data-dependent branching but is constrained by its limited thread count for massively parallel tensor operations. In the simulation, CPU processing speed reflects this by applying a parallelism penalty proportional to the workload's inherent parallelism.

3.2 GPU Model

Table 2: GPU model specifications
ParameterValueRationale
CUDA Cores4096Representative high-end data center GPU
Clock Frequency1.8 GHzLower per-core frequency, massive parallelism
Memory24GB HBM2eHigh-bandwidth memory for tensor throughput
Parallelism ModelSIMT (single instruction, multiple threads)Thousands of threads executing identical kernels
StrengthsMassive parallelism, high memory bandwidthOptimized for data-parallel operations

The GPU model achieves its performance advantage through massive parallelism—4096 CUDA cores executing the same instruction on different data elements simultaneously. The simulation models this by scaling GPU throughput proportionally to the data-parallel dimension of each workload, producing 10–50x speedups over CPU for suitable tasks.

3.3 TPU Model

Table 3: TPU model specifications
ParameterValueRationale
Systolic Array128 × 128Matrix multiply unit with 16,384 multiply-accumulate units
Clock Frequency940 MHzLower frequency, higher efficiency per operation
Memory32GB HBMUnified high-bandwidth memory
Parallelism ModelSystolic data flowData flows through processing elements in a pipeline
StrengthsMatrix operations, deterministic latencyPurpose-built for tensor computations

The TPU model represents the most specialized architecture, achieving peak efficiency for matrix multiply-accumulate operations through its systolic array. The 128×128 array processes 16,384 multiply-accumulate operations per cycle in a pipelined fashion. The simulation reflects this by giving TPU the highest throughput for matrix-heavy workloads while applying efficiency penalties for operations that don't map well to the systolic array structure.

4. Workload Parameterization

The simulator supports four representative AI workloads, each characterized by distinct computational patterns that exercise the three architectures differently.

Table 4: Supported workloads and their computational characteristics
Workload Operation Dominant Pattern CPU Factor GPU Factor TPU Factor
Matrix Multiplication 1024 × 1024 GEMM Dense linear algebra 1.0x 25x 45x
CNN Training ResNet-50, 1 epoch Conv2D + backprop 1.0x 30x 40x
Batch Inference Image classification Forward pass only 1.0x 20x 50x
NLP Transformer BERT-Base forward Attention + FFN 1.0x 22x 35x

4.1 Batch Size Scaling

Users configure batch sizes from 32 to 256. Larger batch sizes increase the total computation and amplify the parallelism advantage of GPU and TPU architectures. The simulation models this relationship:

Tprocessor = Tbase × (batch_size / 32) / speedup_factorprocessor

This formulation captures the reality that CPUs scale linearly with batch size (limited parallelism), GPUs scale sub-linearly (parallelism saturates at high batch sizes), and TPUs maintain near-linear throughput scaling due to the systolic array's efficient data flow.

4.2 Simulation Speed Control

A speed multiplier (adjustable via slider) scales the animation tick rate without affecting the relative performance ratios between processors. This allows users to observe the race dynamics at their preferred pace—slow for detailed analysis or fast for rapid comparison.

5. Simulation Engine

5.1 Tick-Based Architecture

The simulation runs on a setInterval-driven tick system. Each tick advances all three processors by their respective throughput amounts, calculated from the processor model parameters and current workload configuration. The tick interval is set to 50ms, providing smooth visual updates at 20 frames per second.

Tick Event → Calculate Progress Increment (per processor) → Apply Stochastic Jitter → Update Progress Bars → Animate Core Grids → Update Metrics Display → Check Completion → Log Events

5.2 Stochastic Jitter Model

To prevent the simulation from appearing artificially deterministic, each processor's progress increment includes a random jitter component. The jitter is sampled from a uniform distribution scaled to ±15% of the base increment:

Δprogresst = Δbase × (1 + jitter), where jitter ~ U(-0.15, 0.15)

This produces realistic variation in finish times across simulation runs while preserving the expected performance ordering. The jitter also creates visual dynamism in the progress bars, making the race more engaging.

5.3 Completion Detection and Ranking

When a processor's cumulative progress reaches 100%, the simulation records its finish timestamp and assigns a rank (1st, 2nd, 3rd) with corresponding medal badges. The global elapsed timer continues until all processors complete. After all finish, the system calculates speedup ratios relative to the slowest processor and displays a comprehensive results summary.

6. Visualization Architecture

6.1 Processing Race Display

The centerpiece of the simulator is the processing race—three horizontal progress bars representing CPU, GPU, and TPU advancing toward 100% completion. Each lane is color-coded (cyan for CPU, green for GPU, orange for TPU) and displays the processor name, current progress percentage, and status badge (IDLE, RUNNING, or FINISHED). A global elapsed timer above the lanes shows cumulative time in seconds.

6.2 Core Visualization Grids

Each processor card displays an animated grid representing its processing elements:

Table 5: Core visualization grid parameters
ProcessorGrid SizeVisual RepresentationAnimation
CPU4 × 4 (16 cores)Large squares, widely spacedSequential pulsing, 1–2 active at a time
GPU16 × 16 (256 cores)Small dots, densely packedWave-pattern pulsing across grid
TPU8 × 8 (64 units)Medium squares, uniform gridDiagonal sweep pattern (systolic flow)

The grid animations are designed to convey each architecture's parallelism model: CPU cores activate individually (limited parallelism), GPU cores activate in large waves (SIMT execution), and TPU units activate in diagonal sweeps (systolic data flow).

6.3 Architecture Diagrams

ASCII-style technical diagrams illustrate the internal structure of each processor, providing educational context for the performance differences:

CPU Architecture:
┌─────────────────────────────┐
│ Control Unit  │ Branch Predictor │
│──────────────┼──────────────┤
│ L1$ │ L2$ │    L3 Cache     │
│────┼────┼──────────────┤
│ Core0 │ Core1 │ Core2 │ Core3 │
└───────┴───────┴───────┴──────┘

GPU Architecture:
┌─────────────────────────────┐
│ GPC0: SM0 SM1 SM2 SM3     │
│ GPC1: SM4 SM5 SM6 SM7     │
│ Each SM: 128 CUDA Cores    │
│─────────────────────────────┤
│ HBM2e Memory Controller    │
└─────────────────────────────┘

TPU Architecture:
┌─────────────────────────────┐
│   128x128 Systolic Array   │
│   [PE][PE][PE]...[PE]      │
│   [PE][PE][PE]...[PE]      │
│   Matrix Multiply Unit     │
│─────────────────────────────┤
│   Unified HBM Controller   │
└─────────────────────────────┘

6.4 Real-Time Metrics Dashboard

Each processor card displays live metrics updated on every simulation tick:

Table 6: Real-time metrics displayed per processor
MetricDescriptionUpdate Frequency
Operations/SecondSimulated throughput based on processor modelEvery tick (50ms)
Tasks CompletedCumulative count of completed work unitsOn completion of each unit
ThroughputData processed per second (GB/s or TFLOPS)Every tick
Start TimeTimestamp when processing beganOnce at start
Elapsed TimeRunning duration since startEvery tick
Finish TimeTimestamp when processing completedOnce at completion
ProgressPercentage complete with animated barEvery tick

6.5 Event Log Console

A scrollable console at the bottom of the interface displays timestamped log entries tracking simulation events. Events include processor initialization, workload assignment, milestone completions (25%, 50%, 75%), finish events with timing data, and final rankings with speedup calculations. The log provides a textual record complementing the visual race display.

7. Results and Analysis Display

Upon completion of all three processors, the simulator presents a comprehensive results panel:

The results display reinforces the quantitative learning by presenting multiple representations of the same performance data—numerical, visual, and chronological.

8. User Interface Design

8.1 Visual Design System

Table 7: Color-coded processor theming
ProcessorPrimary ColorBackground TintAccent
CPUCyan (#06b6d4)#0e2433#22d3ee
GPUGreen (#22c55e)#0e2418#4ade80
TPUOrange (#f97316)#241509#fb923c

The dark theme with processor-specific color coding creates immediate visual association between each lane and its processor type. The color scheme is consistent across progress bars, core grids, status badges, architecture diagrams, and results displays.

8.2 Interactive Controls

Table 8: User-configurable simulation parameters
ControlTypeOptionsDefault
Workload TypeDropdownMatrix Mul / CNN Training / Batch Inference / NLP TransformerMatrix Multiplication
Batch SizeDropdown32 / 64 / 128 / 25664
Simulation SpeedRange slider0.5x – 3x1x
Start / ResetButtons

8.3 Responsive Layout

The interface uses CSS Grid and Flexbox to adapt across viewport sizes. On desktop, processor cards are displayed in a three-column layout for side-by-side comparison. On tablet, the layout transitions to a stacked arrangement with preserved race lanes. On mobile, all elements stack vertically with touch-optimized controls.

9. Deployment

9.1 Dual-Platform Deployment

The project is deployed on two platforms simultaneously, demonstrating the portability of the zero-dependency architecture:

Table 9: Deployment platforms
PlatformURLDeployment MethodStatus
GitHub Pagesromizone.github.io/gpu-cpu-tpu-simulatorAutomatic on push to mainCompleted
Vercelgpu-cpu-tpu-simulator.vercel.appAutomatic on push to mainCompleted

Both deployments serve the identical index.html file with no build step, transpilation, or bundling required. The zero-dependency architecture ensures identical behavior across both CDN networks.

9.2 Local Development

git clone https://github.com/romizone/gpu-cpu-tpu-simulator.git
cd gpu-cpu-tpu-simulator
npx serve .
# Or: python3 -m http.server 3000
# Or: open index.html (direct file access)

10. Implementation Details

10.1 Technology Stack

Table 10: Core technology stack
LayerTechnologyPurpose
MarkupHTML5Semantic structure, accessibility
StylingCSS3Grid/Flexbox layout, animations, responsive design, dark theme
LogicVanilla JavaScript (ES6+)Simulation engine, event handling, DOM manipulation
AnimationsetInterval + CSS transitionsTick-based progress updates, core grid pulsing
HostingGitHub Pages + VercelDual-platform static deployment with CDN
Source ControlGitHubVersion control with automatic deployment triggers

10.2 Zero-Dependency Architecture

The entire application resides in a single index.html file containing HTML structure, CSS styles, and JavaScript logic. No npm packages, frameworks, build tools, or external CDN references are used. This architecture provides:

10.3 Project Structure

gpu-cpu-tpu-simulator/
├── index.html              # Complete application (single-file, all-in-one)
│   ├── <style>             # CSS: dark theme, processor colors, animations, responsive
│   ├── <body>              # HTML: race lanes, processor cards, controls, results
│   └── <script>            # JS: simulation engine, tick loop, core animations, logging
├── LICENSE                 # MIT License
└── README.md               # Documentation and usage guide

11. Educational Value

The simulator serves as an educational bridge between theoretical hardware architecture knowledge and practical performance intuition. Key learning outcomes include:

Table 11: Educational objectives and simulator features
Learning ObjectiveSimulator FeatureInsight Gained
Parallelism disparityCore visualization gridsGPU has 256x more cores than CPU; quantity vs. quality trade-off
Workload-architecture fitWorkload selectorTPU dominates matrix ops; GPU excels at training; CPU handles sequential logic
Batch size scalingBatch size controlLarger batches amplify GPU/TPU advantage over CPU
Systolic array conceptTPU architecture diagramData flows through processing elements in a pipelined grid
Performance magnitudeProcessing raceGPU/TPU finish 20–50x faster than CPU for parallel workloads
Real-world varianceStochastic jitterPerformance is not perfectly deterministic; variance exists in real hardware

12. Browser Compatibility

Table 12: Browser compatibility
BrowserCSS GridCSS AnimationsES6+ JavaScriptOverall
Chrome 90+FullFullFullFull support
Edge 90+FullFullFullFull support
Safari 15+FullFullFullFull support
Firefox 103+FullFullFullFull support
Chrome (Android)FullFullFullFull support
Safari (iOS 15+)FullFullFullFull support

13. Future Work

14. Conclusion

GPU vs CPU vs TPU Simulator demonstrates that the fundamental performance characteristics of heterogeneous processor architectures can be made accessible through interactive browser-based visualization. By staging a visual race between three processors across representative AI workloads, users develop intuitive understanding of why GPUs and TPUs dramatically outperform CPUs for data-parallel tensor operations, and how workload characteristics influence the relative advantage of each architecture.

The animated core visualization grids convey the parallelism disparity between architectures more effectively than static diagrams or benchmark tables. The configurable workload selection and batch size controls enable users to explore the workload-architecture fit landscape, discovering that TPUs excel at matrix-heavy inference, GPUs dominate training workloads, and CPUs remain competitive only for sequential or control-flow-heavy operations.

Built as a single HTML file with zero external dependencies and deployed on both GitHub Pages and Vercel, the application is instantly accessible, fully transparent, and suitable for educational contexts ranging from computer architecture courses to AI practitioner workshops. The dual-platform deployment demonstrates the portability benefits of dependency-free web development.

The complete source code is available at https://github.com/romizone/gpu-cpu-tpu-simulator and live demos are accessible at GitHub Pages and Vercel.

References

  1. Jouppi, N.P., Young, C., Patil, N., et al. (2017). In-Datacenter Performance Analysis of a Tensor Processing Unit. Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 1–12.
  2. Mattson, P., Cheng, C., Diamos, G., et al. (2020). MLPerf Training Benchmark. Proceedings of Machine Learning and Systems (MLSys), 2, 336–349.
  3. Ignatov, A., Timofte, R., Chou, W., et al. (2019). AI Benchmark: Running Deep Neural Networks on Android Smartphones. Proceedings of the European Conference on Computer Vision (ECCV) Workshops.
  4. Hennessy, J.L. and Patterson, D.A. (2019). Computer Architecture: A Quantitative Approach. 6th Edition. Morgan Kaufmann.
  5. NVIDIA Corporation (2020). NVIDIA A100 Tensor Core GPU Architecture. NVIDIA Whitepaper.
  6. Jouppi, N.P., Yoon, D.H., Ashcraft, M., et al. (2021). Ten Lessons from Three Generations of Tensor Processing Units. Proceedings of the 48th Annual International Symposium on Computer Architecture (ISCA).
  7. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  8. Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT.
  9. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30.
  10. Smilkov, D., Carter, S., Sculley, D., Viégas, F.B., and Wattenberg, M. (2017). Direct-Manipulation Visualization of Deep Networks. ICML Visualization for Deep Learning Workshop.