AI News Presenter Simulator: Virtual News Anchor with Bilingual Text-to-Speech and Animated SVG Avatar

Abstract

We present AI News Presenter Simulator, a browser-based virtual news anchor application that combines animated SVG avatars with real-time Text-to-Speech (TTS) synthesis and synchronized subtitles. The system renders a professional female presenter avatar with lip-sync mouth animation, automatic eye blink cycles, subtle breathing motion, and intensified glow effects during broadcast. Audio generation leverages the Web Speech API with bilingual support for Bahasa Indonesia and English, featuring adjustable speech rate (0.5x–2x), pitch control (0.5–2.0), multiple voice selection per language, and automatic voice switching on language change. The TV studio experience includes a running BREAKING news ticker, blinking LIVE badge, real-time clock display, SVG news desk, audio equalizer animation, floating particles, spotlight effects, and glassmorphism UI styling. Real-time subtitles with word-level highlighting provide visual reinforcement of spoken content. The entire application is built as a single HTML file with zero external dependencies using vanilla HTML5, CSS3, and JavaScript, deployed on Vercel with edge CDN distribution. We describe the avatar animation system, TTS integration pipeline, subtitle synchronization mechanism, and responsive design architecture.

1. Introduction

The broadcasting industry has increasingly explored virtual presenters as cost-effective alternatives to human anchors for routine news delivery, weather updates, and informational segments. Traditional virtual presenter systems require substantial infrastructure—video generation models, GPU-intensive lip-sync networks such as Wav2Lip or SadTalker, dedicated backend servers, and complex audio-video synchronization pipelines. These requirements place virtual presenter technology beyond the reach of educators, small newsrooms, and individual creators.

AI News Presenter Simulator demonstrates that a compelling virtual news anchor experience can be achieved entirely within the browser using standard web technologies. By combining SVG-based avatar animation with the Web Speech API for text-to-speech synthesis, the system eliminates the need for external servers, GPU hardware, or third-party API subscriptions. The result is a zero-cost, zero-dependency application that runs on any modern browser.

2. Related Work

Virtual presenter systems have evolved along two trajectories: deep learning-based approaches and web-based approaches. On the deep learning side, SadTalker (Zhang et al., 2023) generates realistic talking head videos from a single image and audio input using 3D motion coefficients, while Wav2Lip (Prajwal et al., 2020) achieves accurate lip-sync by training on the LRS2 dataset. These systems produce photorealistic results but require GPU inference, introducing latency, cost, and infrastructure dependencies.

The Web Speech API (W3C, 2012) provides browser-native text-to-speech synthesis without server-side processing. While the synthesis quality varies across browsers and operating systems, modern implementations on Chrome (Google TTS), Edge (Microsoft Azure TTS), and Safari (Apple TTS) deliver acceptable quality for informational content. The API exposes SpeechSynthesisUtterance events including boundary events that report the character offset of each spoken word—a capability we leverage for subtitle synchronization.

SVG animation for character representation has been explored in educational contexts and interactive storytelling. Unlike raster-based approaches, SVG avatars scale to any resolution without quality loss, render efficiently on low-powered devices, and can be manipulated programmatically through CSS and JavaScript. The combination of SVG avatars with TTS has not been extensively explored for news presentation scenarios, which is the gap this work addresses.

3. System Architecture

AI News Presenter Simulator is architected as a single-page application contained entirely within one HTML file. The application comprises four major subsystems: the SVG Avatar Engine, the TTS Pipeline, the Subtitle Synchronizer, and the Studio UI Layer. All subsystems communicate through a shared JavaScript state object and DOM event listeners.

3.1 Application Flow

3.2 SVG Avatar Engine

The presenter avatar is a hand-crafted inline SVG element with individually addressable components for the face, hair, eyes, mouth, blazer, and body. Each component uses SVG gradient fills for depth and realism. The animation system manages four concurrent animation loops:

The lip-sync animation operates by toggling between open and closed mouth SVG paths. When the TTS engine is actively speaking, a JavaScript interval alternates the mouth state at approximately 150ms intervals, creating the visual impression of speech. The mouth animation is synchronized with the TTS state rather than individual phonemes—a deliberate design choice that avoids the complexity of phoneme-to-viseme mapping while maintaining visual believability at the application's typical viewing distance.

3.3 Text-to-Speech Pipeline

Table 1: Avatar animation subsystems
Animation	Trigger	Mechanism	Frequency
Lip-sync (mouth)	TTS speaking state	CSS class toggle on SVG mouth path	~150ms cycle
Eye blink	Interval timer	SVG ellipse ry attribute animation	Every 3–5 seconds (randomized)
Breathing	CSS animation (infinite)	Subtle translateY on torso group	4-second cycle
Broadcast glow	TTS speaking state	CSS box-shadow pulse on avatar container	2-second pulse cycle

The TTS subsystem wraps the Web Speech API SpeechSynthesis interface with language management, voice selection, and parameter control:

When the user switches language, the system performs a cascading update: the voice list is re-filtered for the new locale, the UI labels (buttons, headers, placeholders) switch to the corresponding language, the news template presets update, and any active broadcast is stopped. Voice availability depends on the user's operating system and browser—Chrome on macOS typically offers 20+ voices, while Chrome on Linux may offer fewer options.

3.4 Subtitle Synchronization

Table 2: TTS configuration parameters
Parameter	Range	Default	Control
Language	id-ID, en-US/en-GB	id-ID	Toggle button
Speech Rate	0.5x – 2.0x	1.0x	Range slider
Pitch	0.5 – 2.0	1.0	Range slider
Voice	Available system voices	First matching voice	Dropdown select

Real-time subtitle display with word-level highlighting is achieved through the SpeechSynthesisUtterance.onboundary event. This event fires at word boundaries during speech synthesis, providing the character index and length of the currently spoken word. The subtitle system:

The word highlight uses a distinct background color and increased font weight, providing clear visual indication of the current reading position. This feature is particularly valuable for language learners and hearing-impaired users who benefit from simultaneous audio and visual text presentation.

4. Studio UI Design

The visual design replicates the aesthetics of a professional TV news studio using pure CSS techniques:

4.1 Visual Components

4.2 Glassmorphism Design

Table 3: Studio UI elements
Element	Implementation	Behavior
LIVE Badge	CSS animation (opacity pulse)	Blinks continuously during broadcast
News Ticker	CSS translateX animation	Continuous horizontal scroll with BREAKING prefix
Real-time Clock	JavaScript setInterval	Updates every second (HH:MM:SS WIB format)
News Desk	SVG with gradient fills	Static foreground element in front of avatar
Audio Equalizer	CSS animation (scaleY randomized)	Animated bars during broadcast, static when idle
Floating Particles	CSS keyframe animation	Subtle floating dots for ambient depth
Spotlight	CSS radial-gradient overlay	Intensifies during broadcast mode

The control panel and subtitle container use glassmorphism styling—semi-transparent backgrounds with backdrop-filter: blur()—creating a modern, layered visual effect. This design choice allows the studio background and avatar to remain partially visible behind the controls, reinforcing the immersive broadcast environment.

4.3 Responsive Layout

5. Implementation Details

5.1 Technology Stack

5.2 Zero-Dependency Architecture

Table 4: Responsive breakpoints
Viewport	Width	Layout Adaptation
Desktop	≥ 1024px	Full studio layout with side-by-side controls
Tablet	768px – 1023px	Stacked layout, reduced ticker speed
Mobile	< 768px	Single column, touch-optimized controls, compact avatar

Table 5: Core technology stack
Layer	Technology	Purpose
Markup	HTML5	Document structure, semantic elements, inline SVG avatar
Styling	CSS3	Animations, glassmorphism, gradients, responsive layout (600+ lines)
Logic	Vanilla JavaScript	TTS control, animation orchestration, state management
TTS Engine	Web Speech API	Browser-native speech synthesis (zero cost, no API key)
Avatar	Inline SVG	Vector-based presenter with gradient fills and CSS animation
Typography	Google Fonts	Playfair Display, Source Sans 3, JetBrains Mono
Hosting	Vercel	Static deployment with edge CDN and CI/CD
Source Control	GitHub	Version control with automatic Vercel deployment on push

A defining architectural decision is the zero-dependency approach: the entire application—HTML structure, CSS styling (600+ lines), SVG avatar definition, and JavaScript logic—resides in a single index.html file. No npm packages, no build step, no bundler, no framework. This choice yields several advantages:

5.3 News Template System

The application includes four pre-built news script templates to demonstrate its capabilities without requiring users to write content:

Templates automatically switch when the user changes language, ensuring the displayed content always matches the selected TTS language. Users can also write custom scripts in the text area for any topic.

6. Deployment

6.1 Vercel Deployment (Recommended)

The configuration enables SPA routing, sets appropriate cache headers for static content, and applies security headers to prevent content-type sniffing and clickjacking attacks. Vercel's edge CDN ensures low-latency delivery globally.

6.2 Local Development

Table 6: Pre-built news templates
Template	Language	Topic
Berita Utama	Indonesian	General news headline
Berita Ekonomi	Indonesian	Economic/financial news
Tech News	English	Technology news
World News	English	International news

Due to the zero-dependency architecture, the application can also be opened directly as a local file (open index.html) without a web server, though some browsers may restrict Web Speech API access in the file:// protocol.

7. Browser Compatibility

The Web Speech API is the primary compatibility constraint. The following table summarizes TTS support across major browsers:

Chrome and Edge provide the best experience due to their comprehensive voice collections and reliable boundary event firing. The word highlighting feature degrades gracefully on browsers with limited boundary event support—subtitles still display but without word-level tracking.

8. Project Structure

9. Future Work

The current v1.0.0 release establishes the foundation for more advanced virtual presenter capabilities. Planned enhancements include:

10. Conclusion

AI News Presenter Simulator demonstrates that a compelling virtual news anchor experience can be built entirely with standard web technologies—HTML5, CSS3, and vanilla JavaScript—without external dependencies, server infrastructure, or API costs. The combination of SVG avatar animation, Web Speech API synthesis, and word-level subtitle synchronization creates an engaging broadcast simulation that runs on any modern browser.

The zero-dependency, single-file architecture makes the application immediately deployable, easily maintainable, and fully transparent for educational purposes. The bilingual support for Bahasa Indonesia and English, combined with four pre-built news templates, provides a ready-to-use tool for educators, content creators, and developers exploring virtual presenter technology.

References

Table 7: Browser TTS compatibility
Browser	TTS Support	Indonesian Voices	Boundary Events
Chrome (Desktop)	Full	Yes (Google TTS)	Yes
Edge (Desktop)	Full	Yes (Azure TTS)	Yes
Safari (macOS/iOS)	Full	Limited	Partial
Firefox	Partial	OS-dependent	Limited
Chrome (Android)	Full	Yes	Yes

Zhang, W., Cun, X., Wang, X., et al. (2023). SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., and Jawahar, C.V. (2020). A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. Proceedings of the 28th ACM International Conference on Multimedia.
W3C. (2012). Web Speech API Specification. World Wide Web Consortium. https://wicg.github.io/speech-api/
MDN Web Docs. (2024). SpeechSynthesisUtterance: boundary event. Mozilla Developer Network. https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesisUtterance
Vercel Inc. (2024). Vercel Documentation: Edge Network. Vercel. https://vercel.com/docs
Dahlbäck, N., Jönsson, A., and Ahrenberg, L. (1993). Wizard of Oz Studies: Why and How. Knowledge-Based Systems, 6(4), 258–266.
Cassell, J., Sullivan, J., Prevost, S., and Churchill, E. (2000). Embodied Conversational Agents. MIT Press.
W3C. (2011). Scalable Vector Graphics (SVG) 1.1 (Second Edition). World Wide Web Consortium. https://www.w3.org/TR/SVG11/