Alibaba Launches HappyHorse-1.0: the Number One Ranked AI Video Model

28-04-2026

Alibaba's HappyHorse-1.0 is a 15-billion-parameter AI video model that generates synchronized audio and video in a single pass, currently ranked number one on the Artificial Analysis Video Arena.

Written by:

Jorick van Weelie

Jorick van Weelie | Marketing Lead & AI Pioneer at DataNorth AI Jorick specializes in translating complex AI architectures into actionable business strategies.

Published: April 28, 2026

Alibaba’s Taotian Future Life Lab, operating under the Alibaba Token Hub division, has officially launched HappyHorse-1.0, an AI video generation model that holds the number one Elo ranking on the Artificial Analysis Video Arena in both the Text-to-Video and Image-to-Video categories. Developer and enterprise API access went live on April 27, 2026 through fal, making HappyHorse-1.0 immediately available via four endpoints: text-to-video, image-to-video, reference-to-video, and video-edit.

HappyHorse-1.0 is a 15-billion-parameter unified Transformer that generates synchronized video and audio in a single forward pass, with native lip-sync support across seven languages. The model produces 1080p output in approximately 38 seconds on a single NVIDIA H100 GPU.

What can HappyHorse-1.0 do?

HappyHorse-1.0 is a unified 40-layer self-attention Transformer that generates video and audio jointly in a single forward pass, without cross-attention modules and without a separate audio post-processing step. This architecture means the model produces synchronized audiovisual output natively, including lip-sync across seven languages:

English,
Mandarin,
Cantonese,
Japanese,
Korean,
German,
French.

The model supports four API endpoints:

text-to-video (generate a video from a text prompt),
image-to-video (animate a still image),
reference-to-video (maintain consistent character identity across shots),
video-edit (modify existing video content).

Output is available in 720p and 1080p resolution, in aspect ratios including 16:9, 9:16, 1:1, 4:3, and 3:4, making it suitable for a range of platforms from YouTube to TikTok and Instagram.

Camera direction fidelity is a notable feature: HappyHorse-1.0 responds to specific cinematographic cues such as “slow dolly push-in,” “overhead crane shot,” and wind intensity variations. The model also supports multi-shot sequences with consistent character identity across frames, which is relevant for product promos, social content, and short-form storytelling.

HappyHorse-1.0 benchmarks and technical specs

On the Artificial Analysis Video Arena, which ranks models based on blind human preference votes where users compare two unlabeled clips without knowing which model produced either, HappyHorse-1.0 earned an Elo of 1333 in Text-to-Video and 1392 in Image-to-Video (both without audio evaluation). When audio is included in the evaluation, the model scores 1238 Elo. These scores place HappyHorse-1.0 above every other video generation model currently benchmarked on the platform, including ByteDance’s Dreamina Seedance 2.0, which trails by nearly 115 Elo points in text-to-video.

The model contains 15 billion parameters arranged in a 40-layer self-attention Transformer architecture. On a single NVIDIA H100 GPU, HappyHorse-1.0 generates 1080p video in approximately 38 seconds and a 5-second clip at 256p resolution in roughly 2 seconds. The model was led by Zhang Di, a 15-year AI industry veteran who previously served as VP at Kuaishou and was the technical architect of Kling AI before rejoining Alibaba in late 2025.

How does HappyHorse-1.0 compare to other AI video models?

HappyHorse-1.0 is the first video model to reach the top position on the Artificial Analysis leaderboard. Its primary differentiator compared to competitors like Dreamina Seedance 2.0, Sora, Veo, Runway, Pika, and Luma is the joint audio-video generation capability. Where most competing models generate silent video and require a separate audio pipeline, HappyHorse-1.0 produces video with synchronized dialogue, ambient sound, and Foley effects in a single inference step.

The native multilingual lip-sync across seven languages is another area where HappyHorse-1.0 stands apart. Most competing models either lack lip-sync entirely or support only English. The combination of high visual fidelity (confirmed by the Arena’s blind preference voting) with native audio makes HappyHorse-1.0 particularly relevant for use cases that require talking-head content, product demonstrations with voiceover, or multilingual social media production.

HappyHorse-1.0 pricing and availability

HappyHorse-1.0 is available now through fal’s generative media platform. Pricing is set at $0.14 per second of generated video at 720p resolution and $0.28 per second at 1080p, with no minimum spend or subscription requirement. Enterprise pricing is available on request. fal guarantees full commercial rights for all generated outputs.

Developers can integrate HappyHorse-1.0 using fal’s Python and JavaScript SDKs. The four API endpoints are accessible at fal.ai/models/alibaba/happy-horse/ followed by text-to-video, image-to-video, reference-to-video, or video-edit. A playground for non-technical users is also available on the fal website.

HappyHorse-1.0 is also available via Alibaba Cloud Model Studio (Bailian), where an introductory 10% discount is offered for early access users. The model was developed by Alibaba’s Taotian Future Life Lab and is accessible globally from launch.

For full technical details and API documentation, visit the official HappyHorse-1.0 page on fal.