Notizie IA Logo

AITalk

News and analysis on Artificial Intelligence

The Explosion of Generative AI Video. Between the Hype of the Giants and the Shadow of Open Source. Ovi and the Other Rebels.

Generative AIBusinessStartups

ai-video-gen.jpg

For generative AI video, 2024 was what 1991 was to grunge: the sudden explosion of something that had been brewing for a long time. If in February 2024 OpenAI dropped a bombshell with Sora 1, showing generated videos that were hailed as a technological miracle, just a year later the landscape has multiplied exponentially.

Google responded with Veo, Meta pulled Movie Gen out of its hat, and dozens of university labs and startups began publishing their own models with a feverish frequency. Just like in that Seattle music explosion, where alongside Nirvana, bands like Soundgarden, Pearl Jam, and Alice in Chains emerged, today alongside the tech giants, open-source projects are appearing, promising to democratize a technology still largely locked away in the data centers of big tech.

The timeline of this acceleration is dizzying. Since Sora 1 showed the world that it was possible to generate photorealistic videos from simple text prompts, the sector has been in a frantic race. In September 2024, Google launched Veo 2, focusing on a cinematic style and impressive visual quality. Meta, not wanting to be left behind, introduced Movie Gen in October 2024, a system capable of producing videos up to 16 seconds long with synchronized music and sound effects. And then, in September 2025, Sora 2 arrived, with the long-awaited addition of synchronized audio, including dialogue and sound effects. Meanwhile, the open-source world has not been idle: projects like Tencent's HunyuanVideo, Mochi 1, Open-Sora 1.3, and now Ovi have begun to offer concrete alternatives, albeit with very different approaches and results.

Yet, behind the glittering marketing and breathtaking demo videos, the reality is more complex and less democratic than press releases would have us believe. The models from the tech giants are still largely inaccessible, locked behind endless waitlists, premium subscriptions, or simply not available to the public. And this brings us to the central question: is the hype generated by big companies proportional to the results actually accessible to users? Or are we facing yet another marketing operation that promises revolutions while delivering limited access and proprietary models that are impossible to modify or study?

The Giants and Their Golden Locks

Sora 2, presented by OpenAI with great media fanfare in September 2025, represents a significant evolutionary leap on paper. The model generates videos up to 20 seconds long with synchronized audio, dialogue, and sound effects, promising cinematic quality and impressive narrative coherence. The demo videos show complex scenes, with fluid camera movements and a realism bordering on photorealism. However, as is often the case in the OpenAI world, access is far from open. Sora 2 is only available to ChatGPT Plus and Pro subscribers, with costs starting at $20 per month and potentially rising considerably for those who want to generate more than a few clips a month. There is no stable public API, no possibility of downloading the model weights, and no technical documentation that allows a real understanding of how the system works. It's as if Led Zeppelin had released "Stairway to Heaven" but only on a jukebox accessible with a monthly fee, without ever releasing the record.

Google, for its part, has bet everything on Veo 3, a model that emphasizes a cinematic style and professional-level visual quality. Veo 3 generates videos up to 2 minutes long, supports high resolutions, and includes synchronized music and ambient audio. Here too, however, access is reserved for premium users of Google One AI Premium, with costs starting at $19.99 per month. The VideoFX platform, where Veo 3 is implemented, offers a user-friendly interface but remains a completely closed system: no code available, no possibility of fine-tuning, no transparency on training data. It is the classic walled garden approach: it works well, has an impeccable aesthetic, but you are completely at the mercy of Mountain View's decisions.

Meta has chosen a seemingly more open path with Movie Gen and the Vibes platform, integrated into the Meta AI app. Movie Gen generates videos up to 16 seconds long with music and sound effects, and Vibes allows users to create, remix, and share short clips in a social and creative experience. At the moment, the service is free, which makes it the most accessible among the giants' models. But here the crucial question arises, well highlighted by a critical analysis by Facta: how long will it remain free? Meta's strategy is clear: create ecosystem dependency, collect user data in industrial quantities, and then eventually monetize when the user base is large enough and locked in. Movie Gen is not available for download, there is no self-hosted version, and the technical documentation is vague and partial. It is the freemium model taken to the extreme: free today, but who knows about tomorrow.

The point is that all these models, however technically impressive, share the same structural limitation: they are proprietary systems, inaccessible in their internal workings, and modifiable only in the superficial parameters that the companies decide to expose. You cannot study them, you cannot adapt them to your specific needs, you cannot verify how they were trained or what data they have seen. And above all, you cannot guarantee that they will continue to exist or be accessible in six months or a year. When you rely on a closed-source model, you are in fact renting technology, not owning it. And the rent can go up, the contract can change, the service can be discontinued. It has happened with Google APIs, with Amazon services, with Microsoft platforms. Why should it be different with generative models? sora2.jpg Image from the Sora 2 trailer

Ovi: Garage Band vs. Major Label

And this is where Ovi comes in, a project developed by Character.AI and released as fully open source in September 2025. To continue with the musical metaphor, if Sora and Veo are the multi-million dollar productions of major record labels, with state-of-the-art recording studios and unlimited budgets, Ovi is the band that records its EP in the garage, with assembled equipment and a lot of passion. But as often happens in the history of music, it is precisely from those garages that the most interesting and innovative records emerge.

Ovi is based on an architecture that its creators call twin backbone cross-modal fusion, an approach that sounds complicated but is conceptually elegant. Instead of first generating the video and then adding the audio in post-production, or vice versa, Ovi models the two modalities as a single generative process. The system uses two identical DiT (Diffusion Transformer) modules, one for video and one for audio, which are trained jointly through bidirectional cross-attention mechanisms. In practice, the two modules constantly talk to each other during the generation process, exchanging temporal and semantic information. This allows for a natural synchronization between images and sounds, without the need for separate pipelines or post-hoc alignments.

The technical documentation published on arXiv is transparent and detailed. The training is divided into two phases: first, an audio tower that mirrors the architecture of a pre-trained video model is initialized and trained from scratch on hundreds of thousands of hours of raw audio. In this phase, the model learns to generate realistic sound effects and speech with rich identity and emotion. In the second phase, the two towers are trained jointly on a vast video corpus, exchanging timing through scaled RoPE embeddings and semantics through bidirectional cross-attention. The result is a model capable of simultaneously generating synchronized video and audio, including dialogue, sound effects, and background music.

Now, it must be said clearly: Ovi does not beat Sora 2 or Veo 3 in pure visual quality. The generated videos are limited to 5 seconds, compared to Sora's 20 or Veo's 120. The resolution is lower, the fluidity of movement is less refined, and the ability to handle complex scenes with many moving elements is still nascent. But this comparison, while inevitable, is also a bit misleading. It's like comparing a Hollywood production with an independent short film: of course, the technical difference is obvious, but the latter is not necessarily less interesting or useful than the former.

Ovi's real strength is not its technical superiority, which honestly does not exist, but the open-source philosophy that governs it. The code is fully available on GitHub, the model weights are downloadable, and the documentation is accessible and understandable. You can study how the system works, modify it, adapt it to your needs, and integrate it into larger projects. You can fine-tune it on specific datasets, experiment with different architectures, and contribute to the community with improvements and fixes. And above all, you can do it locally, on your own hardware, without depending on remote servers, APIs that can change or be disabled, or usage policies that change overnight.

Of course, the hardware requirements are not trivial. To run Ovi smoothly, you need at least a high-end GPU, with 24GB of VRAM or more, and a good amount of system RAM. It's not exactly something you can run on your laptop while on the train. But for a small company, a creative studio, a university lab, or even an enthusiast with a dedicated budget, it is absolutely feasible. We are talking about a few thousand euros of hardware, versus hundreds of euros per month for subscriptions to closed-source services that could change conditions or disappear tomorrow.

And there is another often underestimated aspect: the ability to verify what the model has learned and how it uses it. With closed-source models, you are completely in the dark about the training data. Did they include copyrighted material? Did they use videos without the creators' consent? Did they introduce problematic biases? You can't know. With Ovi, at least in theory, you can analyze the code, study the architectural decisions, and have a deeper understanding of what is really happening under the hood. It is not just a matter of ethical transparency, but also of technical control and debugging capability. ovi.jpg Image from the Ovi trailer

The Other "Rebels" of Open-Source Video

Ovi is not alone in its battle to democratize generative AI video. A diverse ecosystem of open-source projects has formed around it, each with its own specific approaches, strengths, and limitations. It's a bit like the hardcore punk scene of the eighties: many small independent labels, many bands playing in basements, few resources but a lot of determination.

HunyuanVideo, developed by Tencent, is perhaps the most ambitious project in this landscape. It aims to generate videos up to 10 seconds long with a visual quality that approaches commercial models, and it supports high resolutions. The architecture is based on diffusion transformers, similar to Sora's, and the model is trained on a huge dataset of Chinese and international videos. Its strength is the fluidity of movement and temporal coherence, but audio is still absent. And here you can see the difference between a project supported by a corporation like Tencent and smaller initiatives: the resources are there, the results too, but accessibility is limited by prohibitive hardware requirements and a complex setup.

Mochi 1, on the other hand, is a more experimental project, developed by the community and focused on animating static images. The idea is to take an image, perhaps generated with Stable Diffusion or DALL-E, and animate it with realistic movements. It is particularly popular among digital artists who want to bring their artwork to life without having to learn traditional animation software. The quality is variable, but the creative potential is considerable. Here too, however, audio is completely absent, and the maximum duration is 3-4 seconds.

Open-Sora 1.3 is an attempt by the community to replicate the architecture of the original Sora based on publicly available information. Without access to OpenAI's code, the developers had to reverse-engineer the technical descriptions and related papers, creating an architecture that in theory should work similarly. The results are interesting but still far from the quality of Sora 1, let alone Sora 2. The fluidity is often interrupted by artifacts, temporal coherence is fragile, and handling complex scenes is problematic. But it is a living project, with an active community that continues to improve the code.

AnimateDiff deserves a special mention because it has a completely different approach. Instead of being a standalone model, it is an extension of Stable Diffusion that adds animation capabilities. You install AnimateDiff, connect it to your Stable Diffusion setup, and you can turn your generations into short animations. It is popular among those who already use Stable Diffusion for generative art, because it allows integrating animation into the existing workflow without having to learn a new system from scratch. But here too, no audio and very short durations.

CogVideoX from Tsinghua University certainly deserves a mention. Developed by the THUDM lab and updated in November 2024 with version 1.5, it is one of the most mature open-source projects in the landscape. CogVideoX-5B generates videos up to 10 seconds at 720x480 resolution, and version 1.5 introduces support for image-to-video at any resolution. The architecture is based on diffusion transformers with an expert transformer that handles complex movements better than previous models. It is particularly appreciated in the community for the quality of its temporal coherence and for having surpassed competitors like VideoCrafter-2.0 and Open-Sora in benchmarks. The code is fully available on GitHub and Hugging Face, with detailed documentation. The only limitation remains the absence of audio, but for those looking for pure video generation with good quality and decent durations, CogVideoX is one of the most solid options.

LTX-Video by Lightricks, launched in November 2024 with an ambitious promise: real-time video generation. With 2 billion parameters in the initial version (and 13 billion in the version launched in May 2025), LTX-Video is the first DiT model capable of generating video at 30 FPS with a resolution of 1216x704 faster than it takes to watch them. Lightricks claims it is 30 times faster than comparable models, which makes it particularly interesting for applications that require rapid iterations. It is fully open source, integrated into ComfyUI for those who already use that workflow, and has an active community that is contributing with improvements on motion consistency and scene coherence. Here too, no audio, but the generation speed is a significant competitive advantage for those doing prototyping or iterative creative work.

The picture that emerges is clear: the open-source ecosystem is vibrant, creative, and rapidly evolving. But it is also fragmented, often under-resourced, and still quite behind the commercial giants in terms of absolute quality. Most of these projects do not include audio, and when they do, the quality is lower. The durations are short, the hardware requirements demanding, and the setup complex. They are not plug-and-play solutions, but tools that require technical skills, patience, and a willingness to experiment.

Democracy or Illusion?

This brings us to the heart of the matter: is open source in the field of generative AI video a true democratic alternative, or is it just an illusion for tinkerers with too many GPUs and too much free time? The answer, as is often the case, is nuanced and depends on who you are and what you want to do.

If you are a company that needs to produce high-quality video content for marketing campaigns, commercial models like Sora 2 or Veo 3 are probably still the best choice. The quality is superior, the interface is user-friendly, and technical support exists. You pay more, of course, but you get immediate results without having to manage complex infrastructures.

But if you are a researcher, a developer, an artist who wants to experiment, or a small entity with technical skills but a limited budget, then projects like Ovi become invaluable. They offer you freedom, control, and the ability to build something customized. You can integrate the model into broader creative pipelines, adapt it to specific needs, and not depend on corporate decisions that are beyond your control.

True democratization, however, requires more than just open code. It requires clear documentation, active communities, educational resources, and a progressive reduction in hardware requirements. These projects need to become more accessible, installation needs to become simpler, and tutorials need to be understandable even to those who are not machine learning experts. And it requires economic sustainability: many open-source projects in the AI field are developed by small teams or even single individuals working in their spare time, without stable funding. How long can they last? How can they compete with the research labs of tech giants that have budgets of millions of dollars?

The near future will likely be a hybrid scenario. Commercial models will continue to dominate in terms of absolute quality and ease of use, but open-source projects will colonize specific niches: academic research, experimental artistic applications, custom integrations, use cases where control and transparency matter more than visual perfection. Ovi and its counterparts will not replace Sora or Veo, but they will offer a concrete alternative for those who want or need that alternative.

And perhaps, just as it happened with punk and indie rock, it is from these garage bands that the ideas and innovations will emerge that tomorrow the corporate giants will absorb into their mainstream products. The history of technology is full of examples of open-source projects that anticipated trends later adopted by the industry. Linux, Python, TensorFlow itself. It would not be the first time that the garage beats the million-dollar recording studio. Not in terms of budget or glitter, but in terms of ideas, freedom, and the ability to change the rules of the game.