Advertisment

Google Introduces LLM for Text and Audio Input Video Generation

Google is set to unveil VideoPoet, an advanced language model LLM capable of transforming text into dynamic videos.

author-image
DQC Bureau
New Update
Google Introduces LLM for Text and Audio Input Video Generation

Google Introduces LLM for Text and Audio Input Video Generation

Google is set to unveil VideoPoet, an advanced language model (LLM) capable of transforming text into dynamic videos. In a demonstration of VideoPoet's prowess, Google Research crafted a brief film comprising numerous clips generated by the innovative model.

Advertisment

The experiment involved soliciting prompts from Bard, instructing the LLM to weave a short narrative about a globetrotting raccoon. VideoPoet responded by creating distinct video clips corresponding to each prompt, showcasing its ability to translate textual input into engaging visual content. This development marks a significant stride in harnessing language models for multimedia content creation and storytelling.

Google's researchers have introduced VideoPoet, a robust Large Language Model (LLM) engineered to handle multimodal inputs like text, images, video, and audio with the primary goal of video creation. Distinguishing itself with a 'decoder-only architecture,' VideoPoet displays the capacity to produce content for tasks it hasn't specifically learned.

The model undergoes a two-step training process akin to other LLMs—initial pre-training and subsequent task-specific adaptation. In the pre-training phase, VideoPoet establishes a foundational framework, serving as a versatile base for various video generation tasks. Researchers elucidate that this adaptable nature allows VideoPoet to transcend its initial training and venture into diverse video-related applications, highlighting its potential as a dynamic and versatile tool in the realm of multimedia content creation.

Advertisment

“VideoPoet is a simple modeling method that can convert any autoregressive language model or large language model (LLM) into a high-quality video generator,” reads a post on the website. 

Unlike prevailing video models that utilize diffusion models, which introduce noise to training data before reconstructing it, VideoPoet takes a distinctive approach by amalgamating diverse video generation capabilities within a unified language model.

In contrast to models that typically feature separate components trained independently for specific tasks, VideoPoet seamlessly integrates all functionalities into a singular Large Language Model (LLM). This integrated approach sets VideoPoet apart, allowing it to holistically comprehend and execute various video generation tasks without the need for separate specialized components.

Advertisment

By consolidating these capabilities, VideoPoet represents a departure from conventional methods, showcasing a more unified and efficient way to handle the intricacies of video generation within a cohesive language model framework.

google DQC Bureau Video Generation LLM
Advertisment