Scientists have posited that the development of vision acted as a hyper-accelerant in the evolution of organic life. Simple life forms that had limited vantage of the world around them could suddenly see things around them; things to eat and predators to avoid.
The recent progress in AI’s vision capabilities, punctuated by the release of GPT-4-Vision last month, is a milestone of similar significance. As AI learns to see the world around it, new abilities get unlocked for machine powered insight, ushering in a new evolutionary phase of our techno-powered civilization.
Let’s explore the practical implications of generative vision through the most universally appreciated aspects of the world we live in: Cat Videos.
Autonomous Video Editing
What’s novel about what we’ve put together here is that it’s 100% automated. No human editing is involved. Everything is powered by machines. And while, for the time being, your humble protagonist plays an instrumental role pressing buttons in a Google Colab notebook, I mostly just kick my feet up and eat a sandwich while the AI video editor we’ve coded does the rest.
How It Works
Let’s take our AI video editor for a spin by putting it to work on some TikTok videos. Here are the steps involved:
Our AI editor searches TikTok for trending cat videos.
It breaks the video into frames and analyzes them to understand the contents.
It generates an insightful commentary to accompany the video.
It analyzes the it’s own commentary to identify the most humorous ones.
The commentaries are fed to a Text to Speech model to generate audio.
The system edits the video and audio together into a video compeltation.
Video Outputs
Within a minute or so it spits out an original piece of content. A compilation of feline commentary showcasing AI’s attempts to contribute to this cherished pastime.
Applications and Implications
This isn’t the most intelligent use of AI vision capabilities. But because every step in this process is 100% automated, the general approach opens up a wide range of use cases that are worth scratching your head over.
In the West, the many obvious use cases for this technology involve sales and marketing:
Imagine a Fast Fashion brand being able to analyze the latest styles coming out of a particular subculture so as to co-opt trends faster.
Imagine a grocery chain being able to understand which products a customer is interested in based on the aisle they stop and meander in.
Imagine any marketer being able to analyze their content for insights on what’s performing best.
Imagine a brand being able to generate infinite edits of an ad to personalize to thousands of audiences.
Within totalitarian systems these capabilities become infinitely more troubling. China leads the world in the development and export of surveillance technology and is only second to the US when it comes to computer vision research. One shudders to imagine what surveillance apps of the future might look like.
More AI Cat Content
You're not here to fret about the future of civilization, so let’s turn our attention back to the infinite cat video production machine we’ve assembled.
And because it’s fully automated this is an all you can eat buffet.
Addition designs and develops AI solutions for modern brands.
Read about us in the Wall Street Journal
Visit our website to learn about the work we do with brands and agencies
Yeah for sure, LLM agents can trigger generative videos and edit them together. The only significant limitation on that up until recently has been API / model access for vid gen models but that’s changing (for example release of recent SD Video model).
In terms of rights, TikTok is just a demo. I wouldn’t mean to suggest scrapping them for commercial use. Although there could be a lot of interesting use cases for insight discovery using this pipeline.
Super awesome. Love this. NOW... could you expand the machine to have it text-to-generate original video? Also curious how you're thinking about rights/usage as it relates to video scraped from TT.