I’ve been pushing the limits of modern AI video models to see how far they can go in generating immersive 360-degree videos. What’s truly remarkable is that these models, despite not being explicitly trained for 360 content, are producing incredibly convincing spherical VR videos. The key? Precise and clear prompting. It’s astonishing how much can be achieved when you’re exact with your instructions.
My tests involved Google DeepMind’s Veo 3 and the Hailuo 02 model, both state-of-the-art video generation tools. The prompt I used was quite specific: “Generate a 360-degree equirectangular video. The output format must be 360-degree equirectangular. The scene is a lush waterfall, rendered as a complete 360×180 spherical VR video. The camera is stationary, capturing a full 360-degree view of the entire environment, which must be rendered in equirectangular format. Water cascades down mossy rocks into a clear pool. Final instruction: The video must be a 360-degree, spherical, equirectangular video. 360 video format is mandatory. Include ambient waterfall sounds. No subtitles.”
The results were genuinely impressive. Veo 3 handles audio natively, so the ambient waterfall sounds were seamlessly integrated. Hailuo 02, while visually stunning, doesn’t generate audio, so I had to add it afterward using MMAudio V2. Despite this, both models delivered results that were nearly seamless. This held true even with high-motion scenes, like drone flight paths, which typically make seam artifacts in 360 videos much more apparent.
The Unsung Hero: Veo 3’s Native 360 Capabilities
Google’s Veo 3, released in 2025, is a powerful text-to-video AI model. It can generate high-quality 8-second videos at 1080p or 720p resolution. One of its standout features, which often goes unnoticed, is its native audio generation. It integrates ambient sounds, sound effects, and even dialogue directly from prompts, eliminating the need for external audio editing. This is a game-changer for workflow efficiency.
What I found particularly compelling is Veo 3’s innate ability to create 360-degree spherical videos in equirectangular format. All it requires is a simple instruction in the prompt, such as “make it 360 degrees.” For instance, I successfully generated immersive 360×180 spherical videos of lush waterfall scenes, complete with cascading water and mossy rocks, all from a stationary camera providing a full 360-degree view. The output is VR-ready and works perfectly with platforms like YouTube that support 360-degree playback. The native audio generation capability means ambient waterfall sounds are perfectly synchronized, significantly enhancing the immersive experience.
For more on the broader capabilities of AI video models, you might check out my comparison of leading models in Best AI Video Models of June 2025.
Tackling the Seam: Challenges and Clever Workflows
Despite the excellent results, one persistent challenge with these AI-generated 360-degree videos is the seam artifact—the line where the edges of the spherical projection meet. It becomes more noticeable, even distracting, in high-motion videos. I tried experimenting with inpainting the start frame to remove the seam visually, and I also explored image-to-video approaches with Midjourney. However, maintaining a perfectly seamless edge throughout the motion proved difficult. Still images are, predictably, easier to fix than moving content. Interestingly, current text-to-video models seem to produce better seam coherence than image-to-video workflows, likely because they are trained on sequences and have a better understanding of motion context.
To address this seam issue effectively, I developed a simple but clever workflow. It involves horizontally shifting the video to center the seam—a process I call ‘rolling’ the video. I built a custom browser-based tool using Gemini that automatically shifts the video’s content by 50% horizontally. This process generates a new video file where the seam is a single, clean line down the middle, perfectly prepped for an inpainting model to work on.
From there, I can use inpainting models like Wan 2.1 14B VACE Inpainting to fill and smooth over the seam area, dramatically improving the final quality. This approach is crucial because video inpainting, especially frame-by-frame, is computationally expensive and slow. By focusing inpainting on a single key frame, combined with the horizontal shifting, I strike a practical balance—powerful enough for most applications, and affordable enough for casual experimentation. The fact that text-to-video generation currently produces better seam coherence than image-to-video is a significant finding. Image-to-video models tend to struggle more with maintaining seamless edges once motion is introduced.
Experiments with Midjourney’s video model also highlighted seam challenges. While initial frames with seam removal showed promise, the model struggled to maintain perfect seam continuity during motion, especially in high-motion POV drone shots. Hailuo 02 performed better than Midjourney in this regard but still exhibited minor seam artifacts under challenging conditions. This suggests that while AI is incredibly capable, physics and projection still present hurdles.
The role of audio is also noteworthy. Veo 3’s native support means you can prompt both video and sound at once. For models that don’t support audio, post-process addition is essential, which adds complexity. Still, the fact that these models can generate 360 content from detailed prompts without explicit 360 training is a major step forward.
The Future of Immersive AI Content
High-motion test cases reveal some current limitations: even the best models struggle with perfect seam concealment during rapid camera movements like drone shots. The seams tend to flicker or ripple during motion, though static scenes are almost perfect. I believe future improvements will come from better inpainting algorithms, more training on spherical projections, and more refined prompt engineering.
We’re witnessing a significant shift. These recent models can generate 360 videos that are good enough for many immersive applications, all through precise prompting. Post-processing techniques, like shifting and seam inpainting, make the results even more convincing. This work is just getting started, but the implications are profound: the barrier for creating spherical VR content with AI is dropping fast, with no need for models trained specifically on 360 data.
Looking ahead, I expect that text-to-video will surpass still-image workflows in seam consistency and overall immersion. The challenge remains the same—improve seam concealment during high motion, automate post-processing, and refine prompt accuracy. For now, this approach opens new possibilities for virtual tours, creative storytelling, and immersive experiences, all built from straightforward prompts and minimal edits.
For those interested in trying this yourself, I recommend experimenting with public models like Veo 3 or Hailuo 02, coupled with simple shifting and inpainting workflows. The cost is very low, and the quality for preliminary projects can be surprisingly high. There’s a lot of room for experimentation, and I’m excited to see how these tools will take shape in the coming months.
The ability of models like Veo 3 and Hailuo 02 to produce high-quality 360-degree content without specialized training underscores a broader trend: AI models are becoming more versatile and adaptable. This means that designers and content creators don’t need to wait for purpose-built 360-degree generation tools; they can start experimenting with existing, powerful text-to-video models today. The implications for VR content creation, gaming, virtual tourism, and even architectural visualization are substantial. Imagine generating an entire virtual walkthrough of a building from a simple text description, complete with realistic lighting and ambient sounds. The current capabilities are a strong indicator that this future is closer than we think.
While the cost of video inpainting for perfect results across an entire video can be prohibitive, the focused approach of correcting key frames after horizontal shifting makes it a practical solution. This cost-effectiveness means that even independent creators can experiment with high-quality 360 video production without breaking the bank. The emphasis on prompt engineering also places more power in the hands of the user. The more detailed and specific your prompt, the better the output, reducing the need for extensive post-production.
This is a testament to the rapid advancements in AI. What was once the domain of specialized hardware and software, requiring extensive technical expertise, is now becoming accessible through intuitive prompting. This democratization of immersive content creation will undoubtedly lead to a surge in innovative applications and experiences. The minor seam artifacts are a small price to pay for such a significant leap in capability and accessibility.
The ongoing competition between models like Veo 3 and Hailuo 02 will only accelerate these improvements. As models become more efficient and better at understanding complex spatial relationships, the need for manual post-processing will likely diminish. For now, embracing these clever workflows allows creators to capitalize on current AI strengths while anticipating future breakthroughs.