Pure white background with black sans serif text 'HEARTMULA 3B'

HeartMuLa (3B) Is the First Local Music Model That Feels Close, But AI Music Needs Editing to Really Take Off

I do not think AI music takes off in a big way until we get good autoregressive models with editing. Not just generate another song or regenerate the chorus, but the kind of editing you would expect in a DAW: keep the good parts, replace the bad parts, preserve timing, and fix one line without destroying the rest. The current state of the industry is a lot of one-shot generation that is impressive but ultimately frustrating for anyone trying to do actual work.

That is why I am paying attention to HeartMuLa. The new open-source release, HeartMuLa-oss-3B, is a 3 billion parameter song generation model you can run locally. It is not the biggest model in the world, but it is a meaningful datapoint: open source is getting close to the feel of the closed systems like Suno, and it is doing it with a design that clearly cares about lyrical fidelity. HeartMuLa-oss-3B was designed to reproduce commercial-grade systems using academic-scale resources, which is a significant achievement for the open-source community.

One immediate impression from the demos is that the vocals can be surprisingly clear. The model seems tuned to make words legible, and in that narrow lane it does a good job. Suno still feels better overall to me, but HeartMuLa is a real competitor in the my lyrics are understandable category, which is harder than people think. However, there is a clear trade-off here. By optimizing so heavily for the lyric error rate, the model seems to have become less ambitious with instrumental action.

What HeartMuLa is, technically

HeartMuLa is a full framework rather than just a single model. The architecture consists of four integrated components that attempt to bridge the gap between text and high-fidelity audio. First, there is HeartCLAP, which handles the audio-text alignment. Then HeartTranscriptor handles lyric recognition. The core of the efficiency comes from HeartCodec, a music codec tokenizer that operates at a low frame rate of 12.5 Hz. This allows it to preserve fine acoustic details without making the sequences so long that they become computationally impossible to manage on local hardware.

The generation model itself uses hierarchical transformers. A global transformer predicts the base tokens to capture structure, while a local transformer handles the residual details. This is a smart way to handle the computational constraints of long-sequence generation. It is the kind of technical pragmatism I like to see in open-source projects. You can find more about how I track these kinds of releases in my ai-aggregator project.

HeartMuLa vs Suno Trade-offs

HeartMuLa prioritizes lyric accuracy and local accessibility, while Suno maintains a lead in musical arrangement.

The Lyric Optimization Trap

HeartMuLa uses a four-stage training paradigm: warmup, pretraining, supervised finetuning, and reinforcement learning. Throughout this process, there is a heavy emphasis on lyrical accuracy. They even use structural markers like intro, verse, and chorus to guide the generation. While this results in a model that actually says what you told it to say, it can feel a bit boxed in. In music, the arrangement is part of the understandability. If the instruments are too static or conservative, the vocals can feel disconnected from the track, even if the syllables are clear.

This reminds me of the early days of AI image generation where models were great at specific textures but struggled with composition. HeartMuLa is great at the words, but the soul of the music—the instrumental action—needs more work. I suspect the team over-optimized for the lyric error rate metric. It is a classic case of what you measure is what you get.

Why Editing is the Killer Feature

If you have spent any time with these tools, you know the pain. You get a great track, but the bridge is terrible. Today, your only real option is to reroll and hope for the best. That is not a workflow; it is a slot machine. The first model that allows for true autoregressive editing will win. We need the ability to lock a section and regenerate only a specific window, or change the lyrics without losing the singer’s identity. HeartMuLa’s support for multi-condition inputs like reference audio and section tags is a step toward this, but the UI and the underlying model flexibility are not quite there yet.

There is a 7B parameter version of HeartMuLa coming which apparently shows significant improvements when scaled. I am curious to see if that version takes more risks with the instrumentals. If they can keep the vocal clarity while adding more musical ambition, they will have something very dangerous. For now, it is a great tool for anyone who wants to run music generation locally without relying on a cloud provider’s credits or privacy policies.

If you want to see the research yourself, check out the HeartMuLa arXiv preprint. My take is consistent: the tech is impressive, but we are still waiting for the tool that lets us actually finish a song without a hundred rerolls.