Futuristic microphone with flowing sound waves transforming into glowing transcript tiles toward a silhouetted listener

Making ASR Accessible: Introducing WhisperX Studio

7 min read
0:00--:--

To honour my previous promise to open source more of the work I’ve done building Valeon, I’ve continued to develop tools that make the utilities integrated into the blog accessible to a general audience. Mostly, this is because the tools in their current form are so tightly integrated into how Valeon was built that they really wouldn’t work for anyone else. The architecture makes sense when you can see the whole machine at oncethe content collections, the asset pipeline, the audio generation, the alignment artefacts, the caching rules, the render-time componentsbut taken out of that context it becomes brittle, opinionated, and frustrating. I don’t wantopen sourceto meanhere’s the code, good luck.” I want it to mean: here’s a tool you can use immediately, and here’s a codebase you can read and actually learn from.

That’s one of the reasons I made Valeon TTS Studio: a general tool that anyone can use right away, or clone and study to understand how to integrate automated text-to-speech generation into their own workflows. It includes the same kind of real-world constraints I had to solve forchunking rules, sane defaults, predictable output, and the little practical decisions that don’t show up in glossy demos. TTS was the first half of theaudio engine.” But the moment you add TTS to writing, you almost inevitably feel the pull toward the second half: word-level highlighting. Not because it’s flashy, but because it changes how audio is experienced. Suddenly, listening isn’t separate from reading. The text becomes a timeline. The timeline becomes navigable. And the distance betweenI’m listeningandI’m understandinggets a little smaller.

The problem is that alignmenttrue word-level timestamps you can trustis a different kind of beast. Valeon’s current pipeline leans heavily on Montreal Forced Aligner (MFA) and the surrounding glue that makes it behave like a product: normalised speechtext, language and dictionary considerations, batching, artefact formats, and a lot of unglamorous edge-case handling. MFA is open source, and anyone can deploy it by reading the documentation, but that’s not the point. The point is usability. The tight integration between my MFA deployment and Valeon’s pipeline means that, as it stands, it would be practically unusable for anyone else. So the direction I’m moving in now is notship my internal pipeline,” butgeneralise the capability”: a clean API endpoint with MFA as the backend, where you can submit audio + text and get back alignment artefacts that are actually useful.

Getting to that point wasn’t easy. When I first decided I wanted TTS and word highlighting inside Valeon, I truly had no idea what I was doing. I didn’t have a neat plan. I had a desire, a vague mental image of what it should feel like, and a long stretch of trial-and-error between me and anything real. I tested service after service, not because I enjoyed the churn, but because each one taught me something about the shape of the problem. ElevenLabs. Microsoft’s offerings. AWS. Different pricing models, different output quality, different failure modes, different trade-offs hiding behind the same marketing language. And along the way I ran into a couple of tools thatwhile not perfectdid agood enoughjob for basic transcription without demanding a full infrastructure project from the user.

One of those was WhisperX: a transcription and alignment pipeline built on top of Whisper that does an impressive job of producing usable timestamps. In terms ofhow far you can get without building a research lab,” it’s genuinely excellent. It’s fast, it’s accurate enough for real workflows, and the outputs are structured in a way that makes them easy to feed into UI features like word highlighting. It’s the kind of tool you discover and immediately think: this solves 80% of the problem for 80% of people. And if you’re trying to ship a product featurenot win a benchmarkthat matters.

But there’s a catch, and it’s the same catch that follows almost every useful ML model: deployment is the tax you pay to use it. If you don’t want to buy or rent infrastructure, you’re either running it locally (which isn’t always feasible) or you’re learning how to host it (which is often its own project). That’s where I ran into Replicate: a service that runs these models and charges you per prediction at rates that are surprisingly reasonable. In the first week of deploying Valeon with audio-level word highlighting, the blog was using WhisperX behind the scenes via Replicate. It worked. It was practical. It let me iterate without spinning up a dedicated GPU box. And it made something clear to me: a lot of people don’t needperfect alignment at scale.” They need a button they can press that gives them a transcript they can ship.

Unfortunately, “press a buttonis not what most people get when the interface is a CLI command or an API call tucked inside a script. If you already live in terminals, if you already think in request payloads, Replicate is easy. If you don’t, it may as well be locked behind a wall. The model is accessible, but the workflow isn’t. And this is where the real problem shows itself: the barrier isn’t intelligence, it’s interface. The barrier isn’t capability, it’s packaging. This is not about dumbing anything down. It’s about removing unnecessary friction so people can actually use the tools that already exist.

So I made WhisperX Studio.

WhisperX Studio is the same idea as Valeon TTS Studio, applied to transcription and timestamps: a React-based wrapper with a bring-your-own-keys approach. You create an account on Replicate, top up your balance, generate an API key, paste that key into WhisperX Studio, and it should just work. No server setup. No GPU instance. Noclone this repo, install CUDA, and pray.” You’re paying for predictions directly, and you’re getting the benefits of a hosted runtime without having to become an infrastructure engineer first. For a lot of creators, writers, indie developers, and small teams, that middle path is the difference betweenI might try this somedayandI shipped it this afternoon.”

I’ve attempted to expose all the parameters the model has available while keeping the UI as friendly as possible. That balance matters. Too few controls and the tool feels like a toy; too many and it feels like a cockpit. WhisperX Studio tries to sit in that sweet spot where you can run a simple transcription in seconds, but also fine-tune when you need tolanguage, diarisation options where applicable, different output formats, and the kinds of knobs you only care about after you’ve used the thing enough to notice where it breaks. The goal isn’t to hide complexity. It’s to make complexity optional until you need it.

If you’re already comfortable self-hosting or you’d rather keep everything in your own environment, you can simply clone the GitHub repository and deploy an instance yourself. If you just want the utility, a hosted instance is available. Pick the workflow that matches how you work. That’s the point. These tools are not meant to be another gated platform you’re locked into. They’re meant to be bridgessmall, practical bridgesfromthis model existstothis model is now part of my process.”

And to be clear: WhisperX Studio isn’t the endgame for Valeon’s audio engine. It’s a stepping stone. WhisperX isgood enoughfor a huge range of transcription use-cases, and in many cases it’s more than good enough. But forced alignment through MFA remains the gold standard for the kind of tight, text-accurate word highlighting I want as the default inside Valeon. What I’m building in parallelslowly, carefullyis a generalised MFA-backed API that anyone can integrate without inheriting the entire Valeon pipeline. WhisperX Studio exists because people shouldn’t have to wait for that future to get value today.

If you’ve ever wanted to add transcripts to your content, generate word timestamps for highlighting, or build a reading experience that actually respects the listener, WhisperX Studio is for you. Not because it’s revolutionary, but because it’s usable. It turns an API-shaped capability into a human-shaped tool. And that, more than anything, is what open sourcing should feel like: not a drop of code into the ocean, but a door you can actually walk through.

Related posts

© 2026 Valeon. All rights reserved.