Making ASR Accessible: Introducing WhisperX Studio
To honour my previous promise to open source more of the work I’ve done building Valeon, I’ve continued to develop tools that make the utilities integrated into the blog accessible to a general audience. Mostly, this is because the tools in their current form are so tightly integrated into how Valeon was built that they really wouldn’t work for anyone else. The architecture makes sense when you can see the whole machine at once—the content collections, the asset pipeline, the audio generation, the alignment artefacts, the caching rules, the render-time components—but taken out of that context it becomes brittle, opinionated, and frustrating. I don’t want “open source” to mean “here’s the code, good luck.” I want it to mean: here’s a tool you can use immediately, and here’s a codebase you can read and actually learn from.
That’s one of the reasons I made Valeon TTS Studio: a general tool that anyone can use right away, or clone and study to understand how to integrate automated text-to-speech generation into their own workflows. It includes the same kind of real-world constraints I had to solve for—chunking rules, sane defaults, predictable output, and the little practical decisions that don’t show up in glossy demos. TTS was the first half of the “audio engine.” But the moment you add TTS to writing, you almost inevitably feel the pull toward the second half: word-level highlighting. Not because it’s flashy, but because it changes how audio is experienced. Suddenly, listening isn’t separate from reading. The text becomes a timeline. The timeline becomes navigable. And the distance between “I’m listening” and “I’m understanding” gets a little smaller.
The problem is that alignment—true word-level timestamps you can trust—is a different kind of beast. Valeon’s current pipeline leans heavily on Montreal Forced Aligner (MFA) and the surrounding glue that makes it behave like a product: normalised speechtext, language and dictionary considerations, batching, artefact formats, and a lot of unglamorous edge-case handling. MFA is open source, and anyone can deploy it by reading the documentation, but that’s not the point. The point is usability. The tight integration between my MFA deployment and Valeon’s pipeline means that, as it stands, it would be practically unusable for anyone else. So the direction I’m moving in now is not “ship my internal pipeline,” but “generalise the capability”: a clean API endpoint with MFA as the backend, where you can submit audio + text and get back alignment artefacts that are actually useful.
Getting to that point wasn’t easy. When I first decided I wanted TTS and word highlighting inside Valeon, I truly had no idea what I was doing. I didn’t have a neat plan. I had a desire, a vague mental image of what it should feel like, and a long stretch of trial-and-error between me and anything real. I tested service after service, not because I enjoyed the churn, but because each one taught me something about the shape of the problem. ElevenLabs. Microsoft’s offerings. AWS. Different pricing models, different output quality, different failure modes, different trade-offs hiding behind the same marketing language. And along the way I ran into a couple of tools that—while not perfect—did a “good enough” job for basic transcription without demanding a full infrastructure project from the user.
One of those was WhisperX: a transcription and alignment pipeline built on top of Whisper that does an impressive job of producing usable timestamps. In terms of “how far you can get without building a research lab,” it’s genuinely excellent. It’s fast, it’s accurate enough for real workflows, and the outputs are structured in a way that makes them easy to feed into UI features like word highlighting. It’s the kind of tool you discover and immediately think: this solves 80% of the problem for 80% of people. And if you’re trying to ship a product feature—not win a benchmark—that matters.
But there’s a catch, and it’s the same catch that follows almost every useful ML model: deployment is the tax you pay to use it. If you don’t want to buy or rent infrastructure, you’re either running it locally (which isn’t always feasible) or you’re learning how to host it (which is often its own project). That’s where I ran into Replicate: a service that runs these models and charges you per prediction at rates that are surprisingly reasonable. In the first week of deploying Valeon with audio-level word highlighting, the blog was using WhisperX behind the scenes via Replicate. It worked. It was practical. It let me iterate without spinning up a dedicated GPU box. And it made something clear to me: a lot of people don’t need “perfect alignment at scale.” They need a button they can press that gives them a transcript they can ship.
Unfortunately, “press a button” is not what most people get when the interface is a CLI command or an API call tucked inside a script. If you already live in terminals, if you already think in request payloads, Replicate is easy. If you don’t, it may as well be locked behind a wall. The model is accessible, but the workflow isn’t. And this is where the real problem shows itself: the barrier isn’t intelligence, it’s interface. The barrier isn’t capability, it’s packaging. This is not about dumbing anything down. It’s about removing unnecessary friction so people can actually use the tools that already exist.
So I made WhisperX Studio.
WhisperX Studio is the same idea as Valeon TTS Studio, applied to transcription and timestamps: a React-based wrapper with a bring-your-own-keys approach. You create an account on Replicate, top up your balance, generate an API key, paste that key into WhisperX Studio, and it should just work. No server setup. No GPU instance. No “clone this repo, install CUDA, and pray.” You’re paying for predictions directly, and you’re getting the benefits of a hosted runtime without having to become an infrastructure engineer first. For a lot of creators, writers, indie developers, and small teams, that middle path is the difference between “I might try this someday” and “I shipped it this afternoon.”
I’ve attempted to expose all the parameters the model has available while keeping the UI as friendly as possible. That balance matters. Too few controls and the tool feels like a toy; too many and it feels like a cockpit. WhisperX Studio tries to sit in that sweet spot where you can run a simple transcription in seconds, but also fine-tune when you need to—language, diarisation options where applicable, different output formats, and the kinds of knobs you only care about after you’ve used the thing enough to notice where it breaks. The goal isn’t to hide complexity. It’s to make complexity optional until you need it.
If you’re already comfortable self-hosting or you’d rather keep everything in your own environment, you can simply clone the GitHub repository and deploy an instance yourself. If you just want the utility, a hosted instance is available. Pick the workflow that matches how you work. That’s the point. These tools are not meant to be another gated platform you’re locked into. They’re meant to be bridges—small, practical bridges—from “this model exists” to “this model is now part of my process.”
And to be clear: WhisperX Studio isn’t the endgame for Valeon’s audio engine. It’s a stepping stone. WhisperX is “good enough” for a huge range of transcription use-cases, and in many cases it’s more than good enough. But forced alignment through MFA remains the gold standard for the kind of tight, text-accurate word highlighting I want as the default inside Valeon. What I’m building in parallel—slowly, carefully—is a generalised MFA-backed API that anyone can integrate without inheriting the entire Valeon pipeline. WhisperX Studio exists because people shouldn’t have to wait for that future to get value today.
If you’ve ever wanted to add transcripts to your content, generate word timestamps for highlighting, or build a reading experience that actually respects the listener, WhisperX Studio is for you. Not because it’s revolutionary, but because it’s usable. It turns an API-shaped capability into a human-shaped tool. And that, more than anything, is what open sourcing should feel like: not a drop of code into the ocean, but a door you can actually walk through.


