Abstract illustration of audio waveforms, LaTeX equations, and flowing text in motion.

Valeon: Listening in Motion

15 November 2025 at 10:52 GMT

8 min read

0:00--:--

In the last update, I promised to keep more of the scaffolding in the open—to show not just what the site looks like from the street, but also some of the wiring behind the walls. Since then, most of the work has happened in one particular corner of the house: audio. How it’s generated, how it sounds, how it behaves as you move between devices, and how it can help you read more attentively instead of becoming background noise. This update is a small tour of those changes from the reader’s side, with just enough technical detail to make the choices legible.

The biggest shift is that the entire audio synthesis flow has been rebuilt around a simple realisation: OpenAI’s TTS is more context-aware than I was giving it credit for. When you feed it isolated sentences, it does a competent job. When you feed it whole paragraphs or sections, it starts to breathe a little—pauses become more natural, emphasis lands where you’d intuitively place it, and the voice picks up changes in tone that make long pieces feel like they’re being read by a person who understands the argument, not just the words. That sounds obvious in hindsight, but the previous pipeline treated TTS more like a dumb formatter than a responsive listener. The new pipeline does the opposite: it tries very hard not to interrupt the model’s sense of context unless it absolutely has to.

Practically, that means the chunking rules for synthesis have been rewritten with context, not just character counts, at the centre. OpenAI’s TTS can handle up to 4096 characters in a single request, which is more generous than many tools and gives us room to keep things intact. So the site now behaves roughly like this: if a post has no section headings and fits under that ~4k character ceiling, it gets sent as one continuous chunk, allowing the voice to ride the full arc of the piece without being chopped up. When a post is longer and has headings, each top-level heading is treated as a meaningful boundary: the heading itself is synthesised as its own tiny chunk, and then the body of that section is kept together as long as it fits within the limit. Only when a section’s body is too long do we start splitting at paragraph boundaries, and even then we only fall back to sentence- or length-based splits if a single paragraph is so long that it exceeds the limit on its own. The point isn’t to show off clever rules; it’s to let the model see enough of the surrounding text to decide where to put weight, softness, or urgency—especially in essays where the emotional tone shifts gradually rather than sentence by sentence.

This idea of “don’t reinvent what the model is already good at” has spilled over into how the site handles mathematical content as well. Previously, I had a home-grown system for turning LaTeX into speech, stitching together various rules to make formulas sound halfway natural. It worked, but it had all the usual flaws of a one-person tool: lots of edge cases, and a nagging sense that I was rebuilding something that the broader community had already spent far more time and care on. So I’ve switched that part of the pipeline over to Speech Rule Engine in ClearSpeak mode. Instead of me guessing at how, say, a nested fraction or a matrix should be read aloud, SRE applies a mature, well-tested set of rules designed for exactly this purpose—accessible, intelligible speech for mathematical content. The result, in my admittedly biased ears, is a clear step up in how posts with equations sound: less robotic, more like something you could follow with your eyes closed and still reconstruct the structure of the expression.

That shift is part of a broader lesson from the last few weeks: not everything needs a custom solution. There are places where building from first principles is the right call, and there are places where the most honest move is to stand on the shoulders of open-source work that has already gone through years of refinement. To honour that, the tech stack page now explicitly credits the libraries and tools in play—including SRE, MathJax, and Montreal Forced Aligner—along with links to their home projects. If you’re curious about the machinery behind the audio or the rendering, you can go and explore the source yourself.

On the visual side of mathematics, the site has also moved to using MathJax for LaTeX rendering. This change is less dramatic from a distance—you’ll still see familiar inline equations and display formulas—but it means the typography, spacing, and accessibility hooks for math are now handled by an engine that’s been battle-tested across thousands of sites. Equations are crisper, more consistent, and better behaved on different devices and zoom levels. It also opens the door to more sophisticated handling of math in the future (for example, richer semantics or alternative reading modes) without having to rework everything from scratch.

Outside of math, a few quality-of-life changes have landed quietly but should make the day-to-day experience of reading and listening smoother. The audio player now remembers where you left off. If you’re listening to a long essay on your commute, pause halfway, and come back later, the site will pick up from roughly where you stopped instead of sending you back to the beginning. It’s a small thing, but for a publication that leans toward longer pieces, it matters. I built this one for myself as much as for anyone else: I review a lot of drafts and published posts by listening to them in bed on my phone, and having to hunt for the right minute marker every time was just enough friction to be annoying.

Images have also been given a bit more respect in the content pipeline. Embedded images in Markdown are now processed through a custom rehype-figure plugin during the content sync phase. Rather than treating them as generic inline blobs, the site wraps them in proper figure and figcaption structures, which means captions behave more predictably, alignment and spacing are more consistent, and future improvements (like better galleries or lightbox behaviour) have a clean foundation to sit on. From a reader’s perspective, this should show up as images feeling more integrated with the essay rather than glued on top of it—less “floating rectangle,” more “deliberate part of the page.”

The most visible (and, for me, exciting) new feature, though, is word-level highlighting tied to audio playback. When you enable it, the site will highlight each word as it’s spoken and gently auto-scroll the page to keep the current line in view. Under the hood, this is powered by a Dockerised deployment of Montreal Forced Aligner (MFA) behind a custom Python endpoint that plugs into the audio synthesis pipeline. For each post, the pipeline sends the final audio and a normalised version of the text to MFA, which then force-aligns the two and returns precise timestamps for each word. Those timestamps are baked into the HTML as data attributes, and the player simply walks through them in sync with the audio. From a technical standpoint, it’s one of the more intricate bits of plumbing on the site; from a reader’s standpoint, it’s something you can simply flip on in the utilities menu and forget about.

By default, the highlighting is off. I don’t want to overwhelm first-time visitors with moving parts or turn every reading session into karaoke. But if you’re someone who likes to listen and read at the same time, or if you’re reviewing a piece closely and want to see exactly how a sentence sounds as it’s spoken, it can be surprisingly helpful. I’ve been using it a lot myself on mobile—again, usually in bed, scrolling with one hand while half-asleep—which is a good stress test for how forgiving the interaction needs to be. The auto-scroll is intentionally gentle; it nudges rather than yanks, trying to stay out of the way unless you’ve drifted far from the current line.

There is, however, an honest limitation: at the moment, the word-highlighting feature does not work on posts that include LaTeX. The combination of rendered math and alignment-friendly text introduces enough complexity that I haven’t yet wired it up in a way that I’m happy with. For now, those posts will still have audio, and the equations will still render cleanly with MathJax, but the per-word highlight will sit out. This is on the list to fix—there are promising approaches involving parallel “speech-friendly” text for math that the aligner can use behind the scenes—but I’d rather ship a solid experience for plain-text posts now than hold everything back until the hardest corner case is resolved.

All of these changes share a common thread: making the site feel more like a companion and less like a static archive. Audio that breathes with the text instead of flattening it. Math that looks and sounds the way your mind expects it to. Images that sit comfortably in the flow of reading. A player that respects your time by remembering where you left off. And, when you want it, a highlighted trail through every word of a piece, so you can follow the cadence as well as the content. None of this is flashy, and that’s the point. The aim is still what it has always been: rigorous ideas, clear language, humane tools. Valeon should be a place where you can sit with a difficult thought, on whatever device you’re holding, and have the technology quietly help rather than demand your attention.

As ever, thank you for reading, listening, and occasionally peeking behind the scaffolding with me.

Valeon: Listening in Motion

Related posts

Introducing VocaSync: A Voice Platform for Creators and Developers

Making Text to Speech Usable: Introducing Valeon TTS Studio

A New Beginning—with Speed Bumps