Growing in the Gaps
The last update — Growing into the Valeon Ecosystem — ended with a promise and a direction. VocaSync had been put under load and rebuilt around what that load revealed. The design language had been unified across all Valeon products. Plutarc was next. ShipSpace had moved from hypothesis to architectural planning.
What it did not mention was anything about OEMI. Or a Fourier transform experiment that had no business being as interesting as it turned out to be.
This is not a new pattern. Before VocaSync existed, there were two browser-based tools: Valeon TTS Studio, which made text-to-speech usable without a deployment pipeline, and WhisperX Studio, which put automatic speech recognition in the browser and asked what accessibility in that space could actually look like. Neither was a product in the full sense. Both were questions that needed answering before a platform could be built around the answers. VocaSync is, in part, what those questions produced.
The work that happens in the gaps is not filler. It is often the work that makes the next thing possible — the contained experiment that maps unfamiliar territory, the rewrite that brings a product into line with the standard the rest of the stack has since reached. It does not announce itself. It ships, or it informs, and then you write about it after the fact.
This update is that kind. Two things have shipped that were not loudly announced, and one idea is still forming that I am not ready to make precise. What follows is an account of all three.
OEMI Gets a Proper Identity
OEMI has been running quietly for a while now, doing what it was designed to do. But the UI was telling a different story to the one the rest of Valeon tells — it looked like a product from a different house.
That changes now. The UI has been rebuilt from the ground up to match the design philosophy running across Plutarc, VocaSync, and the Valeon blog. Same visual language, same spatial logic, same sense that these things belong to one another. The Valeon umbrella should be immediately legible across every product under it.
More substantially: the identity submission, editing, and review pipeline has been completely reworked. The old flow had friction in the wrong places. The new one borrows the rendering architecture from Valeon — the same pipeline that handles post rendering on this blog — and applies it to identity pages. The result is a far more coherent authoring experience, and an identity page that renders with the same fidelity I expect from the rest of the stack.
The editor upgrade deserves its own mention. OEMI now uses TipTap, which is the same direction the Valeon author dashboard is heading. One editor abstraction, adopted consistently, rather than a different tool per product. Small decision, meaningful over time.
Wavegram: A Fourier Experiment That Shipped
Wavegram started as a question: what does audio actually look like when you pull it apart with a Fast Fourier Transform and render it as an image?
The answer, it turns out, is interesting enough to put in front of people — and more complete than I initially described.
Wavegram is a two-tab single-page application. The forward direction is what I originally led with: drop in an audio file up to sixty seconds, and it produces a square PNG spectrogram entirely in the browser. No server, no uploads, no network calls. The audio is decoded via the Web Audio API, mixed to mono, resampled to 16 kHz, and run through a Short-Time Fourier Transform with a Hann window before being written to the image. Two precision modes: 8-bit grayscale and 16-bit, where magnitude is packed across the red and green channels for substantially cleaner output. The 16-bit mode is the recommended default.
The backward direction is the more interesting part. The same PNG can be loaded back in, the spectrogram read, and the audio reconstructed — via Griffin-Lim phase recovery running in a Web Worker, with three quality presets from fast to high-fidelity. Phase is discarded in the forward pass and recovered probabilistically on the return, so the reconstruction is faithful to the original but not bit-exact. That is a property of magnitude-only spectrograms, not a limitation of the implementation.
What makes this work cleanly is that the PNG is self-describing. Every parameter needed to reconstruct the audio is embedded directly in the image — in the first sixteen rows, rendered as 8×8 black-and-white squares: sample rate, FFT and hop size, sample count, precision flag, and a CRC-16 to validate the lot. The format carries a four-byte magic sequence. Lossy image formats are rejected at the decoder by signature, because JPEG and WebP compression corrupts the magnitude values. Any Wavegram PNG decodes with no external state, by design.
The source is on GitHub if you want to pull it apart. The DSP and codec logic is kept entirely free of any DOM or React dependency — it runs identically in the browser, in a Web Worker, and in Vitest under Node.
Wavegram was a contained experiment with a specific goal: understand what is possible at the intersection of audio, frequency analysis, and in-browser rendering. That goal is answered. What I did not expect is that answering it would illuminate something directly relevant to where VocaSync is heading.
The Signal I Have Not Finished Reading Yet
I will not overstate what I know here, because I do not know enough yet to state it precisely.
What I can say is this: spectrograms are not decorative. They are the intermediate representation that most synthesis and voice cloning models actually operate on. Mel spectrograms sit at the centre of how XTTS v2, StyleTTS 2, and the rest of the VocaSync Studio stack produce and evaluate audio. Wavegram probed that domain from the browser side, and the distance between what it can do and what a studio-grade workflow might need is shorter than I initially assumed.
There is something there. I am not ready to say what it becomes yet.
More when I am.