Diarization and Transcription

2025-07-06 I had it in my mind that it would be great to be able to do transcriptions without pumping audio into some random website.

n8n

I was mid trying to figure out n8n and had this workflow that would call whisper (openAI key) for the transcript. The transcription worked, but it just puked out a big volume of text. I (Gemma and I) then vibe coded in a sentence breaker. It was decent, set up to break every 4 sentences or 100 words. There were some bizarre splits thought.

Then I was feeling good and thought that I could setup a google form that would accept my audio file from anywhere and then the n8n would be looking for changes to the folder at my home machine and transcription would be done when I returned home.

Once that was done, my next test was to upload… well it was then I learned that whisper can only take up to a 25mb file and I couldn’t figure how to get splitter / merge code to work. Plus, if there were more than one person speaking, it could not make those distinctions.

Docker

I had been trying to use docker previously to this, but I’m still not sure I have a handle on it. Its a program, but it creates a container, but it can access the computer resources (setable). Regardless, I got the bright idea that running local would be best, and docker was going to make this work for me.

Gemma and I started on the vibe again. It was many bumps and bruises and an upgrade from radeon to nvidia hardware later and I finally had the container batch transcribing. Again there was to distinction on speakers. Just all words.

After some googling, I found this is referred to as diarization. Essentially the file is ran for transcription and times, then it is ran again for changes in voices with times, and then the data is merged to create Speaker 1; Speaker 2, etc.

Getting this to work was almost a week of trying different things. Then a shoutout to Joseph C. Topping. He shared code doing what I was looking to accomplish. I presented their code to a new chat of AIStudio and it seemed to help it along as it created a docker environment. Eventually I got it to work on first one audio file and then I managed to get it to batch anything new in the folder.

I learned a few things. Transcription and diarization are not as simple as I thought they would be. By the time I got it all working, my desire to implement voice print had waned. Maybe I will swing back and pick it up later.

It wasn’t long after I moved on from this project that Nick Gray posted this killer tweet on a simple way to get the transcription and diarization done. Check it out. Easier input and far superior output. Thanks for sharing Nick!

Gemini 2.5 Pro is probably the best AI model for speech-to-text transcription right now

I'll tell you why you should be doing a lot more voice recordings and getting the transcripts

But first let me show you exactly how to do this for free

If you have a voice note from your…
— Nick Gray / How to Make Friends (@nickgraynews) July 14, 2025