World's most boring audiobook: German-English text-to-speech

I was curious how it would turn out

July 3, 2024 3 minutes read | 513 words by Ruben Berenguel

A lot of my projects involve a great deal of I wonder what the result might look like. This one was not unexpected

SDXL seed: 2325280438a professional grade shot of a cinematic movie in the style of Hitchcock, German imagery, lederhosen, audiobook, headphones, audio, books, Python, large realistic snake, boredom, thriller

A few days ago I quickly hacked a pocket translator using a M5Cardputer and the WikDict SQLite dictionaries.

Thinking about what else I could do with these databases, I had a stupid thought:

Let’s make an audiobook of German to English words.

I hadn’t done any TTS in… ages, luckily with the help of Gemini I could find a decent modern library I could run locally… After failing with espeak-ng, which does not work well on Mac and Mozilla TTS, which has few models and is unmaintained. I ended up using Coqui TTS, and two TTS models (one for English and one for German) that work reasonably well.

Click to open the repository for the pocket translator

Input data preparation

The first thing was preparing a list of the most important German words, out of the wikdict databases.

sqlite3 de-en.sqlite3

Supposedly the column importance in the de-en.sqlite3 database measures some sort of proxy of it.

CREATE TABLE translation(
  lexentry TEXT,
  sense_num TEXT,
  sense,
  written_rep TEXT,
  trans_list,
  score,
  is_good,
  importance
);

I wanted the German nouns to come together with their adjective… and otherwise get the part of speech. It is good to know if something is a verb or an adjective. This information is in the de.sqlite3 database

ATTACH DATABASE 'de.sqlite3' AS de;

SELECT DISTINCT 
  t.lexentry,
  t.written_rep,
  t.trans_list,
  e.vocable,
  e.part_of_speech,
  e.gender,
  t.importance FROM translation AS t
  JOIN de.entry AS e
  ON t.lexentry = e.lexentry
  ORDER BY importance DESC
  LIMIT 10;

This looked reasonably good, so I created a separate sqlite3 file with this:

ATTACH DATABASE 'top_de.sqlite3' AS top;

CREATE TABLE top.top AS 
SELECT DISTINCT 
  t.lexentry,
  t.written_rep,
  t.trans_list,
  e.vocable,
  e.part_of_speech,
  e.gender,
  t.importance FROM translation AS t
  JOIN de.entry AS e
  ON t.lexentry = e.lexentry
  ORDER BY importance DESC
  LIMIT 10000;

Top 10000 words looks reasonably large but not overwhelming. But, if you think about it, at 5 seconds per word this is going to be around 15 hours. Not the most exciting “audiobook”.

Audio file generation

I got Gemini to write most of this. Although I set it up as a poetry project, I’m posting it as a script here. You will need to install TTS (capitalised) and pydub:

If you happen to run this, it might take a while: took my Mac M1 around 5 hours to generate the 4GB of wav files.

Final cleanup

To wrap up (I could have done this in Python, but did not bother), I grouped the resulting wav files in folders with 1000 files and converted them to concatenated mp3:

for f in *.wav; do echo "file '$f'" >> input.txt; done # Avoid issues with spaces and long commands
ffmpeg -f concat -safe 0 -i input.txt -c:a libmp3lame -q:a 4 file.mp3

If you do this, you will end up with around 14 hours of todlangweilig audio.