World's most boring audiobook: German-English text-to-speech
A lot of my projects involve a great deal of I wonder what the result might look like. This one was not unexpected
A few days ago I quickly hacked a pocket translator using a M5Cardputer and the WikDict SQLite dictionaries.
Thinking about what else I could do with these databases, I had a stupid thought:
Let’s make an audiobook of German to English words.
I hadn’t done any TTS in… ages, luckily with the help of Gemini I could find a decent modern library I could run locally… After failing with espeak-ng
, which does not work well on Mac and Mozilla TTS
, which has few models and is unmaintained. I ended up using Coqui TTS, and two TTS models (one for English and one for German) that work reasonably well.
Input data preparation
The first thing was preparing a list of the most important German words, out of the wikdict databases.
sqlite3 de-en.sqlite3
Supposedly the column importance
in the de-en.sqlite3
database measures some sort of proxy of it.
CREATE TABLE translation(
lexentry TEXT,
sense_num TEXT,
sense,
written_rep TEXT,
trans_list,
score,
is_good,
importance
);
I wanted the German nouns to come together with their adjective… and otherwise get the part of speech. It is good to know if something is a verb or an adjective. This information is in the de.sqlite3
database
ATTACH DATABASE 'de.sqlite3' AS de;
SELECT DISTINCT
t.lexentry,
t.written_rep,
t.trans_list,
e.vocable,
e.part_of_speech,
e.gender,
t.importance FROM translation AS t
JOIN de.entry AS e
ON t.lexentry = e.lexentry
ORDER BY importance DESC
LIMIT 10;
This looked reasonably good, so I created a separate sqlite3 file with this:
ATTACH DATABASE 'top_de.sqlite3' AS top;
CREATE TABLE top.top AS
SELECT DISTINCT
t.lexentry,
t.written_rep,
t.trans_list,
e.vocable,
e.part_of_speech,
e.gender,
t.importance FROM translation AS t
JOIN de.entry AS e
ON t.lexentry = e.lexentry
ORDER BY importance DESC
LIMIT 10000;
Top 10000 words looks reasonably large but not overwhelming. But, if you think about it, at 5 seconds per word this is going to be around 15 hours. Not the most exciting “audiobook”.
Audio file generation
I got Gemini to write most of this. Although I set it up as a poetry project, I’m posting it as a script here. You will need to install TTS
(capitalised) and pydub
:
If you happen to run this, it might take a while: took my Mac M1 around 5 hours to generate the 4GB of wav
files.
Final cleanup
To wrap up (I could have done this in Python, but did not bother), I grouped the resulting wav
files in folders with 1000 files and converted them to concatenated mp3
:
for f in *.wav; do echo "file '$f'" >> input.txt; done # Avoid issues with spaces and long commands
ffmpeg -f concat -safe 0 -i input.txt -c:a libmp3lame -q:a 4 file.mp3
If you do this, you will end up with around 14 hours of todlangweilig audio.