The file can be downloaded at the end of the post
As you may already know, I’m travelling to Iceland this
and started learning Icelandic a few months ago. It advances slowly but firmly,
but I found a problem:when you are self-learning a new language, an invaluable
tool is a list of most common words.
I was able to find the 100 most
common words, from a research paper (Íslenskur Orðasjóður - Building a large
I don’t want to dismiss their results, but for a published paper you can’t count
hann, or count
f. as a word, I think. However, they explain
the procedure in the paper, and it looks pretty good. Just that the list they
give leaves a little to be desired, and I could not find a way to use the corpus
they generated to get the frequency list.
I decided to do something different. First I thought of sampling a lot of
Icelandic data (online news and such), but I didn’t want to waste that much
time, so I downloaded
is.wikipedia.org. A meagre
42Mb of compressed data. Well, it could be even smaller, if I was the one
sampling it.The article sampled 142Gb, in comparison. Truly an amazing corpus!
After I had the data, I wrote a small script that moved all html files to the same directory:
# Navigate through directory tree and copy all html files here
for FILE in $(find ./ -name \*.html -type f); do
mv $FILE ./
My idea was then to cat all these
files into one big html file, and then do word frequency analysis there.
Problem: cat *.html > file does not work: *.html yielded too many results
(around 60 thousand, I think). Instead of writing a script (the solution I
should have used) I just cat-ed every letter as in
cat A\*.html > is-a.dat. I
should have used a script similar to the one I created for the Christmas
for i in \`seq 1 $FILES\`;
ls \*.jpg | head -n $NUM | tail -n $COLUMNS F$i
This is the original code. In the file
F$i I would have a list of all files I
cat together. Anyway, I did it by hand. On my way to the end file, I
found several letter combinations (
Fokkur which means Categories, for instance)
with a lot of pages, which
cat also could not manage. (I think the problem was
the shell, more than
cat) I removed them, because the
Categories page could have a
strong bias towards certain words.
Once I had this really big
WPislenska.dat file, it was just standard command
line tricks (which I got from the Linux
tr ' ' \[return\]
'' /lt WPislenska.dat | sort | uniq -c | sort -g -r /gt IslenskaWF-FromWP.dat
This turns spaces into returns, sorts alphabetically, counts unique words and
orders in decreasing frequency order.
IslenskaWF-FromWP.dat contains word-frequency counts for data from this
Wikipedia dump. The next step was the maddening one: removing all html entities,
wikipedia words (like page, visitors, users…) and find the English
translation, via Wiktionary, my Icelandic
dictionary, my Icelandic learning course and Ubiquity.
The final result is this file, with the 100 most common words in Icelandic’s