Log in

No account? Create an account
11 November 2005 @ 01:01 am
Dictionaries, Dictionaries...  
So, I've been working on finishing up my Japanese language drill tools. The sticking point had been the vocabulary bit, and so I've been working on an editor, along with word lists to put in it and such. Vocabulary's a much tougher nut to crack than Kanji, because the way the Japanese organize kanji is very... Organized. And the organization is (almost) universally used when talking about Kanji. Vocabulary, on the other hand... Not so much.

Well, I've decided that one logical thing to do is to use the JLPT word lists I can find and use those as primary source material (I already found words frequency lists long ago that I've long since modified to my own nefarious ends, and there are also various Japanese books I might or might not use). Problem is, it's a web page. Well, that's not a big problem, I can grab the page and save it. But the only definitions are links to ANOTHER web page (one link per word), so that means grabbing all of those pages... For something like 5000 words total for all four JLPT levels.

Still, not a big problem. I can fetch the pages in a script and parse them (it'll take a long time, sure, especially since I put in a delay between fetches -- grabbing them as fast as they'll come is kind of anti-social -- but I'm patient). Except I do everything in unicode, and the original list of words is in shift-JIS. And the online dictionary it links to is in EUC. Okay, not a big problem, I have a EUC to unicode conversion table already, and I found a shift-JIS to unicode table in fairly short order, too.

Except I was writing the program in TCL (already had a bunch of code for this lying from writing the drill tools, which are written in TCL/TK -- TCL being the most natural way to talk to TK if you ask me, perl, etc. are more awkward), and there was some sort of weird bug where it wasn't handling the binary correctly, and after trying to get at it three different ways, it was clear I wasn't going to be able to work around it. So, I re-wrote the whole thing in Perl in fairly short order.

And now it more or less works, except the data is messy as all hell and will take some cleanup, but that's okay. That's why I wrote the dictionary editor for in the first place.

The funny thing is, though, is that the TCL array handling code was way faster than the perl code on startup (it's loading all sorts of crap to build the dictionary, on the order of 10k records each in the EUC and shift-JIS tables, and almost 50k in the frequency table), which doesn't make any sense to me at all. My only guess is that maybe the perl was doing all of its optimization on initialization or something to optimize lookups, but it was wasted energy, because even a slow lookup for the handful of characters it needs for every word doesn't matter when all the runtime is tied up in fetching HTML and thumb-twiddling.
In the mood: productive