November 11th, 2005

kanji diagram

Dictionaries, Dictionaries...

So, I've been working on finishing up my Japanese language drill tools. The sticking point had been the vocabulary bit, and so I've been working on an editor, along with word lists to put in it and such. Vocabulary's a much tougher nut to crack than Kanji, because the way the Japanese organize kanji is very... Organized. And the organization is (almost) universally used when talking about Kanji. Vocabulary, on the other hand... Not so much.

Well, I've decided that one logical thing to do is to use the JLPT word lists I can find and use those as primary source material (I already found words frequency lists long ago that I've long since modified to my own nefarious ends, and there are also various Japanese books I might or might not use). Problem is, it's a web page. Well, that's not a big problem, I can grab the page and save it. But the only definitions are links to ANOTHER web page (one link per word), so that means grabbing all of those pages... For something like 5000 words total for all four JLPT levels.

Still, not a big problem. I can fetch the pages in a script and parse them (it'll take a long time, sure, especially since I put in a delay between fetches -- grabbing them as fast as they'll come is kind of anti-social -- but I'm patient). Except I do everything in unicode, and the original list of words is in shift-JIS. And the online dictionary it links to is in EUC. Okay, not a big problem, I have a EUC to unicode conversion table already, and I found a shift-JIS to unicode table in fairly short order, too.

Except I was writing the program in TCL (already had a bunch of code for this lying from writing the drill tools, which are written in TCL/TK -- TCL being the most natural way to talk to TK if you ask me, perl, etc. are more awkward), and there was some sort of weird bug where it wasn't handling the binary correctly, and after trying to get at it three different ways, it was clear I wasn't going to be able to work around it. So, I re-wrote the whole thing in Perl in fairly short order.

And now it more or less works, except the data is messy as all hell and will take some cleanup, but that's okay. That's why I wrote the dictionary editor for in the first place.

The funny thing is, though, is that the TCL array handling code was way faster than the perl code on startup (it's loading all sorts of crap to build the dictionary, on the order of 10k records each in the EUC and shift-JIS tables, and almost 50k in the frequency table), which doesn't make any sense to me at all. My only guess is that maybe the perl was doing all of its optimization on initialization or something to optimize lookups, but it was wasted energy, because even a slow lookup for the handful of characters it needs for every word doesn't matter when all the runtime is tied up in fetching HTML and thumb-twiddling.
  • Current Mood


Finally got around to posting the new membership rates in the Nippon2007 community; it's been over a month since they've been in effect (although I didn't have the new flyers until about a month ago -- they were a little late getting those out).

In other news, I note that there's an active community for the WorldCon in LA next year; I'm feeling completely apathetic about the community and am utterly uninterested in attending anything in LA, so I won't be bothering to subscribe. I'd probably go if I had an infinite amount of money to throw around, but I don't -- I don't even have enough to attend even if it was someplace remotely interesting.
  • Current Music
    O-Zone - Dragostea Din Tei