Douglas Triggs (doubt72) wrote,
Douglas Triggs

  • Mood:

Dictionaries, Dictionaries...

So, I've been working on finishing up my Japanese language drill tools. The sticking point had been the vocabulary bit, and so I've been working on an editor, along with word lists to put in it and such. Vocabulary's a much tougher nut to crack than Kanji, because the way the Japanese organize kanji is very... Organized. And the organization is (almost) universally used when talking about Kanji. Vocabulary, on the other hand... Not so much.

Well, I've decided that one logical thing to do is to use the JLPT word lists I can find and use those as primary source material (I already found words frequency lists long ago that I've long since modified to my own nefarious ends, and there are also various Japanese books I might or might not use). Problem is, it's a web page. Well, that's not a big problem, I can grab the page and save it. But the only definitions are links to ANOTHER web page (one link per word), so that means grabbing all of those pages... For something like 5000 words total for all four JLPT levels.

Still, not a big problem. I can fetch the pages in a script and parse them (it'll take a long time, sure, especially since I put in a delay between fetches -- grabbing them as fast as they'll come is kind of anti-social -- but I'm patient). Except I do everything in unicode, and the original list of words is in shift-JIS. And the online dictionary it links to is in EUC. Okay, not a big problem, I have a EUC to unicode conversion table already, and I found a shift-JIS to unicode table in fairly short order, too.

Except I was writing the program in TCL (already had a bunch of code for this lying from writing the drill tools, which are written in TCL/TK -- TCL being the most natural way to talk to TK if you ask me, perl, etc. are more awkward), and there was some sort of weird bug where it wasn't handling the binary correctly, and after trying to get at it three different ways, it was clear I wasn't going to be able to work around it. So, I re-wrote the whole thing in Perl in fairly short order.

And now it more or less works, except the data is messy as all hell and will take some cleanup, but that's okay. That's why I wrote the dictionary editor for in the first place.

The funny thing is, though, is that the TCL array handling code was way faster than the perl code on startup (it's loading all sorts of crap to build the dictionary, on the order of 10k records each in the EUC and shift-JIS tables, and almost 50k in the frequency table), which doesn't make any sense to me at all. My only guess is that maybe the perl was doing all of its optimization on initialization or something to optimize lookups, but it was wasted energy, because even a slow lookup for the handful of characters it needs for every word doesn't matter when all the runtime is tied up in fetching HTML and thumb-twiddling.

  • New House

    So, uh, we have a new house. And I took pictures with the SLR for the first time in, what, two years? Have an HDR: (Click through the image for…

  • So Quiet...

    So, haven't posted here in a long, long time. Mostly because the game stuff has moved elsewhere ( here for one), and haven't done any photography…

  • That Thing About That One thing

    And it's done... It's actually been out for a couple days, but the last couple of evenings have been hectic; Tuesday there was a Muse concert and…

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded