blog

a post on a retired blog, Digital Bridge

Wordsmith

No comments | Posted Mar 10, 2006 in Digital Bridge, Mac, Programming, Python

In my lin­guis­tics classes, we’re doing stuff with text processing/analysis soft­ware (like Word­Cruncher) that’s Windows-​only, and it’s a shame. How hard would it be to write a text analy­sis engine in Perl or Python and a fron­tend in PyObjC? It can’t be that hard… Per­haps I’ll use that as my learn­ing pro­gram for Python — start small and build up. So, the next ques­tion then is what text analy­sis soft­ware ought to do. There’s got to be a ton of dif­fer­ent ways to look at a text com­pu­ta­tion­ally — word­print analy­ses, sta­tis­tics of var­i­ous types, etc. The engine would also have to sup­port tag­ging the text, so you could say “This word is a verb, 3rd person sin­gu­lar present active indicative” or “This is a conjunction” or what­ever. I really only have expe­ri­ence with Word­Cruncher, but in my research class we looked at some Oxford tools a month or two ago which seemed to be the same sort of thing.

But in all real­ity, to make this project worth my time (and to keep my inter­est), it has to be some­thing I care about. Adding sta­tis­tics on end won’t cut it. So, what does it need to do to be useful? For me, it’d be nice to make a list of the top x (50 or 100 or 1000 or what­not) words in a text. For­eign lan­guage texts are impor­tant to me as well, and I can see this tool (let’s call it Word­smith for now) being of some use in prepar­ing texts for River­glen Press “publication.” In fact, that’s a good way to ensure that it will be useful, to me at least — focus it on help­ing with pro­duc­tion for River­glen Press. Excellent…

[tags]text pro­cess­ing, Mac, Word­Cruncher, Perl, Python, PyObjC[/tags]

Leave a Reply