Thursday, August 21, 2008

Russian OCR Magic

For the past few of months. Give or take a year. I've been looking for a good OCR application to pull Russian text out of images. Finally, I can say that I've found one that can help me complete my translation project for this year. I have no doubt that tesseract project can do this for me, just at the moment I can see it has bugs and glitches I can't deal with. I was not happy with tesseract outcome so I went on to search for bigger better things and wallah I found an application that was written by Russian scientists. It has great result compared to what I've found previously. It's called CuneiForm and it's all open source. You can download the zip and run it on windows or in wine. If you really want to get down and dirty there's also source. As of this writing Cuneiform is in version V.12 It has support of 20 languages: English, German, French, Spanish, Italian, Portuguese, Dutch, Russian, Mixed Russian-English, Ukrainian, Danish, Swedish, Finnish, Serbian, Croatian, Polish and others. Enjoy.


Watch these videos to know how to translate text in images:

how to install cuneiform on windows, if you see question mark instead of Russian text in cuneiform you will have to change your locales or just wing it:
VIDEO screencast.com/t/wrST2VB3
how to extract Russian text out of images into text files:
VIDEO screencast.com/t/WFJVpHvS
how to translate your newly extracted Russian text via free online tools(eg. translation2.paralink.com,google.com/translate):
VIDEO screencast.com/t/bDIhI6XPq

I hope this helps someone, somehow. I'm thinking about using Python's Pyrex or something else more feasible to automate this task and future tasks for me. Thanks. -A

No comments: