Google OCR for Sanskrit

Since I recently learned that Google’s Cloud Vision API is capable of recognizing text in Sanskrit in the Dēvanāgarī script, I have been excited to see what the possibilities are. Earlier this month Arun posted a comparison of Google OCR and Sanskrit OCR for Sanskrit text, and it looks like Google OCR isn’t bad. I decided to do a few tests of my own. To access the API, I needed to first sign up for Google Cloud, and then set up some authentication settings. Then I simply followed these instructions for accessing the API. I used Python to do so (more or less using this script). The script needs an input file and an output directory, both of which need to be on Google Cloud Storage. I chose a PDF containing about 20 pages of Śālikanātha’s Prakaraṇapañcikā, put it on Google Cloud Storage, and pointed the script at it. Magically, it populated the target folder (also on Google Cloud Storage) with a set of JSON files containing the OCR data. I was only interested in the text, so I grabbed it with the following Python script:

import json
import sys
import os
import io

with open(sys.argv[1]) as f:
    data = json.load(f)
    text = data['responses'][0]['fullTextAnnotation']['text']
    output = open(sys.argv[1].split('.')[0] + '.txt','w')
    output.write(text)

That put the OCR output in .txt files that I concatenated to give the “raw” text data for those pages of the PDF. The results will need a lot of cleaning up, but I think it’s a huge leap forward.