an OCR Engine in JavaScript

Tesseract.js is a javascript library that gets words in almost any language out of images. Below is an example tested with Node.js (local file system).

Note that your will need to download the trained language data for the language here. Unzip the file and put it somewhere in your solution. Then the call to Tesseract will look something like the below. You probably want to save a reference to what is returned by create because loading the file takes some time.


var fs = require(“fs”);
var Tesseract = require(‘tesseract.js’);
var path = require(‘path’);

// Asynchronous read
fs.readFile(‘text_file.txt’, function (err, data) {
if (err) {
return console.error(err);
}
console.log(“Asynchronous read: ” + data.toString());
});

var imagePath = path.join(__dirname, ‘image_text.png’);

console.log(‘Here is the path to the file: ‘ + imagePath);

Tesseract.create({ langPath: “eng.traineddata” }).recognize(imagePath, ‘eng’)
.then(function(result){
console.log(result.text)
});

 

 

I tested one example. It doesn’t get things right perfectly though…