Image Text Searcher Using Python & Tesseract
This project aims to create a tool using Python & Tesseract OCR that identifies words in any given image and counts the occurences of any given target word.
The main aim of this project is to create a tool for extracting data from any given image which has been successfully achieved.
This project:
1. reads an image with text data
2. identifies text using Tesseract OCR
3. searches for the target word.
4. prints the number of occurences.
An additional use-case of this project can be a CSV Convertor for reading an tabular image and output that data as a Comma Seperated Value (CSV) file.
This project has been attempted in Python using the Tesseract OCR module.
FUTURE SCOPE (Shortcomings still to be overcome (For CSV Convertor)):
1. The data extracted is in the form of a long string. Therefore it is imperative that a way be developed to isolate the columns and the content therein.
2. The most basic attempt was to create a list of words detected from the string, but as mentioned, the list too had the same shortcoming.
3. Further steps may include:
1. Using OpenCV to create a bounding box around the columns and subsequently the cells, so as to identify the data in a cell as accurately as possible.
2. Using Artificial Neural Networks to train a model to identify cells in an image and read them in a LEFT-RIGHT, TOP-BOTTOM order and create a list accrdingly.
4. The second approach, i.e, 3.2 is the most promising approach but not the only approach.
Project Files
| .. | ||
| This directory is empty. | ||