Text Extraction From Image And Pdf Using Python and Tesseract
Text Extraction from both images and pdfs in Python using Tesseract OCR(handled by pytesseract)
Description:
Extracting texts from both images(.png and .jpg types) and pdfs using Tesseract OCR Engine.This Project can be further modified and implemented to extract particular texts from image or pdf.
Requirements:
Tesseract OCR:
installtion details for Tesseract : https://github.com/tesseract-ocr/tesseract/wiki#windows
(path to the folder must be defined in environment variables)
Pytesseract:
pip install pytesseract
PyMuPDF:
pip install PyMuPDF
Pillow:
pip install Pillow
Usage:
Go to the destined folder and open command prompt (terminal). From command prompt (terminal) type:
python text_extractor.py --file path_to_file
For example: python text_extractor.py --file test.pdf
Project Files
| .. | ||
| This directory is empty. | ||