7 Best Python OCR Libraries for Image-to-Text Conversion

Optical Character Recognition (OCR) is a technology that extracts readable text from images, scanned documents, and even hand-written notes. In Python, OCR tools have evolved significantly over the years, and with the latest version, these libraries now offer even more powerful, efficient solutions.

This article will cover the top seven OCR libraries in Python, highlighting their strengths, unique features, and code examples to help you get started.

1. Tesseract OCR (pytesseract)

Tesseract is undoubtedly the most popular and widely used OCR library in the Python ecosystem. Originally developed by HP and now maintained by Google, Tesseract provides high-quality OCR capabilities for over 100 languages.

Key Features:

Open-source and free to use.
Supports multiple languages, including non-Latin alphabets.
Recognizes text in images, scanned documents, and PDFs.
Can be customized with custom training data for specialized use cases.
Works well with pre-processing tools like OpenCV to improve accuracy.

To install Tesseract OCR on Linux, follow these steps depending on your distribution:

sudo apt install tesseract-ocr     [On Debian, Ubuntu and Mint]
sudo yum install tesseract         [On RHEL/CentOS/Fedora and Rocky/AlmaLinux]
sudo emerge -a sys-apps/tesseract  [On Gentoo Linux]
sudo apk add tesseract             [On Alpine Linux]
sudo pacman -S tesseract           [On Arch Linux]
sudo zypper install tesseract      [On OpenSUSE]    
sudo pkg install tesseract         [On FreeBSD]

Once Tesseract is installed, if you want to use it with Python, you need to install the pytesseract package using the pip package manager.

pip3 install pytesseract
OR
pip install pytesseract

Here’s an example Python code for using Tesseract OCR with the pytesseract library to extract text from an image.

import pytesseract
from PIL import Image

# Load an image
img = Image.open("image_sample.png")

# Use Tesseract to extract text
text = pytesseract.image_to_string(img)

# Print the extracted text
print(text)

2. EasyOCR

EasyOCR is another excellent Python OCR library that supports more than 80 languages and is easy to use for beginners. It is built on deep learning techniques, making it an excellent choice for those who want to leverage modern OCR technology.

Key Features:

High accuracy with deep learning models.
Supports a wide range of languages.
Can detect text in vertical and multi-lingual images.
Simple and easy-to-understand API.

To install EasyOCR on Linux, you can use the following pip command based on your distribution.

pip3 install easyocr
OR
pip install easyocr

Once the installation is complete, you can use EasyOCR to extract text from an image.

import easyocr

# Initialize the OCR reader
reader = easyocr.Reader(['en'])

# Extract text from an image
result = reader.readtext('image_sample.png')

# Print the extracted text
for detection in result:
    print(detection[1])

3. OCRopus

OCRopus is an open-source OCR system developed by Google. While it is primarily used for historical documents and books, OCRopus can also be applied to a wide variety of text extraction tasks.

Key Features:

Specializes in document layout analysis and text extraction.
Built with modularity in mind, enabling easy customization.
Can work with multi-page documents and large datasets.

Here’s an example Python code to extract text from an image.

import subprocess

# Use OCRopus to process an image
subprocess.run(['ocropus', 'identify', 'image_sample.png'])

4. PyOCR

PyOCR is a Python wrapper around several OCR engines, including Tesseract and CuneiForm. It provides a simple interface for integrating OCR functionality into Python applications.

Key Features:

Can interface with multiple OCR engines.
Provides a simple API for text extraction.
Can be combined with image preprocessing libraries for improved results.

PyOCR requires Tesseract (OCR engine) and Pillow (image processing library). You can install them using the following commands:

sudo apt install tesseract-ocr     [On Debian, Ubuntu and Mint]
sudo yum install tesseract         [On RHEL/CentOS/Fedora and Rocky/AlmaLinux]
sudo emerge -a sys-apps/tesseract  [On Gentoo Linux]
sudo apk add tesseract             [On Alpine Linux]
sudo pacman -S tesseract           [On Arch Linux]
sudo zypper install tesseract      [On OpenSUSE]    
sudo pkg install tesseract         [On FreeBSD]

Now, you can install the pyocr and pillow libraries using pip:

pip3 install pyocr pillow
OR
pip install pyocr pillow

Here’s a Python example that extracts text from an image using PyOCR and Tesseract:

import pyocr
from PIL import Image

# Choose the OCR tool (Tesseract or CuneiForm)
tool = pyocr.get_available_tools()[0]

# Load the image
img = Image.open('image_sample.png')

# Extract text from the image
text = tool.image_to_string(img)

# Print the extracted text
print(text)

5. PaddleOCR

PaddleOCR is an OCR library developed by PaddlePaddle, a deep learning framework. It supports more than 80 languages and offers cutting-edge accuracy due to its use of deep learning models.

Key Features:

High performance, especially for images with complex backgrounds.
Supports text detection, recognition, and layout analysis.
Includes pre-trained models for a variety of languages.

To install PaddleOCR in Linux, use:

pip3 install paddlepaddle paddleocr
OR
pip install paddlepaddle paddleocr

Here’s a Python example that extracts text from an image using paddleocr library:

from paddleocr import PaddleOCR

# Initialize the OCR
ocr = PaddleOCR(use_angle_cls=True, lang='en')

# Perform OCR on an image
result = ocr.ocr('image_sample.png', cls=True)

# Print the extracted text
for line in result[0]:
    print(line[1])

6. Kraken

Kraken is a high-performance OCR library specifically designed for historical and multilingual text. It is built on top of OCRopus and provides additional features for complex layouts and text extraction.

Key Features:

Best suited for old books and multilingual OCR.
Handles complex text layouts and historical fonts.
Uses machine learning for better recognition accuracy.

To install Kraken in Linux, use:

pip3 install kraken
OR
pip install kraken

Here’s a Python example that extracts text from an image using kraken library:

import kraken

# Load the model and recognize text
text = kraken.binarize("image_sample.png")

# Print the recognized text
print(text)

7. Textract (AWS)

AWS Textract is Amazon’s cloud-based OCR service that can analyze documents and forms and extract text with high accuracy. It integrates seamlessly with other AWS services.

Key Features:

Cloud-based OCR with scalable solutions.
Supports document structure analysis, including tables and forms.
Integration with AWS services for further data processing.

To install Textract in Linux, use:

pip3 install boto3
OR
pip install boto3

Here is an example Python script that uses AWS Textract to extract text from a document (for example, a scanned PDF or image file).

import boto3

# Initialize a Textract client
client = boto3.client('textract')

# Path to the image or PDF file you want to analyze
file_path="path_to_your_file.png"  # Replace with your file path

# Open the file in binary mode
with open(file_path, 'rb') as document:
    # Call Textract to analyze the document
    response = client.detect_document_text(Document={'Bytes': document.read()})

# Print the extracted text
for item in response['Blocks']:
    if item['BlockType'] == 'LINE':
        print(item['Text'])

Conclusion

Choosing the right OCR library in Python depends on the specific use case, the language requirements, and the complexity of the documents you’re processing. Whether you’re working on historical documents, multilingual texts, or simple scanned PDFs, these libraries provide powerful tools for text extraction.

For beginners, Tesseract and EasyOCR are excellent starting points due to their ease of use and wide adoption. However, for more advanced or specialized tasks, libraries like PaddleOCR, OCRopus, and Kraken offer greater flexibility and accuracy.

Source link