Codesnippets

Top 7 Open Source OCR Models for Document Processing

Codesnippets

Ai ml Computer vision Document processing Python

Top 7 Open Source OCR Models for Document Processing

Mohammad Abu Mattar 09 Jan, 2026 03 Mins read

AI Tool

Turn your documents into perfect digital copies with these powerful open source OCR models. No more dealing with messy text extraction get clean, accurate markdown from PDFs, images, and scanned documents.

The OCR Revolution

Beyond Basic Text Extraction

Complete document understanding

These modern OCR models do way more than just read text. They understand document structure, tables, diagrams, math equations, and multiple languages, turning everything into well-formatted markdown.

Local Processing Benefits

Privacy and control

Run these models right on your own machine without sending sensitive documents to cloud services. You keep complete control over your data while getting accuracy that rivals enterprise solutions.

olmOCR-2-7B-1025

High-Performance Document OCR

Allen Institute’s flagship model

This is fine-tuned from Qwen2.5-VL-7B-Instruct using GRPO reinforcement learning, scoring 82.4 on the olmOCR-bench evaluation.

1
from transformers import AutoTokenizer, AutoModelForCausalLM
2
import torch
3

4
# Load the model
5
model_id = "allenai/olmOCR-2-7B-1025"
6
tokenizer = AutoTokenizer.from_pretrained(model_id)
7
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
8

9
# Process document
10
messages = [
11
    {"role": "user", "content": "Convert this document to markdown"},
12
    {"role": "user", "content": f"[IMAGE_PLACEHOLDER]"}
13
]
14

15
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
16
outputs = model.generate(inputs, max_new_tokens=4096)
17
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

Key Features

Mathematical equations: Handles complex math expressions perfectly
Table recognition: Keeps table structure and formatting intact
Layout understanding: Manages multi-column documents and complex layouts
Large-scale processing: Built for processing millions of documents
Automated retries: Includes error handling and automatic rotation correction

PaddleOCR VL

Efficient Multilingual OCR

Ultra-compact vision-language model

Integrates NaViT visual encoder with ERNIE language model, supporting 109 languages with minimal resource usage.

1
import paddle
2
from paddlenlp import Taskflow
3

4
# Initialize OCR pipeline
5
ocr = Taskflow("document_parsing")
6

7
# Process document
8
result = ocr({"doc": "path/to/document.pdf"})
9
print(result)

Key Features

109 languages: Chinese, English, Japanese, Arabic, Hindi, Thai
Complex elements: Tables, formulas, charts recognition
High accuracy: State-of-the-art performance on OmniDocBench
Fast inference: Optimized for real-world deployment
Compact size: Efficient resource utilization

OCRFlux-3B

Multimodal Document Conversion

Preview release from ChatDOC

Fine-tuned from Qwen2.5-VL-3B-Instruct for clean markdown output from PDFs and images.

1
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
2
from qwen_vl_utils import process_vision_info
3

4
# Load model
5
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
6
    "ChatDOC/OCRFlux-3B",
7
    torch_dtype=torch.bfloat16,
8
    device_map="auto"
9
)
10
processor = AutoProcessor.from_pretrained("ChatDOC/OCRFlux-3B")
11

12
# Process image
13
messages = [
14
    {
15
        "role": "user",
16
        "content": [
17
            {"type": "image", "image": "path/to/image.jpg"},
18
            {"type": "text", "text": "Convert to markdown"}
19
        ]
20
    }
21
]
22

23
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
24
image_inputs, video_inputs = process_vision_info(messages)
25
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt").to(model.device)
26

27
generated_ids = model.generate(**inputs, max_new_tokens=4096)
28
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Key Features

Clean markdown: Structured output with proper formatting
Cross-page tables: Merges table data across multiple pages
Consumer hardware: Runs on GTX 3090 and similar GPUs
Scalable deployment: vLLM inference support
State-of-the-art accuracy: Superior parsing quality

MiniCPM-V 4.5

Mobile-Optimized OCR

Latest in the MiniCPM-V series

Built on Qwen3-8B and SigLIP2-400M, delivers exceptional performance for text recognition in images, documents, and videos.

1
from transformers import AutoModel, AutoTokenizer
2
import torch
3

4
# Load model
5
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4.5', trust_remote_code=True)
6
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4.5', trust_remote_code=True)
7
model.eval().cuda()
8

9
# Process image
10
image = Image.open('path/to/image.jpg').convert('RGB')
11
question = 'Convert this document to text'
12

13
msgs = [{'role': 'user', 'content': [image, question]}]
14
result = model.chat(msgs=msgs, tokenizer=tokenizer, sampling=True, temperature=0.7)
15
print(result)

Key Features

Mobile deployment: Optimized for edge devices
Multi-image processing: Handles multiple images simultaneously
Video OCR: Text recognition in video content
State-of-the-art benchmarks: Leading performance across evaluations
Practical efficiency: Everyday application ready

InternVL2.5-4B

Compact Multimodal Understanding

Efficient vision-language model

Combines InternViT vision encoder with Qwen2.5 language model for comprehensive OCR and understanding.

1
import torch
2
from transformers import AutoTokenizer, AutoModel
3

4
# Load model
5
model = AutoModel.from_pretrained(
6
    'OpenGVLab/InternVL2_5-4B',
7
    torch_dtype=torch.bfloat16,
8
    low_cpu_mem_usage=True,
9
    trust_remote_code=True
10
).eval().cuda()
11

12
tokenizer = AutoTokenizer.from_pretrained(
13
    'OpenGVLab/InternVL2_5-4B',
14
    trust_remote_code=True
15
)
16

17
# Process image
18
image = Image.open('path/to/image.jpg').convert('RGB')
19
question = 'Extract all text from this image'
20

21
response = model.chat(tokenizer, image, question)
22
print(response)

Key Features

Dynamic resolution: 448x448 pixel tile processing
Resource efficient: Suitable for constrained environments
Text recognition: Strong OCR performance
Reasoning tasks: Advanced multimodal understanding
Compact architecture: 4 billion parameters total

Granite Vision 3.3 2b

Document Understanding Specialist

IBM’s vision-language model

Built on Granite 3.1-2b-instruct with SigLIP2 vision encoder for automated content extraction.

1
from transformers import AutoProcessor, AutoModelForVision2Seq
2
import torch
3

4
# Load model
5
processor = AutoProcessor.from_pretrained("ibm-granite/granite-vision-3.3-2b")
6
model = AutoModelForVision2Seq.from_pretrained("ibm-granite/granite-vision-3.3-2b", device_map="auto")
7

8
# Process image
9
image = Image.open("path/to/document.png").convert("RGB")
10
text_prompt = "<|start_of_text|>"
11

12
inputs = processor(text=text_prompt, images=image, return_tensors="pt").to(model.device)
13
generated_ids = model.generate(**inputs, max_new_tokens=500)
14
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Key Features

Table extraction: Automated table content extraction
Chart recognition: Infographics and plot understanding
Multi-page support: Handles multi-page documents
Image segmentation: Advanced visual processing
Enhanced safety: Improved security features

TrOCR Large (SROIE Fine-tuned)

Specialized Text Recognition

Transformer-based OCR

Encoder-decoder architecture combining BEiT image transformer with RoBERTa text transformer.

1
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
2
import torch
3

4
# Load model
5
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-printed')
6
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-printed')
7

8
# Process image
9
image = Image.open('path/to/image.jpg').convert('RGB')
10
pixel_values = processor(images=image, return_tensors="pt").pixel_values
11
generated_ids = model.generate(pixel_values)
12

13
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
14
print(generated_text)

Key Features

Single-line text: Optimized for printed text recognition
High accuracy: State-of-the-art performance on benchmarks
Transformer architecture: Modern deep learning approach
Pre-trained models: Leverages large-scale training data
Sequence processing: Handles text as sequential tokens

Choosing the Right Model

Use Case Considerations

Match model to needs

High accuracy: olmOCR-2-7B-1025, OCRFlux-3B
Efficiency: PaddleOCR VL, InternVL2.5-4B
Multilingual: PaddleOCR VL, MiniCPM-V 4.5
Mobile/Edge: MiniCPM-V 4.5, InternVL2.5-4B
Specialized: TrOCR (printed text), Granite Vision (documents)

Performance Optimization

Hardware considerations

GPU memory: Match model size to available VRAM
Batch processing: Use models supporting multiple images
Quantization: Consider quantized versions for efficiency
Local deployment: All models support local inference

Implementation Best Practices

Preprocessing

Optimize input quality

1
# Image preprocessing
2
def preprocess_image(image_path):
3
    image = Image.open(image_path).convert('RGB')
4
    # Resize if needed
5
    if max(image.size) > 2240:
6
        image = image.resize((2240, 2240), Image.Resampling.LANCZOS)
7
    return image

Error Handling

Robust processing

1
def safe_ocr_processing(model, image_path):
2
    try:
3
        image = preprocess_image(image_path)
4
        result = model.process(image)
5
        return result
6
    except Exception as e:
7
        logging.error(f"OCR processing failed: {e}")
8
        return None

Output Formatting

Structured results

1
def format_ocr_output(raw_text, confidence_scores=None):
2
    """Format OCR output with metadata"""
3
    return {
4
        "text": raw_text,
5
        "confidence": confidence_scores,
6
        "timestamp": datetime.now().isoformat(),
7
        "model_version": "olmOCR-2-7B-1025"
8
    }

These open source OCR models represent the cutting edge of document processing technology. Choose based on your specific requirements accuracy, speed, language support, or hardware constraints and start converting documents with unprecedented quality.

Top 7 Open Source OCR Models for Document Processing

You might also enjoy