Deep Learning Meets OCR: My FastAPI-Powered Document Cleaning Tool

Document Cleaner API — Clean Up Scanned Docs with AI + FastAPI
Hey folks!
I recently wrapped up a project that combines deep learning, OCR, and FastAPI to make scanned documents more readable and searchable. Whether you’re working with messy handwrit…


This content originally appeared on DEV Community and was authored by John

Document Cleaner API — Clean Up Scanned Docs with AI + FastAPI
Hey folks!
I recently wrapped up a project that combines deep learning, OCR, and FastAPI to make scanned documents more readable and searchable. Whether you're working with messy handwritten notes, low-contrast scans, or old documents, this tool helps clean them up and exports them as OCR-ready PDFs.
I call it the “Document Cleaner API,” and it’s live on Google Cloud Run if you want to try it.
🧠 What It Does
The app takes scanned .jpg, .png, or zipped image files and:

  • Cleans and denoises them using a pretrained deep learning model (DnCNN) and OpenCV image processing.
  • Auto-tunes the model weights for best OCR clarity on batches using 20% of the images or 10 of the images whichever is smaller, to select the best weight for the batch. It then tunes OpenCV processing parameters on a per image basis.
  • Returns both cleaned PNGs and a PDF optimized for OCR.
  • Works as both a CLI tool and a REST API.
  • Designed for cloud deployment. (GCP / Docker-ready)

Tech Stack

  • Python 3.10
  • [FastAPI] https://fastapi.tiangolo.com/) for the web server
  • PyTorch for deep learning
  • OpenCV for image cleanup
  • Tesseract OCR for text recognition
  • Deployed via Google Cloud Run

Try It Live

The API is live and public on Cloud Run.
https://document-cleaning-cli-111-777-888-7777-934773375188.us-central1.run.app/
You can test it by uploading a .png or .zip.
Example — Clean a Single Image in your terminal
Run:
bash
curl -X POST -F "file=@sample.png" \
https://document-cleaning-cli-111-777-888-7777-934773375188.us-central1.run.app/process-document/
Example — Clean a ZIP of Images in your terminal
Run:
bash
curl -X POST -F "file=@your_batch.zip" \
https://document-cleaning-cli-111-777-888-7777-934773375188.us-central1.run.app/process-batch/ \
--output cleaned_output.zip

Auto-Tuning Per Batch
When you upload a ZIP of images, the API:

  1. Samples up to 20% of the images (max 10)
  2. Runs OCR tests using different model weights
  3. Picks the best-performing one
  4. Applies it to the entire batch for maximum clarity

This helps maintain high quality while keeping runtime fast—perfect for bulk jobs.
Local Setup

If you want to run it locally or tweak it:
Run:
bash
git clone https://github.com/jcaperella29/Document_cleaning_CLI.git
cd Document_cleaning_CLI
pip install -r requirements.txt

Make sure you have Tesseract OCR installed:
Use the following:

  • Linux: sudo apt install tesseract-ocr
  • macOS: brew install tesseract
  • Windows: Tesseract Download Ideas for Usage

Whether you’re automating document workflows or just trying to make old PDFs legible again, here are a few ideas:

  • Clean up scanned lab notebooks -Prep historical documents for OCR archiving
  • Make handwritten notes searchable -Integrate into pipelines with Python, Bash, or Node.js Example integrations are in the repository:
  • “curl” + shell script for batch runs
  • Python “requests” snippet for automation
  • Node.js + Axios setup for full-stack integration Project Structure

├── main.py # FastAPI routes
├── processor.py # Image cleanup logic (DnCNN + OCR)
├── model_weights/ # .mat weight files
├── uploads/ # Temp folder for input
├── processed/ # Output folder for cleaned files

🙏 Feedback Welcome!

If you:

  • Have feature suggestions
  • Want to try a custom model
  • Need help deploying your own version

Feel free to open an issue or drop a star ⭐ over at:

GitHub Repo: jcaperella29/Document_cleaning_CLI
Thanks for reading! Always happy to connect with fellow developers working on AI, bioinformatics, or productivity tools.


This content originally appeared on DEV Community and was authored by John


Print Share Comment Cite Upload Translate Updates
APA

John | Sciencx (2025-03-18T15:33:17+00:00) Deep Learning Meets OCR: My FastAPI-Powered Document Cleaning Tool. Retrieved from https://www.scien.cx/2025/03/18/deep-learning-meets-ocr-my-fastapi-powered-document-cleaning-tool/

MLA
" » Deep Learning Meets OCR: My FastAPI-Powered Document Cleaning Tool." John | Sciencx - Tuesday March 18, 2025, https://www.scien.cx/2025/03/18/deep-learning-meets-ocr-my-fastapi-powered-document-cleaning-tool/
HARVARD
John | Sciencx Tuesday March 18, 2025 » Deep Learning Meets OCR: My FastAPI-Powered Document Cleaning Tool., viewed ,<https://www.scien.cx/2025/03/18/deep-learning-meets-ocr-my-fastapi-powered-document-cleaning-tool/>
VANCOUVER
John | Sciencx - » Deep Learning Meets OCR: My FastAPI-Powered Document Cleaning Tool. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/03/18/deep-learning-meets-ocr-my-fastapi-powered-document-cleaning-tool/
CHICAGO
" » Deep Learning Meets OCR: My FastAPI-Powered Document Cleaning Tool." John | Sciencx - Accessed . https://www.scien.cx/2025/03/18/deep-learning-meets-ocr-my-fastapi-powered-document-cleaning-tool/
IEEE
" » Deep Learning Meets OCR: My FastAPI-Powered Document Cleaning Tool." John | Sciencx [Online]. Available: https://www.scien.cx/2025/03/18/deep-learning-meets-ocr-my-fastapi-powered-document-cleaning-tool/. [Accessed: ]
rf:citation
» Deep Learning Meets OCR: My FastAPI-Powered Document Cleaning Tool | John | Sciencx | https://www.scien.cx/2025/03/18/deep-learning-meets-ocr-my-fastapi-powered-document-cleaning-tool/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.