Automate data entry from PDF to Excel with Python

I will show you today how you can automate the process of importing PDF data into Excel. This can save a lot of time for everyone doing this daily.

This is our test data:

Let’s jump to the code!

First we need to import dependencies

from tika …


This content originally appeared on DEV Community and was authored by Stokry

I will show you today how you can automate the process of importing PDF data into Excel. This can save a lot of time for everyone doing this daily.

This is our test data:
enter image description here

Let's jump to the code!

First we need to import dependencies

from tika import parser
import pprint
from collections import defaultdict
import re
import pandas as pd

After that we need to define PrettyPrinter and get the content of the PDF file and convert it into a list:

pp = pprint.PrettyPrinter(indent=3)
parsedPDF = parser.from_file("final-test.pdf")

content = parsedPDF['content']
contentlist = content.split('\n')

Remove empty strings in the list resulting from the split

contentlist = list(filter(lambda a: a != '', contentlist))

Create an iterator and other flags that we will use to for the algorithm, Iterator of the contents of PDF per line:

iterateContent = iter(contentlist)

Dictionary placeholder of the data scraped

data = defaultdict(dict)

Our counter to count how many blocks did we able to get

 cntr = 0

Indicator which line are we in a specific block of data

 line = 1

The algorithm will use the flags cntr and line to determine if we are in a new block or existing block

while True:
    try:
        string = next(iterateContent)
    except StopIteration:
        break

    if re.match('^[A-Z\s]+$', string):
        cntr += 1           

        data[cntr]['Name'] = string
        line = 2
        print('matched')

    elif line == 2:
        data[cntr]['Address'] = string
        line += 1

    elif line == 3:
        data[cntr]['Website'] = string
        line += 1
print("Total data:", len(data.keys()))

Setting up the data into Dataframe

df = pd.DataFrame(data.values())
df.index += 1
print(df)

Write the dataframe into excel

writer = pd.ExcelWriter("dataframe.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='output', index=False)
writer.save()

Our final results:

enter image description here

Thank you all.


This content originally appeared on DEV Community and was authored by Stokry


Print Share Comment Cite Upload Translate Updates
APA

Stokry | Sciencx (2021-04-14T07:35:50+00:00) Automate data entry from PDF to Excel with Python. Retrieved from https://www.scien.cx/2021/04/14/automate-data-entry-from-pdf-to-excel-with-python/

MLA
" » Automate data entry from PDF to Excel with Python." Stokry | Sciencx - Wednesday April 14, 2021, https://www.scien.cx/2021/04/14/automate-data-entry-from-pdf-to-excel-with-python/
HARVARD
Stokry | Sciencx Wednesday April 14, 2021 » Automate data entry from PDF to Excel with Python., viewed ,<https://www.scien.cx/2021/04/14/automate-data-entry-from-pdf-to-excel-with-python/>
VANCOUVER
Stokry | Sciencx - » Automate data entry from PDF to Excel with Python. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2021/04/14/automate-data-entry-from-pdf-to-excel-with-python/
CHICAGO
" » Automate data entry from PDF to Excel with Python." Stokry | Sciencx - Accessed . https://www.scien.cx/2021/04/14/automate-data-entry-from-pdf-to-excel-with-python/
IEEE
" » Automate data entry from PDF to Excel with Python." Stokry | Sciencx [Online]. Available: https://www.scien.cx/2021/04/14/automate-data-entry-from-pdf-to-excel-with-python/. [Accessed: ]
rf:citation
» Automate data entry from PDF to Excel with Python | Stokry | Sciencx | https://www.scien.cx/2021/04/14/automate-data-entry-from-pdf-to-excel-with-python/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.