This content originally appeared on Envato Tuts+ Tutorials and was authored by Abder-Rahman Ali
I really admire Portable Document Format (PDF) files. They are immensely popular with people because you get the same exact content and layout irrespective of your operating system, reading device or software being used.
Anyone who has worked with plain text files in Python before might think that working with PDF files is also going to be easy. But, it is a bit different here. PDF documents are binary files and more complex than just plain text files, especially since they contain different font types, colors, etc.
However, that doesn't mean that it is hard to work with PDF documents using Python, it is rather simple, and using an external module solves the issue.
Initial Set Up
As I mentioned above, using an external module would be the key. The module we will be using in this tutorial is PyPDF2. As it is an external module, the first step we have to take is to install it. For that, we will be using pip, which is (based on Wikipedia):
A package management system used to install and manage software packages written in Python. Many packages can be found in the Python Package Index (PyPI).
You can follow the steps mentioned in the official guide for installing pip. There is a good chance that pip was installed automatically for you if you downloaded Python from python.org.
PyPDF2 now can be simply installed by typing the following command inside your terminal:
pip install PyPDF2
Great! You now have PyPDF2 installed, and you're ready to start playing with PDF documents.
PyPDF2 Basics
Before we dig deeper, I would like to give you a brief overview of the PyPDF2 module. This is a completely free and open source library that can do a lot of things with PDF documents. You can use the library not only for reading from a PDF file but also for writing, splitting and merging.
A lot of things have changed in the library from its older versions. For this tutorial, I am going to use the version 2.11.1 of the library.
The PyPDF2 library doesn't require any dependency for its regular features. However, you will need some dependencies to work with cryptography and images in PDF files. Automatic installation of all dependencies is possible with the command:
pip install PyPDF2[full]
However, if you know that you will need to encrypt and decrypt PDF documents with AES or Advanced Encryption System you will need to install some cryptography related dependencies:
pip install PyPDF2[crypto]
I should also point out that RC4 encryption is supported with the standalone installation of PyPDF2 without any dependencies.
Reading a PDF Document
The sample file we will be working with in this tutorial is a PDF version of Beauty and the Beast hosted on Project Gutenberg. Go ahead and download the file to follow the tutorial, or you can simply use any PDF file you like.
The following code will get you set up for extracting additional information from the file:
import PyPDF2 with open('beauty-and-the-beast.pdf', 'rb') as book: book_reader = PyPDF2.PdfReader(book)
The first line imports the PyPDF2 module for us to use in our program. We then use the built-in open()
function to open our PDF file in binary mode.
Once the file is open, we use the PdfReader
base class from the module to initialize our PdfReader
object by passing it our book as the parameter. We are now ready to handle a variety of reading operations on our book.
More Operations on PDF Documents
After reading the PDF document, we can now carry out different operations on the document, as we will see in this section.
Number of Pages
The number of pages in a PDF document are accessible with a read-only property of the PdfReader
class called pages
. This property basically gives us a list of Page
objects. Those page objects represent the individual pages of the PDF file.
You can easily get the number of pages by using the built-in len()
function and passing the list of Page
objects as a parameter.
import PyPDF2 with open('beauty-and-the-beast.pdf', 'rb') as book: book_reader = PyPDF2.PdfReader(book) number_of_pages = len(book_reader.pages) # Outputs: 48 print(number_of_pages)
In this case, the returned value was 48 which is equal to the number of pages in our document.
Directly Accessing a Page Number
We have seen in the previous section that the pages
property of the PdfReader
class returns a list of Page
objects. You can directly access any page from the list by specifying its index. Consider the following example in which I will retrieve the second item from a list of languages.
languages = ["French", "English", "Hindi"] # Outputs: English print(languages[1])
Directly accessing a page from the PDF document will work similarly. Here is an example:
import PyPDF2 with open('beauty-and-the-beast.pdf', 'rb') as book: book_reader = PyPDF2.PdfReader(book) page_list = book_reader.pages first_page = page_list[0] last_page = page_list[-1]
Now that we have learned how to access a Page
object based on the page number. Let's see how to do the reverse and get the page number from a page object. The PyPDF2
library has a very handy function called get_page_number()
that you can use to get the page number of the current page. All you need to do is pass the Page
object as a parameter to the get_page_number()
function.
import random from PyPDF2 import PdfReader with open('beauty-and-the-beast.pdf', 'rb') as book: book_reader = PdfReader(book) page_list = book_reader.pages last_page = page_list[-1] # Outputs: 47 print(book_reader.get_page_number(last_page)) some_page = page_list[random.randint(15, 35)] # Outputs: 19 print(book_reader.get_page_number(some_page))
In the above example, we first try to get the page number for the last page in our PDF document and it comes out to 47 since the indexing starts at 0. A value of 47 actually means the page 48.
We also try the same function with a page between 15 and 35 selected at random. The output is 19 in this particular instance but it will vary with every execution.
Page Mode and Page Layout
The library also allows you to easily access the page mode and page layout information for your PDF document. You simply need to use the properties called page_mode
and page_layout
to do so.
All the valid page mode values are shown in the table below:
/UseNone |
Do not show outlines or thumbnails panels |
/UseOutlines |
Show outlines (aka bookmarks) panel |
/UseThumbs |
Show page thumbnails panel |
/FullScreen |
Fullscreen view |
/UseOC |
Show Optional Content Group (OCG) panel |
/UseAttachments |
Show attachments panel |
The table below shows all the valid page layout values:
/NoLayout |
Layout explicitly not specified |
/SinglePage |
Show one page at a time |
/OneColumn |
Show one column at a time |
/TwoColumnLeft |
Show pages in two columns, odd-numbered pages on the left |
/TwoColumnRight |
Show pages in two columns, odd-numbered pages on the right |
/TwoPageLeft |
Show two pages at a time, odd-numbered pages on the left |
/TwoPageRight |
Show two pages at a time, odd-numbered pages on the right |
In order to check our page mode, we can use the following script:
from PyPDF2 import PdfReader with open('beauty-and-the-beast.pdf', 'rb') as book: book_reader = PdfReader(book) # Outputs: None print(book_reader.page_mode) # Outputs: None print(book_reader.page_layout)
In the case of our PDF document the returned value is None
, which means that the page mode as well as the page layout is not specified.
Extract Metadata
The PdfReader
class also has a property called metadata that returns the document information dictionary for the PDF file that you are reading. This metadata can contain information such as the author name, title of the document, creation date, and producer. The following example tries to extract all of this information from our own PDF document.
from PyPDF2 import PdfReader with open('beauty-and-the-beast.pdf', 'rb') as book: book_reader = PdfReader(book) book_metadata = book_reader.metadata # Beauty and the Beast print(book_metadata.title) # Anonymous print(book_metadata.author) # 2006-11-30 01:13:00-08:00 print(book_metadata.creation_date) # pdfeTeX-1.21a print(book_metadata.producer)
Please keep in mind that some PDF files could have all of these values set to None
.
Extract Text
We have been wandering around the file so far, so let's see what's inside. The method extract_text()
will be our friend in this task. The script to extract a text from the PDF document is as follows:
from PyPDF2 import PdfReader with open('beauty-and-the-beast.pdf', 'rb') as book: book_reader = PdfReader(book) page_list = book_reader.pages story_page = page_list[6] page_text = story_page.extract_text() print(page_text)
The output that I got after executing the above script is shown below:
[002] BEAUTY AND THE BEAST. Once upon a time, in a very far-off country, there lived a mer- chant who had been so fortunate in all his undertakings that he was enormously rich. As he had, however, six sons and six daughters,hefoundthathismoneywasnottoomuchtoletthem allhaveeverythingtheyfancied,astheywereaccustomedtodo. But one day a most unexpected misfortune befell them. Their house caught fire and was speedily burnt to the ground, with all the splendid furniture, the books, pictures, gold, silver, and precious goods it contained; and this was only the beginning of
I was able to extract all the text on the page. However, as you can see the extract_text()
function doesn't get the spacing between the words right in some places. The final result depends on a variety of factors with one of them being the generator used to create the PDF file. This basically means that you won't face such issue in all PDF files but some of them are bound to have messed up spacing upon text extraction.
Conclusion
As we can see, Python makes it simple to work with PDF documents. This tutorial just scratched the surface on this topic, and you can find more details on different operations you can perform on PDF documents on the PyPDF2 documentation page.
This content originally appeared on Envato Tuts+ Tutorials and was authored by Abder-Rahman Ali
Abder-Rahman Ali | Sciencx (2016-01-17T09:14:02+00:00) How to Work With PDF Documents Using Python. Retrieved from https://www.scien.cx/2016/01/17/how-to-work-with-pdf-documents-using-python/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.