How do i extract text from a pdf in python?

View Discussion

Nội dung chính Show

Extracting Text from PDF File
Installation
Learn to use Python to extract text from PDFs
Rating: 3/5
Rating: 0/5
Apache Tika
Rating: 2/5
Rating: 5/5
Rating: 4/5
PDF -> JPEG -> Text
How do I extract specific text from a PDF in Python?
Can you extract data from a PDF with Python?
How do I extract text from a PDF?
Can Python convert PDF text?

Improve Article

Save Article

Read

Discuss

View Discussion

Improve Article

Save Article

All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.

Extracting Text from PDF File

Python package PyPDF can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files.

Note: For more information, refer to Working with PDF files in Python

Installation

To install this package type the below command in the terminal.

pip install PyPDF2

Example:

Input PDF:

import PyPDF2

pdfFileObj = open('example.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

print(pdfReader.numPages)

pageObj = pdfReader.getPage(0)

print(pageObj.extractText())

pdfFileObj.close()

Output:

How do i extract text from a pdf in python?

Let us try to understand the above code in chunks:

```
pdfFileObj = open('example.pdf', 'rb')
```
We opened the example.pdf in binary mode. and saved the file object as pdfFileObj.
```
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
```
Here, we create an object of PdfFileReader class of PyPDF2 module and pass the pdf file object & get a pdf reader object.
```
print(pdfReader.numPages)
```
numPages property gives the number of pages in the pdf file. For example, in our case, it is 20 (see first line of output).
```
pageObj = pdfReader.getPage(0)
```
Now, we create an object of PageObject class of PyPDF2 module. pdf reader object has function getPage() which takes page number (starting form index 0) as argument and returns the page object.
```
print(pageObj.extractText())
```
Page object has function extractText() to extract text from the pdf page.
```
pdfFileObj.close()
```
At last, we close the pdf file object.

Learn to use Python to extract text from PDFs

Photo by Carl Heyerdahl on Unsplash

In this blog, we are going to examine the most popular libraries for processing PDFs with Python. A lot of information is shared in the form of PDF, and often we need to extract some details for further processing.

To assist it in my research in identifying the most popular python libraries, I looked across StackOverflow, Reddit and generally lots of google searches. I identified numerous packages, each with its own strengths and weakness. Specifically, users across the internet seem to be using: PyPDF2, Textract, tika, pdfPlumber, pdfMiner.

During my research, however, for one reason or another, I was only able to get 3 of these libraries to work as expected. For some of these libraries, the set up was too complicated (missing dependencies, strange error messages, etc.)

Let us quickly review all these libraries anyway.

PyPDF2

Rating: 3/5

The good news with PyPDF2 was that it was a breeze to install. The documentation is somewhat lacking easy examples to follow, but pay close enough attention, and you can figure it out eventually.

The bad news, however, is that the results were not great.

As you can see, it identified the right text, but for some reason, it broke it up into multiple lines.

The code:

import PyPDF2fhandle = open(r'D:\examplepdf.pdf', 'rb')pdfReader = PyPDF2.PdfFileReader(fhandle)pagehandle = pdfReader.getPage(0)print(pagehandle.extractText())

Textract

Rating: 0/5

Off to a promising start with the number of people raving about this library. The documentation is also good.

Unfortunately, the latest version has a bug which throws an error every time you try to extract text from a PDF. Following the bug through the library’s dev forum, there may be a fix in the works. Fingers crossed.

Apache Tika

Rating: 2/5

Apache Tika has a python library which apparently lets you extract text from PDFs. Installing the Python library is simple enough, but it will not work unless you have JAVA installed.

At least that is the theory. I did not want to install JAVA; hence I remained at: “RuntimeError: Unable to start Tika server.” error.

According to this medium blog (no affiliation), however, once you get it working, it is terrific. So, let’s go with 2/5 rating.

The code would apparently look something like:

from tika import parserfile = r'D:\examplepdf.pdf'file_data = parser.from_file(file)text = file_data['content']print(text)

pdfPlumber

Rating: 5/5

Right when I started losing faith in the existence of a simple to use python library for mining text out of pdfs, across comes pdfPlumber.

The documentation is not too bad; within minutes, the whole thing gets going. The results are as good as they can be.

Worth noting, however, that the library does specifically say that it works best on machine-generated PDFs rather than scanned documents; which is what I used.

The code:

import pdfplumberwith pdfplumber.open(r'D:\examplepdf.pdf') as pdf:
    first_page = pdf.pages[0]
    print(first_page.extract_text())

pdfMiner3

Rating: 4/5

I will be honest; in a typical pythonic way, I glanced at the documentation (twice!) and failed to understand how I was meant to run this package; this includes pdfMiner (not version 3 that I am reviewing here, as well). I even installed it and tried a few things with no success.

Alas, to my rescue comes a kind stranger in StackOverflow. Once you go through the example provided, it is actually easy to follow. Oh, and the results are as good as you would expect:

The code can be found in the linked StackOverflow post.

PDF -> JPEG -> Text

Another way that this problem could be addressed is by transforming the PDF file into an image. This could be done either programmatically or by taking a screenshot of each page.

Once you have the image files, you can use the tesseract library to extract the text out of them:

Before you go, if you liked this article, you may also like:

How do I extract specific text from a PDF in Python?

Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.

Can you extract data from a PDF with Python?

There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library.