How do i extract text from a pdf in python?

View Discussion

Improve Article

Save Article

  • Read
  • Discuss
  • View Discussion

    Improve Article

    Save Article

    All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.

    Extracting Text from PDF File

    Python package PyPDF can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files.

    Note: For more information, refer to Working with PDF files in Python

    Installation

    To install this package type the below command in the terminal.

    pip install PyPDF2

    Example:

    Input PDF:

    How do i extract text from a pdf in python?

    import PyPDF2 

    pdfFileObj = open('example.pdf', 'rb'

    pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

    print(pdfReader.numPages) 

    pageObj = pdfReader.getPage(0

    print(pageObj.extractText()) 

    pdfFileObj.close() 

    Output:

    How do i extract text from a pdf in python?

    Let us try to understand the above code in chunks:

    • pdfFileObj = open('example.pdf', 'rb')

      We opened the example.pdf in binary mode. and saved the file object as pdfFileObj.

    • pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

      Here, we create an object of PdfFileReader class of PyPDF2 module and  pass the pdf file object & get a pdf reader object.

    • print(pdfReader.numPages)

      numPages property gives the number of pages in the pdf file. For example, in our case, it is 20 (see first line of output).

    • pageObj = pdfReader.getPage(0)

      Now, we create an object of PageObject class of PyPDF2 module. pdf reader object has function getPage() which takes page number (starting form index 0) as argument and returns the page object.

    • print(pageObj.extractText())

      Page object has function extractText() to extract text from the pdf page.

    • pdfFileObj.close()

      At last, we close the pdf file object.

    Learn to use Python to extract text from PDFs

    Photo by Carl Heyerdahl on Unsplash

    In this blog, we are going to examine the most popular libraries for processing PDFs with Python. A lot of information is shared in the form of PDF, and often we need to extract some details for further processing.

    To assist it in my research in identifying the most popular python libraries, I looked across StackOverflow, Reddit and generally lots of google searches. I identified numerous packages, each with its own strengths and weakness. Specifically, users across the internet seem to be using: PyPDF2, Textract, tika, pdfPlumber, pdfMiner.

    During my research, however, for one reason or another, I was only able to get 3 of these libraries to work as expected. For some of these libraries, the set up was too complicated (missing dependencies, strange error messages, etc.)

    Let us quickly review all these libraries anyway.

    PyPDF2

    Rating: 3/5

    The good news with PyPDF2 was that it was a breeze to install. The documentation is somewhat lacking easy examples to follow, but pay close enough attention, and you can figure it out eventually.

    The bad news, however, is that the results were not great.

    How do i extract text from a pdf in python?

    As you can see, it identified the right text, but for some reason, it broke it up into multiple lines.

    The code:

    import PyPDF2fhandle = open(r'D:\examplepdf.pdf', 'rb')pdfReader = PyPDF2.PdfFileReader(fhandle)pagehandle = pdfReader.getPage(0)print(pagehandle.extractText())

    Textract

    Rating: 0/5

    Off to a promising start with the number of people raving about this library. The documentation is also good.

    Unfortunately, the latest version has a bug which throws an error every time you try to extract text from a PDF. Following the bug through the library’s dev forum, there may be a fix in the works. Fingers crossed.

    Apache Tika

    Rating: 2/5

    Apache Tika has a python library which apparently lets you extract text from PDFs. Installing the Python library is simple enough, but it will not work unless you have JAVA installed.

    At least that is the theory. I did not want to install JAVA; hence I remained at: “RuntimeError: Unable to start Tika server.” error.

    According to this medium blog (no affiliation), however, once you get it working, it is terrific. So, let’s go with 2/5 rating.

    The code would apparently look something like:

    from tika import parserfile = r'D:\examplepdf.pdf'file_data = parser.from_file(file)text = file_data['content']print(text)

    pdfPlumber

    Rating: 5/5

    Right when I started losing faith in the existence of a simple to use python library for mining text out of pdfs, across comes pdfPlumber.

    The documentation is not too bad; within minutes, the whole thing gets going. The results are as good as they can be.

    Worth noting, however, that the library does specifically say that it works best on machine-generated PDFs rather than scanned documents; which is what I used.

    How do i extract text from a pdf in python?

    The code:

    import pdfplumberwith pdfplumber.open(r'D:\examplepdf.pdf') as pdf:
    first_page = pdf.pages[0]
    print(first_page.extract_text())

    pdfMiner3

    Rating: 4/5

    I will be honest; in a typical pythonic way, I glanced at the documentation (twice!) and failed to understand how I was meant to run this package; this includes pdfMiner (not version 3 that I am reviewing here, as well). I even installed it and tried a few things with no success.

    Alas, to my rescue comes a kind stranger in StackOverflow. Once you go through the example provided, it is actually easy to follow. Oh, and the results are as good as you would expect:

    How do i extract text from a pdf in python?

    The code can be found in the linked StackOverflow post.

    PDF -> JPEG -> Text

    Another way that this problem could be addressed is by transforming the PDF file into an image. This could be done either programmatically or by taking a screenshot of each page.

    Once you have the image files, you can use the tesseract library to extract the text out of them:

    Before you go, if you liked this article, you may also like:

    How do I extract specific text from a PDF in Python?

    Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.

    Can you extract data from a PDF with Python?

    There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library.

    How do I extract text from a PDF?

    How to Extract Text from a PDF.
    Open the PDF Document you wish to convert..
    Go to the Convert Tab > Convert To > Text on the toolbar..
    Choose a file name and location to save the .txt document that will contain the extracted text..
    Click Save to extract the text and to the file selected..

    Can Python convert PDF text?

    Modern document-processing Python API creates a TXT document from PDF with professional quality. Test the highest quality PDF to TXT conversion right in your browser. Powerful Python library allows converting PDF files to almost all TXT document formats.