How do i extract text from a pdf in python?
View Discussion Show
Improve Article Save Article View Discussion Improve Article Save Article All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system. Extracting Text from PDF FilePython package PyPDF can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files. Note: For more information, refer to Working with PDF files in Python InstallationTo install this package type the below command in the terminal. pip install PyPDF2 Example: Input PDF:
Output: Let us try to understand the above code in chunks:
In this blog, we are going to examine the most popular libraries for processing PDFs with Python. A lot of information is shared in the form of PDF, and often we need to extract some details for further processing. To assist it in my research in identifying the most popular python libraries, I looked across StackOverflow, Reddit and generally lots of google searches. I identified numerous packages, each with its own strengths and weakness. Specifically, users across the internet seem to be using: PyPDF2, Textract, tika, pdfPlumber, pdfMiner. During my research, however, for one reason or another, I was only able to get 3 of these libraries to work as expected. For some of these libraries, the set up was too complicated (missing dependencies, strange error messages, etc.) Let us quickly review all these libraries anyway. PyPDF2Rating: 3/5The good news with PyPDF2 was that it was a breeze to install. The documentation is somewhat lacking easy examples to follow, but pay close enough attention, and you can figure it out eventually. The bad news, however, is that the results were not great. As you can see, it identified the right text, but for some reason, it broke it up into multiple lines. The code: import PyPDF2fhandle = open(r'D:\examplepdf.pdf', 'rb')pdfReader = PyPDF2.PdfFileReader(fhandle)pagehandle = pdfReader.getPage(0)print(pagehandle.extractText()) TextractRating: 0/5Off to a promising start with the number of people raving about this library. The documentation is also good. Unfortunately, the latest version has a bug which throws an error every time you try to extract text from a PDF. Following the bug through the library’s dev forum, there may be a fix in the works. Fingers crossed. Apache TikaRating: 2/5Apache Tika has a python library which apparently lets you extract text from PDFs. Installing the Python library is simple enough, but it will not work unless you have JAVA installed. At least that is the theory. I did not want to install JAVA; hence I remained at: “RuntimeError: Unable to start Tika server.” error. According to this medium blog (no affiliation), however, once you get it working, it is terrific. So, let’s go with 2/5 rating. The code would apparently look something like: from tika import parserfile = r'D:\examplepdf.pdf'file_data = parser.from_file(file)text = file_data['content']print(text) pdfPlumberRating: 5/5Right when I started losing faith in the existence of a simple to use python library for mining text out of pdfs, across comes pdfPlumber. The documentation is not too bad; within minutes, the whole thing gets going. The results are as good as they can be. Worth noting, however, that the library does specifically say that it works best on machine-generated PDFs rather than scanned documents; which is what I used. The code: import pdfplumberwith pdfplumber.open(r'D:\examplepdf.pdf') as pdf: pdfMiner3Rating: 4/5I will be honest; in a typical pythonic way, I glanced at the documentation (twice!) and failed to understand how I was meant to run this package; this includes pdfMiner (not version 3 that I am reviewing here, as well). I even installed it and tried a few things with no success. Alas, to my rescue comes a kind stranger in StackOverflow. Once you go through the example provided, it is actually easy to follow. Oh, and the results are as good as you would expect: The code can be found in the linked StackOverflow post. PDF -> JPEG -> TextAnother way that this problem could be addressed is by transforming the PDF file into an image. This could be done either programmatically or by taking a screenshot of each page. Once you have the image files, you can use the tesseract library to extract the text out of them: Before you go, if you liked this article, you may also like: How do I extract specific text from a PDF in Python?Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.
Can you extract data from a PDF with Python?There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library.
How do I extract text from a PDF?How to Extract Text from a PDF. Open the PDF Document you wish to convert.. Go to the Convert Tab > Convert To > Text on the toolbar.. Choose a file name and location to save the .txt document that will contain the extracted text.. Click Save to extract the text and to the file selected.. Can Python convert PDF text?Modern document-processing Python API creates a TXT document from PDF with professional quality. Test the highest quality PDF to TXT conversion right in your browser. Powerful Python library allows converting PDF files to almost all TXT document formats.
|