Python processing PDF: Installation and use of PyMuPDF!
Hello everyone, I am Python Artificial Intelligence Technology
1. Introduction to PyMuPDF
1. Introduction
Before introducing PyMuPDF, let’s first understand MuPDF. As can be seen from the naming form, PyMuPDF is the Python interface form of MuPDF.
MuPDF
MuPDF is a lightweight PDF, XPS and e-book viewer. MuPDF consists of software libraries, command line tools, and viewers for various platforms.
The renderer in MuPDF is tailor-made for high-quality anti-aliased graphics. It renders text with measurements and spacing accurate to within a fraction of a pixel for maximum fidelity in reproducing the appearance of a printed page on the screen.
This observer is small, fast, but complete. It supports multiple document formats such as PDF, XPS, OpenXPS, CBZ, EPUB and FictionBook 2. You can use the mobile viewer to annotate and fill out forms in PDF documents (this feature will soon be available for desktop viewers as well).
The command line tool allows you to annotate, edit, and convert documents to other formats such as HTML, SVG, PDF, and CBZ. You can also write scripts using Javascript to manipulate documents.
PyMuPDF
PyMuPDF (current version 1.18.17) is a Python binding that supports MuPDF (current version 1.18.*).
With PyMuPDF, you can access files with extensions ".pdf", ".xps", ".oxps", ".cbz", ".fb2" or ".epub". In addition, about 10 popular image formats can also be processed like documents: ".png", ".jpg", ".bmp", ".tiff", etc.
2. Functions
For all supported document types:
- Decrypt files
- Access meta information, links and bookmarks
- Render the page in raster format (PNG and other formats) or vector format SVG
- Search for text
- Extract text and images
- Convert to other formats: PDF, (X)HTML, XML, JSON, text
- For PDF documents, there are a large number of additional functions: they can be created, merged or split. Pages can be inserted, deleted, rearranged or modified in a variety of ways (including comments and form fields).
- Images and fonts can be extracted or inserted
- Full support for embedded files
- PDF files can be reformatted to support duplex printing, tone separation, apply logos or watermarks
- Full support for password protection: decryption, encryption, encryption method selection, permission levels and user/owner password settings
- Supports PDF optional content concepts for images, text and drawings
- Can access and modify low-level PDF structures
- Command line module "python -m fitz..." A multifunctional utility with the following features
- Encryption/decryption/optimization
- Create subdocument
- Document connection
- Image/font extraction
- Full support for embedded files
- Text extraction for saved layouts ( All documents)
New: Layout saving text extraction!
The script fitzcliy.py provides text extraction in different formats via the subcommand "gettext". Particularly interesting is of course layout saving, which generates text as close as possible to the original physical layout, with areas surrounding images, or copies of text in tables and multi-column text.
3. Installation
PyMuPDF can be installed from the source code or from wheels.
For Windows, Linux and Mac OSX platforms, wheels are available in the download section of PyPI. This includes Python 64-bit versions 3.6 through 3.9. There is also a 32-bit version for Windows. Since recently, there have been some issues with the Linux ARM architecture - look for the platform tag manylinux2014_aarch64.
It has no mandatory external dependencies other than the standard library. There are some nice methods only if certain packages are installed:
- Pillow: Required when using Pixmap.pil_save() and Pixmap.pil_tobytes()
- fontTools : Required when using Document.subset_fonts()
- pymupdf-fonts is a good font choice that can be used for text output methods
Use the pip installation command:
pip install PyMuPDF
Import library:
import fitz
Instructions on naming fitz
Standard Python import of this library The statement is import fitz. There are historical reasons for this:
MuPDF’s original rendering library was called Libart.
After Artifex Software acquired the MuPDF project, the development focus shifted to writing a new modern graphics library called "Fitz". Fitz started as an R&D project to replace the aging Ghostscript graphics library, but became the rendering engine for MuPDF (quoted from Wikipedia).
4. How to use
1. Import the library and check the version
import fitz print(fitz.__doc__) PyMuPDF 1.18.16: Python bindings for the MuPDF 1.18.0 library. Version date: 2021-08-05 00:00:01. Built for Python 3.8 on linux (64-bit).
2. Open the document
doc = fitz.open(filename)
This will create the Document object doc. The filename must be a Python string that already exists.
You can also open a document from memory data, or create a new empty PDF. You can also use documents as context managers.
3. Document methods and properties
#Methods/properties | Description | ||||||||||||||||||||||
Document.page_count | page Number(int)|||||||||||||||||||||||
| |||||||||||||||||||||||
| ## | Document.load_page()||||||||||||||||||||||
| 示例: >>> doc.count_page 1 >>> doc.metadata {'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '福昕阅读器PDF打印机 版本 10.0.130.3456', 'creationDate': "D:20210810173328+08'00'", 'modDate': "D:20210810173328+08'00'", 'trapped': '', 'encryption': None} Copy after login 4. 获取元数据PyMuPDF完全支持标准元数据。Document.metadata是一个具有以下键的Python字典。 它适用于所有文档类型,但并非所有条目都始终包含数据。元数据字段为字符串,如果未另行指示,则为无。还要注意的是,并非所有数据都始终包含有意义的数据——即使它们不是一个都没有。另外,搜索公众号Java架构师技术后台回复“面试题”,获取一份惊喜礼包。
5. 获取目标大纲toc = doc.get_toc() Copy after login 6. 页面(Page)页面处理是MuPDF功能的核心。
首先,必须创建一个页面Page。这是Document的一种方法: page = doc.load_page(pno) # loads page number 'pno' of the document (0-based) page = doc[pno] # the short form Copy after login 这里可以使用任何整数-inf 更高级的方法是将文档用作页面的迭代器: a. 检查页面的链接、批注或表单字段 使用某些查看器软件显示文档时,链接显示为==“热点区域”==。如果您在光标显示手形符号时单击,您通常会被带到该热点区域中编码的标记。以下是如何获取所有链接: links是一个Python字典列表。 还可以作为迭代器使用: 如果处理PDF文档页面,还可能存在注释(Annot)或表单字段(Widget),每个字段都有自己的迭代器: b. 呈现页面 此示例创建页面内容的光栅图像: pix是一个Pixmap对象,它(在本例中)包含页面的RGB图像,可用于多种用途。 方法Page.get_pixmap()提供了许多用于控制图像的变体:分辨率、颜色空间(例如,生成灰度图像或具有减色方案的图像)、透明度、旋转、镜像、移位、剪切等。 例如:创建RGBA图像(即,包含alpha通道),指定pix=page.get_pixmap(alpha=True)。 Pixmap包含以下引用的许多方法和属性。其中包括整数宽度、高度(每个像素)和跨距(一个水平图像行的字节数)。属性示例表示表示图像数据的矩形字节区域(Python字节对象)。 还可以使用page.get_svg_image()创建页面的矢量图像。 c. 将页面图像保存到文件中 我们可以简单地将图像存储在PNG文件中: d. 提取文本和图像 我们还可以以多种不同的形式和细节级别提取页面的所有文本、图像和其他信息: 对opt使用以下字符串之一以获取不同的格式: e. 搜索文本 您可以找到某个文本字符串在页面上的确切位置: 这将提供一个矩形列表,每个矩形都包含一个字符串“mupdf”(不区分大小写)。您可以使用此信息来突出显示这些区域(仅限PDF)或创建文档的交叉引用。 PDF是唯一可以使用PyMuPDF修改的文档类型。其他文件类型是只读的。 但是,您可以将任何文档(包括图像)转换为PDF,然后将所有PyMuPDF功能应用于转换结果,Document.convert_to_pdf()。 Document.save()始终将PDF以其当前(可能已修改)状态存储在磁盘上。 通常,您可以选择是保存到新文件,还是仅将修改附加到现有文件(“增量保存”),这通常要快得多。 下面介绍如何操作PDF文档。 a. 修改、创建、重新排列和删除页面 有几种方法可以操作所谓页面树(描述所有页面的结构): b. 连接和拆分PDF文档 方法Document.insert_pdf()在不同的pdf文档之间复制页面。下面是一个简单的joiner示例(doc1和doc2在PDF中打开): 下面是一个拆分doc1的片段。它将创建第一页和最后10页的新文档: c. 保存 Document.save()将始终以当前状态保存文档。 您可以通过指定选项incremental=True将更改写回原始PDF。这个过程(通常)非常快,因为更改会附加到原始文件,而不会完全重写它。 d. 关闭 在程序继续运行时,通常需要“关闭”文档以将底层文件的控制权交给操作系统。 这可以通过Document.close()方法实现。除了关闭基础文件外,还将释放与文档关联的缓冲区。 The above is the detailed content of Python processing PDF: Installation and use of PyMuPDF!. For more information, please follow other related articles on the PHP Chinese website! AI-powered app for creating realistic nude photos Online AI tool for removing clothes from photos. Undress images for free AI clothes remover Swap faces in any video effortlessly with our completely free AI face swap tool! Easy-to-use and free code editor Chinese version, very easy to use Powerful PHP integrated development environment Visual web development tools God-level code editing software (SublimeText3) PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning. PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem. Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming. VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security. VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time. PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields. In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency. VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software. |