Home Backend Development C#.Net Tutorial C# implements the function of converting PDF to text

C# implements the function of converting PDF to text

Nov 24, 2016 pm 01:17 PM
c#

Update

February 27, 2014: This article originally only described using PDFBox to parse PDF files. It has now been extended to include routines for using IFilter and iTextSharp.

 This article and the corresponding Visual Studio project have been updated to the latest PDFBox version (1.8.4). The complete project including all dependencies can be downloaded from http://www.squarepdf.net/how-to-convert-pdf-to-text-in-net-sample-project/ (removing dependencies is a bit tricky).

 How to parse PDF files

  Several main methods to extract text from PDF files in .NET are:

Microsoft’s IFilter interface and Adobe’s IFilter implementation;

iTextSharp;

PDFBox.

 Unfortunately, none of these PDF parsing solutions are perfect. We discuss these methods below.

 Adobe PDF IFilter

 To use the IFilter interface to parse PDF files, you need:

Windows 2000 or later

Adobe Acrobat or Reader 7.0.5+ (or standalone Adobe PDF IFilter [adobe.com])

IFilter COM encapsulation class [dotlucene.net]

Sample code:

using IFilter;
 
// ...
 
public static string ExtractTextFromPdf(string path) {
  return DefaultParser.Extract(path); 
}
Copy after login

Disadvantages:

Uses unreliable COM interop to handle the IFilter interface (and combining IFilter COM and Adobe PDF IFilter is particularly troublesome).

Requires Adobe IFilter to be installed separately on the target system. It's a pain if you need to publish an indexable solution to others.

iTextSharp

iTextSharp (http://sourceforge.net/projects/itextsharp/) is a Java PDF operation library iText (http://itextpdf.com/) .NET output. It's primarily focused on editing PDFs rather than reading them, but it certainly supports extracting text from PDFs as well (although it's a bit overkill).

  Routine:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
 
// ...
  
public static string ExtractTextFromPdf(string path)
{
  using (PdfReader reader = new PdfReader(path))
  {
    StringBuilder text = new StringBuilder();
 
    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
    }
 
    return text.ToString();
  }
}
Copy after login

Credit: Member number 10364982

Disadvantages:

Requires a license (if you don’t like AGPL license)

PDFBox

PDFBox is another Java PDF class library. It can also be used with original Java Lucene (see LucenePDFDocument).

Fortunately, PDFBox has a .NET version developed using IKVM.NET (just visit the PDFBox download page).

 To use PDFBox in .NET, you need to quote:

IKVM.OpenJDK.Core.dll

IKVM.OpenJDK.SwingAWT.dll

pdfbox-1.8.4.dll

 And copy the following files to the bin folder :

commons-logging.dll

fontbox-1.8.4.dll

IKVM.OpenJDK.Util.dll

IKVM.Runtime.dll

It is very simple to use PDFBox to parse PDF:

using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
 
// ...
 
private static string ExtractTextFromPdf(string path)
{
  PDDocument doc = null;
  try {
    doc = PDDocument.load(path)
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
  }
  finally {
    if (doc != null) {
      doc.close();
    }
  }
}
Copy after login

The compiled size increases It's almost 18MB in total:

IKVM.OpenJDK.Core.dll (4 MB)

IKVM.OpenJDK.SwingAWT.dll (6 MB)

pdfbox-1.8.4.dll (4 MB)

commons-logging. dll (82 kB)

fontbox-1.8.4.dll (180 kB)

IKVM.OpenJDK.Util.dll (2 MB)

IKVM.Runtime.dll (1 MB)

  Speed ​​is OK: parsing U.S. Copyright Act PDF (5.1 MB) file took 13 seconds.

 Thanks bobrien100 for the improvement suggestions.

 Disadvantages:

IKVM.NET dependency (18 MB)

Speed ​​(especially the startup time of IKVM.NET)


Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Active Directory with C# Active Directory with C# Sep 03, 2024 pm 03:33 PM

Guide to Active Directory with C#. Here we discuss the introduction and how Active Directory works in C# along with the syntax and example.

C# Serialization C# Serialization Sep 03, 2024 pm 03:30 PM

Guide to C# Serialization. Here we discuss the introduction, steps of C# serialization object, working, and example respectively.

Random Number Generator in C# Random Number Generator in C# Sep 03, 2024 pm 03:34 PM

Guide to Random Number Generator in C#. Here we discuss how Random Number Generator work, concept of pseudo-random and secure numbers.

C# Data Grid View C# Data Grid View Sep 03, 2024 pm 03:32 PM

Guide to C# Data Grid View. Here we discuss the examples of how a data grid view can be loaded and exported from the SQL database or an excel file.

Patterns in C# Patterns in C# Sep 03, 2024 pm 03:33 PM

Guide to Patterns in C#. Here we discuss the introduction and top 3 types of Patterns in C# along with its examples and code implementation.

Factorial in C# Factorial in C# Sep 03, 2024 pm 03:34 PM

Guide to Factorial in C#. Here we discuss the introduction to factorial in c# along with different examples and code implementation.

Prime Numbers in C# Prime Numbers in C# Sep 03, 2024 pm 03:35 PM

Guide to Prime Numbers in C#. Here we discuss the introduction and examples of prime numbers in c# along with code implementation.

The difference between multithreading and asynchronous c# The difference between multithreading and asynchronous c# Apr 03, 2025 pm 02:57 PM

The difference between multithreading and asynchronous is that multithreading executes multiple threads at the same time, while asynchronously performs operations without blocking the current thread. Multithreading is used for compute-intensive tasks, while asynchronously is used for user interaction. The advantage of multi-threading is to improve computing performance, while the advantage of asynchronous is to not block UI threads. Choosing multithreading or asynchronous depends on the nature of the task: Computation-intensive tasks use multithreading, tasks that interact with external resources and need to keep UI responsiveness use asynchronous.

See all articles