ImageEn for Delphi and C++ Builder ImageEn for Delphi and C++ Builder

 

ImageEn Forum
Profile    Join    Active Topics    Forum FAQ    Search this forumSearch
Forum membership is Free!  Click Join to sign-up
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 ImageEn Library for Delphi, C++ and .Net
 ImageEn and IEvolution Support Forum
 Need to extract all plain text from a PDF file
 New Topic  Reply to Topic
Author Previous Topic Topic Next Topic  

PeterPanino

1018 Posts

Posted - Feb 12 2025 :  11:53:39  Show Profile  Reply
I have tried the following:

var ThisPdfDoc := iexPdfiumCore.TPdfDocument.Create;
try
  ThisPdfDoc.LoadFromFile(APdfFile);
  for var i := 0 to ThisPdfDoc.PageCount - 1 do
  begin
    //ThisPdfDoc.Pages[i]. -> Unfortunately, there is no method to extract the text from the whole page!
  end;
finally
  ThisPdfDoc.Free;
end;

xequte

39298 Posts

Posted - Feb 12 2025 :  18:41:35  Show Profile  Reply
Please see the example at:

http://www.imageen.com/help/TIEPdfViewer.SelText.html

Alternatively you can iterate through all the text objects in the page:

http://www.imageen.com/help/TIEPdfViewer.Objects.html

Nigel
Xequte Software
www.imageen.com
Go to Top of Page

PeterPanino

1018 Posts

Posted - Nov 25 2025 :  04:24:10  Show Profile  Reply
No, I need a method that only extracts ALL PLAINTEXT FROM ALL PDF-PAGES without any Viewer or image overhead. Can't the ImageEn PDFIUM DLL do this?
Go to Top of Page

PeterPanino

1018 Posts

Posted - Nov 25 2025 :  07:09:26  Show Profile  Reply
This is my solution:

function ExtractPdfText(const FileName: string): string;
var
  Doc: TPdfDocument;
  i, j, CharCount: Integer;
  PageText: string;
begin
  Result := '';
  Doc := TPdfDocument.Create();
  try
    Doc.LoadFromFile(FileName);

    for i := 0 to Doc.PageCount - 1 do
    begin
      Doc.ActivePageIndex := i;
      CharCount := Doc.Pages[i].GetCharCount;

      // Pre-allocate string length for performance
      SetLength(PageText, CharCount);

      for j := 0 to CharCount - 1 do
        PageText[j + 1] := Doc.Pages[i].ReadChar(j);

      Result := Result + PageText + sLineBreak;
    end;
  finally
    Doc.Free;
  end;
end;
Go to Top of Page

xequte

39298 Posts

Posted - Nov 25 2025 :  14:39:50  Show Profile  Reply


Nigel
Xequte Software
www.imageen.com
Go to Top of Page

PeterPanino

1018 Posts

Posted - Nov 26 2025 :  09:58:03  Show Profile  Reply
Hi Nigel,

I am encountering strange errors, where PdfDoc.LoadFromFile randomly throws an exception:

PdfDoc: TPdfDocument;
PdfDoc := TPdfDocument.Create; 
PdfDoc.LoadFromFile(FileName);


Which is faster: TPdfDocument.LoadFromFile or TPdfDocument.LoadFromStream?

BTW, is TPdfDocument thread-safe?
Go to Top of Page

PeterPanino

1018 Posts

Posted - Nov 26 2025 :  15:57:55  Show Profile  Reply
The issue was caused by concurrently accessing multiple pdf files by multiple threads. (I searched all PDFs on my whole very modern computer for several words with 12 (!) threads).

The Problem:

TPdfDocument (ImageEn's PDF library) has internal state corruption when multiple threads reuse instances or process PDFs concurrently. Identical PDFs in different directories would fail randomly - the first found worked, later ones failed with "corrupted file" errors.

The Solution:

FCriticalSectionPdf: TCriticalSection;

function ExtractPdfTextThreadLocal(const FileName: string): string;
begin
  FCriticalSectionPdf.Enter;  // Only ONE thread can enter
  try
    PdfDoc := TPdfDocument.Create;
    try
      PdfDoc.LoadFromFile(FileName);
      // Extract text...
    finally
      PdfDoc.Free;
    end;
  finally
    FCriticalSectionPdf.Leave;  // Release lock
  end;
end;


What It Does:

Serializes ALL PDF processing - only one thread can open/read/process a PDF at any time. All other threads wait in line.
Trade-off:
Correctness: Every PDF is processed successfully
Performance: Slower (all PDFs processed sequentially, not in parallel)
Go to Top of Page

xequte

39298 Posts

Posted - Nov 26 2025 :  16:47:04  Show Profile  Reply
Hi Peter

PDFium is not threadsafe:

https://groups.google.com/g/pdfium/c/HeZSsM_KEUk?pli=1


Nigel
Xequte Software
www.imageen.com
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
Jump To: