ImageEn for Delphi and C++ Builder ImageEn for Delphi and C++ Builder

 

ImageEn Forum
Profile    Join    Active Topics    Forum FAQ    Search this forumSearch
 All Forums
 ImageEn Library for Delphi, C++ and .Net
 ImageEn and IEvolution Support Forum
 Need to extract all plain text from a PDF file

Note: You must be registered in order to post a reply.
To register, click here. Registration is FREE!

View 
UserName:
Password:
Format  Bold Italicized Underline  Align Left Centered Align Right  Horizontal Rule  Insert Hyperlink   Browse for an image to attach to your post Browse for a zip to attach to your post Insert Code  Insert Quote Insert List
   
Message 

 

Emoji
Smile [:)] Big Smile [:D] Cool [8D] Blush [:I]
Tongue [:P] Evil [):] Wink [;)] Black Eye [B)]
Frown [:(] Shocked [:0] Angry [:(!] Sleepy [|)]
Kisses [:X] Approve [^] Disapprove [V] Question [?]

 
Check here to subscribe to this topic.
   

T O P I C    R E V I E W
PeterPanino Posted - Feb 12 2025 : 11:53:39
I have tried the following:

var ThisPdfDoc := iexPdfiumCore.TPdfDocument.Create;
try
  ThisPdfDoc.LoadFromFile(APdfFile);
  for var i := 0 to ThisPdfDoc.PageCount - 1 do
  begin
    //ThisPdfDoc.Pages[i]. -> Unfortunately, there is no method to extract the text from the whole page!
  end;
finally
  ThisPdfDoc.Free;
end;

7   L A T E S T    R E P L I E S    (Newest First)
xequte Posted - Nov 26 2025 : 16:47:04
Hi Peter

PDFium is not threadsafe:

https://groups.google.com/g/pdfium/c/HeZSsM_KEUk?pli=1


Nigel
Xequte Software
www.imageen.com
PeterPanino Posted - Nov 26 2025 : 15:57:55
The issue was caused by concurrently accessing multiple pdf files by multiple threads. (I searched all PDFs on my whole very modern computer for several words with 12 (!) threads).

The Problem:

TPdfDocument (ImageEn's PDF library) has internal state corruption when multiple threads reuse instances or process PDFs concurrently. Identical PDFs in different directories would fail randomly - the first found worked, later ones failed with "corrupted file" errors.

The Solution:

FCriticalSectionPdf: TCriticalSection;

function ExtractPdfTextThreadLocal(const FileName: string): string;
begin
  FCriticalSectionPdf.Enter;  // Only ONE thread can enter
  try
    PdfDoc := TPdfDocument.Create;
    try
      PdfDoc.LoadFromFile(FileName);
      // Extract text...
    finally
      PdfDoc.Free;
    end;
  finally
    FCriticalSectionPdf.Leave;  // Release lock
  end;
end;


What It Does:

Serializes ALL PDF processing - only one thread can open/read/process a PDF at any time. All other threads wait in line.
Trade-off:
Correctness: Every PDF is processed successfully
Performance: Slower (all PDFs processed sequentially, not in parallel)
PeterPanino Posted - Nov 26 2025 : 09:58:03
Hi Nigel,

I am encountering strange errors, where PdfDoc.LoadFromFile randomly throws an exception:

PdfDoc: TPdfDocument;
PdfDoc := TPdfDocument.Create; 
PdfDoc.LoadFromFile(FileName);


Which is faster: TPdfDocument.LoadFromFile or TPdfDocument.LoadFromStream?

BTW, is TPdfDocument thread-safe?
xequte Posted - Nov 25 2025 : 14:39:50


Nigel
Xequte Software
www.imageen.com
PeterPanino Posted - Nov 25 2025 : 07:09:26
This is my solution:

function ExtractPdfText(const FileName: string): string;
var
  Doc: TPdfDocument;
  i, j, CharCount: Integer;
  PageText: string;
begin
  Result := '';
  Doc := TPdfDocument.Create();
  try
    Doc.LoadFromFile(FileName);

    for i := 0 to Doc.PageCount - 1 do
    begin
      Doc.ActivePageIndex := i;
      CharCount := Doc.Pages[i].GetCharCount;

      // Pre-allocate string length for performance
      SetLength(PageText, CharCount);

      for j := 0 to CharCount - 1 do
        PageText[j + 1] := Doc.Pages[i].ReadChar(j);

      Result := Result + PageText + sLineBreak;
    end;
  finally
    Doc.Free;
  end;
end;
PeterPanino Posted - Nov 25 2025 : 04:24:10
No, I need a method that only extracts ALL PLAINTEXT FROM ALL PDF-PAGES without any Viewer or image overhead. Can't the ImageEn PDFIUM DLL do this?
xequte Posted - Feb 12 2025 : 18:41:35
Please see the example at:

http://www.imageen.com/help/TIEPdfViewer.SelText.html

Alternatively you can iterate through all the text objects in the page:

http://www.imageen.com/help/TIEPdfViewer.Objects.html

Nigel
Xequte Software
www.imageen.com