| Author |
Topic  |
|
|
Merlin
Germany
7 Posts |
Posted - Mar 07 2025 : 06:51:39
|
Hello,
Is it possible to subsequently convert a PDF into a PDF with OCR content or do the pages of the PDF file have to be exported and then reassembled using TIEVisionSearchablePDFGenerator?
An example program would be great :)
Thanx |
|
|
xequte
    
39450 Posts |
Posted - Mar 07 2025 : 19:07:06
|
Sorry, do you mean that you have a PDF that contains images of text (not text itself), and you want to convert it into a PDF where the text is available (text has been OCR'ed)?
Nigel Xequte Software www.imageen.com
|
 |
|
|
Merlin
Germany
7 Posts |
Posted - Mar 10 2025 : 04:46:37
|
Hello
yes, I want to apply text recognition to a pdf file that does not contain any text. To do this, the file must be loaded, the individual pages exported as images and then the text content must be determined with the text recognition via pdfGen : TIEVisionSearchablePDFGenerator.
Hmm, maybe there's a small example available if I do not have to use external libraries for the export of the individual PDF pages.
Thanks |
 |
|
|
xequte
    
39450 Posts |
Posted - Mar 10 2025 : 19:46:39
|
Why not do it as follows:
// Convert "in.pdf" (pages are images) to "out.pdf" (text in pages now selectable)
ImageEnMView1.MIO.LoadFromFile( 'D:\in.pdf' );
pdfGen := IEVisionLib.createSearchablePDFGenerator('./', IEOCRLanguageList[OCR_English_language].Code);
pdfGen.beginDocument(PAnsiChar(AnsiString(langPath + 'out')), PAnsiChar(AnsiString('title')));
for i := 0 to ImageEnMView1.ImageCount - 1 do
begin
ImageEnMView1.SelectedImage := i; // Show the image being processed
pdfGen.addPage(ImageEnMView1.IEBitmap.GetIEVisionImage());
end;
pdfGen.endDocument();
You will need to add iepdf32.dll to your EXE folder.
Nigel Xequte Software www.imageen.com
|
 |
|
|
Merlin
Germany
7 Posts |
Posted - Mar 12 2025 : 08:20:53
|
Hello Nigel,
thank you, I will give it a try :) |
 |
|
|
AndNit
 
Brazil
94 Posts |
Posted - May 07 2026 : 22:00:28
|
First, I'd like to thank you for the explanation; it's very simple and I've already implemented it in my code. However, I noticed that the file with OCR is MUCH larger. How can I solve this?
Here's my code. Thank you.
for i := 0 to imgMPdf.MIO.ParamsCount - 1 do begin
imgMPdf.MIO.Params[i].PDF_PaperSize := iepAuto;
imgMPdf.MIO.Params[i].PDF_Compression := ioPDF_JPEG; // or ioPDF_G4FAX for monochrome images
end;
IEGlobalSettings().PDFEngine := ieenLegacy;
imgMPdf.MIO.SaveToFilePDF(CaminhoPDF);
IEGlobalSettings().PDFEngine := ieenAuto;
//
pdfGen := IEVisionLib.createSearchablePDFGenerator('./', IEOCRLanguageList[OCR_Portuguese_Language].Code);
pdfGen.beginDocument(PAnsiChar(AnsiString(CaminhoPDF)), PAnsiChar(AnsiString('title')));
for i := 0 to imgMPdf.ImageCount - 1 do
begin
imgMPdf.SelectedImage := i; // Show the image being processed
pdfGen.addPage(imgMPdf.IEBitmap.GetIEVisionImage());
end;
pdfGen.endDocument(); |
 |
|
|
AndNit
 
Brazil
94 Posts |
Posted - May 07 2026 : 22:04:58
|
I'd like to take this opportunity to ask which OCR is most accurate for all situations... running text, tables, text under images, etc...
TIEVisionOCREngine::ievOCRFAST |
 |
|
|
xequte
    
39450 Posts |
|
|
xequte
    
39450 Posts |
Posted - May 09 2026 : 07:59:02
|
Hi
TIEVisionOCREngine should generally just be left as ievOCRDefault.
The main thing is which language files you use:
- LTSM - Standard - LTSM - Slow, Highest Quality - LTSM + Legacy
Naturally, the second one should generally give the best results.
Nigel Xequte Software www.imageen.com
|
 |
|
|
AndNit
 
Brazil
94 Posts |
Posted - May 09 2026 : 14:35:17
|
Perfect, everything worked out, thank you Nigel.
Now, how do I create this PDF with all the metadata?
PDF_Title PDF_Author PDF_Subject PDF_Keywords PDF_Creator PDF_Producer, Etc |
 |
|
|
xequte
    
39450 Posts |
Posted - May 09 2026 : 17:04:27
|
Unfortunately PDFium does not support meta-data at this time.
Nigel Xequte Software www.imageen.com
|
 |
|
|
AndNit
 
Brazil
94 Posts |
Posted - May 10 2026 : 17:21:34
|
Okay, thank you for your reply.
The generated PDF is not a PDF/A; if I'm not mistaken, ImageEnter doesn't generate PDF/A, correct?
Do you suggest any way to include the metadata and convert the PDF to PDF/A after creating it?
Thank you. |
 |
|
|
xequte
    
39450 Posts |
Posted - May 10 2026 : 17:33:44
|
Hi
No, it is not PDF/A, you would need to use a post converter for that. I'm afraid I don't know what third party tool would be best for that.
Nigel Xequte Software www.imageen.com
|
 |
|
|
AndNit
 
Brazil
94 Posts |
Posted - May 11 2026 : 11:50:30
|
| I appreciate the information from the forum; I implemented the metadata processing and conversion to PDF/A using Ghostscript. |
 |
|
|
AndNit
 
Brazil
94 Posts |
Posted - May 31 2026 : 16:45:06
|
Good evening
How can I measure the OCR confidence score within this process?
pdfGen := IEVisionLib.createSearchablePDFGenerator('./', IEOCRLanguageList[OCR_Portuguese_Language].Code);
pdfGen.beginDocument(PAnsiChar(AnsiString(CaminhoPDF)), PAnsiChar(AnsiString('title')));
for i := 0 to imgMPdf.ImageCount - 1 do
begin
imgMPdf.SelectedImage := i; // Show the image being processed
pdfGen.addPage(imgMPdf.IEBitmap.GetIEVisionImage());
end;
pdfGen.endDocument(); |
 |
|
|
xequte
    
39450 Posts |
Posted - May 31 2026 : 16:48:03
|
Sorry, the information is not returned by that process.
Nigel Xequte Software www.imageen.com
|
 |
|
|
Harald

Germany
12 Posts |
Posted - Jun 05 2026 : 03:37:02
|
I would like to run this function in a background thread. I have scanned PDF and want to convert them into searchable PDF. Is IEVision capable of multithreading in this scenario, where multiple threads are converting PDFs simultaneously? What is the best way to read the images from the PDF if I don’t want to use a visual TImageEnMView in the thread? Thank you very much and best regards, Harald
Document Management http://www.officemanager.de/en |
 |
|
|
Harald

Germany
12 Posts |
Posted - Jun 06 2026 : 07:31:31
|
For now, I've found a solution that works for me. The PDF is read using a DevExpress component, exported as images, and those images are then processed using IEVision OCR. The text isn't yet written back into the PDF, but this is sufficient for my purposes regarding the subsequent AI analysis of the documents.
function PdfExtractDocumentText(const AFileName: string;
const AMaxPages: integer = -1;
ALog: TLogCallback = nil): TArray<string>;
var LPDF: TdxPDFDocument;
i, LCount: integer;
LPageText: string;
LOCR: TIEVisionOCR;
LdxImage: TdxSmartImage;
LieBitmap: TIEBitmap;
begin
LPDF := TdxPDFDocument.Create;
try
LPDF.LoadFromFile(AFileName);
if AMaxPages = -1 then
LCount := LPDF.PageCount
else
LCount := AMaxPages;
SetLength(result, LCount);
for i:=0 to LCount-1 do begin
LPageText := LPDF.PageInfo[i].Text;
if LPageText.IsEmpty and (LPDF.PageInfo[i].Images.Count > 0) then
try
if Assigned(ALog) then
ALog('Seite ohne Text, aber mit Bild(ern). OCR-Texterkennung ausführen.');
LdxImage := TdxSmartImage.Create;
LieBitmap := TIEBitmap.Create;
try
if not dxPDFDocumentExportToImageEx(LPDF, i, 1, LdxImage) then
Continue;
LieBitmap.Assign(LdxImage.GetAsBitmap);
LOCR := IEVisionLib.createOCR(IEOCRLanguageList[OCR_English_language].Code);
LPageText := LOCR.recognize(LIEBitmap.GetIEVisionImage, IEVisionRect(0, 0, 0, 0)).c_str();
finally
LdxImage.Free;
LieBitmap.Free;
end;
except
on E:Exception do
if Assigned(ALog) then
ALog('Feher beim Ausführen der OCR: ' + E.Message);
end;
result[i] := LPageText;
end;
finally
LPDF.Free;
end;
end;
Document Management https://www.officemanager.de/en |
 |
|
|
xequte
    
39450 Posts |
Posted - Jun 09 2026 : 00:50:19
|
Hi
I don't think there is a requirement for DevExpress here. It should all be possible within ImageEn:
http://www.imageen.com/help/TIEPdfViewer.html
The code will be something like (converting your code):
function PdfExtractDocumentText(const AFileName: string;
const AMaxPages: integer = -1;
ALog: TLogCallback = nil): TArray<string>;
var LPDF: TIEPdfViewer;
i, LCount: integer;
LPageText: string;
LOCR: TIEVisionOCR;
LieBitmap: TIEBitmap;
begin
LPDF := TIEPdfViewer.Create();
try
LPDF.LoadFromFile(AFileName);
if AMaxPages = -1 then
LCount := LPDF.PageCount
else
LCount := AMaxPages;
SetLength(result, LCount);
for i := 0 to LCount-1 do begin
LPageText := LPDF.GetText();
if LPageText.IsEmpty then
try
if Assigned(ALog) then
ALog('Seite ohne Text, aber mit Bild(ern). OCR-Texterkennung ausführen.');
LieBitmap := TIEBitmap.Create;
try
LPDF.DrawTo( LieBitmap );
LOCR := IEVisionLib.createOCR(IEOCRLanguageList[OCR_English_language].Code);
LPageText := LOCR.recognize(LIEBitmap.GetIEVisionImage, IEVisionRect(0, 0, 0, 0)).c_str();
finally
LieBitmap.Free;
end;
except
on E:Exception do
if Assigned(ALog) then
ALog('Feher beim Ausführen der OCR: ' + E.Message);
end;
result[i] := LPageText;
end;
finally
LPDF.Free;
end;
end;
Nigel Xequte Software www.imageen.com
|
 |
|
| |
Topic  |
|