如何解决使用tessaract ocr在asp.net核心中阅读pdf文档时如何保留空格
我想将 pdf 文档作为带有所有空格的文本阅读。 下面是api代码 我试过下面的链接 How to preserve document structure in tesseract
[Authorize]
[HttpPost,disableRequestSizeLimit]
[Route("ocr/extract-pdf")]
[responsecache(Location = responsecacheLocation.None,NoStore = true)]
public async Task<JsonResult> ExtractPDF(IFormFile file)
{
try
{
if (file == null)
{
return new JsonResult(new
{
code = HttpStatusCode.NotFound,messages = new string[] { "File not found" }
});
}
if (Path.GetExtension(file.FileName) != ".pdf")
{
return new JsonResult(new
{
code = HttpStatusCode.NotFound,messages = new string[] { "Invalid file extension,please uplaod .pdf file" }
});
}
string contentRootPath = _hostingEnvironment.ContentRootPath;
string filePath = await DocumentUtil.SaveFiletodisk(contentRootPath + "\\assets\\OCR-PDFS",file);
string text = TesseractOCRMapper.ExtractPDFUsingOCR(filePath);
return new JsonResult(new
{
code = HttpStatusCode.OK,data = text,messages = new string[] { "Data extracted successfully" }
});
}
catch (Exception ex)
{
_logger.LogError(this.GetType().Name + "." + Logger.GetCurrentMethod(),"Error saving data mapper: " + ex.Message,ex);
}
}
这是我的 ocr 类文件
public static class TesseractOCRMapper
{
public static string ExtractPDFUsingOCR(string filePath)
{
var documentText = new StringBuilder();
using (var pdf = new PdfDocument(filePath))
{
using (var engine = new TesseractEngine(@"tessdata","eng",EngineMode.Default))
{
for (int i = 0; i < pdf.PageCount; ++i)
{
if (documentText.Length > 0)
documentText.Append("\r\n\r\n");
pdfpage page = pdf.Pages[i];
string searchableText = page.GetText();
// Simple check if the page contains searchable text.
// We do not need to perform OCR in that case.
if (!string.IsNullOrEmpty(searchableText.Trim()))
{
documentText.Append(searchableText);
continue;
}
// This page is not searchable.
// Save the page as a high-resolution image
PdfdrawOptions options = PdfdrawOptions.Create();
options.BackgroundColor = new PdfRgbColor(255,255,255);
options.HorizontalResolution = 300;
options.VerticalResolution = 300;
string pageImage = $"page_{i}.png";
page.Save(pageImage,options);
// Perform OCR
using (Pix img = Pix.LoadFromFile(pageImage))
{
using (Page recognizedPage = engine.Process(img))
{
Console.WriteLine($"Mean confidence for page #{i}: {recognizedPage.GetMeanConfidence()}");
string recognizedText = recognizedPage.GetText();
documentText.Append(recognizedText);
}
}
File.Delete(pageImage);
}
}
}
using (var writer = new StreamWriter("result.txt"))
writer.Write(documentText.ToString());
DocumentUtil.RemoveFile(filePath);
return documentText.ToString();
}
}
我根据这些链接搜索了一些链接,我创建了名为 ocrSettins 的文件,并将其拍到了 tessdata/config 文件夹,在该文件中我添加了类似 preserve_interword_spaces 1
的行
但我仍然无法阅读带有空格的 pdf。是一个
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。