programing

VB.NET 또는 C#에서 iTextsharp dll로 PDF 콘텐츠 읽기

kakaobank 2023. 5. 17. 23:20

VB.NET 또는 C#에서 iTextsharp dll로 PDF 콘텐츠 읽기

PDF 리더 클래스에서 itext로 PDF 콘텐츠를 샤프하게 읽는 방법은 무엇입니까?내 PDF에는 일반 텍스트 또는 텍스트 이미지가 포함될 수 있습니다.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;

public string ReadPdfFile(string fileName)
{
    StringBuilder text = new StringBuilder();

    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}

LGPL / FOSS iTextSharp 4.x

var pdfReader = new PdfReader(path); //other filestream etc
byte[] pageContent = _pdfReader .GetPageContent(pageNum); //not zero based
byte[] utf8 = Encoding.Convert(Encoding.Default, Encoding.UTF8, pageContent);
string textFromPage = Encoding.UTF8.GetString(utf8);

다른 답변은 저에게 유용하지 않았습니다. 모두 iTextSharp의 AGPL v5를 대상으로 하는 것 같습니다.에 대한 어떠한 언급도 찾을 수 없었습니다.SimpleTextExtractionStrategy또는LocationTextExtractionStrategyFOSS 버전에서.

이와 관련하여 매우 유용할 수 있는 다른 것이 있습니다.

const string PdfTableFormat = @"\(.*\)Tj";
Regex PdfTableRegex = new Regex(PdfTableFormat, RegexOptions.Compiled);

List<string> ExtractPdfContent(string rawPdfContent)
{
    var matches = PdfTableRegex.Matches(rawPdfContent);

    var list = matches.Cast<Match>()
        .Select(m => m.Value
            .Substring(1) //remove leading (
            .Remove(m.Value.Length - 4) //remove trailing )Tj
            .Replace(@"\)", ")") //unencode parens
            .Replace(@"\(", "(")
            .Trim()
        )
        .ToList();
    return list;
}

표시된 텍스트가 다음과 같은 경우 PDF에서 텍스트 전용 데이터를 추출합니다.Foo(bar)PDF에 다음과 같이 인코딩됩니다.(Foo\(bar\))Tj이 메소드는 반환됩니다.Foo(bar)역시이 방법은 원시 PDF 콘텐츠에서 위치 좌표와 같은 많은 추가 정보를 제거합니다.

다음은 Shravankumar Kumar의 솔루션을 기반으로 한 VB.NET 솔루션입니다.

텍스트만 제공됩니다.그 이미지들은 다른 이야기입니다.

Public Shared Function GetTextFromPDF(PdfFileName As String) As String
    Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)

    Dim sOut = ""

    For i = 1 To oReader.NumberOfPages
        Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy

        sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
    Next

    Return sOut
End Function

저의 경우, 저는 PDF 문서의 특정 영역에 있는 텍스트를 원했기 때문에 해당 영역의 직사각형을 사용하여 텍스트를 추출했습니다.아래 샘플에서 좌표는 전체 페이지에 대한 것입니다.PDF 저작 도구가 없기 때문에 직사각형을 특정 위치로 좁힐 시간이 되었을 때 영역을 찾을 때까지 좌표를 몇 번 추측했습니다.

Rectangle _pdfRect = new Rectangle(0f, 0f, 612f, 792f); // Entire page - PDF coordinate system 0,0 is bottom left corner.  72 points / inch
RenderFilter _renderfilter = new RegionTextRenderFilter(_pdfRect);
ITextExtractionStrategy _strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), _filter);
string _text = PdfTextExtractor.GetTextFromPage(_pdfReader, 1, _strategy);

위의 코멘트에서 지적한 바와 같이 결과 텍스트는 PDF 문서에 있는 서식을 유지하지 않지만, 캐리지 리턴을 보존했다는 점에서 기뻤습니다.저의 경우, 텍스트에 필요한 값을 추출할 수 있는 충분한 상수가 있었습니다.

여기 슈라반쿠마르 쿠마르의 개선된 답변이 있습니다.텍스트 행과 해당 행의 단어를 기준으로 PDF의 단어에 액세스할 수 있도록 페이지에 대한 특별 클래스를 만들었습니다.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

//create a list of pdf pages
var pages = new List<PdfPage>();

//load the pdf into the reader. NOTE: path can also be replaced with a byte array
using (PdfReader reader = new PdfReader(path))
{
    //loop all the pages and extract the text
    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        pages.Add(new PdfPage()
        {
           content = PdfTextExtractor.GetTextFromPage(reader, i)
        });
    }
}

//use linq to create the rows and words by splitting on newline and space
pages.ForEach(x => x.rows = x.content.Split('\n').Select(y => 
    new PdfRow() { 
       content = y,
       words = y.Split(' ').ToList()
    }
).ToList());

사용자 지정 클래스

class PdfPage
{
    public string content { get; set; }
    public List<PdfRow> rows { get; set; }
}


class PdfRow
{
    public string content { get; set; }
    public List<string> words { get; set; }
}

이제 행별 및 단어 색인별로 단어를 얻을 수 있습니다.

string myWord = pages[0].rows[12].words[4];

또는 Linkq를 사용하여 특정 단어가 포함된 행을 찾습니다.

//find the rows in a specific page containing a word
var myRows = pages[0].rows.Where(x => x.words.Any(y => y == "myWord1")).ToList();

//find the rows in all pages containing a word
var myRows = pages.SelectMany(r => r.rows).Where(x => x.words.Any(y => y == "myWord2")).ToList();

Public Sub PDFTxtToPdf(ByVal sTxtfile As String, ByVal sPDFSourcefile As String)
        Dim sr As StreamReader = New StreamReader(sTxtfile)
    Dim doc As New Document()
    PdfWriter.GetInstance(doc, New FileStream(sPDFSourcefile, FileMode.Create))
    doc.Open()
    doc.Add(New Paragraph(sr.ReadToEnd()))
    doc.Close()
End Sub

언급URL : https://stackoverflow.com/questions/2550796/reading-pdf-content-with-itextsharp-dll-in-vb-net-or-c-sharp

'programing' 카테고리의 다른 글

Node.js getaddrinfo ENOTFound (0)	2023.05.17
티켓을 날짜 형식으로 변환하려면 어떻게 해야 합니까? (0)	2023.05.17
Visual Studio에서 "32비트 선호" 설정의 목적은 무엇이며 실제로 어떻게 작동합니까? (0)	2023.05.17
선택한 항목 간의 차이항목, 선택된 값 및 선택된 값 경로 (0)	2023.05.17
단순 C# DLL - Excel, Access, VBA, VB6에서 어떻게 호출합니까? (0)	2023.05.17

현재글VB.NET 또는 C#에서 iTextsharp dll로 PDF 콘텐츠 읽기

각종 프로그래밍 정보를 다루는 블로그입니다.

ios, python-3.x, Azure, windows, json, WPF, REACTJS, Eclipse, .NET, Excel, spring-boot, sql-server, MongoDB, git, asp.net, AngularJS, Bash, Wordpress, AJAX, vb.net,

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

kakaobank

VB.NET 또는 C#에서 iTextsharp dll로 PDF 콘텐츠 읽기

VB.NET 또는 C#에서 iTextsharp dll로 PDF 콘텐츠 읽기

LGPL / FOSS iTextSharp 4.x

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

VB.NET 또는 C#에서 iTextsharp dll로 PDF 콘텐츠 읽기

VB.NET 또는 C#에서 iTextsharp dll로 PDF 콘텐츠 읽기

LGPL / FOSS iTextSharp 4.x

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바