我正在使用acrobat.tlb库解析.pdf
在连续删除连字符的新行中,连字符被分开.
例如
ABC-123-XXX-987
解析为:
ABC
123
XXX
987
如果我使用iTextSharp解析文本,它会解析文件中显示的整个字符串,这是我想要的行为.但是,我需要在.pdf和iTextSharp中突出显示这些字符串(序列号),而不是将突出显示放在正确的位置…因此acrobat.tlb
我正在使用此代码,从这里:http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf
' filey = "*your full file name including directory here*" AcroExchApp = CreateObject("AcroExch.App") AcroExchAVDoc = CreateObject("AcroExch.AVDoc") ' Open the [strfiley] pdf file AcroExchAVDoc.Open(filey,"") ' Get the PDDoc associated with the open AVDoc AcroExchPDDoc = AcroExchAVDoc.GetPDDoc sustext = "accessorizes" suktext = "accessorises" ' get JavaScript Object ' note jso is related to PDDoc of a PDF,jso = AcroExchPDDoc.GetJSObject ' count nCount = 0 nCount1 = 0 gbStop = False bUSCnt = False bUKCnt = False ' search for the text If Not jso Is Nothing Then ' total number of pages nPages = jso.numpages ' Go through pages For i = 0 To nPages - 1 ' check each word in a page nWords = jso.getPageNumWords(i) For j = 0 To nWords - 1 ' get a word word = Trim(CStr(jso.getPageNthWord(i,j))) 'If VarType(word) = VariantType.String Then If word <> "" Then ' compare the word with what the user wants If Trim(sustext) <> "" Then result = StrComp(word,sustext,vbTextCompare) ' if same If result = 0 Then nCount = nCount + 1 If bUSCnt = False Then iUSCnt = iUSCnt + 1 bUSCnt = True End If End If End If If suktext<> "" Then result1 = StrComp(word,suktext,vbTextCompare) ' if same If result1 = 0 Then nCount1 = nCount1 + 1 If bUKCnt = False Then iUKCnt = iUKCnt + 1 bUKCnt = True End If End If End If End If Next j Next i jso = Nothing End If
代码执行突出显示文本的工作,但带有’word’变量的FOR循环将带连字符的字符串拆分为组件部分.
For i = 0 To nPages - 1 ' check each word in a page nWords = jso.getPageNumWords(i) For j = 0 To nWords - 1 ' get a word word = Trim(CStr(jso.getPageNthWord(i,j)))
有谁知道如何使用acrobat.tlb维护整个字符串?我的相当广泛的搜索空白.
我可以理解iTextSharp在突出显示文本时很麻烦,因为你必须绘制一个矩形并变得复杂,但acrobat.tlb的解决方案也有它的缺点.它不是免费的,很少有人会使用它.对我们其他人来说更好的解决方案是免费且易于使用的Spire.Pdf.你可以从NuGet包中获得它.代码执行以下操作:
- Opens .pdf
- Read each text page
- using regular expression find matches
- save them to a list of strings eliminating duplicates
- for each string in this list search page and highlight the word
码:
Dim pdf As PdfDocument = New PdfDocument("Path") Dim pattern As String = "([A-Z,0-9]{3}[-][A-Z,0-9]{3})" Dim matches As MatchCollection Dim result As PdfTextFind() = Nothing Dim content As New StringBuilder() Dim matchList As New List(Of String) For Each page As PdfPageBase In pdf.Pages 'get text from current page content.Append(page.ExtractText()) 'find matches matches = Regex.Matches(content.ToString,pattern,RegexOptions.None) matchList.Clear() 'Assign each match to a string list. For Each match As Match In matches matchList.Add(match.Value) Next 'Eliminate duplicates. matchList = matchList.Distinct.ToList 'for each string in list For i = 0 To matchList.Count - 1 'find all occurances of matchList(i) string in page and highlight it result = page.FindText(matchList(i)).Finds For Each find As PdfTextFind In result find.ApplyHighLight(Color.BlueViolet) 'you can set your color preference Next Next 'matchList Next 'page pdf.SaveToFile("New Path") pdf.Close() pdf.Dispose()
我在正则表达方面不太好,所以你可以实现你的.无论如何,那是我的方法.