我有一个解析PDF文档的程序以获取信息.当发布新版本的PDF时,作者使用粗体或斜体文本来指示新信息,并使用Strike through或underlined来指示省略的文本.使用PDFBox中的基础剥离器类会返回所有文本,但格式化将被删除,因此我无法确定文本是新的还是省略.我目前正在使用下面的项目示例代码:
Dim doc As PDDocument = Nothing Try doc = PDDocument.load(RFPFilePath) Dim stripper As New PDFTextStripper() stripper.setAddMoreFormatting(True) stripper.setSortByPosition(True) rtxt_DocumentViewer.Text = stripper.getText(doc) Finally If doc IsNot Nothing Then doc.close() End If End Try
如果我只是将PDF文本复制并粘贴到保存格式的richtextBox中,我的解析代码就可以正常工作.我打算通过打开PDF,选择全部,复制,关闭文档然后将其粘贴到我的richtextBox中以编程方式执行此操作,但这看起来很笨拙.
自定义文本剥离器
正如评论中已经提到的,
The bold and italic effects in the OP’s sample document are generated by using a different font (containing bold or italic versions of the letters) to draw the text. The underline and strike-through effects in the sample document are generated by drawing a rectangle under / through the text line which has the width of the text line and a very small height. To extract these information,therefore,one has to extend the
PDFTextStripper
to somehow react to font changes and rectangles nearby text.
这是一个扩展PDFTextStripper的示例类,如下所示:
public class PDFStyledTextStripper extends PDFTextStripper { public PDFStyledTextStripper() throws IOException { super(); registerOperatorProcessor("re",new AppendRectangleToPath()); } @Override protected void writeString(String text,List<TextPosition> textPositions) throws IOException { for (TextPosition textPosition : textPositions) { Set<String> style = determineStyle(textPosition); if (!style.equals(currentStyle)) { output.write(style.toString()); currentStyle = style; } output.write(textPosition.getCharacter()); } } Set<String> determineStyle(TextPosition textPosition) { Set<String> result = new HashSet<>(); if (textPosition.getFont().getBaseFont().toLowerCase().contains("bold")) result.add("Bold"); if (textPosition.getFont().getBaseFont().toLowerCase().contains("italic")) result.add("Italic"); if (rectangles.stream().anyMatch(r -> r.underlines(textPosition))) result.add("Underline"); if (rectangles.stream().anyMatch(r -> r.strikesThrough(textPosition))) result.add("StrikeThrough"); return result; } class AppendRectangleToPath extends OperatorProcessor { public void process(PDFOperator operator,List<COSBase> arguments) { COSNumber x = (COSNumber) arguments.get(0); COSNumber y = (COSNumber) arguments.get(1); COSNumber w = (COSNumber) arguments.get(2); COSNumber h = (COSNumber) arguments.get(3); double x1 = x.doubleValue(); double y1 = y.doubleValue(); // create a pair of coordinates for the transformation double x2 = w.doubleValue() + x1; double y2 = h.doubleValue() + y1; Point2D p0 = transformedPoint(x1,y1); Point2D p1 = transformedPoint(x2,y1); Point2D p2 = transformedPoint(x2,y2); Point2D p3 = transformedPoint(x1,y2); rectangles.add(new TransformedRectangle(p0,p1,p2,p3)); } Point2D.Double transformedPoint(double x,double y) { double[] position = {x,y}; getGraphicsState().getCurrentTransformationMatrix().createAffineTransform().transform( position,position,1); return new Point2D.Double(position[0],position[1]); } } static class TransformedRectangle { public TransformedRectangle(Point2D p0,Point2D p1,Point2D p2,Point2D p3) { this.p0 = p0; this.p1 = p1; this.p2 = p2; this.p3 = p3; } boolean strikesThrough(TextPosition textPosition) { Matrix matrix = textPosition.getTextPos(); // TODO: This is a very simplistic implementation only working for horizontal text without page rotation // and horizontal rectangular strikeThroughs with p0 at the left bottom and p2 at the right top // Check if rectangle horizontally matches (at least) the text if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt() / 10.0) return false; // Check whether rectangle vertically is at the right height to underline double vertDiff = p0.getY() - matrix.getYPosition(); if (vertDiff < 0 || vertDiff > textPosition.getFont().getFontDescriptor().getAscent() * textPosition.getFontSizeInPt() / 1000.0) return false; // Check whether rectangle is small enough to be a line return Math.abs(p2.getY() - p0.getY()) < 2; } boolean underlines(TextPosition textPosition) { Matrix matrix = textPosition.getTextPos(); // TODO: This is a very simplistic implementation only working for horizontal text without page rotation // and horizontal rectangular underlines with p0 at the left bottom and p2 at the right top // Check if rectangle horizontally matches (at least) the text if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt() / 10.0) return false; // Check whether rectangle vertically is at the right height to underline double vertDiff = p0.getY() - matrix.getYPosition(); if (vertDiff > 0 || vertDiff < textPosition.getFont().getFontDescriptor().getDescent() * textPosition.getFontSizeInPt() / 500.0) return false; // Check whether rectangle is small enough to be a line return Math.abs(p2.getY() - p0.getY()) < 2; } final Point2D p0,p3; } final List<TransformedRectangle> rectangles = new ArrayList<>(); Set<String> currentStyle = Collections.singleton("Undefined"); }
除了PDFTextStripper所做的,这个类也是
>使用AppendRectangleToPath运算符处理器内部类的实例从内容(使用re指令定义)收集矩形,
>从determineStyle中的示例文档中检查样式变体的文本,以及
>每当样式更改时,将新样式添加到writeString中的结果中.
注意:这仅仅是一个概念证明!尤其是
> TransformedRectangle.underlines(TextPosition)和TransformedRectangle#strikesThrough(TextPosition)中的测试实现非常简单,仅适用于没有页面旋转的水平文本和水平矩形strikeThroughs和下划线,左下角为p0,右上角为p2 ;
>收集所有矩形,而不是检查它们是否真的充满了可见的颜色;
>“粗体”和“斜体”的测试仅仅检查使用过的字体的名称,这通常是不够的.
测试输出
像这样使用PDFStyledTextStripper
String extractStyled(PDDocument document) throws IOException { PDFTextStripper stripper = new PDFStyledTextStripper(); stripper.setSortByPosition(true); return stripper.getText(document); }
(从ExtractText.java开始,从测试方法testExtractStyledFromExampleDocument调用)
一个得到结果
[]This is an example of plain text [Bold]This is an example of bold text [] [Underline]This is an example of underlined text[] [Italic]This is an example of italic text [] [StrikeThrough]This is an example of strike through text[] [Italic,Bold]This is an example of bold,italic text
对于OP的样本文档
PS同时,PDFStyledTextStripper的代码也略有改变,也适用于在github问题中共享的示例文档,特别是其内部类TransformedRectangle的代码,参见. here.