我编写了一个Perl脚本,使用
Win32::OLE读取Microsoft Word文档内容.
我的问题是包含编号列表的文档(以1,2,3,…开头).我的Perl脚本无法获得该号码.我只能得到文字内容,而不是数字.
请建议如何将编号列表转换为纯文本,以保留编号和文本.
解决方法
我的博客文章
Extract bullet lists from PowerPoint slides using Perl and Win32::OLE 显示了如何使用PowerPoint执行此操作.事实证明Word的任务有点简单.
#!/usr/bin/env perl use strict; use warnings; use feature 'say'; use Carp qw( croak ); use Const::Fast; use Path::Class; use Try::Tiny; use Win32::OLE; use Win32::OLE::Const ('Microsoft.Word'); use Win32::OLE::Enum; $Win32::OLE::Warn = 3; run(@ARGV); sub run { my $docfile = shift; # Croaks if it cannot resolve $docfile = file($docfile)->absolute->resolve; my $word = get_word(); my $doc = $word->Documents->Open( { FileName => "$docfile",ConfirmConversions => 0,AddToRecentFiles => 0,Revert => 0,ReadOnly => 1,} ); my $pars = Win32::OLE::Enum->new($doc->Paragraphs); while (my $par = $pars->Next) { print_paragraph($par); } } sub print_paragraph { my $par = shift; my $range = $par->Range; my $fmt = $range->ListFormat; my $bullet = $fmt->ListString; my $text = $range->Text; unless ($bullet) { say $text; return; } my $level = $fmt->ListLevelNumber; say ">" x $level,join(' ',$bullet,$text); return; } sub get_word { my $word; try { $word = Win32::OLE->GetActiveObject('Word.Application') } catch { croak $_ }; return $word if $word; $word = Win32::OLE->new('Word.Application',sub { $_[0]->Quit }); return $word if $word; croak sprintf('Cannot start Word: %s',Win32::OLE->LastError); }
鉴于以下Word文档:
This is a document >1. This is a numbered list >2. Second item in the numbered list >3. Third one Back to normal paragraph. >>a. Another list >>b. Yup,here comes the second item >>c. Not so sure what to put here >>>i. Sub-item
Object Browser是必不可少的.