我如何拆分RTF文件

wer*_*tyk 2 .net parsing rtf

我想通过字符串将RTF文件(使用C#或VB.Net)拆分为2个或更多部分[BreakPage].我有这个文件,包含一个[BreakPage],需要分为两部分:

{\ rtf1\ansi\ansicpg1251\uc1\deff0\stshfdbch0\stshfloch0\stshfhich0\stshfbi0\deflang1049\deflangfe1049 {\ fonttbl {\ f0\froman\fcharset204\fprq2 {*\panose 02020603050405020304} Times New Roman;} {\ f38\froman\fcharset0\fprq2 Times New Roman;} {\ f36\froman\fcharset238\fprq2 Times New Roman CE;} {\ f39\froman\fcharset161\fprq2 Times New Roman Greek;} {\ f40\froman\fcharset162\fprq2 Times New Roman Tur;} {\ f41\froman\fcharset177\fprq2 Times New Roman(希伯来语);} {\ f42\froman\fcharset178\fprq2 Times New Roman(阿拉伯语);} {\ f43\froman\fcharset186\fprq2 Times New Roman Baltic;} {\ f44\froman\fcharset163\fprq2 Times New Roman(越南语);}} {\ colortbl;\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;\red255\green255\blue0;\red255\green255\blue255;\red0\green0\blue128;\red0\green128\blue128;\red0\green128\blue0;\red128\green0\blue128;\red128\green0\blue0;\red128\green128\blue0;\red128\green128\blue128;\red192\green192\blue192;} {\ stylesheet {\ ql\li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0\fs24\lang1049\langfe1049\cgrid\langnp1049\langfenp1049\snext0正常;} {*\cs10\additive\ssemihidden默认段落字体;} {*\ts11\tsrowd\trftsWidthB3\trpaddl108\trpaddr108\trpaddfl3\trpaddft3\trpaddfb3\trpaddfr3\trcbpat1\trcfpat1\tscellwidthfts0\tsvertalt\tsbrdrt\tsbrdrl\tsbrdrb\tsbrdrr\tsbrdrdgl\tsbrdrdgr\tsbrdrh\tsbrdrv\ql\li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0\fs20\lang1024\langfe1024\cgrid\langnp1024\langfenp1024\snext11\ssemihidden普通表;}} {*\latentstyles\lsdstimax156\lsdlockeddef0} {*\rsidtbl\rsid2111663\rsid7154806\rsid15558346} {*\generator Microsoft Word 11.0.5604;} {\ info {\ author Programmer} {\ operator程序员} {\ creatim\yr2011\MO8\DY2\HR12\min45} {\ revtim\yr2011\MO8\DY5\HR12\min34} {\版本3} {\ edmins1} {\nofpages1} {\nofwords5} {\nofchars34} {\nofcharsws38} {\ vern24689}}\margl1701\margr850\margt113 4\margb1134\widowctrl\ftnbj\aenddoc \noxlattoyen\expshrtn \noultrlspc\dntblnsbdb \nospaceforul\hyphcaps0\horzdoc\dghspace120\dgvspace120\dghorigin1701\dgvorigin1984\dghshow0\dgvshow3\jcompress\viewkind1\viewscale100 \nolnhtadjtbl\rsidroot15558346\fet0\sectd\linex0\sectdefaultcl\sftnbj {*\pnseclvl1\pnucrm\pnstart1\pnindent720\pnhang {\ pntxta.}} {*\pnseclvl2\pnucltr\pnstart1\pnindent720\pnhang {\ pntxta.}} {*\pnseclvl3\pndec\pnstart1\pnindent720\pnhang {\ pntxta.}} {*\pnseclvl4\pnlcltr\pnstart1\pnindent720\pnhang {\ pntxta)}} {*\pnseclvl5\pndec\pnstart1\pnindent720\pnhang {\ pntxtb(} {\ pntxta)}} {*\pnseclvl6\pnlcltr\pnstart1\pnindent720\pnhang {\ pntxtb(} {\ pntxta)}} {*\pnseclvl7\pnlcrm\pnstart1\pnindent720\pnhang {\ pntxtb(} {\ pntxta)}} {*\pnseclvl8\pnlcltr\pnstart1\pnindent720\pnhang {\ pntxtb(} {\ pntxta)}} {*\pnseclvl9\pnlcrm\pnstart1\pnindent720\pnhang {\ pntxtb(} {\ pntxta)}}\pard\plain\ql\li0\ri0 \nowidctlpar\faauto\rin0\lin0\itap0\fs24\lang1049\langfe1049\cgrid\langnp1049\langfenp1049 {\ b\insrsid7154806\charrsid7154806第1行\ par} {\ insrsid7154806\par} {\ i\insrsid7154806\charrsid7154806第3行} {\ lang1048\langfe1049\langnp1048\insrsid7154806\par} {\ lang1048\langfe1049\langnp1048\insrsid2111663 [BreakPage ]\par} {\ insrsid7154806 Line4\par\par Line5\par}}

谁能帮我?

谢谢!

Kon*_*lph 5

问题是RTF在全局标头中有一些(但不一定是全部)格式化信息.为了拆分RTF文本以使结果再次成为有效格式的RTF,您基本上需要知道标题信息的位置,并在分割中复制它.

有两种方法可以做到这一点:

  1. 编写RTF解析器
  2. 使用现有的RTF解析器

(1)是可行的,但需要时间.幸运的是,RTF解析器已经存在,例如CodeProject上的这个解析器.

另外,您也可以将RTF文本加载到RichTextBox,然后搜索拆分文本"[BreakPage]"RichTextBox,以编程方式选择第一和第二部分,并使用检索RTF文本SelectedRtf属性.