当我从 PDF 复制或打印文档时，为什么文本“fi”会被剪切？

Question

当我从 PDF 复制或打印文档时，为什么文本“fi”会被剪切？

Tam*_*man 22 windows clipboard

当我从包含以下内容的 Adobe Reader PDF 文件复制时

Define an operation

我宁愿看到

Dene an operation

当我粘贴文本时，这是为什么？

我该如何解决这个烦人的问题？

过去，当我将 Microsoft Office Word 文件打印到我的打印机时，我也看到过这种情况。

Answer 1

afr*_*ier 16

这听起来像是字体问题。PDF 可能在 word 中使用 OpenTypefi 连字define，而目标应用程序的当前字体缺少该字形。

我不知道是否有一种简单的方法可以让 Acrobat 在复制时分解连字。

您的打印问题可能也与字体有关。有些东西可能允许打印机用它自己的内置字体替换文档的字体，而打印机的字体版本也缺少那个特定的字形。您必须告诉 Windows 始终将字体下载到打印机以解决此问题。

打印时的另一种可能性：UniScribe 可能未启用。 MS KB 2642020讨论了这一点以及一些可能的解决方法（即使用 RAW 类型打印而不是 EMF 类型打印）。尽管上下文与您的具体问题略有不同，但原因可能相同，并且可能适用相同的解决方法。

Answer 2

Joe*_*oey 10

正如另一个答案所指出的，这里的问题是连字。但是，它与 OpenType 没有任何关系。根本问题在于，PDF 是一种预印格式，它本身只关心内容和语义，而是致力于忠实地表示将要打印的页面。

文本不是作为文本而是作为特定位置的字体的字形排列的。所以你会得到类似»将字形编号 72 放在那里，字形编号 101 放在那里，字形编号 108 放在那里，......«。在这个层面有根本没有文字的概念可言。这只是描述它的外观。从一堆字形中提取含义有两个问题：

空间布局。由于 PDF 已经包含了放置每个字形的特定信息，因此它下面没有正常的实际文本。另一个副作用是没有空格。当然，如果您查看文本，但不是在 PDF 中。当您根本可以不发出任何字形时，为什么要发出空白字形？毕竟结果是一样的。因此，PDF 阅读器必须再次小心地将文本拼凑起来，在遇到字形之间的较大间隙时插入一个空格。
PDF 呈现字形，而不是文本。大多数情况下，字形 ID 对应于 Unicode 代码点或至少对应于嵌入字体中的 ASCII 代码，这意味着您通常可以很好地恢复 ASCII 或 Latin 1 文本，这取决于谁首先创建了 PDF（有些断章取义一切都在这个过程中）。但通常，即使是允许您很好地导出 ASCII 文本的 PDF 也会破坏所有不是ASCII 的内容。对于复杂的脚本尤其可怕，例如在布局阶段后仅包含连字和替代字形的阿拉伯语，这意味着阿拉伯语 PDF 几乎从不包含实际文本

第二个问题和你面临的一样。这里的一个常见罪魁祸首是 LaTeX，它利用估计数量的 238982375 种不同字体（每种字体限制为 256 个字形）来实现其输出。用于普通文本、数学（使用不止一种）等的不同字体使事情变得非常困难，特别是因为 Metafont 比 Unicode 早了近 20 年，因此从来没有 Unicode 映射。元音变音也由叠加在字母上的分音符呈现，例如，当从 PDF 复制时，您会得到 »¨a« 而不是 »ä«（当然也不能搜索它）。

生成 PDF 的应用程序可以选择包含实际文本作为元数据。如果他们不这样做，您将受制于如何处理嵌入字体以及 PDF 阅读器是否可以再次拼凑原始文本。但是 »fi« 被复制为空白或根本不复制通常是 LaTeX PDF 的标志。你应该在石头上涂上 Unicode 字符，然后把它们扔给生产者，希望它们会切换到 XeLaTeX，从而最终到达 1990 年代的字符编码和字体标准。

Answer 3

Jan*_*gen 9

您可以用原件替换大多数这些“破碎”的词。如果出现以下情况，您可以安全地替换单词：

喜欢dene或rey，它不是一个真正的词
像defineor一样firefly，有一种方法可以重新添加连字序列（ff, fi, fl, ffi, or ffl）并生成一个真实的单词

大多数连字问题都符合这些标准。但是，您不能替换：

us 因为它是一个真实的词，即使它最初可能是 fluffs
- 还有affirm, butterfly, fielders, fortifies, flimflam, misfits...
cus因为它可能变成cuffs或ficus
- 还有stiffed/ stifled, rifle/ riffle, flung/ fluffing...

在这本49.6万字的英语词典中，有16055 个单词至少包含ff、fi、fl、ffi、或一个，去掉连字后ffl变成15879 个单词。173相撞像漏字cuffs和ficus，最后3是因为该字典包含文字ff，fi和fl。

这些“去除连字”的单词中有790 个是真实单词，例如us，但15089 个是损坏的单词。14960个断词可以安全地替换为原始词，这意味着99.1%的断词是可修复的，93.2%的包含连字的原始词在复制粘贴PDF后可以恢复。6.8%的包含连字序列的单词会因冲突 ( cus) 和子单词 ( us)而丢失，除非您选择某种方式（单词/文档上下文？）为每个没有保证的单词选择最佳替换替代品。

下面是我生成上述统计数据的 Python 脚本。它需要一个每行一个单词的字典文本文件。最后，它会写入一个 CSV 文件，将可修复的损坏单词映射到它们的原始单词。

这是下载 CSV 的链接：http : //www.filedropper.com/brokenligaturewordfixes 将此映射与正则表达式替换脚本之类的内容结合起来，以替换大部分损坏的单词。

import csv import itertools import operator import re dictionary_file_path = 'dictionary.txt' broken_word_fixes_file_path = 'broken_word_fixes.csv' ligatures = 'ffi', 'ffl', 'ff', 'fi', 'fl' with open(dictionary_file_path, 'r') as dictionary_file: dictionary_words = list(set(line.strip() for line in dictionary_file.readlines())) broken_word_fixes = {} ligature_words = set() ligature_removed_words = set() broken_words = set() multi_ligature_words = set() # Find broken word fixes for words with one ligature sequence # Example: "dene" --> "define" words_and_ligatures = list(itertools.product(dictionary_words, ligatures)) for i, (word, ligature) in enumerate(words_and_ligatures): if i % 50000 == 0: print('1-ligature words {percent:.3g}% complete' .format(percent=100 * i / len(words_and_ligatures))) for ligature_match in re.finditer(ligature, word): if word in ligature_words: multi_ligature_words.add(word) ligature_words.add(word) if word == ligature: break # Skip words that contain a larger ligature if (('ffi' in word and ligature != 'ffi') or ('ffl' in word and ligature != 'ffl')): break # Replace ligatures with dots to avoid creating new ligatures # Example: "offline" --> "of.ine" to avoid creating "fi" ligature_removed_word = (word[:ligature_match.start()] + '.' + word[ligature_match.end():]) # Skip words that contain another ligature if any(ligature in ligature_removed_word for ligature in ligatures): continue ligature_removed_word = ligature_removed_word.replace('.', '') ligature_removed_words.add(ligature_removed_word) if ligature_removed_word not in dictionary_words: broken_word = ligature_removed_word broken_words.add(broken_word) if broken_word not in broken_word_fixes: broken_word_fixes[broken_word] = word else: # Ignore broken words with multiple possible fixes # Example: "cus" --> "cuffs" or "ficus" broken_word_fixes[broken_word] = None # Find broken word fixes for word with multiple ligature sequences # Example: "rey" --> "firefly" multi_ligature_words = sorted(multi_ligature_words) numbers_of_ligatures_in_word = 2, 3 for number_of_ligatures_in_word in numbers_of_ligatures_in_word: ligature_lists = itertools.combinations_with_replacement( ligatures, r=number_of_ligatures_in_word ) words_and_ligature_lists = list(itertools.product( multi_ligature_words, ligature_lists )) for i, (word, ligature_list) in enumerate(words_and_ligature_lists): if i % 1000 == 0: print('{n}-ligature words {percent:.3g}% complete' .format(n=number_of_ligatures_in_word, percent=100 * i / len(words_and_ligature_lists))) # Skip words that contain a larger ligature if (('ffi' in word and 'ffi' not in ligature_list) or ('ffl' in word and 'ffl' not in ligature_list)): continue ligature_removed_word = word for ligature in ligature_list: ligature_matches = list(re.finditer(ligature, ligature_removed_word)) if not ligature_matches: break ligature_match = ligature_matches[0] # Replace ligatures with dots to avoid creating new ligatures # Example: "offline" --> "of.ine" to avoid creating "fi" ligature_removed_word = ( ligature_removed_word[:ligature_match.start()] + '.' + ligature_removed_word[ligature_match.end():] ) else: # Skip words that contain another ligature if any(ligature in ligature_removed_word for ligature in ligatures): continue ligature_removed_word = ligature_removed_word.replace('.', '') ligature_removed_words.add(ligature_removed_word) if ligature_removed_word not in dictionary_words: broken_word = ligature_removed_word broken_words.add(broken_word) if broken_word not in broken_word_fixes: broken_word_fixes[broken_word] = word else: # Ignore broken words with multiple possible fixes # Example: "ung" --> "flung" or "fluffing" broken_word_fixes[broken_word] = None # Remove broken words with multiple possible fixes for broken_word, fixed_word in broken_word_fixes.copy().items(): if not fixed_word: broken_word_fixes.pop(broken_word) number_of_ligature_words = len(ligature_words) number_of_ligature_removed_words = len(ligature_removed_words) number_of_broken_words = len(broken_words) number_of_fixable_broken_words = len( [word for word in set(broken_word_fixes.keys()) if word and broken_word_fixes[word]] ) number_of_recoverable_ligature_words = len( [word for word in set(broken_word_fixes.values()) if word] ) print(number_of_ligature_words, 'ligature words') print(number_of_ligature_removed_words, 'ligature-removed words') print(number_of_broken_words, 'broken words') print(number_of_fixable_broken_words, 'fixable broken words ({percent:.3g}% fixable)' .format(percent=( 100 * number_of_fixable_broken_words / number_of_broken_words ))) print(number_of_recoverable_ligature_words, 'recoverable ligature words ({percent:.3g}% recoverable)' '(for at least one broken word)' .format(percent=( 100 * number_of_recoverable_ligature_words / number_of_ligature_words ))) with open(broken_word_fixes_file_path, 'w+', newline='') as broken_word_fixes_file: csv_writer = csv.writer(broken_word_fixes_file) sorted_broken_word_fixes = sorted(broken_word_fixes.items(), key=operator.itemgetter(0)) for broken_word, fixed_word in sorted_broken_word_fixes: csv_writer.writerow([broken_word, fixed_word])
Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，10 月前
查看次数：	33814 次
最近记录：	4 年，9 月前