在 Python 中从 PowerPoint 表格中读取？

Question

在 Python 中从 PowerPoint 表格中读取？

我正在使用 python pptx 模块自动更新 powerpoint 文件中的值。我可以使用以下代码提取文件中的所有文本：

from pptx import Presentation
prs = Presentation(path_to_presentation)
# text_runs will be populated with a list of strings,
# one for each text run in presentation
text_runs = []
for slide in prs.slides:
  for shape in slide.shapes:
    if not shape.has_text_frame:
      continue
  for paragraph in shape.text_frame.paragraphs:
    for run in paragraph.runs:
      text_runs.append(run.text)

Run Code Online (Sandbox Code Playgroud)

此代码将提取文件中的所有文本，但无法提取 ppt 表中的文本，我想更新其中一些值。我试图从这个问题中实现一些代码：Reading text values in a PowerPoint table using pptx? 但不能。有任何想法吗？谢谢。

Answer 1

Ger*_*pez 5

这对我有用：

    def access_table(): 
            slide = prs.slides[0] #first slide
            table = slide.shapes[2].table # maybe 0..n
            for r in table.rows:
                    s = ""
                    for c in r.cells:
                            s += c.text_frame.text + " | "
                            #to write
                            #c.text_frame.text = "example"
                    print s

Run Code Online (Sandbox Code Playgroud)

Answer 2

Too*_*one 5

如何从幻灯片演示文稿内的表格中提取所有文本

以下代码从幻灯片演示文稿中的表格中提取文本。表格外部演示文稿中的文本被省略，但您也可以修改我的代码以捕获非表格对象中的文本。

import pptx as pptx
from pptx import *

def get_tables_from_presentation(pres):
   """
   The input parameter `pres` should receive
   an object returned by `pptx.Presentation()`

   EXAMPLE:
       ```
       import pptx
       p = "C:\\Users\\user\\Desktop\\power_point_pres.pptx"
       pres = pptx.Presentation(p)

       tables = get_tables_from_presentation(pres)
       ```
   """
   tables = list()
   for slide in pres.slides:
      for shp in iter(slide.shapes):
         if shp.has_table:
            table = shp.table
            tables.append(table)
   return tables


def iter_to_nonempty_table_cells(tbl):
   """
   :param tbl: 'pptx.table.Table'
          input table is NOT modified

   :return: return iterator to non-empty rows
   """
   for ridx in range(sum(1 for _ in iter(tbl.rows))):
      for cidx in range(sum(1 for _ in iter(tbl.columns))):
         cell = tbl.cell(ridx, cidx)
         txt = type("")(cell.text)
         txt = txt.strip()
         if len(txt) > 1:
            yield txt


# establish read path
in_file_path = "C:\\Users\\user\\Desktop\\power_point_pres.pptx"

# Open slide-show presentation
pres = Presentation(in_file_path)

# extract tables from slide-show presentation
tables = get_tables_from_presentation(pres)

for tbl in tables:
   it = iter_to_nonempty_table_cells(tbl)
   print("".join(it))

Run Code Online (Sandbox Code Playgroud)

关于此问题的其他答案之一的注释

其他人发布了用伪代码编写的这个问题的半有用答案。他们写了以下内容：

For r = 1 to tbl.rows.count
  For c = 1 to tbl.columns.count
     tbl.cell(r,c).Shape.Textframe.Text

Run Code Online (Sandbox Code Playgroud)

问题是，那不是Python。

在 python 中，这样的语法是非法的For r = 1 to 10 ，我们可以这样写：

for r in range(1, 11):
   print(r)  

from itertools import *
for r in takewhile(lambda k: k <= 10, count(1)):
   print(r)

Run Code Online (Sandbox Code Playgroud)

此外，行索引r = 0从不开始r = 1

表格的左上角tbl.cell(0,0)不是tbl.cell(1,1)

不存在.count行属性或列属性之类的东西。(For r = 1 to tbl.rows.count)没有任何意义，因为不存在这样的事情tbl.rows.count

tbl.cell(r,c).Shape不起作用，因为从类实例化的对象pptx.table._Cell没有名为的属性Shape

cell对象具有以下属性：

fill
is_merge_origin
is_spanned
margin_bottom
margin_left
margin_right
margin_top
merge
part
span_height
span_width
split
text
text_frame
vertical_anchor

修复如下所示：

# ----------------------------------------
# BEGIN SYNTACTICALLY INCORRECT CODE
# ----------------------------------------
# For r = 1 to tbl.rows.count
#   For c = 1 to tbl.columns.count
#      tbl.cell(r,c).Shape.Textframe.Text
# ----------------------------------------
# END SYNTACTICALLY INCORRECT CODE
# BEGIN SYNTACTICALLY CORRECT CODE
# ----------------------------------------
for r in range(sum(1 for row in iter(tbl.rows))):
    for c in range(sum(1 for _ in iter(tbl.columns))):
        print(tbl.cell(r,c).text)
# ----------------------------------------
# END SYNTACTICALLY CORRECT CODE
# ----------------------------------------

Run Code Online (Sandbox Code Playgroud)

关于原始代码的注释

关键词`continue`

在您的原始源代码中，您有以下 for 循环：

for shape in slide.shapes:
    if not shape.has_text_frame:
      continue

Run Code Online (Sandbox Code Playgroud)

该 for 循环不执行任何操作。

该关键字的意思只是“增加循环计数器并跳转到循环的开头”但是，在循环之后和循环结束之前continue没有代码。continue也就是说，无论如何，循环都会继续，而无需您编写，continue因为它已经位于循环体的末尾。

要了解更多信息，continue请考虑以下示例：

for k in [1, 2, 3, 4, 5]:
    print("For k ==", k, "we have k % 2 == ", k % 2)
    if not k % 2 == 0:
        continue
    print("For k ==", k, "we got past the `continue`")

Run Code Online (Sandbox Code Playgroud)

输出是：

For k == 1 we have k % 2 ==  1
For k == 2 we have k % 2 ==  0
For k == 2 we got past the `continue`
For k == 3 we have k % 2 ==  1
For k == 4 we have k % 2 ==  0
For k == 4 we got past the `continue`
For k == 5 we have k % 2 ==  1

Run Code Online (Sandbox Code Playgroud)

无论使用什么关键字，以下三段代码都打印完全相同的消息continue：

For k == 1 we have k % 2 ==  1
For k == 2 we have k % 2 ==  0
For k == 2 we got past the `continue`
For k == 3 we have k % 2 ==  1
For k == 4 we have k % 2 ==  0
For k == 4 we got past the `continue`
For k == 5 we have k % 2 ==  1

Run Code Online (Sandbox Code Playgroud)

Answer 3

Ste*_*erg 3

您的代码将错过更多文本而不仅仅是表格；例如，它不会看到属于组的形状中的文本。

对于表，您需要做几件事：

测试形状以查看形状的 .HasTable 属性是否为 true。如果是这样，您可以使用形状的 .Table 对象来提取文本。从概念上讲，非常空中代码：

For r = 1 to tbl.rows.count
   For c = 1 to tbl.columns.count
      tbl.cell(r,c).Shape.Textframe.Text ' is what you're after

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，1 月前
查看次数：	8659 次
最近记录：	9 年，2 月前

在 Python 中从 PowerPoint 表格中读取？

如何从幻灯片演示文稿内的表格中提取所有文本

关于此问题的其他答案之一的注释

关于原始代码的注释

关键词continue​

关键词`continue`