在 Python 中从 PowerPoint 表格中读取?

tdo*_*uts 3 python powerpoint

我正在使用 python pptx 模块自动更新 powerpoint 文件中的值。我可以使用以下代码提取文件中的所有文本:

from pptx import Presentation
prs = Presentation(path_to_presentation)
# text_runs will be populated with a list of strings,
# one for each text run in presentation
text_runs = []
for slide in prs.slides:
  for shape in slide.shapes:
    if not shape.has_text_frame:
      continue
  for paragraph in shape.text_frame.paragraphs:
    for run in paragraph.runs:
      text_runs.append(run.text)
Run Code Online (Sandbox Code Playgroud)

此代码将提取文件中的所有文本,但无法提取 ppt 表中的文本,我想更新其中一些值。我试图从这个问题中实现一些代码:Reading text values in a PowerPoint table using pptx? 但不能。有任何想法吗?谢谢。

Ger*_*pez 5

这对我有用:

    def access_table(): 
            slide = prs.slides[0] #first slide
            table = slide.shapes[2].table # maybe 0..n
            for r in table.rows:
                    s = ""
                    for c in r.cells:
                            s += c.text_frame.text + " | "
                            #to write
                            #c.text_frame.text = "example"
                    print s
Run Code Online (Sandbox Code Playgroud)


Too*_*one 5

如何从幻灯片演示文稿内的表格中提取所有文本

以下代码从幻灯片演示文稿中的表格中提取文本。表格外部演示文稿中的文本被省略,但您也可以修改我的代码以捕获非表格对象中的文本。

import pptx as pptx
from pptx import *

def get_tables_from_presentation(pres):
   """
   The input parameter `pres` should receive
   an object returned by `pptx.Presentation()`

   EXAMPLE:
       ```
       import pptx
       p = "C:\\Users\\user\\Desktop\\power_point_pres.pptx"
       pres = pptx.Presentation(p)

       tables = get_tables_from_presentation(pres)
       ```
   """
   tables = list()
   for slide in pres.slides:
      for shp in iter(slide.shapes):
         if shp.has_table:
            table = shp.table
            tables.append(table)
   return tables


def iter_to_nonempty_table_cells(tbl):
   """
   :param tbl: 'pptx.table.Table'
          input table is NOT modified

   :return: return iterator to non-empty rows
   """
   for ridx in range(sum(1 for _ in iter(tbl.rows))):
      for cidx in range(sum(1 for _ in iter(tbl.columns))):
         cell = tbl.cell(ridx, cidx)
         txt = type("")(cell.text)
         txt = txt.strip()
         if len(txt) > 1:
            yield txt


# establish read path
in_file_path = "C:\\Users\\user\\Desktop\\power_point_pres.pptx"

# Open slide-show presentation
pres = Presentation(in_file_path)

# extract tables from slide-show presentation
tables = get_tables_from_presentation(pres)

for tbl in tables:
   it = iter_to_nonempty_table_cells(tbl)
   print("".join(it))
Run Code Online (Sandbox Code Playgroud)

关于此问题的其他答案之一的注释

其他人发布了用伪代码编写的这个问题的半有用答案。他们写了以下内容:

For r = 1 to tbl.rows.count
  For c = 1 to tbl.columns.count
     tbl.cell(r,c).Shape.Textframe.Text
Run Code Online (Sandbox Code Playgroud)

问题是,那不是Python。

在 python 中,这样的语法是非法的For r = 1 to 10 ,我们可以这样写:

for r in range(1, 11):
   print(r)  

from itertools import *
for r in takewhile(lambda k: k <= 10, count(1)):
   print(r)
Run Code Online (Sandbox Code Playgroud)

此外,行索引r = 0从不开始r = 1

表格的左上角tbl.cell(0,0)不是tbl.cell(1,1)

不存在.count行属性或列属性之类的东西。(For r = 1 to tbl.rows.count)没有任何意义,因为不存在这样的事情tbl.rows.count

tbl.cell(r,c).Shape不起作用,因为从类实例化的对象pptx.table._Cell没有名为的属性Shape

cell对象具有以下属性:

  • fill
  • is_merge_origin
  • is_spanned
  • margin_bottom
  • margin_left
  • margin_right
  • margin_top
  • merge
  • part
  • span_height
  • span_width
  • split
  • text
  • text_frame
  • vertical_anchor

修复如下所示:

# ----------------------------------------
# BEGIN SYNTACTICALLY INCORRECT CODE
# ----------------------------------------
# For r = 1 to tbl.rows.count
#   For c = 1 to tbl.columns.count
#      tbl.cell(r,c).Shape.Textframe.Text
# ----------------------------------------
# END SYNTACTICALLY INCORRECT CODE
# BEGIN SYNTACTICALLY CORRECT CODE
# ----------------------------------------
for r in range(sum(1 for row in iter(tbl.rows))):
    for c in range(sum(1 for _ in iter(tbl.columns))):
        print(tbl.cell(r,c).text)
# ----------------------------------------
# END SYNTACTICALLY CORRECT CODE
# ----------------------------------------
Run Code Online (Sandbox Code Playgroud)

关于原始代码的注释

关键词continue

在您的原始源代码中,您有以下 for 循环:

for shape in slide.shapes:
    if not shape.has_text_frame:
      continue
Run Code Online (Sandbox Code Playgroud)

该 for 循环不执行任何操作。

该关键字的意思只是“增加循环计数器并跳转到循环的开头”但是,在循环之后和循环结束之前continue没有代码。continue也就是说,无论如何,循环都会继续,而无需您编写,continue因为它已经位于循环体的末尾。

要了解更多信息,continue请考虑以下示例:

for k in [1, 2, 3, 4, 5]:
    print("For k ==", k, "we have k % 2 == ", k % 2)
    if not k % 2 == 0:
        continue
    print("For k ==", k, "we got past the `continue`")
Run Code Online (Sandbox Code Playgroud)

输出是:

For k == 1 we have k % 2 ==  1
For k == 2 we have k % 2 ==  0
For k == 2 we got past the `continue`
For k == 3 we have k % 2 ==  1
For k == 4 we have k % 2 ==  0
For k == 4 we got past the `continue`
For k == 5 we have k % 2 ==  1
Run Code Online (Sandbox Code Playgroud)

无论使用什么关键字,以下三段代码都打印完全相同的消息continue

For k == 1 we have k % 2 ==  1
For k == 2 we have k % 2 ==  0
For k == 2 we got past the `continue`
For k == 3 we have k % 2 ==  1
For k == 4 we have k % 2 ==  0
For k == 4 we got past the `continue`
For k == 5 we have k % 2 ==  1
Run Code Online (Sandbox Code Playgroud)


Ste*_*erg 3

您的代码将错过更多文本而不仅仅是表格;例如,它不会看到属于组的形状中的文本。

对于表,您需要做几件事:

测试形状以查看形状的 .HasTable 属性是否为 true。如果是这样,您可以使用形状的 .Table 对象来提取文本。从概念上讲,非常空中代码:

For r = 1 to tbl.rows.count
   For c = 1 to tbl.columns.count
      tbl.cell(r,c).Shape.Textframe.Text ' is what you're after
Run Code Online (Sandbox Code Playgroud)