从 Confluence 获取 JSON 格式的表

Question

从 Confluence 获取 JSON 格式的表

Ome*_*ega 1 python confluence confluence-rest-api

我正在尝试从 Confluence 页面获取 JSON 格式的表内容。这都是 SSO，所以我只能使用 API 密钥，而且我还没有找到使用请求库访问 Confluence 的方法。不幸的是，Confluence API 的输出是纯 html。

这就是我到目前为止所得到的。Confluence 库可以以 JSON 格式输出表格（而不是在字典中显示原始 html 代码）吗？

from atlassian import Confluence
import os

user = "me@myself.com"
api_key = os.environ['confluence_api_key']
server = "https://xxxxxx.atlassian.net"
api_url = "/rest/api/content"
page_id = "12345"

confluence = Confluence(url=server, username=user, password=api_key)
page = confluence.get_page_by_title("TEST", "page 1", expand="body.storage")
content = page["body"]["storage"]
print(content)

Run Code Online (Sandbox Code Playgroud)

输出如下所示：

{'value': '<p>Something something.</p><p /><table data-layout="default" ac:local-id="xxx"><colgroup><col style="width: 226.67px;" /><col style="width: 226.67px;" /><col style="width: 226.67px;" /></colgroup><tbody><tr><th><p><strong>name</strong></p></th><th><p><strong>type</strong></p></th><th><p><strong>comment</strong></p></th></tr><tr><td><p>text1</p></td><td><p>varchar(10)</p></td><td><p /></td></tr><tr><td><p>123</p></td><td><p>int</p></td><td><p /></td></tr></tbody></table>', 'representation': 'storage', 'embeddedContent': [], '_expandable': {'content': '/rest/api/content/12345'}}

Run Code Online (Sandbox Code Playgroud)

requests 库出现 404 错误：

request_url = "{server}{api_url}/{page_id}?expand=body.storage".format(
    server=server, api_url=api_url, page_id=page_id
)

requestResponse = requests.get(request_url, auth=(user, api_key))

print(requestResponse.status_code)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Tra*_*nbi 5

为什么要直接使用 requests 库？atlassian python API 已经在使用它，为您节省了一些工作。

这周我碰巧遇到了同样的问题，我不得不用 BeautifulSoup 解析表。我认为对于通用解决方案，最好将表作为数据框：

from atlassian import Confluence
import os
from bs4 import BeautifulSoup
import pandas as pd

user = "me@myself.com"
api_key = os.environ['confluence_api_key']
server = "https://xxxxxx.atlassian.net"

confluence = Confluence(url=server, username=user, password=api_key)
page = confluence.get_page_by_title("TEST", "page 1", expand="body.storage")
body = page["body"]["storage"]["value"]

tables_raw = [[[cell.text for cell in row("th") + row("td")]
                    for row in table("tr")]
                    for table in BeautifulSoup(body, features="lxml")("table")]

tables_df = [pd.DataFrame(table) for table in tables_raw]
for table_df in tables_df:
    print(table_df)

Run Code Online (Sandbox Code Playgroud)

然后，您可以使用to_json将 DataFrame 转换为 JSON ，具体取决于您想要如何构建字典......

编辑：在这种情况下，样式信息（和其他标签，如链接）会丢失（我们只获取单元格的文本），所以如果您想在修改后更新页面的内容，请注意
此外，如果您想使用表格内容作为字典键您可能想更改行/列索引

编辑2：这是一个旧的答案，但由于它最近得到了投票，我想补充一点，在这种情况下，tables_raw可以使用内置的 pandas 进行计算read_html：

tables_df = pd.read_html(body)

Run Code Online (Sandbox Code Playgroud)

这甚至会直接将表标题设置为 df 列名称，并具有用于提取链接或解析日期的参数。但是，特别是如果您的 df 中需要的不仅仅是cell.text（在我最初的情况下我想导入图标），上述答案仍然有效。

归档时间：	4 年，5 月前
查看次数：	8385 次
最近记录：	2 年，10 月前