如何使用 R 从 .docx 文件中提取纯文本

6 r extract docx

任何人都知道他们可以推荐什么,以便从 .docx 格式的文章中提取纯文本(最好使用 R)?

速度并不重要,我们甚至可以使用一个具有一些 API 的网站来上传和提取文件,但我一直找不到一个。我需要提取引言、方法、结果和结论我想删除摘要、参考文献,特别是图形和表格, 谢谢

Kat*_*tia 7

您可以尝试使用 readtext 库:

library(readtext)
x <- readtext("/path/to/file/myfile.docx")
# x$text will contain the plain text in the file
Run Code Online (Sandbox Code Playgroud)

变量 x 仅包含没有任何格式的文本,因此如果您需要提取一些信息,则需要执行字符串搜索。例如,对于您在评论中提到的文档,一种方法可能如下:

library(readtext)
doc.text <- readtext("test.docx")$text

# Split text into parts using new line character:
doc.parts <- strsplit(doc.text, "\n")[[1]]

# First line in the document- the name of the Journal
journal.name <- doc.parts[1]
journal.name
# [1] "International Journal of Science and Research (IJSR)"

# Similarly we can extract some other parts from a header
issn <-  doc.parts[2]
issue <- doc.parts[3]

# Search for the Abstract:
abstract.loc <- grep("Abstract:", doc.parts)[1]

# Search for the Keyword
Keywords.loc <- grep("Keywords:", doc.parts)[1]

# The text in between these 2 keywords will be abstract text:
abstract.text <- paste(doc.parts[abstract.loc:(Keywords.loc-1)], collapse=" ")

# Same way we can get Keywords text:
Background.loc <- Keywords.loc + grep("1\\.", doc.parts[-(1:Keywords.loc)])[1]
Keywords.text <- paste(doc.parts[Keywords.loc:(Background.loc-1)], collapse=" ")
Keywords.text
# [1] "Keywords: Nephronophtisis, NPHP1 deletion, NPHP4 mutations, Tunisian patients"

# Assuming that Methods is part 2
Methods.loc <- Background.loc + grep("2\\.", doc.parts[-(1:Background.loc)])[1]
Background.text <- paste(doc.parts[Background.loc:(Methods.loc-1)], collapse=" ")


# Assuming that Results is Part 3
Results.loc <- Methods.loc- + grep("3\\.", doc.parts[-(1:Methods.loc)])[1]
Methods.text <- paste(doc.parts[Methods.loc:(Results.loc-1)], collapse=" ")

# Similarly with other parts. For example for Acknowledgements section:
Ack.loc <- grep("Acknowledgements", doc.parts)[1]
Ref.loc <- grep("References", doc.parts)[1]
Ack.text <- paste(doc.parts[Ack.loc:(Ref.loc-1)], collapse=" ")
Ack.text
# [1] "6. Acknowledgements We are especially grateful to the study participants. 
# This study was supported by a grant from the Tunisian Ministry of Health and 
# Ministry of Higher Education ...
Run Code Online (Sandbox Code Playgroud)

确切的方法取决于您需要搜索的所有文档的共同结构。例如,如果第一部分始终命名为“背景”,您可以使用该词进行搜索。但是,如果有时是“背景”,有时是“简介”,那么您可能需要搜索“1”。图案。