我正在抓http://www.progarchives.com/album.asp?id=
一个警告信息:
警告消息:
XML内容似乎不是XML:http:
//www.progarchives.com/album.asp
? id = 2 http://www.progarchives.com/album.asp?id=3 http:// www.progarchives.com/album.asp?id=4
http://www.progarchives.com/album.asp?id=5
刮刀分别适用于每个页面,但不适用于网址b1=2:b2=1000
.
library(RCurl)
library(XML)
getUrls <- function(b1,b2){
root="http://www.progarchives.com/album.asp?id="
urls <- NULL
for (bandid in b1:b2){
urls <- c(urls,(paste(root,bandid,sep="")))
}
return(urls)
}
prog.arch.scraper <- function(url){
SOURCE <- getUrls(b1=2,b2=1000)
PARSED <- htmlParse(SOURCE)
album <- xpathSApply(PARSED,"//h1[1]",xmlValue)
date <- xpathSApply(PARSED,"//strong[1]",xmlValue)
band <- xpathSApply(PARSED,"//h2[1]",xmlValue)
return(c(band,album,date))
}
prog.arch.scraper(urls)
Run Code Online (Sandbox Code Playgroud) 我有一个包含超过5000个文本文件的语料库.我想在每个文件运行预处理之后获得每个文件的单个字数(转向更低,删除停用词等).我对单个文本文件的单词计数没有任何好运.任何帮助,将不胜感激.
library(tm)
revs<-Corpus(DirSource("data/"))
revs<-tm_map(revs,tolower)
revs<-tm_map(revs,removeWords, stopwords("english"))
revs<-tm_map(revs,removePunctuation)
revs<-tm_map(revs,removeNumbers)
revs<-tm_map(revs,stripWhitespace)
dtm<-DocumentTermMatrix(revs)
Run Code Online (Sandbox Code Playgroud) 我正在使用R的tm
包来使用字典方法获取字频.我想找到所有以"esque"结尾的单词,无论它们拼写为"abcd-esque","abcdesque"还是"abcd esque"(因为我的语料库中存在所有不同的拼写).如何为此创建正则表达式?这就是我到目前为止所拥有的.任何帮助/提示将不胜感激.
text <- Corpus(DirSource("txt/"))
text <- tm_map(text,tolower)
text <- tm_map(text,stripWhitespace)
dtm.text <- DocumentTermMatrix(text)
list<-inspect(
DocumentTermMatrix(text,list(dictionary = c("rose", "green", "esque")))
)
Run Code Online (Sandbox Code Playgroud) 我有一个包含多列的Excel电子表格.我想自动添加唯一ID号(从单元格A2开始)到D列中的重复值(从D2开始).任何使电子表格如下所示的方法?谢谢.
Column A Column D
1 3
1 3
2 Bard
2 Bard
3 4ton
3 4ton
3 4ton
Run Code Online (Sandbox Code Playgroud) I'm trying to get cumulative sums for the previous row/year. Running cumsum(data$fonds)
gives me the running totals of adjacent sells, which doesn't work for what I want to do. I would like to have my data look like the following:
year fond cumsum
1 1950 0 0
2 1951 1 0
3 1952 3 1
4 1953 0 4
5 1954 0 4
Run Code Online (Sandbox Code Playgroud)
Any help would be appreciated.
r ×4
tm ×2
corpus ×1
dictionary ×1
excel ×1
html ×1
regex ×1
sum ×1
text-mining ×1
web-scraping ×1
word-count ×1
xls ×1
xlsx ×1
xml ×1