我想弄清楚如何解析网页

err*_*ter 2 parsing r web-scraping

我正在做一个夏季项目.从我学校的网站上获取课程信息.

我从这里开始:http : //www.uah.edu/cgi-bin/schedule.pl? file = fall2015.html&segment =收集课程部门.

然后我从像这样的页面中获取信息.我有我需要的东西过滤到如下列表:

 [1] "91091  211 01     PRINC OF FINANCIAL ACCOUNTING     3.0   55   22       33    0 MW      12:45PM 02:05PM BAB   106        Rose-Green E"
 [2] "91092  211 02     PRINC OF FINANCIAL ACCOUNTING     3.0   53   18       35    0 TR      09:35AM 10:55AM BAB   123        STAFF"       
 [3] "91093  211 03     PRINC OF FINANCIAL ACCOUNTING     3.0   48   29       19    0 TR      05:30PM 06:50PM BAB   220        Hoskins J"   
 [4] "91094  212 01     MANAGEMENT ACCOUNTING             3.0   55   33       22    0 MWF     11:30AM 12:25PM BAB   106        Hoskins J"   
 [5] "91095  212 02     MANAGEMENT ACCOUNTING             3.0   55   27       28    0 TR      02:20PM 03:40PM BAB   106        Bryson R"
Run Code Online (Sandbox Code Playgroud)

但是我的问题如下:

www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=CS

我需要从每个网址添加部门.在我给出的链接中,该部门是"CS".我需要在每个条目中包含它.

我需要把它变成一个表,或者其他一些我可以引用数据的对象

                                                              Max               Wait                                                                                                                    
     CRN    Course     Title                          Credit Enrl Enrl Avail    List Days    Start   End     Bldg  Room       Instructor                                                                
     ------ ---------- ------------------------------ ------ ---- ---- -------- ---- ------- ------- ------- ----- ---------- -------------------- 
Run Code Online (Sandbox Code Playgroud)

基本上如何在页面上显示数据.

所以我的最终目标是浏览我抓取的每个链接,获取所有课程信息(除了部分类型).然后把它放到一个巨大的data.frame中,它包含所有类似的课程.

Department CRN    Course     Title                      Credit  MaxEnrl  Enrl Avail WaitList  Days    Start   End     Bldg  Room       Instructor   
ACC        91095  212 02     MANAGEMENT ACCOUNTING      3.0     55       27     28    0      TR      02:20PM 03:40PM   BAB   106        Bryson R
Run Code Online (Sandbox Code Playgroud)

到目前为止,我有这个工作

require(data.table)
require(gdata)
library(foreach)

uah <- readLines('http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=')
uah <- substring(uah[grep('fall2015', uah)], 10)
uah <- sub("\\\"(.*)", "", uah)
uah <- paste("http://www.uah.edu" , uah , sep = "")

gatherClasses <- function(url){

    dep <- readLines(url)

    dep <- dep[grep('[[:digit:][:digit:][:digit:][:digit:]][[:digit:][:digit:][:digit:]] [[:digit:][:digit:]]', dep)]

    dep <- substring(dep, 6) 

    dep <- foreach(i=dep) %do% i[grep('[[:digit:][:digit:][:digit:][:digit:]][[:digit:][:digit:][:digit:]] [[:digit:][:digit:]]', i)]

    dep <- foreach(i=dep) %do% trim(i)

    dep <- dep[2:length(dep)]

    return(dep)
}

x <- gatherClasses(uah[1])
x <-unlist(x)
Run Code Online (Sandbox Code Playgroud)

我无法在正确的位置拆分数据.我不确定接下来应该尝试什么.

编辑:(立即工作)

require(data.table)
require(gdata)
library(foreach)

uah <- readLines('http://www.uah.edu/cgi-bin/schedule.pl?file=sum2015b.html&segment=')
uah <- substring(uah[grep('sum2015b', uah)], 10)
uah <- sub("\\\"(.*)", "", uah)
uah <- paste("http://www.uah.edu" , uah , sep = "")


gatherClasses <- function(url){

    L <- readLines(url)
    Fields <- sub(" *$", " ", grep("---", L, value = TRUE))
    widths <- diff(c(0, gregexpr(" ", Fields)[[1]]))
    Data <- grep("\\d{5}  \\d{3}", L, value = TRUE)
    classes <- read.fwf(textConnection(Data), widths, as.is = TRUE, strip.white = TRUE)
    classes$department <-  unlist(strsplit(url, '='))[3]

    return(classes)
}

allClasses = foreach(i=uah) %do% gatherClasses(i)
allClasses <- do.call("rbind", allClasses)

write.table(mydata, "c:/sum2015b.txt", sep="\t")
Run Code Online (Sandbox Code Playgroud)

G. *_*eck 5

读取行L,将"--- ---- etc."行划入Fields并确保最后只有一个空格.找到空格的字符位置并区分它们以获得字段宽度.最后grep输出数据部分并使用read.fwf读取固定宽度字段读取它.例如,对于艺术史:

URL <- "http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=ARH"
L <- readLines(URL)
Fields <- sub(" *$", " ", grep("---", L, value = TRUE))
widths <- diff(c(0, gregexpr(" ", Fields)[[1]]))
Data <- grep("\\d{5}  \\d{3} \\d{2}", L, value = TRUE)
read.fwf(textConnection(Data), widths, as.is = TRUE, strip.white = TRUE)
Run Code Online (Sandbox Code Playgroud)

赠送:

   V1    V2     V3                          V4 V5 V6 V7 V8 V9 V10     V11     V12 V13 V14               V15
1     90628 100 01   ARH SURV:ANCIENT-MEDIEVAL  3 35 27  8  0  TR 12:45PM 02:05PM WIL 168           Joyce L
2     90630 101 01 ARH SURV:RENAISSANCE-MODERN  3 35 14 21  0  MW 12:45PM 02:05PM WIL 168         Stewart D
3     90631 101 02 ARH SURV:RENAISSANCE-MODERN  3 35  8 27  0  MW 03:55PM 05:15PM WIL 168         Stewart D
4     92269 101 03 ARH SURV:RENAISSANCE-MODERN  3 35  5 30  0  TR 11:10AM 12:30PM WIL 168 Shapiro Guanlao M
5     90632 101 04 ARH SURV:RENAISSANCE-MODERN  3 35 13 22  0  TR 02:20PM 03:40PM WIL 168 Shapiro Guanlao M
6     90633 301 01           ANCIENT GREEK ART  3 18  3 15  0  MW 02:20PM 03:40PM WIL 168           Joyce L
7     92266 306 01   COLLAPSE OF CIVILIZATIONS  3 10  4  6  0  TR 12:45PM 02:05PM SST 205           Sever T
8   W 90634 309 01   CONTEMPORARY ART & ISSUES  3 18 10  8  0  TR 09:35AM 10:55AM WIL 168         Stewart D
9     90635 320 01     ST: MODERN ARCHITECTURE  3 12  0 12  0  TR 11:10AM 12:30PM WIL 172          Takacs T
10    90636 400 01               SENIOR THESIS  3  0  0  0  0 TBA     TBA         TBA TBA           Joyce L
11    90637 400 02               SENIOR THESIS  3  0  0  0  0 TBA     TBA         TBA TBA         Stewart D
Run Code Online (Sandbox Code Playgroud)