我正在做一些网页抓取,这是数据的格式
Sr.No. Course_Code Course_Name Credit Grade Attendance_Grade
Run Code Online (Sandbox Code Playgroud)
我收到的实际字符串是以下形式
1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M
Run Code Online (Sandbox Code Playgroud)
我感兴趣的东西是Course_Code,Course_Name和Grade,在这个例子中是值
Course_Code : CA727
Course_Name : PRINCIPLES OF COMPILER DESIGN
Grade : A
Run Code Online (Sandbox Code Playgroud)
有没有办法让我使用正则表达式或其他技术轻松提取此信息,而不是手动解析字符串.我在1.9模式下使用jruby.
Phr*_*ogz 40
让我们使用Ruby的命名捕获和自我描述的正则表达式!
course_line = /
^ # Starting at the front of the string
(?<SrNo>\d+) # Capture one or more digits; call the result "SrNo"
\s+ # Eat some whitespace
(?<Code>\S+) # Capture all the non-whitespace you can; call it "Code"
\s+ # Eat some whitespace
(?<Name>.+\S) # Capture as much as you can
# (while letting the rest of the regex still work)
# Make sure you end with a non-whitespace character.
# Call this "Name"
\s+ # Eat some whitespace
(?<Credit>\S+) # Capture all the non-whitespace you can; call it "Credit"
\s+ # Eat some whitespace
(?<Grade>\S+) # Capture all the non-whitespace you can; call it "Grade"
\s+ # Eat some whitespace
(?<Attendance>\S+) # Capture all the non-whitespace; call it "Attendance"
$ # Make sure that we're at the end of the line now
/x
str = "1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M"
parts = str.match(course_line)
puts "
Course Code: #{parts['Code']}
Course Name: #{parts['Name']}
Grade: #{parts['Grade']}".strip
#=> Course Code: CA727
#=> Course Name: PRINCIPLES OF COMPILER DESIGN
#=> Grade: A
Run Code Online (Sandbox Code Playgroud)
纯娱乐:
str = "1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M"
tok = str.split /\s+/
data = {'Sr.No.' => tok.shift, 'Course_Code' => tok.shift, 'Attendance_Grade' => tok.pop,'Grade' => tok.pop, 'Credit' => tok.pop, 'Course_Name' => tok.join(' ')}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
14465 次 |
| 最近记录: |