解释这个原始文本 - 策略?

Dom*_*mra 7 ruby language-agnostic parsing text screen-scraping

我有这个原始文本:

________________________________________________________________________________________________________________________________
Pos Car  Competitor/Team                Driver                   Vehicle              Cap   CL Laps     Race.Time Fastest...Lap

1     6  Jason Clements                 Jason Clements           BMW M3               3200       10     9:48.5710   3 0:57.3228*
2    42  David Skillender               David Skillender         Holden VS Commodore  6000       10     9:55.6866   2 0:57.9409 
3    37  Bruce Cook                     Bruce Cook               Ford  Escort         3759       10     9:56.4388   4 0:58.3359 
4    18  Troy Marinelli                 Troy Marinelli           Nissan  Silvia       3396       10     9:56.7758   2 0:58.4443 
5    75  Anthony Gilbertson             Anthony Gilbertson       BMW M3               3200       10    10:02.5842   3 0:58.9336 
6    26  Trent Purcell                  Trent Purcell            Mazda RX7            2354       10    10:07.6285   4 0:59.0546 
7    12  Scott Hunter                   Scott Hunter             Toyota  Corolla      2000       10    10:11.3722   5 0:59.8921 
8    91  Graeme Wilkinson               Graeme Wilkinson         Ford  Escort         2000       10    10:13.4114   5 1:00.2175 
9     7  Justin Wade                    Justin Wade              BMW M3               4000       10    10:18.2020   9 1:00.8969 
10   55  Greg Craig                     Grag Craig               Toyota  Corolla      1840       10    10:18.9956   7 1:00.7905 
11   46  Kyle Orgam-Moore               Kyle Organ-Moore         Holden VS Commodore  6000       10    10:30.0179   3 1:01.6741 
12   39  Uptiles Strathpine             Trent Spencer            BMW Mini Cooper S    1500       10    10:40.1436   2 1:02.2728 
13  177  Mark Hyde                      Mark Hyde                Ford  Escort         1993       10    10:49.5920   2 1:03.8069 
14   34  Peter Draheim                  Peter Draheim            Mazda RX3            2600       10    10:50.8159  10 1:03.4396 
15    5  Scott Douglas                  Scott Douglas            Datsun  1200         1998        9     9:48.7808   3 1:01.5371 
16   72  Paul Redman                    Paul Redman              Ford  Focus          2lt         9    10:11.3707   2 1:05.8729 
17    8  Matthew Speakman               Matthew Speakman         Toyota  Celica       1600        9    10:16.3159   3 1:05.9117 
18   74  Lucas Easton                   Lucas Easton             Toyota  Celica       1600        9    10:16.8050   6 1:06.0748 
19   77  Dean Fuller                    Dean Fuller              Mitsubishi  Sigma    2600        9    10:25.2877   3 1:07.3991 
20   16  Brett Batterby                 Brett Batterby           Toyota  Corolla      1600        9    10:29.9127   4 1:07.8420 
21   95  Ross Hurford                   Ross Hurford             Toyota  Corolla      1600        8     9:57.5297   2 1:12.2672 
DNF  13  Charles Wright                 Charles Wright           BMW 325i             2700        9     9:47.9888   7 1:03.2808 
DNF  20  Shane Satchwell                Shane Satchwell          Datsun  1200 Coupe   1998        1     1:05.9100   1 1:05.9100 

Fastest Lap Av.Speed Is 152kph, Race Av.Speed Is 148kph
R=under lap record by greatest margin, r=under lap record, *=fastest lap time
________________________________________________________________________________________________________________________________
Issue# 2 - Printed Sat May 26 15:43:31 2012                     Timing System By NATSOFT (03)63431311 www.natsoft.com.au/results
Amended 
Run Code Online (Sandbox Code Playgroud)

我需要将它解析为具有明显位置,汽车,驱动程序等字段的对象.问题是我不知道使用什么样的策略.如果我将它拆分为空格,我最终会得到一个如下列表:

["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]
Run Code Online (Sandbox Code Playgroud)

你能看到这个问题吗?我不能只解释这个列表,因为人们可能只有一个名字,或一个名字中的3个单词,或汽车中的许多不同的单词.它使得仅使用索引仅引用列表是不可能的.

如何使用列名定义的偏移?我不太清楚如何使用它.

编辑:所以我使用的当前算法的工作方式如下:

  1. 拆分新行上的文本,给出一组行.
  2. 在每一行上找到常见的空白字符FURTHEST RIGHT.即每行上的位置(索引),其他每行包含空格.例如:
  3. 根据这些常见字符拆分行.
  4. 修剪线条

存在几个问题:

如果名称包含相同的长度,如下所示:

Jason Adams
Bobby Sacka
Jerry Louis
Run Code Online (Sandbox Code Playgroud)

然后它会将其解释为两个单独的项目:([ "Jason" "Adams", "Bobby", "Sacka", "Jerry", "Louis"]].

然而,如果它们都如此不同:

Dominic Bou
Bob Adams
Jerry Seinfeld
Run Code Online (Sandbox Code Playgroud)

然后它将正确地分裂在Seinfeld的最后一个'd'(因此我们得到三个名字的集合(["Dominic Bou", "Bob Adams", "Jerry Seinfeld"]).

它也很脆弱.我正在寻找一个更好的解决方案.

pgu*_*rio 6

这对于正则表达式来说不是一个好例子,你真的想要发现格式然后解压缩行:

lines = str.split "\n"

# you know the field names so you can use them to find the column positions
fields = ['Pos', 'Car', 'Competitor/Team', 'Driver', 'Vehicle', 'Cap', 'CL Laps', 'Race.Time', 'Fastest...Lap']
header = lines.shift until header =~ /^Pos/
positions = fields.map{|f| header.index f}

# use that to construct an unpack format string
format = 1.upto(positions.length-1).map{|x| "A#{positions[x] - positions[x-1]}"}.join
# A4A5A31A25A21A6A12A10

lines.each do |line|
  next unless line =~ /^(\d|DNF)/ # skip lines you're not interested in
  data = line.unpack(format).map{|x| x.strip}
  puts data.join(', ')
  # or better yet...
  car = Hash[fields.zip data]
  puts car['Driver']
end
Run Code Online (Sandbox Code Playgroud)


Bhu*_*dha 6

http://blog.ryanwood.com/past/2009/6/12/slither-a-dsl-for-parsing-fixed-width-text-files这可以解决您的问题.

这里有几个例子和github.

希望这可以帮助!


ear*_*ils 5

我认为在每条线上使用固定宽度很容易.

#!/usr/bin/env ruby

# ruby parsing_winner.rb winners_list.txt 
args = ARGV
puts "ruby parsing_winner.rb winners_list.txt " if args.empty?
winner_file = open args.shift
array_of_race_results, array_of_race_results_array  = [], []

class RaceResult

  attr_accessor :position, :car, :team, :driver, :vehicle, :cap, :cl_laps, :race_time, :fastest, :fastest_lap
  def initialize(position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap)
    @position    = position 
    @car         = car 
    @team        = team  
    @driver      = driver  
    @vehicle     = vehicle  
    @cap         = cap  
    @cl_laps     = cl_laps  
    @race_time   = race_time 
    @fastest     = fastest
    @fastest_lap = fastest_lap 
  end

  def to_a
    # ["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]
    [position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap]
  end
end

# Pos Car  Competitor/Team                Driver                   Vehicle              Cap   CL Laps     Race.Time Fastest...Lap

# 1     6  Jason Clements                 Jason Clements           BMW M3               3200       10     9:48.5710   3 0:57.3228*
# 2    42  David Skillender               David Skillender         Holden VS Commodore  6000       10     9:55.6866   2 0:57.9409
# etc...
winner_file.each_line do |line|
  next if line[/^____/] || line[/^\w{4,}|^\s|^Pos/] || line[0..3][/\=/]
  position    = line[0..3].strip
  car         = line[4..8].strip
  team        = line[9..39].strip
  driver      = line[40..64].strip
  vehicle     = line[65..85].strip
  cap         = line[86..91].strip
  cl_laps     = line[92..101].strip
  race_time   = line[102..113].strip
  fastest     = line[114..116].strip
  fastest_lap = line[117..-1].strip
  racer = RaceResult.new(position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap)
  array_of_race_results << racer
  array_of_race_results_array << racer.to_a
end

puts "Race Results Objects: #{array_of_race_results}"
puts "Race Results: #{array_of_race_results_array.inspect}"
Run Code Online (Sandbox Code Playgroud)

输出=>

Race Results Objects: [#<RaceResult:0x007fcc4a84b7c8 @position="1", @car="6", @team="Jason Clements", @driver="Jason Clements", @vehicle="BMW M3", @cap="3200", @cl_laps="10", @race_time="9:48.5710", @fastest="3", @fastest_lap="0:57.3228*">, #<RaceResult:0x007fcc4a84aa08 @position="2", @car="42", @team="David Skillender", @driver="David Skillender", @vehicle="Holden VS Commodore", @cap="6000", @cl_laps="10", @race_time="9:55.6866", @fastest="2", @fastest_lap="0:57.9409">, #<RaceResult:0x007fcc4a849ce8 @position="3", @car="37", @team="Bruce Cook", @driver="Bruce Cook", @vehicle="Ford  Escort", @cap="3759", @cl_laps="10", @race_time="9:56.4388", @fastest="4", @fastest_lap="0:58.3359">, #<RaceResult:0x007fcc4a8491f8 @position="4", @car="18", @team="Troy Marinelli", @driver="Troy Marinelli", @vehicle="Nissan  Silvia", @cap="3396", @cl_laps="10", @race_time="9:56.7758", @fastest="2", @fastest_lap="0:58.4443">, #<RaceResult:0x007fcc4b091ab8 @position="5", @car="75", @team="Anthony Gilbertson", @driver="Anthony Gilbertson", @vehicle="BMW M3", @cap="3200", @cl_laps="10", @race_time="10:02.5842", @fastest="3", @fastest_lap="0:58.9336">, #<RaceResult:0x007fcc4b0916a8 @position="6", @car="26", @team="Trent Purcell", @driver="Trent Purcell", @vehicle="Mazda RX7", @cap="2354", @cl_laps="10", @race_time="10:07.6285", @fastest="4", @fastest_lap="0:59.0546">, #<RaceResult:0x007fcc4b091298 @position="7", @car="12", @team="Scott Hunter", @driver="Scott Hunter", @vehicle="Toyota  Corolla", @cap="2000", @cl_laps="10", @race_time="10:11.3722", @fastest="5", @fastest_lap="0:59.8921">, #<RaceResult:0x007fcc4b090e88 @position="8", @car="91", @team="Graeme Wilkinson", @driver="Graeme Wilkinson", @vehicle="Ford  Escort", @cap="2000", @cl_laps="10", @race_time="10:13.4114", @fastest="5", @fastest_lap="1:00.2175">, #<RaceResult:0x007fcc4b090a78 @position="9", @car="7", @team="Justin Wade", @driver="Justin Wade", @vehicle="BMW M3", @cap="4000", @cl_laps="10", @race_time="10:18.2020", @fastest="9", @fastest_lap="1:00.8969">, #<RaceResult:0x007fcc4b090668 @position="10", @car="55", @team="Greg Craig", @driver="Grag Craig", @vehicle="Toyota  Corolla", @cap="1840", @cl_laps="10", @race_time="10:18.9956", @fastest="7", @fastest_lap="1:00.7905">, #<RaceResult:0x007fcc4b090258 @position="11", @car="46", @team="Kyle Orgam-Moore", @driver="Kyle Organ-Moore", @vehicle="Holden VS Commodore", @cap="6000", @cl_laps="10", @race_time="10:30.0179", @fastest="3", @fastest_lap="1:01.6741">, #<RaceResult:0x007fcc4b08fe48 @position="12", @car="39", @team="Uptiles Strathpine", @driver="Trent Spencer", @vehicle="BMW Mini Cooper S", @cap="1500", @cl_laps="10", @race_time="10:40.1436", @fastest="2", @fastest_lap="1:02.2728">, #<RaceResult:0x007fcc4b08fa38 @position="13", @car="177", @team="Mark Hyde", @driver="Mark Hyde", @vehicle="Ford  Escort", @cap="1993", @cl_laps="10", @race_time="10:49.5920", @fastest="2", @fastest_lap="1:03.8069">, #<RaceResult:0x007fcc4b08f628 @position="14", @car="34", @team="Peter Draheim", @driver="Peter Draheim", @vehicle="Mazda RX3", @cap="2600", @cl_laps="10", @race_time="10:50.8159", @fastest="10", @fastest_lap="1:03.4396">, #<RaceResult:0x007fcc4b08f218 @position="15", @car="5", @team="Scott Douglas", @driver="Scott Douglas", @vehicle="Datsun  1200", @cap="1998", @cl_laps="9", @race_time="9:48.7808", @fastest="3", @fastest_lap="1:01.5371">, #<RaceResult:0x007fcc4b08ee08 @position="16", @car="72", @team="Paul Redman", @driver="Paul Redman", @vehicle="Ford  Focus", @cap="2lt", @cl_laps="9", @race_time="10:11.3707", @fastest="2", @fastest_lap="1:05.8729">, #<RaceResult:0x007fcc4b08e9f8 @position="17", @car="8", @team="Matthew Speakman", @driver="Matthew Speakman", @vehicle="Toyota  Celica", @cap="1600", @cl_laps="9", @race_time="10:16.3159", @fastest="3", @fastest_lap="1:05.9117">, #<RaceResult:0x007fcc4b08e5e8 @position="18", @car="74", @team="Lucas Easton", @driver="Lucas Easton", @vehicle="Toyota  Celica", @cap="1600", @cl_laps="9", @race_time="10:16.8050", @fastest="6", @fastest_lap="1:06.0748">, #<RaceResult:0x007fcc4b08e1d8 @position="19", @car="77", @team="Dean Fuller", @driver="Dean Fuller", @vehicle="Mitsubishi  Sigma", @cap="2600", @cl_laps="9", @race_time="10:25.2877", @fastest="3", @fastest_lap="1:07.3991">, #<RaceResult:0x007fcc4b08ddc8 @position="20", @car="16", @team="Brett Batterby", @driver="Brett Batterby", @vehicle="Toyota  Corolla", @cap="1600", @cl_laps="9", @race_time="10:29.9127", @fastest="4", @fastest_lap="1:07.8420">, #<RaceResult:0x007fcc4a848348 @position="21", @car="95", @team="Ross Hurford", @driver="Ross Hurford", @vehicle="Toyota  Corolla", @cap="1600", @cl_laps="8", @race_time="9:57.5297", @fastest="2", @fastest_lap="1:12.2672">, #<RaceResult:0x007fcc4a847948 @position="DNF", @car="13", @team="Charles Wright", @driver="Charles Wright", @vehicle="BMW 325i", @cap="2700", @cl_laps="9", @race_time="9:47.9888", @fastest="7", @fastest_lap="1:03.2808">, #<RaceResult:0x007fcc4a847010 @position="DNF", @car="20", @team="Shane Satchwell", @driver="Shane Satchwell", @vehicle="Datsun  1200 Coupe", @cap="1998", @cl_laps="1", @race_time="1:05.9100", @fastest="1", @fastest_lap="1:05.9100">]
Race Results: [["1", "6", "Jason Clements", "Jason Clements", "BMW M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"], ["2", "42", "David Skillender", "David Skillender", "Holden VS Commodore", "6000", "10", "9:55.6866", "2", "0:57.9409"], ["3", "37", "Bruce Cook", "Bruce Cook", "Ford  Escort", "3759", "10", "9:56.4388", "4", "0:58.3359"], ["4", "18", "Troy Marinelli", "Troy Marinelli", "Nissan  Silvia", "3396", "10", "9:56.7758", "2", "0:58.4443"], ["5", "75", "Anthony Gilbertson", "Anthony Gilbertson", "BMW M3", "3200", "10", "10:02.5842", "3", "0:58.9336"], ["6", "26", "Trent Purcell", "Trent Purcell", "Mazda RX7", "2354", "10", "10:07.6285", "4", "0:59.0546"], ["7", "12", "Scott Hunter", "Scott Hunter", "Toyota  Corolla", "2000", "10", "10:11.3722", "5", "0:59.8921"], ["8", "91", "Graeme Wilkinson", "Graeme Wilkinson", "Ford  Escort", "2000", "10", "10:13.4114", "5", "1:00.2175"], ["9", "7", "Justin Wade", "Justin Wade", "BMW M3", "4000", "10", "10:18.2020", "9", "1:00.8969"], ["10", "55", "Greg Craig", "Grag Craig", "Toyota  Corolla", "1840", "10", "10:18.9956", "7", "1:00.7905"], ["11", "46", "Kyle Orgam-Moore", "Kyle Organ-Moore", "Holden VS Commodore", "6000", "10", "10:30.0179", "3", "1:01.6741"], ["12", "39", "Uptiles Strathpine", "Trent Spencer", "BMW Mini Cooper S", "1500", "10", "10:40.1436", "2", "1:02.2728"], ["13", "177", "Mark Hyde", "Mark Hyde", "Ford  Escort", "1993", "10", "10:49.5920", "2", "1:03.8069"], ["14", "34", "Peter Draheim", "Peter Draheim", "Mazda RX3", "2600", "10", "10:50.8159", "10", "1:03.4396"], ["15", "5", "Scott Douglas", "Scott Douglas", "Datsun  1200", "1998", "9", "9:48.7808", "3", "1:01.5371"], ["16", "72", "Paul Redman", "Paul Redman", "Ford  Focus", "2lt", "9", "10:11.3707", "2", "1:05.8729"], ["17", "8", "Matthew Speakman", "Matthew Speakman", "Toyota  Celica", "1600", "9", "10:16.3159", "3", "1:05.9117"], ["18", "74", "Lucas Easton", "Lucas Easton", "Toyota  Celica", "1600", "9", "10:16.8050", "6", "1:06.0748"], ["19", "77", "Dean Fuller", "Dean Fuller", "Mitsubishi  Sigma", "2600", "9", "10:25.2877", "3", "1:07.3991"], ["20", "16", "Brett Batterby", "Brett Batterby", "Toyota  Corolla", "1600", "9", "10:29.9127", "4", "1:07.8420"], ["21", "95", "Ross Hurford", "Ross Hurford", "Toyota  Corolla", "1600", "8", "9:57.5297", "2", "1:12.2672"], ["DNF", "13", "Charles Wright", "Charles Wright", "BMW 325i", "2700", "9", "9:47.9888", "7", "1:03.2808"], ["DNF", "20", "Shane Satchwell", "Shane Satchwell", "Datsun  1200 Coupe", "1998", "1", "1:05.9100", "1", "1:05.9100"]]
Run Code Online (Sandbox Code Playgroud)


Mar*_*mas 4

你可以使用fixed_width宝石。

您给定的文件可以使用以下代码进行解析:

require 'fixed_width'
require 'pp'

FixedWidth.define :cars do |d|
  d.head do |head|
    head.trap { |line| line !~ /\d/ }
  end
  d.body do |body|
    body.trap { |line| line =~ /^(\d|DNF)/ }
    body.column :pos, 4
    body.column :car, 5
    body.column :competitor, 31
    body.column :driver, 25
    body.column :vehicle, 21
    body.column :cap, 5
    body.column :cl_laps, 11
    body.column :race_time, 11
    body.column :fast_lap_no, 4
    body.column :fast_lap_time, 10
  end
end

pp FixedWidth.parse(File.open("races.txt"), :cars)
Run Code Online (Sandbox Code Playgroud)

trap方法识别每个部分中的行。我使用正则表达式:

  • 正则表达式head查找不包含数字的行。
  • 正则表达式body查找以数字或“DNF”开头的行

每个部分必须包含紧接在最后一个部分之后的行。这些column定义只是标识要抓取的列数。图书馆会为您去除空白。如果您想生成固定宽度的文件,您可以添加对齐参数,但您似乎不需要它。

结果是一个像这样开始的哈希:

{:head=>[{}, {}, {}],
 :body=>
  [{:pos=>"1",
    :car=>"6",
    :competitor=>"Jason Clements",
    :driver=>"Jason Clements",
    :vehicle=>"BMW M3",
    :cap=>"3200",
    :cl_laps=>"10",
    :race_time=>"9:48.5710",
    :fast_lap_no=>"3",
    :fast_lap_time=>"0:57.3228"},
   {:pos=>"2",
    :car=>"42",
    :competitor=>"David Skillender",
    :driver=>"David Skillender",
    :vehicle=>"Holden VS Commodore",
    :cap=>"6000",
    :cl_laps=>"10",
    :race_time=>"9:55.6866",
    :fast_lap_no=>"2",
    :fast_lap_time=>"0:57.9409"},
Run Code Online (Sandbox Code Playgroud)