我们可以创建一个与该列表中的每个创始人匹配的正则表达式吗?

Art*_*llé 2 ruby regex

用户@adventured在Hacker News上发布了此内容:

Paul Graham (31, Viaweb); Jan Koum (33, WhatsApp); Brian Acton (37, WhatsApp); Ev Williams (34, Twitter); Jack Dorsey (33, Square); Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal); Garrett Camp (30, Uber); Travis Kalanick (32, Uber); Brian Chesky (27, Airbnb); Adam Neumann (31, WeWork); Reed Hastings (37, Netflix); Reid Hoffman (36, LinkedIn); Jack Ma (35, Alibaba); Jeff Bezos (30, Amazon); Jerry Sanders (33, AMD); Marc Benioff (35, Salesforce); Ross Perot (32, EDS); Peter Norton (39, Norton); Larry Ellison (33, Oracle); Mitch Kapor (32, Lotus); Leonard Bosack (32, Cisco); Sandy Lerner (29, Cisco); Gordon Moore (39, Intel); Mark Cuban (37, Broadcast.com); Scott Cook (31, Intuit); Nolan Bushnell (29, Atari); Paul Galvin (33, Motorola); Irwin Jacobs (52, Qualcomm); David Duffield (46, PeopleSoft | 64, Workday); Aneel Bhusri (39, Workday); Thomas Siebel (41, Siebel Systems); John McAfee (42, McAfee); Gary Hendrix (32, Symantec); Scott McNealy (28, Sun); Pierre Omidyar (28, eBay); Rich Barton (29, Expedia | 38, Zillow); Jim Clark (38, SGI | 49, Netscape); Charles Wang (32, CA); David Packard (27, HP); Craig Newmark (43, Craigslist); John Warnock (42, Adobe); Robert Noyce (30, Fairchild | 41, Intel); Rod Canion (37, Compaq); Jen-Hsun Huang (30, nVidia); James Goodnight (33, SAS); John Sall (28, SAS); Eli Harari (41, SanDisk); Sanjay Mehrotra (28, SanDisk); Al Shugart (48, Seagate); Finis Conner (34, Seagate); Henry Samueli (37, Broadcom); Henry Nicholas (32, Broadcom); Charles Brewer (36, Mindspring); William Shockley (45, Shockley); Ron Rivest (35, RSA); Adi Shamir (30, RSA); John Walker (32, Autodesk); Halsey Minor (30, CNet); David Filo (28, Yahoo); Jeremy Stoppelman (27, Yelp); Eric Lefkofsky (39, Groupon); Andrew Mason (29, Groupon); Markus Persson (30, Mojang); David Hitz (28, NetApp); Brian Lee (28, Legalzoom); Demis Hassabis (34, DeepMind); Tim Westergren (35, Pandora); Martin Lorentzon (37, Spotify); Ashar Aziz (44, FireEye); Kevin O'Connor (36, DoubleClick); Ben Silbermann (28, Pinterest); Evan Sharp (28, Pinterest); Steve Kirsch (38, Infoseek); Stephen Kaufer (36, TripAdvisor); Michael McNeilly (28, Applied Materials); Eugene McDermott (52, Texas Instruments); Richard Egan (43, EMC); Gary Kildall (32, Digital Research); Hasso Plattner (28, SAP); Robert Glaser (32, Real Networks); Patrick Byrne (37, Overstock.com); Marc Lore (33, Diapers.com); Ed Iacobucci (36, Citrix Systems); Ray Noorda (55, Novell); Tom Leighton (42, Akamai); Daniel Lewin (28, Akamai); Diane Greene (43, VMWare); Mendel Rosenblum (36, VMWare); Michael Mauldin (35, Lycos); Tom Anderson (33, MySpace); Chris DeWolfe (37, MySpace); Mark Pincus (41, Zynga); Caterina Fake (34, Flickr); Stewart Butterfield (31, Flickr | 36, Slack); Kevin Systrom (27, Instagram); Adi Tatarko (37, Houzz); Brian Armstrong (29, Coinbase); Pradeep Sindhu (43, Juniper); Peter Thiel (31, PayPal | 37, Palantir); Jay Walker (42, Priceline.com); Bill Coleman (48, BEA Systems); Evan Goldberg (35, NetSuite); Fred Luddy (48, ServiceNow); Michael Baum (41, Splunk); Nir Zuk (33, Palo Alto Networks); David Sacks (36, Yammer); Jack Smith (28, Hotmail); Sabeer Bhatia (28, Hotmail); Chad Hurley (28, YouTube); Andy Rubin (37, Danger | 41, Android); Rodney Brooks (36, iRobot); Jeff Hawkins (35, Palm); Tom Gosner (39, DocuSign); Niklas Zennström (37, Skype); Janus Friis (27, Skype); George Kurtz (40, CrowdStrike); Trip Hawkins (28, EA); Gabe Newell (33, Valve); David Bohnett (38, Geocities); Bill Gross (40, GoTo.com/Overture); Subrah Iyar (38, WebEx); Eric Yuan (41, Zoom); Min Zhu (47, WebEx); Bob Parsons (47, GoDaddy); Wilfred Corrigan (43, LSI); Joe Parkinson (33, Micron); Aart J. de Geus (32, Synopsys); Patrick Byrne (37, Overstock); Matthew Prince (34, Cloudflare); Ben Uretsky (28, DigitalOcean); Tom Preston-Werner (28, GitHub); Louis Borders (48, Webvan); John Moores (36, BMC Software); Vivek Ranadivé (40, Tibco); Pony Ma (27, Tencent); Robin Li (32, Baidu); Liu Qiangdong (29, JD.com); Lei Jun (40, Xiaomi); Ren Zhengfei (38, Huawei); Arkady Volozh (36, Yandex); Hiroshi Mikitani (34, Rakuten); Morris Chang (56, Taiwan Semi); Cheng Wei (29, Didi Chuxing); James Liang (29, Ctrip); Zhang Yiming (29, ByteDance);
Run Code Online (Sandbox Code Playgroud)

我试图编写一个正则表达式,使每个“匹配组”对应于这些创建者。我能够获得136/144的条目,但是我对如何用管道条目(Elon Musk,David Duffield,Rich Barton,Robert Noyce等)捕获创始人感到困惑。这是一个示例:

Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal);
Run Code Online (Sandbox Code Playgroud)

我知道我可以用逃脱管道,\|但是即使用包裹“ paren part” *似乎也没有用。

这是我创建的正则表达式:

([A-Za-zé'.\/\s+-]+{2})\s+\(([0-9]+),\s+([A-Za-z0-9\s+.-\|]+\s?)\);

(我删除了最后一个分号,以便可以在文件内容上运行split(“;”)之后执行比赛。

我在这里创建了一个简单的repro:https : //github.com/arthurcolle/founders

这是内联代码,以防万一您不想只进行非常简单的复制:

rgx = /([A-Za-zé'.\/\s+-]+{2})\s+\(([0-9]+),\s+([A-Za-z0-9\s+.-\|]+\s?)\)/
FOUNDERS_FILE = "/Users/stochastic-thread/founders/founders.txt"

file = File.read(FOUNDERS_FILE)
items = file.split(";")
items.each {|item|
  matched = rgx.match(item)
  if matched and matched.size == 4
    group = "#{matched[1]},#{matched[2]},#{matched[3]}\n"
    puts group
    File.open("founders.csv", mode: "a") do |f|
      f.write(group)
    end
  end
}
Run Code Online (Sandbox Code Playgroud)

考虑到每个创始人都可能拥有多个具有相应年龄的创办公司(在上述特定格式中,以伊隆·马斯克为例)的正则表达式在每个“创始人-公司”组中都匹配吗?( ö字符是unicode,因此我认为我无法真正匹配它,因为当我将其放在名称匹配组中时,它说多字节字符不起作用。)

我知道我可以找到与正则表达式不匹配的条目,并使用仅与括号格式匹配的正则表达式,甚至可以在管道上再次拆分,但是我正在尝试找到一个“完美正则表达式”这个。

Car*_*and 5

这个问题只要求创始人匹配,因此最初我没有包括他们的企业。但是,稍后,我将讨论一种组织所有信息的可能方法。

String#scan与以下正则表达式结合使用,该正则表达式是我在自由空间模式下定义的,以使其具有自记录功能。

r = /
    (?<=\A|;\s)  # match the beginning of the string or a semi-colon
                 # followed by a whitespace char in a positive lookbehind  
    [\p{L} ]+    # match one or more Unicode letters or spaces
    (?=\s\()     # match a whitespace followed by "(" in a positive lookahead
    /x           # free-spacing regex definition mode
Run Code Online (Sandbox Code Playgroud)

str = "Paul Graham (31, Viaweb); Jan Koum (33, WhatsApp); Brian Acton (37, WhatsApp); " +
      "Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal); Garrett Camp (30, Uber); " +
      "Travis Kalanick (32, Uber);"
Run Code Online (Sandbox Code Playgroud)

str.scan(r)
  #=> ["Paul Graham", "Jan Koum", "Brian Acton", "Elon Musk", "Garrett Camp",
  #    "Travis Kalanick"] 
Run Code Online (Sandbox Code Playgroud)

通常,该正则表达式如下编写。

/(?<=\A|; )[\p{L} ]+(?= \()/
Run Code Online (Sandbox Code Playgroud)

如果需要其他信息,可能需要创建一个哈希,例如以下内容。

r = /
    (?<=\A|;\s)  # match the beginning of the string or a semi-colon
                 # followed by a whitespace char in a positive lookbehind  
    [\p{L} ]+    # match one or more Unicode letters or spaces
    \([^)]+      # match a "(" followed by > 0 characters other than ")"
    /x                
Run Code Online (Sandbox Code Playgroud)

h = str.scan(r).
        map { |s| s.split(/ \(/) }.
        each_with_object({}) do |(name, startups),h|
          h[name] = startups.split(/ *\| */).map do |s|
            age, co = s.split(/, +/)
            { age: age.to_i, co: co }
          end
    end
  #=> {"Paul Graham"    =>[{:age=>31, :co=>"Viaweb"}],
  #    "Jan Koum"       =>[{:age=>33, :co=>"WhatsApp"}],
  #    "Brian Acton"    =>[{:age=>37, :co=>"WhatsApp"}],
  #    "Elon Musk"      =>[{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
  #                        {:age=>27, :co=>"PayPal"}],
  #    "Garrett Camp"   =>[{:age=>30, :co=>"Uber"}],
  #    "Travis Kalanick"=>[{:age=>32, :co=>"Uber"}]}       
Run Code Online (Sandbox Code Playgroud)

然后,人们可以轻松地进行计算,例如,

h.each_with_object(Hash.new { |h,k| h[k] = [] }) do |(name, cos),g|
  cos.each { |co| g[co[:co]] << name }
end
  #=> {"Viaweb"=>["Paul Graham"],
  #    "WhatsApp"=>["Jan Koum", "Brian Acton"],
  #    "Tesla"=>["Elon Musk"],
  #    "SpaceX"=>["Elon Musk"],
  #    "PayPal"=>["Elon Musk"],
  #    "Uber"=>["Garrett Camp", "Travis Kalanick"]} 
Run Code Online (Sandbox Code Playgroud)

传统上,这里使用的正则表达式是这样写的:

/(?<=\A|; )[\p{L} ]+\([^\)]+/                
Run Code Online (Sandbox Code Playgroud)

计算步骤h如下。

a = str.scan(r)
  #=> ["Paul Graham (31, Viaweb", "Jan Koum (33, WhatsApp", "Brian Acton (37, WhatsApp",
  #    "Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal", "Garrett Camp (30, Uber",
  #    "Travis Kalanick (32, Uber"]
b = a.map { |s| s.split(/ \(/) }
  #=> [["Paul Graham", "31, Viaweb"], ["Jan Koum", "33, WhatsApp"],
  #    ["Brian Acton", "37, WhatsApp"],
  #    ["Elon Musk", "32, Tesla | 31, SpaceX | 27, PayPal"],
  #    ["Garrett Camp", "30, Uber"], ["Travis Kalanick", "32, Uber"]] 
h = b.each_with_object({}) do |(name, startups),h|
  h[name] = startups.split(/ *\| */).map do |s|
              age, co = s.split(/, +/)
              { age: age.to_i, co: co }
            end
end
  #=> <as above>
Run Code Online (Sandbox Code Playgroud)

h从中计算b

name = "Elon Musk"
startups = "32, Tesla | 31, SpaceX | 27, PayPal"
h = {"Paul Graham" =>[{:age=>31, :co=>"Viaweb"}],
     "Jan Koum"    =>[{:age=>33, :co=>"WhatsApp"}],
     "Brian Acton" =>[{:age=>37, :co=>"WhatsApp"}]}
Run Code Online (Sandbox Code Playgroud)

块计算如下。

c = startups.split(/ *\| */)
  #=> ["32, Tesla", "31, SpaceX", "27, PayPal"] 
d = c.map do |s|
  age, co = s.split(/, +/)
  { age: age.to_i, co: co }
end
  #=> [{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
  #    {:age=>27, :co=>"PayPal"}] 
h[name] = d
  #=> [{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
  #    {:age=>27, :co=>"PayPal"}] 
Run Code Online (Sandbox Code Playgroud)

现在

h #=> {"Paul Graham"=>[{:age=>31, :co=>"Viaweb"}],
  #    "Jan Koum"   =>[{:age=>33, :co=>"WhatsApp"}],
  #    "Brian Acton"=>[{:age=>37, :co=>"WhatsApp"}],
  #    "Elon Musk"  =>[{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
  #                    {:age=>27, :co=>"PayPal"}]} 
Run Code Online (Sandbox Code Playgroud)