用户@adventured在Hacker News上发布了此内容:
Paul Graham (31, Viaweb); Jan Koum (33, WhatsApp); Brian Acton (37, WhatsApp); Ev Williams (34, Twitter); Jack Dorsey (33, Square); Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal); Garrett Camp (30, Uber); Travis Kalanick (32, Uber); Brian Chesky (27, Airbnb); Adam Neumann (31, WeWork); Reed Hastings (37, Netflix); Reid Hoffman (36, LinkedIn); Jack Ma (35, Alibaba); Jeff Bezos (30, Amazon); Jerry Sanders (33, AMD); Marc Benioff (35, Salesforce); Ross Perot (32, EDS); Peter Norton (39, Norton); Larry Ellison (33, Oracle); Mitch Kapor (32, Lotus); Leonard Bosack (32, Cisco); Sandy Lerner (29, Cisco); Gordon Moore (39, Intel); Mark Cuban (37, Broadcast.com); Scott Cook (31, Intuit); Nolan Bushnell (29, Atari); Paul Galvin (33, Motorola); Irwin Jacobs (52, Qualcomm); David Duffield (46, PeopleSoft | 64, Workday); Aneel Bhusri (39, Workday); Thomas Siebel (41, Siebel Systems); John McAfee (42, McAfee); Gary Hendrix (32, Symantec); Scott McNealy (28, Sun); Pierre Omidyar (28, eBay); Rich Barton (29, Expedia | 38, Zillow); Jim Clark (38, SGI | 49, Netscape); Charles Wang (32, CA); David Packard (27, HP); Craig Newmark (43, Craigslist); John Warnock (42, Adobe); Robert Noyce (30, Fairchild | 41, Intel); Rod Canion (37, Compaq); Jen-Hsun Huang (30, nVidia); James Goodnight (33, SAS); John Sall (28, SAS); Eli Harari (41, SanDisk); Sanjay Mehrotra (28, SanDisk); Al Shugart (48, Seagate); Finis Conner (34, Seagate); Henry Samueli (37, Broadcom); Henry Nicholas (32, Broadcom); Charles Brewer (36, Mindspring); William Shockley (45, Shockley); Ron Rivest (35, RSA); Adi Shamir (30, RSA); John Walker (32, Autodesk); Halsey Minor (30, CNet); David Filo (28, Yahoo); Jeremy Stoppelman (27, Yelp); Eric Lefkofsky (39, Groupon); Andrew Mason (29, Groupon); Markus Persson (30, Mojang); David Hitz (28, NetApp); Brian Lee (28, Legalzoom); Demis Hassabis (34, DeepMind); Tim Westergren (35, Pandora); Martin Lorentzon (37, Spotify); Ashar Aziz (44, FireEye); Kevin O'Connor (36, DoubleClick); Ben Silbermann (28, Pinterest); Evan Sharp (28, Pinterest); Steve Kirsch (38, Infoseek); Stephen Kaufer (36, TripAdvisor); Michael McNeilly (28, Applied Materials); Eugene McDermott (52, Texas Instruments); Richard Egan (43, EMC); Gary Kildall (32, Digital Research); Hasso Plattner (28, SAP); Robert Glaser (32, Real Networks); Patrick Byrne (37, Overstock.com); Marc Lore (33, Diapers.com); Ed Iacobucci (36, Citrix Systems); Ray Noorda (55, Novell); Tom Leighton (42, Akamai); Daniel Lewin (28, Akamai); Diane Greene (43, VMWare); Mendel Rosenblum (36, VMWare); Michael Mauldin (35, Lycos); Tom Anderson (33, MySpace); Chris DeWolfe (37, MySpace); Mark Pincus (41, Zynga); Caterina Fake (34, Flickr); Stewart Butterfield (31, Flickr | 36, Slack); Kevin Systrom (27, Instagram); Adi Tatarko (37, Houzz); Brian Armstrong (29, Coinbase); Pradeep Sindhu (43, Juniper); Peter Thiel (31, PayPal | 37, Palantir); Jay Walker (42, Priceline.com); Bill Coleman (48, BEA Systems); Evan Goldberg (35, NetSuite); Fred Luddy (48, ServiceNow); Michael Baum (41, Splunk); Nir Zuk (33, Palo Alto Networks); David Sacks (36, Yammer); Jack Smith (28, Hotmail); Sabeer Bhatia (28, Hotmail); Chad Hurley (28, YouTube); Andy Rubin (37, Danger | 41, Android); Rodney Brooks (36, iRobot); Jeff Hawkins (35, Palm); Tom Gosner (39, DocuSign); Niklas Zennström (37, Skype); Janus Friis (27, Skype); George Kurtz (40, CrowdStrike); Trip Hawkins (28, EA); Gabe Newell (33, Valve); David Bohnett (38, Geocities); Bill Gross (40, GoTo.com/Overture); Subrah Iyar (38, WebEx); Eric Yuan (41, Zoom); Min Zhu (47, WebEx); Bob Parsons (47, GoDaddy); Wilfred Corrigan (43, LSI); Joe Parkinson (33, Micron); Aart J. de Geus (32, Synopsys); Patrick Byrne (37, Overstock); Matthew Prince (34, Cloudflare); Ben Uretsky (28, DigitalOcean); Tom Preston-Werner (28, GitHub); Louis Borders (48, Webvan); John Moores (36, BMC Software); Vivek Ranadivé (40, Tibco); Pony Ma (27, Tencent); Robin Li (32, Baidu); Liu Qiangdong (29, JD.com); Lei Jun (40, Xiaomi); Ren Zhengfei (38, Huawei); Arkady Volozh (36, Yandex); Hiroshi Mikitani (34, Rakuten); Morris Chang (56, Taiwan Semi); Cheng Wei (29, Didi Chuxing); James Liang (29, Ctrip); Zhang Yiming (29, ByteDance);
Run Code Online (Sandbox Code Playgroud)
我试图编写一个正则表达式,使每个“匹配组”对应于这些创建者。我能够获得136/144的条目,但是我对如何用管道条目(Elon Musk,David Duffield,Rich Barton,Robert Noyce等)捕获创始人感到困惑。这是一个示例:
Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal);
Run Code Online (Sandbox Code Playgroud)
我知道我可以用逃脱管道,\|但是即使用包裹“ paren part” *似乎也没有用。
这是我创建的正则表达式:
([A-Za-zé'.\/\s+-]+{2})\s+\(([0-9]+),\s+([A-Za-z0-9\s+.-\|]+\s?)\);
(我删除了最后一个分号,以便可以在文件内容上运行split(“;”)之后执行比赛。
我在这里创建了一个简单的repro:https : //github.com/arthurcolle/founders
这是内联代码,以防万一您不想只进行非常简单的复制:
rgx = /([A-Za-zé'.\/\s+-]+{2})\s+\(([0-9]+),\s+([A-Za-z0-9\s+.-\|]+\s?)\)/
FOUNDERS_FILE = "/Users/stochastic-thread/founders/founders.txt"
file = File.read(FOUNDERS_FILE)
items = file.split(";")
items.each {|item|
matched = rgx.match(item)
if matched and matched.size == 4
group = "#{matched[1]},#{matched[2]},#{matched[3]}\n"
puts group
File.open("founders.csv", mode: "a") do |f|
f.write(group)
end
end
}
Run Code Online (Sandbox Code Playgroud)
考虑到每个创始人都可能拥有多个具有相应年龄的创办公司(在上述特定格式中,以伊隆·马斯克为例)的正则表达式在每个“创始人-公司”组中都匹配吗?( ö字符是unicode,因此我认为我无法真正匹配它,因为当我将其放在名称匹配组中时,它说多字节字符不起作用。)
我知道我可以找到与正则表达式不匹配的条目,并使用仅与括号格式匹配的正则表达式,甚至可以在管道上再次拆分,但是我正在尝试找到一个“完美正则表达式”这个。
这个问题只要求创始人匹配,因此最初我没有包括他们的企业。但是,稍后,我将讨论一种组织所有信息的可能方法。
将String#scan与以下正则表达式结合使用,该正则表达式是我在自由空间模式下定义的,以使其具有自记录功能。
r = /
(?<=\A|;\s) # match the beginning of the string or a semi-colon
# followed by a whitespace char in a positive lookbehind
[\p{L} ]+ # match one or more Unicode letters or spaces
(?=\s\() # match a whitespace followed by "(" in a positive lookahead
/x # free-spacing regex definition mode
Run Code Online (Sandbox Code Playgroud)
str = "Paul Graham (31, Viaweb); Jan Koum (33, WhatsApp); Brian Acton (37, WhatsApp); " +
"Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal); Garrett Camp (30, Uber); " +
"Travis Kalanick (32, Uber);"
Run Code Online (Sandbox Code Playgroud)
str.scan(r)
#=> ["Paul Graham", "Jan Koum", "Brian Acton", "Elon Musk", "Garrett Camp",
# "Travis Kalanick"]
Run Code Online (Sandbox Code Playgroud)
通常,该正则表达式如下编写。
/(?<=\A|; )[\p{L} ]+(?= \()/
Run Code Online (Sandbox Code Playgroud)
如果需要其他信息,可能需要创建一个哈希,例如以下内容。
r = /
(?<=\A|;\s) # match the beginning of the string or a semi-colon
# followed by a whitespace char in a positive lookbehind
[\p{L} ]+ # match one or more Unicode letters or spaces
\([^)]+ # match a "(" followed by > 0 characters other than ")"
/x
Run Code Online (Sandbox Code Playgroud)
h = str.scan(r).
map { |s| s.split(/ \(/) }.
each_with_object({}) do |(name, startups),h|
h[name] = startups.split(/ *\| */).map do |s|
age, co = s.split(/, +/)
{ age: age.to_i, co: co }
end
end
#=> {"Paul Graham" =>[{:age=>31, :co=>"Viaweb"}],
# "Jan Koum" =>[{:age=>33, :co=>"WhatsApp"}],
# "Brian Acton" =>[{:age=>37, :co=>"WhatsApp"}],
# "Elon Musk" =>[{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
# {:age=>27, :co=>"PayPal"}],
# "Garrett Camp" =>[{:age=>30, :co=>"Uber"}],
# "Travis Kalanick"=>[{:age=>32, :co=>"Uber"}]}
Run Code Online (Sandbox Code Playgroud)
然后,人们可以轻松地进行计算,例如,
h.each_with_object(Hash.new { |h,k| h[k] = [] }) do |(name, cos),g|
cos.each { |co| g[co[:co]] << name }
end
#=> {"Viaweb"=>["Paul Graham"],
# "WhatsApp"=>["Jan Koum", "Brian Acton"],
# "Tesla"=>["Elon Musk"],
# "SpaceX"=>["Elon Musk"],
# "PayPal"=>["Elon Musk"],
# "Uber"=>["Garrett Camp", "Travis Kalanick"]}
Run Code Online (Sandbox Code Playgroud)
传统上,这里使用的正则表达式是这样写的:
/(?<=\A|; )[\p{L} ]+\([^\)]+/
Run Code Online (Sandbox Code Playgroud)
计算步骤h如下。
a = str.scan(r)
#=> ["Paul Graham (31, Viaweb", "Jan Koum (33, WhatsApp", "Brian Acton (37, WhatsApp",
# "Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal", "Garrett Camp (30, Uber",
# "Travis Kalanick (32, Uber"]
b = a.map { |s| s.split(/ \(/) }
#=> [["Paul Graham", "31, Viaweb"], ["Jan Koum", "33, WhatsApp"],
# ["Brian Acton", "37, WhatsApp"],
# ["Elon Musk", "32, Tesla | 31, SpaceX | 27, PayPal"],
# ["Garrett Camp", "30, Uber"], ["Travis Kalanick", "32, Uber"]]
h = b.each_with_object({}) do |(name, startups),h|
h[name] = startups.split(/ *\| */).map do |s|
age, co = s.split(/, +/)
{ age: age.to_i, co: co }
end
end
#=> <as above>
Run Code Online (Sandbox Code Playgroud)
在h从中计算b时
name = "Elon Musk"
startups = "32, Tesla | 31, SpaceX | 27, PayPal"
h = {"Paul Graham" =>[{:age=>31, :co=>"Viaweb"}],
"Jan Koum" =>[{:age=>33, :co=>"WhatsApp"}],
"Brian Acton" =>[{:age=>37, :co=>"WhatsApp"}]}
Run Code Online (Sandbox Code Playgroud)
块计算如下。
c = startups.split(/ *\| */)
#=> ["32, Tesla", "31, SpaceX", "27, PayPal"]
d = c.map do |s|
age, co = s.split(/, +/)
{ age: age.to_i, co: co }
end
#=> [{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
# {:age=>27, :co=>"PayPal"}]
h[name] = d
#=> [{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
# {:age=>27, :co=>"PayPal"}]
Run Code Online (Sandbox Code Playgroud)
现在
h #=> {"Paul Graham"=>[{:age=>31, :co=>"Viaweb"}],
# "Jan Koum" =>[{:age=>33, :co=>"WhatsApp"}],
# "Brian Acton"=>[{:age=>37, :co=>"WhatsApp"}],
# "Elon Musk" =>[{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
# {:age=>27, :co=>"PayPal"}]}
Run Code Online (Sandbox Code Playgroud)