这是输入:
7. Data 1 1. STR1 STR2 3. 12345 4. 0876 9. NO 2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z? 12. NO PRub. 1 1. 1000 XX 2. NO 3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z? 12. NO PRub. 1 1. 1000 XX 2. NO 4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO 0 1.
Run Code Online (Sandbox Code Playgroud)
这是预期的输出:
[('1', '1. STR1 STR2 3. 12345 4. 0876 9. NO'),
('2', '1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z? 12. NO PRub. 1 1. 1000 XX 2. NO'),
('3', '1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z? 12. NO PRub. 1 1. 1000 XX 2. NO'),
('4', '1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO')]
Run Code Online (Sandbox Code Playgroud)
我试过这个:
re.findall(r'(?=\s(\d+)\s(1\..*?)\s\d+\s1\.)', txt, re.DOTALL)
Run Code Online (Sandbox Code Playgroud)
但当然这不是正确的解决方案 - 正则表达式必须匹配,(\d+) 1.但不是PRub. 1 1..
我该怎么做才能让它发挥作用?
这怎么样:
\n\nIn [1]: s='7. Data 1 1. STR1 STR2 3. 12345 4. 0876 9. NO 2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO 3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO 4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO 0 1.'\n\nIn [2]: import re\n\nIn [3]: re.findall('(?<=\\s)\\d.*?(?=\\s\\d\\s\\d[.](?=$|\\s[A-Z]))',s)\nOut[3]: \n['1 1. STR1 STR2 3. 12345 4. 0876 9. NO',\n '2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\\xc5\\x81 12. NO PRub. 1 1. 1000 XX 2. NO',\n '3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\\xc5\\x81 12. NO PRub. 1 1. 1000 XX 2. NO',\n '4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO']\nRun Code Online (Sandbox Code Playgroud)\n\n对于你的确切输出我会做类似的事情:
\n\nIn [4]: ns = re.findall('(?<=\\s)\\d.*?(?=\\s\\d\\s\\d[.](?=$|\\s[A-Z]))',s)\n\nIn [5]: [tuple(f.split(' ',1)) for f in ns]\nOut[5]: \n[('1', '1. STR1 STR2 3. 12345 4. 0876 9. NO'),\n ('2', '1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\\xc5\\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),\n ('3', '1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\\xc5\\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),\n ('4', '1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO')]\nRun Code Online (Sandbox Code Playgroud)\n\n可能是一个更好的方法来做到这一点,但我的 python foo 不如我的正则表达式 foo。
\n\n重新解释:
\n\n(?<=\\s) # Use positive look-behind to match a leading space but don't include it\n\\d # match digit \n.*? # Match everything up till the next record (lazy)\n # The following positive look-behinds is the key. It matches the start of\n # each new record i.e\n # 2 1. S\n # 3 1. S\n # 4 1. Q\n # 0 1.$ \n # look-arounds match but don't seek past. \n(?=\\s\\d\\s\\d[.](?=$|\\s[A-Z]))\n(?= # positive look-ahead 1\n\\s # space\n\\d # digit\n\\s # space\n\\d # digit\n[.] # period\n(?= # postive look-ahead 2 \n$ # end of string\n| # OR\n\\s[A-Z] # space followed by uppercase letter\n) # close look-ahead 1\n) # close look-ahead 2\nRun Code Online (Sandbox Code Playgroud)\n