Nic*_*ddy 4 python python-3.x pandas
我正在尝试将包含两个数据系列的txt文件拉入pandas.到目前为止,我已经尝试了下面的变体,我从堆栈上的其他帖子中获取.到目前为止,它只会作为一个系列阅读.
我正在使用的数据可在此处获得
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", delim_whitespace=True, header=None)
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", header=None, sep="/t")
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", header=None, delimiter=r"\s+")
Run Code Online (Sandbox Code Playgroud)
我确定我做的事情显然是错的,但是我看不到它.
尝试sep=r'\s{2,}'用作分隔符 - 它意味着使用两个或多个空格或制表符作为分隔符:
In [28]: df = pd.read_csv(url, sep=r'\s{2,}', engine='python', header=None, names=['ID','Name'])
In [29]: df
Out[29]:
ID Name
0 A000 Cholera due to Vibrio cholerae 01, biovar cholerae
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor
2 A009 Cholera, unspecified
3 A0100 Typhoid fever, unspecified
4 A0101 Typhoid meningitis
5 A0102 Typhoid fever with heart involvement
6 A0103 Typhoid pneumonia
7 A0104 Typhoid arthritis
8 A0105 Typhoid osteomyelitis
9 A0109 Typhoid fever with other complications
10 A011 Paratyphoid fever A
11 A012 Paratyphoid fever B
12 A013 Paratyphoid fever C
13 A014 Paratyphoid fever, unspecified
14 A020 Salmonella enteritis
15 A021 Salmonella sepsis
16 A0220 Localized salmonella infection, unspecified
17 A0221 Salmonella meningitis
18 A0222 Salmonella pneumonia
19 A0223 Salmonella arthritis
20 A0224 Salmonella osteomyelitis
21 A0225 Salmonella pyelonephritis
22 A0229 Salmonella with other localized infection
23 A028 Other specified salmonella infections
24 A029 Salmonella infection, unspecified
.. ... ...
671 B188 Other chronic viral hepatitis
672 B189 Chronic viral hepatitis, unspecified
673 B190 Unspecified viral hepatitis with hepatic coma
674 B1910 Unspecified viral hepatitis B without hepatic coma
675 B1911 Unspecified viral hepatitis B with hepatic coma
676 B1920 Unspecified viral hepatitis C without hepatic coma
677 B1921 Unspecified viral hepatitis C with hepatic coma
678 B199 Unspecified viral hepatitis without hepatic coma
679 B20 Human immunodeficiency virus [HIV] disease
680 B250 Cytomegaloviral pneumonitis
681 B251 Cytomegaloviral hepatitis
682 B252 Cytomegaloviral pancreatitis
683 B258 Other cytomegaloviral diseases
684 B259 Cytomegaloviral disease, unspecified
685 B260 Mumps orchitis
686 B261 Mumps meningitis
687 B262 Mumps encephalitis
688 B263 Mumps pancreatitis
689 B2681 Mumps hepatitis
690 B2682 Mumps myocarditis
691 B2683 Mumps nephritis
692 B2684 Mumps polyneuropathy
693 B2685 Mumps arthritis
694 B2689 Other mumps complications
695 B269 Mumps without complication
[696 rows x 2 columns]
Run Code Online (Sandbox Code Playgroud)
或者你可以使用read_fwf()方法
您的文件是固定宽度的文件,因此您可以使用read_fwf,这里的默认参数能够推断列宽:
In [106]:
df = pd.read_fwf(r'icd10cm_codes_2017.txt', header=None)
df.head()
Out[106]:
0 1
0 A000 Cholera due to Vibrio cholerae 01, biovar chol...
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor
2 A009 Cholera, unspecified
3 A0100 Typhoid fever, unspecified
4 A0101 Typhoid meningitis
Run Code Online (Sandbox Code Playgroud)
如果您知道列名称所需的名称,则可以将它们传递给read_fwf:
In [107]:
df = pd.read_fwf(r'C:\Users\alanwo\Downloads\icd10cm_codes_2017.txt', header=None, names=['col1', 'col2'])
df.head()
Out[107]:
col1 col2
0 A000 Cholera due to Vibrio cholerae 01, biovar chol...
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor
2 A009 Cholera, unspecified
3 A0100 Typhoid fever, unspecified
4 A0101 Typhoid meningitis
Run Code Online (Sandbox Code Playgroud)
或者只是columns在阅读后覆盖属性:
df.columns = ['col1', 'col2']
Run Code Online (Sandbox Code Playgroud)
至于您尝试失败的原因,read_table使用制表符作为默认分隔符,但文件只有空格且宽度固定
| 归档时间: |
|
| 查看次数: |
7612 次 |
| 最近记录: |