我有两个DataFrames,我想根据列合并.然而,由于交替拼写,不同数量的空格,不存在/存在变音符号,我希望能够合并,只要它们彼此相似即可.
任何相似性算法都可以(soundex,Levenshtein,difflib).
假设一个DataFrame具有以下数据:
df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
number
one 1
two 2
three 3
four 4
five 5
df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
letter
one a
too b
three c
fours d
five e
Run Code Online (Sandbox Code Playgroud)
然后我想得到生成的DataFrame
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
Run Code Online (Sandbox Code Playgroud) 我正在尝试使用其他来源提供的一些学生数据来更新csv文件,但是他们已经将他们的csv数据格式化为与我们的略有不同.
它需要根据三个标准匹配学生的名字,他们的班级,最后是该地点的前几个字母,所以对于B班的前几个学生Dumpt来说实际上是Dumpton Park.
找到匹配项时
以下是一些示例数据:
Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,
Class A,York,Jim,x,x,10,
Class A,York,Sam,x,x,32,
Class B,Dumpton Park,Sarah,x,x,,
Class B,Dumpton Park,Bob,x,x,,
Class B,Dumpton Park,Bill,x,x,,
Class A,Dover,Andy,x,x,,
Class A,Dover,Hannah,x,x,,
Class B,London,Jemma,x,x,,
Class B,London,James,x,x,,
Run Code Online (Sandbox Code Playgroud)
"Class","Location","Student","Scorecard","Number"
"Class A","York","Jim","0","742"
"Class A","York","Sam","0","931"
"Class A","York","Tom","0","653"
"Class B","Dumpt","Bob","23.1","299"
"Class B","Dumpt","Bill","23.4","198"
"Class B","Dumpt","Sarah","23.5","12"
"Class A","Dover","Andy","23","983"
"Class A","Dover","Hannah","1","293"
"Class B","Lond","Jemma","32.2","0"
"Class B","Lond","James","32.0","0"
Run Code Online (Sandbox Code Playgroud)
Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,653
Class A,York,Jim,x,x,10,742
Class A,York,Sam,x,x,32,653
Class B,Dumpton Park,Sarah,x,x,23.5,12
Class B,Dumpton Park,Bob,x,x,23.1,299 …Run Code Online (Sandbox Code Playgroud) 我想基本上使用python查找和替换。
但是,我想说如果一个单元格包含某些内容,则替换为我想要的内容。
我知道
str.replace('safsd','something else')
Run Code Online (Sandbox Code Playgroud)
但是,我不确定如何指定如何摆脱该单元格中的所有内容。我用*吗?我对 python 不太熟悉,但我知道在 bash shell*中对所有内容的引用......
我有
df['Description']
Run Code Online (Sandbox Code Playgroud)
可以包含'optiplex 9010 for classes and research'我只想替换为'optiplex 9010'. 或者'macbook air 11 with configurations...etc.'我只想'macbook air 11'
我的目标是...
if Df['Description'].str.contains('macbook air 11')
then Df['Description'].str.replace(' (not sure what I put in here) , 'mabook air 11')
Run Code Online (Sandbox Code Playgroud)
任何帮助/想法?
谢谢!
**可能有用的其他信息...
我正在处理数千种不同的用户输入。因此,某人购买的“描述”在上下文、措辞、结构等方面完全不同。我可以手动进入 excel 并按包含“optiplex 9010”的内容进行过滤,然后替换所有内容用简单的描述,对 macbook 等做同样的事情。
我认为使用 pandas/python .str.contains 和 .str.replace 可能有一些更简单的方法。
希望额外的信息有帮助!让我知道