假设我有一个包含列的数据框a,b并且c我想按列b升序排序数据帧,按列c降序排序,我该怎么做?
I am trying to read an excel file this way :
newFile = pd.ExcelFile(PATH\FileName.xlsx)
ParsedData = pd.io.parsers.ExcelFile.parse(newFile)
Run Code Online (Sandbox Code Playgroud)
which throws an error that says two arguments expected, I don't know what the second argument is and also what I am trying to achieve here is to convert an Excel file to a DataFrame, Am I doing it the right way? or is there any other way to do this using pandas?
我有以下示例DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
Run Code Online (Sandbox Code Playgroud)
我想只在前2列中替换空值 - 列"a"和"b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Run Code Online (Sandbox Code Playgroud)
以下是创建示例数据帧的代码:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
Run Code Online (Sandbox Code Playgroud)
我知道如何使用以下方法替换所有空值:
df2 = df2.fillna(0)
Run Code Online (Sandbox Code Playgroud)
当我尝试这个时,我失去了第三列:
df2 = df2.select(df2.columns[0:1]).fillna(0)
Run Code Online (Sandbox Code Playgroud) 我希望能够在Pandas DataFrame中计算数据的描述性统计数据,但我只关心重复的条目.例如,假设我创建了DataFrame:
import pandas as pd
data={'key1':[1,2,3,1,2,3,2,2],'key2':[2,2,1,2,2,4,2,2],'data':[5,6,2,6,1,6,2,8]}
frame=pd.DataFrame(data,columns=['key1','key2','data'])
print frame
key1 key2 data
0 1 2 5
1 2 2 6
2 3 1 2
3 1 2 6
4 2 2 1
5 3 4 6
6 2 2 2
7 2 2 8
Run Code Online (Sandbox Code Playgroud)
如您所见,行0,1,3,4,6和7都是重复的(使用'key1'和'key2'.但是,如果我将这个DataFrame索引如下:
frame[frame.duplicated(['key1','key2'])]
Run Code Online (Sandbox Code Playgroud)
我明白了
key1 key2 data
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8
Run Code Online (Sandbox Code Playgroud)
(即,第一行和第二行不会显示,因为它们没有通过重复方法索引为True).
这是我的第一个问题.我的第二个问题涉及如何从这些信息中提取描述性统计数据.暂时忘记丢失的副本,假设我想为重复的条目计算.min()和.max()(这样我就可以得到一个范围).我可以在groupby对象上使用groupby和这些方法,如下所示:
a.groupby(['key1','key2']).min()
Run Code Online (Sandbox Code Playgroud)
这使
key1 key2 data
key1 key2
1 2 1 2 …Run Code Online (Sandbox Code Playgroud) 有更有效的方法吗?我的代码读取文本文件并提取所有名词.
import nltk
File = open(fileName) #open file
lines = File.read() #read all lines
sentences = nltk.sent_tokenize(lines) #tokenize sentences
nouns = [] #empty to array to hold all nouns
for sentence in sentences:
for word,pos in nltk.pos_tag(nltk.word_tokenize(str(sentence))):
if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
nouns.append(word)
Run Code Online (Sandbox Code Playgroud)
如何减少此代码的时间复杂度?有没有办法避免使用嵌套的for循环?
提前致谢!
我有一个以这种方式构建的CSV文件:
Header
Blank Row
"Col1","Col2"
"1,200","1,456"
"2,000","3,450"
Run Code Online (Sandbox Code Playgroud)
我在阅读此文件时遇到两个问题.
这是我尝试过的:
df = sc.textFile("myFile.csv")\
.map(lambda line: line.split(","))\ #Split By comma
.filter(lambda line: len(line) == 2).collect() #This helped me ignore the first two rows
Run Code Online (Sandbox Code Playgroud)
但是,这不起作用,因为值中的逗号被读作分隔符而len(line)返回4而不是2.
我尝试了另一种方法:
data = sc.textFile("myFile.csv")
headers = data.take(2) #First two rows to be skipped
Run Code Online (Sandbox Code Playgroud)
我的想法是使用过滤器而不是读取标题.但是,当我尝试打印标题时,我得到了编码值.
[\x00A\x00Y\x00 \x00J\x00u\x00l\x00y\x00 \x002\x000\x001\x006\x00]
Run Code Online (Sandbox Code Playgroud)
读取CSV文件并跳过前两行的正确方法是什么?
I have the following example Spark DataFrame:
rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)])
df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"])
df.show()
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:30:00| 30|
| 1| 19:30:00|19:40:00| 10|
| 1| 19:40:00|19:43:00| 3|
| 2| 20:00:00|20:10:00| 10|
| 1| 20:05:00|20:15:00| 10|
| 1| 20:15:00|20:35:00| 20|
+-------+----------+--------+--------+
Run Code Online (Sandbox Code Playgroud)
I want to group consecutive rows based on the start and end times. For instance, for the same user_id, …
我用这种方式将Excel表格读入pandas DataFrame:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet1")
Run Code Online (Sandbox Code Playgroud)
选择每列的第一个单元格的值作为dataFrame的列名,我想指定自己的列名,我该怎么做?
这是我的代码:
import pandas as pd
left = pd.DataFrame({'AID': [1, 2, 3, 4],
'D': [2011, 2011,0, 2011],
'R1': [0, 1, 0, 0],
'R2': [1, 0, 0, 0] })
right = pd.DataFrame({'AID': [1, 2, 3, 4],
'D': [2012, 0,0, 2012],
'R1': [0, 1, 0, 0],
'R2': [1, 0, 0, 0] })
result = left.merge(right, how = 'outer')
Run Code Online (Sandbox Code Playgroud)
当我打印结果dataFrame时,整数值现在是浮点数:
AID D R1 R2
0 1.0 2011.0 0.0 1.0
1 2.0 2011.0 1.0 0.0
2 3.0 0.0 0.0 0.0
3 4.0 2011.0 0.0 0.0 …Run Code Online (Sandbox Code Playgroud) 我有以下DataFrame:
df = pd.DataFrame(['Male','Female', 'Female', 'Unknown', 'Male'], columns = ['Gender'])
Run Code Online (Sandbox Code Playgroud)
我想将此转换为DataFrame,列为'Male','Female'和'Unknown',值0和1表示Gender.
Gender Male Female
Male 1 0
Female 0 1
.
.
.
.
Run Code Online (Sandbox Code Playgroud)
为此,我编写了一个函数并使用map调用了函数.
def isValue(x , value):
if(x == value):
return 1
else:
return 0
for value in df['Gender'].unique():
df[str(value)] = df['Gender'].map( lambda x: isValue(str(x) , str(value)))
Run Code Online (Sandbox Code Playgroud)
哪个效果很好.但是有更好的方法吗?我可以使用任何sklearn包中的内置函数吗?
python ×8
pandas ×6
pyspark ×3
apache-spark ×2
dataframe ×2
python-2.7 ×2
nltk ×1
pyspark-sql ×1
scikit-learn ×1
sorting ×1