我有一个NLP任务,我正在使用scikit-learn.阅读我发现的教程必须对文本进行矢量化以及如何使用此向量化模型来提供分类算法.假设我有一些文本,我想将其矢量化如下:
from sklearn.feature_extraction.text import CountVectorizer
corpus =['''Computer science is the scientific and
practical approach to computation and its applications.'''
#this is another opinion
'''It is the systematic study of the feasibility, structure,
expression, and mechanization of the methodical
procedures that underlie the acquisition,
representation, processing, storage, communication of,
and access to information, whether such information is encoded
as bits in a computer memory or transcribed in genes and
protein structures in a biological cell.'''
#anotherone
'''A computer scientist …Run Code Online (Sandbox Code Playgroud) 只是一个快速的问题。使用熊猫,我们可以创建一个数据框并设置标题,如下所示:
import pandas as pd
df = pd.read_csv('/file/path', sep='|', names = ['A','B'])
Run Code Online (Sandbox Code Playgroud)
使用PySpark:
text_file = sc.textFile('path/file')
Run Code Online (Sandbox Code Playgroud)
另一方面,尽管我都已经准备好阅读Spark SQL的文档,但是我没有找到如何设置标头和分隔符,或者如何将数据集的每一列的名称设置为pandas。是否知道如何使用PySparkSQL在每个列中添加名称?
更新:
从@CafeFeed,我尝试了以下操作:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_2 = sqlContext.read.format('com.databricks.spark.csv').options(header='false', delimiter='|').load('path')
df_2
Run Code Online (Sandbox Code Playgroud)
但是,我得到了这个例外:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-31-ad726583541b> in <module>()
2 sqlContext = SQLContext(sc)
3
----> 4 df_2 = sqlContext.read.format('com.databricks.spark.csv').options(header='false', delimiter='|').load('/Users/user/GitHub/PySpark-Notes/ml-100k/u.user')
5 df_2
/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/sql/readwriter.pyc in load(self, path, format, schema, **options)
119 self.options(**options)
120 if path is not None:
--> 121 return self._df(self._jreader.load(path))
122 else:
123 …Run Code Online (Sandbox Code Playgroud) 如何将熊猫列的名称添加到行内的每个值?例如,假设我有以下熊猫数据框:
COL1 COL2 COL3 COL4 ... COL_N
True NO 90.9 2 ... 2018-05-17 20:14:00
True NO 89.11 2 ... 2018-05-17 20:15:32
............
True NO 67.89 1 ... 2018-05-17 20:18:45
Run Code Online (Sandbox Code Playgroud)
我怎样才能把它变成:
COL1 COL2 COL3 COL4 ... COL_N
True (COL1) NO (COL2) 90.9 (COL3) 2 (COL4) ... 2018-05-17 20:14:00 (COL_N)
True (COL1) NO (COL2) 89.11 (COL3) 2 (COL4) ... 2018-05-17 20:15:32 (COL_N)
............
True (COL1) NO (COL2) 67.89 (COL3) 1 (COL4) ... 2018-05-17 20:18:45 (COL_N)
Run Code Online (Sandbox Code Playgroud)
我想这样做是因为我分析了每行内的一些模式。问题之一是我正在处理大熊猫数据框 (500x100000) 列。知道如何给定的 Pandas 数据框将其列名附加到每个值吗?
从传感器,有一个看起来像一系列元组的数据流:
sensor: (-0.560303, -0.627686, 0.467468)
sensor: (-0.561829, -0.626160, 0.466125)
sensor: (-0.556091, -0.623352, 0.471497)
sensor: (-0.558411, -0.625977, 0.468811)
sensor: (-0.557312, -0.626587, 0.468262)
sensor: (-0.557800, -0.625854, 0.465820)
sensor: (-0.563599, -0.624512, 0.464722)
sensor: (-0.555847, -0.623230, 0.467163)
sensor: (-0.557861, -0.621033, 0.468811)
sensor: (-0.555420, -0.625061, 0.470520)
sensor: (-0.559082, -0.626221, 0.475891)
sensor: (-0.559814, -0.625977, 0.466309)
sensor: (-0.561768, -0.624756, 0.467163)
sensor: (-0.551941, -0.628906, 0.469055)
sensor: (-0.556946, -0.626465, 0.471313)
sensor: (-0.558533, -0.626038, 0.469421)
sensor: (-0.557922, -0.625061, 0.467285)
sensor: (-0.562622, -0.623657, 0.469971)
sensor: (-0.554443, -0.625977, 0.465759)
sensor: (-0.559265, -0.626282, …Run Code Online (Sandbox Code Playgroud) python ×2
apache-spark ×1
c ×1
c++ ×1
csv ×1
data-stream ×1
nlp ×1
numpy ×1
pandas ×1
pyspark ×1
python-3.x ×1
scikit-learn ×1
scipy ×1