我正在尝试将CSV文件加载到Hive表中,如下所示:
CREATE TABLE mytable
(
num1 INT,
text1 STRING,
num2 INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/data.csv'
OVERWRITE INTO TABLE mytable;
Run Code Online (Sandbox Code Playgroud)
csv由逗号(,)分隔,如下所示:
1, "some text, with comma in it", 123, "more text"
Run Code Online (Sandbox Code Playgroud)
这将返回损坏的数据,因为第一个字符串中有一个','.
有没有办法设置文本分隔符或让Hive忽略字符串中的','?
我无法更改csv的分隔符,因为它是从外部源中提取的.
我们有什么方法可以覆盖现有文件,同时HDFS使用:
hadoop fs -copyToLocal <HDFS PATH> <local path>
Run Code Online (Sandbox Code Playgroud) 我想在伪分布式模式下设置一个hadoop-cluster.我设法执行所有设置步骤,包括在我的机器上启动Namenode,Datanode,Jobtracker和Tasktracker.
然后我尝试运行一些示例程序并面对java.net.ConnectException: Connection refused错误.我回到了以独立模式运行某些操作的最初步骤,并遇到了同样的问题.
我甚至对所有安装步骤进行了三重检查,并且不知道如何修复它.(我是Hadoop和初学Ubuntu用户的新手,因此,如果提供任何指南或提示,我恳请您"考虑到它").
这是我一直收到的错误输出:
hduser@marta-komputer:/usr/local/hadoop$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'
15/02/22 18:23:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/02/22 18:23:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
java.net.ConnectException: Call From marta-komputer/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at …Run Code Online (Sandbox Code Playgroud) 要创建MapReduce作业,您可以使用旧org.apache.hadoop.mapred包或org.apache.hadoop.mapreduceMapper和Reducers,Jobs ... 的新包.第一个已被标记为已弃用但同时也已恢复.现在我想知道使用旧的mapred包或新的mapreduce包来创建作业以及为什么更好.或者它只取决于您是否需要像旧的mapred包中可用的MultipleTextOutputFormat之类的东西?
https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python似乎已过时.
当我将其添加到/ etc/profile时:
export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py
Run Code Online (Sandbox Code Playgroud)
然后,我可以执行链接中列出的导入,from hive import ThriftHive但实际需要的除外:
from hive_service import ThriftHive
Run Code Online (Sandbox Code Playgroud)
接下来示例中的端口是10000,当我尝试时导致程序挂起.默认的Hive Thrift端口是9083,它停止了悬挂.
所以我这样设置:
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
try:
transport = TSocket.TSocket('<node-with-metastore>', 9083)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = ThriftHive.Client(protocol)
transport.open()
client.execute("CREATE TABLE test(c1 int)")
transport.close()
except Thrift.TException, tx:
print '%s' % (tx.message)
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute …Run Code Online (Sandbox Code Playgroud) HDFS/hadoop的默认数据块大小为64MB.磁盘中的块大小通常为4KB.64MB块大小是什么意思? - >这是否意味着从磁盘读取的最小单位是64MB?
如果是,那么这样做有什么好处? - >在HDFS中连续访问大文件很容易吗?
我们可以通过在磁盘中使用原始的4KB块大小来做同样的事情吗?
在shell中,我cleanJar在Impatient/part1目录中输入了gradle .输出如下.错误是" 找不到org.apache.hadoop.mapred.JobConf的类文件 ".为什么编译失败?
:clean UP-TO-DATE
:compileJava
Download http://conjars.org/repo/cascading/cascading-core/2.0.1/cascading-core-2.0.1.pom
Download http://conjars.org/repo/cascading/cascading-hadoop/2.0.1/cascading-hadoop-2.0.1.pom
Download http://conjars.org/repo/riffle/riffle/0.1-dev/riffle-0.1-dev.pom
Download http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.pom
Download http://repo1.maven.org/maven2/org/slf4j/slf4j-parent/1.6.1/slf4j-parent-1.6.1.pom
Download http://repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.pom
Download http://conjars.org/repo/thirdparty/jgrapht-jdk1.6/0.8.1/jgrapht-jdk1.6-0.8.1.pom
Download http://repo1.maven.org/maven2/org/codehaus/janino/janino/2.5.16/janino-2.5.16.pom
Download http://conjars.org/repo/cascading/cascading-core/2.0.1/cascading-core-2.0.1.jar
Download http://conjars.org/repo/cascading/cascading-hadoop/2.0.1/cascading-hadoop-2.0.1.jar
Download http://conjars.org/repo/riffle/riffle/0.1-dev/riffle-0.1-dev.jar
Download http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.jar
Download http://repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar
Download http://conjars.org/repo/thirdparty/jgrapht-jdk1.6/0.8.1/jgrapht-jdk1.6-0.8.1.jar
Download http://repo1.maven.org/maven2/org/codehaus/janino/janino/2.5.16/janino-2.5.16.jar
/home/is_admin/lab/cascading/Impatient/part1/src/main/java/impatient/Main.java:50: error: cannot access JobConf
Tap inTap = new Hfs( new TextDelimited( true, "\t" ), inPath );
^
class file for org.apache.hadoop.mapred.JobConf not found
1 error
:compileJava FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for …Run Code Online (Sandbox Code Playgroud) 此命令适用于HiveQL:
insert overwrite directory '/data/home.csv' select * from testtable;
Run Code Online (Sandbox Code Playgroud)
但是使用Spark SQL我收到了一个org.apache.spark.sql.hive.HiveQl堆栈跟踪错误:
java.lang.RuntimeException: Unsupported language features in query:
insert overwrite directory '/data/home.csv' select * from testtable
Run Code Online (Sandbox Code Playgroud)
请指导我在Spark SQL中编写导出到CSV功能.
我在Hive中寻找内置的String拆分功能?例如,如果是
A | B | C | d |电子
那么我想要一个像数组拆分的功能(字符串输入,字符分隔符)
所以我回来了[A,B,C,D,E].
Hive中是否存在这样的内置拆分功能.我只能看到regexp_extract和regexp_replace.我很想看到indexOf()和split()字符串函数.
谢谢
阿贾伊
我是hadoop分布式文件系统的新手,我已经在我的机器上完成了hadoop单节点的安装.但是之后当我要将数据上传到hdfs时,它会给出一条错误消息Permission Denied.
来自终端的消息带命令:
hduser@ubuntu:/usr/local/hadoop$ hadoop fs -put /usr/local/input-data/ /input
put: /usr/local/input-data (Permission denied)
hduser@ubuntu:/usr/local/hadoop$
Run Code Online (Sandbox Code Playgroud)
使用sudo并在sudouser中添加hduser后:
hduser@ubuntu:/usr/local/hadoop$ sudo bin/hadoop fs -put /usr/local/input-data/ /inwe
put: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="":hduser:supergroup:rwxr-xr-x
hduser@ubuntu:/usr/local/hadoop$
Run Code Online (Sandbox Code Playgroud)