fs.hdfs.impl.disable.cache导致SparkSQL非常慢

lee*_*wah 5 hadoop hive hdfs apache-spark-sql

这是一个与此问题相关的问题：Hive / Hadoop间歇性故障：无法将源移动到目标

我们发现通过将设置fs.hdfs.impl.disable.cache为，可以避免“无法移动源...文件系统已关闭”的问题。true

但是，我们还发现，SparkSQL查询变得非常慢-过去几秒钟内完成的查询现在需要30到40秒钟以上的时间才能完成（即使查询非常简单，例如读取一个小表）。

这正常吗？

我对为fs.hdfs.impl.disable.cachetrue的理解意味着，FileSystem#get()总会createFileSystem()代替返回已缓存的FileSystem。此设置可以防止一个FileSystem对象被多个客户端共享，这确实是有道理的，因为它可以防止例如两个调用者FileSystem#get()彼此关闭文件系统。

（例如，请参阅此讨论）

此设置会使事情变慢，但可能不会那么慢。

来自：hadoop-source-reading

/**
 * Returns the FileSystem for this URI's scheme and authority. The scheme of
 * the URI determines a configuration property name,
 * <tt>fs.<i>scheme</i>.class</tt> whose value names the FileSystem class.
 * The entire URI is passed to the FileSystem instance's initialize method.
 */
public static FileSystem get(URI uri, Configuration conf)
        throws IOException {
    String scheme = uri.getScheme();
    String authority = uri.getAuthority();

    if (scheme == null) { // no scheme: use default FS
        return get(conf);
    }

    if (authority == null) { // no authority
        URI defaultUri = getDefaultUri(conf);
        if (scheme.equals(defaultUri.getScheme()) // if scheme matches
                // default
                && defaultUri.getAuthority() != null) { // & default has
            // authority
            return get(defaultUri, conf); // return default
        }
    }

    String disableCacheName = String.format("fs.%s.impl.disable.cache",
            scheme);
    if (conf.getBoolean(disableCacheName, false)) {
        return createFileSystem(uri, conf);
    }

    return CACHE.get(uri, conf);
}

Run Code Online (Sandbox Code Playgroud)

速度慢会指向其他一些网络问题，例如解析域名吗？欢迎对此问题有任何见解。

归档时间：	7 年，9 月前
查看次数：	2131 次
最近记录：	7 年，9 月前