pat*_*rit 36 scala apache-spark apache-spark-sql spark-dataframe apache-spark-2.0
我查看了文档,并说它支持以下连接类型:
要执行的联接类型.默认内心.必须是以下之一:inner,cross,outer,full,full_outer,left,left_outer,right,right_outer,left_semi,left_anti.
我查看了关于SQL连接的StackOverflow答案,并且顶部几个答案没有提到上面的一些连接,例如left_semi和left_anti.他们在Spark中意味着什么?
pat*_*rit 69
这是一个简单的说明性实验:
import org.apache.spark.sql._
object SparkSandbox extends App {
private[this] implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
val left = Seq((1, "A1"), (2, "A2"), (3, "A3"), (4, "A4")).toDF("id", "value")
val right = Seq((3, "A3"), (4, "A4"), (4, "A4_1"), (5, "A5"), (6, "A6")).toDF("id", "value")
println("LEFT")
left.orderBy("id").show()
println("RIGHT")
right.orderBy("id").show()
val joinTypes = Seq("inner", "outer", "full", "full_outer", "left", "left_outer", "right", "right_outer", "left_semi", "left_anti")
joinTypes foreach { joinType =>
println(s"${joinType.toUpperCase()} JOIN")
left.join(right = right, usingColumns = Seq("id"), joinType = joinType).orderBy("id").show()
}
}
Run Code Online (Sandbox Code Playgroud)
产量
LEFT
+---+-----+
| id|value|
+---+-----+
| 1| A1|
| 2| A2|
| 3| A3|
| 4| A4|
+---+-----+
RIGHT
+---+-----+
| id|value|
+---+-----+
| 3| A3|
| 4| A4|
| 4| A4_1|
| 5| A5|
| 6| A6|
+---+-----+
INNER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 3| A3| A3|
| 4| A4| A4_1|
| 4| A4| A4|
+---+-----+-----+
OUTER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A1| null|
| 2| A2| null|
| 3| A3| A3|
| 4| A4| A4|
| 4| A4| A4_1|
| 5| null| A5|
| 6| null| A6|
+---+-----+-----+
FULL JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A1| null|
| 2| A2| null|
| 3| A3| A3|
| 4| A4| A4|
| 4| A4| A4_1|
| 5| null| A5|
| 6| null| A6|
+---+-----+-----+
FULL_OUTER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A1| null|
| 2| A2| null|
| 3| A3| A3|
| 4| A4| A4|
| 4| A4| A4_1|
| 5| null| A5|
| 6| null| A6|
+---+-----+-----+
LEFT JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A1| null|
| 2| A2| null|
| 3| A3| A3|
| 4| A4| A4_1|
| 4| A4| A4|
+---+-----+-----+
LEFT_OUTER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 1| A1| null|
| 2| A2| null|
| 3| A3| A3|
| 4| A4| A4_1|
| 4| A4| A4|
+---+-----+-----+
RIGHT JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 3| A3| A3|
| 4| A4| A4_1|
| 4| A4| A4|
| 5| null| A5|
| 6| null| A6|
+---+-----+-----+
RIGHT_OUTER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
| 3| A3| A3|
| 4| A4| A4|
| 4| A4| A4_1|
| 5| null| A5|
| 6| null| A6|
+---+-----+-----+
LEFT_SEMI JOIN
+---+-----+
| id|value|
+---+-----+
| 3| A3|
| 4| A4|
+---+-----+
LEFT_ANTI JOIN
+---+-----+
| id|value|
+---+-----+
| 1| A1|
| 2| A2|
+---+-----+
Run Code Online (Sandbox Code Playgroud)
喜欢Pathikrit的例子。这是使用Spark v2和数据框(包括交叉联接)的Java中可能的转换。
package net.jgp.books.sparkInAction.ch12.lab940AllJoins;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* All joins in a single app, inspired by
* /sf/ask/3219344341/.
*
* Used in Spark in Action 2e, http://jgp.net/sia
*
* @author jgp
*/
public class AllJoinsApp {
/**
* main() is your entry point to the application.
*
* @param args
*/
public static void main(String[] args) {
AllJoinsApp app = new AllJoinsApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Processing of invoices")
.master("local")
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"id",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"value",
DataTypes.StringType,
false) });
List<Row> rows = new ArrayList<Row>();
rows.add(RowFactory.create(1, "A1"));
rows.add(RowFactory.create(2, "A2"));
rows.add(RowFactory.create(3, "A3"));
rows.add(RowFactory.create(4, "A4"));
Dataset<Row> dfLeft = spark.createDataFrame(rows, schema);
dfLeft.show();
rows = new ArrayList<Row>();
rows.add(RowFactory.create(3, "A3"));
rows.add(RowFactory.create(4, "A4"));
rows.add(RowFactory.create(4, "A4_1"));
rows.add(RowFactory.create(5, "A5"));
rows.add(RowFactory.create(6, "A6"));
Dataset<Row> dfRight = spark.createDataFrame(rows, schema);
dfRight.show();
String[] joinTypes = new String[] {
"inner", // v2.0.0. default
"cross", // v2.2.0
"outer", // v2.0.0
"full", // v2.1.1
"full_outer", // v2.1.1
"left", // v2.1.1
"left_outer", // v2.0.0
"right", // v2.1.1
"right_outer", // v2.0.0
"left_semi", // v2.0.0, was leftsemi before v2.1.1
"left_anti" // v2.1.1
};
for (String joinType : joinTypes) {
System.out.println(joinType.toUpperCase() + " JOIN");
Dataset<Row> df = dfLeft.join(
dfRight,
dfLeft.col("id").equalTo(dfRight.col("id")),
joinType);
df.orderBy(dfLeft.col("id")).show();
}
}
}
Run Code Online (Sandbox Code Playgroud)
我将把这个示例放在2e的第12章存储库 “ Spark in Action ”中。
| 归档时间: |
|
| 查看次数: |
48351 次 |
| 最近记录: |