我试图在
Spark SQL documentation中围绕这两个功能 –
> def union(其他:RDD [Row]):RDD [Row]
返回此RDD与另一个RDD的并集.
> def unionAll(otherPlan:SchemaRDD):SchemaRDD
将两个RDD的元组与相同的模式组合在一起,保持重复.
这不是UNION与UNION ALL,as documented in this SO question的标准行为.
我的代码,从Spark SQL documentation借用,有两个函数返回相同的结果.
scala> case class Person(name: String,age: Int) scala> import org.apache.spark.sql._ scala> val one = sc.parallelize(Array(Person("Alpha",1),Person("Beta",2))) scala> val two = sc.parallelize(Array(Person("Alpha",2),Person("Gamma",3))) scala> val schemaString = "name age" scala> val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName,StringType,true))) scala> val peopleSchemaRDD1 = sqlContext.applySchema(one,schema) scala> val peopleSchemaRDD2 = sqlContext.applySchema(two,schema) scala> peopleSchemaRDD1.union(peopleSchemaRDD2).collect res34: Array[org.apache.spark.sql.Row] = Array([Alpha,1],[Beta,2],[Alpha,[Gamma,3]) scala> peopleSchemaRDD1.unionAll(peopleSchemaRDD2).collect res35: Array[org.apache.spark.sql.Row] = Array([Alpha,3])
为什么我更喜欢一个呢?
解决方法
在Spark 1.6中,上面的union版本被删除了,所以unionAll就是剩下的.
在Spark 2.0中,unionAll被重命名为union,unionAll保持向后兼容(我猜).
无论如何,在union
(Spark 2.0)或unionAll
(Spark 1.6)中都没有进行重复数据删除.