sql – 在Apache Spark Join中包含空值

前端之家收集整理的这篇文章主要介绍了sql – 在Apache Spark Join中包含空值前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
我想在Apache Spark连接中包含空值。 Spark默认情况下不包含null行。

这是默认的Spark行为。

  1. val numbersDf = Seq(
  2. ("123"),("456"),(null),("")
  3. ).toDF("numbers")
  4.  
  5. val lettersDf = Seq(
  6. ("123","abc"),("456","def"),(null,"zzz"),("","hhh")
  7. ).toDF("numbers","letters")
  8.  
  9. val joinedDf = numbersDf.join(lettersDf,Seq("numbers"))

这是joinedDf.show()的输出

  1. +-------+-------+
  2. |numbers|letters|
  3. +-------+-------+
  4. | 123| abc|
  5. | 456| def|
  6. | | hhh|
  7. +-------+-------+

这是我想要的输出

  1. +-------+-------+
  2. |numbers|letters|
  3. +-------+-------+
  4. | 123| abc|
  5. | 456| def|
  6. | | hhh|
  7. | null| zzz|
  8. +-------+-------+

解决方法

Scala提供了一个特殊的NULL安全等于运算符:
  1. numbersDf
  2. .join(lettersDf,numbersDf("numbers") <=> lettersDf("numbers"))
  3. .drop(lettersDf("numbers"))
  1. +-------+-------+
  2. |numbers|letters|
  3. +-------+-------+
  4. | 123| abc|
  5. | 456| def|
  6. | null| zzz|
  7. | | hhh|
  8. +-------+-------+

小心不要在Spark 1.5或更早版本中使用它。在Spark 1.6之前,它需要一个笛卡尔积(SPARK-11111快速零安全连接)。

在Spark 2.3.0或更高版本中,您可以在PySpark中使用Column.eqNullSafe:

  1. numbers_df = sc.parallelize([
  2. ("123",),(None,)
  3. ]).toDF(["numbers"])
  4.  
  5. letters_df = sc.parallelize([
  6. ("123","hhh")
  7. ]).toDF(["numbers","letters"])
  8.  
  9. numbers_df.join(letters_df,numbers_df.numbers.eqNullSafe(letters_df.numbers))
  1. +-------+-------+-------+
  2. |numbers|numbers|letters|
  3. +-------+-------+-------+
  4. | 456| 456| def|
  5. | null| null| zzz|
  6. | | | hhh|
  7. | 123| 123| abc|
  8. +-------+-------+-------+

和SparkR中的%< =>%:

  1. numbers_df <- createDataFrame(data.frame(numbers = c("123","456",NA,"")))
  2. letters_df <- createDataFrame(data.frame(
  3. numbers = c("123",""),letters = c("abc","def","zzz","hhh")
  4. ))
  5.  
  6. head(join(numbers_df,letters_df,numbers_df$numbers %<=>% letters_df$numbers))
  1. numbers numbers letters
  2. 1 456 456 def
  3. 2 <NA> <NA> zzz
  4. 3 hhh
  5. 4 123 123 abc

使用sql(Spark 2.2.0),您可以使用IS NOT DISTINCT FROM:

  1. SELECT * FROM numbers JOIN letters
  2. ON numbers.numbers IS NOT DISTINCT FROM letters.numbers

这也可以与DataFrame API一起使用:

  1. numbersDf.alias("numbers")
  2. .join(lettersDf.alias("letters"))
  3. .where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")

猜你在找的MsSQL相关文章