我正在使用spark 1.6.1,我有这样的数据帧.
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+ | scene_id| action_id| classifier|os_name|country|app_ver| p0value|p1value|p2value|p3value|p4value| +-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+ | test_home|scene_enter| test_home|android| KR| 5.6.3|__OTHERS__| false| test| test| test| ......
我想通过使用立方体操作获得如下数据帧.
(由所有字段分组,但只有“os_name”,“country”,“app_ver”字段为立方体)
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+ | scene_id| action_id| classifier|os_name|country|app_ver| p0value|p1value|p2value|p3value|p4value|cnt| +-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+ | test_home|scene_enter| test_home|android| KR| 5.6.3|__OTHERS__| false| test| test| test| 9| | test_home|scene_enter| test_home| null| KR| 5.6.3|__OTHERS__| false| test| test| test| 35| | test_home|scene_enter| test_home|android| null| 5.6.3|__OTHERS__| false| test| test| test| 98| | test_home|scene_enter| test_home|android| KR| null|__OTHERS__| false| test| test| test|101| | test_home|scene_enter| test_home| null| null| 5.6.3|__OTHERS__| false| test| test| test|301| | test_home|scene_enter| test_home| null| KR| null|__OTHERS__| false| test| test| test|225| | test_home|scene_enter| test_home|android| null| null|__OTHERS__| false| test| test| test|312| | test_home|scene_enter| test_home| null| null| null|__OTHERS__| false| test| test| test|521| ......
我尝试过如下,但它看起来很慢而且丑陋……
var cubed = df .cube($"scene_id",$"action_id",$"classifier",$"country",$"os_name",$"app_ver",$"p0value",$"p1value",$"p2value",$"p3value",$"p4value") .count .where("scene_id IS NOT NULL AND action_id IS NOT NULL AND classifier IS NOT NULL AND p0value IS NOT NULL AND p1value IS NOT NULL AND p2value IS NOT NULL AND p3value IS NOT NULL AND p4value IS NOT NULL")
更好的解决方案?
请执行我的坏英语.. ^^;
提前致谢..
解决方法
我相信你无法完全避免这个问题,但有一个简单的技巧可以减少它的规模.我们的想法是用一个占位符替换所有不应被边缘化的列.
例如,如果您有一个DataFrame:
val df = Seq((1,2,3,4,5,6)).toDF("a","b","c","d","e","f")
你对d和e边缘化的多维数据集感兴趣并按a..c分组,你可以将a..c的替换定义为:
import org.apache.spark.sql.functions.struct import sparksql.implicits._ // alias here may not work in Spark 1.6 val rest = struct(Seq($"a",$"b",$"c"): _*).alias("rest")
和立方体:
val cubed = Seq($"d",$"e") // If there is a problem with aliasing rest it can done here. val tmp = df.cube(rest.alias("rest") +: cubed: _*).count
快速过滤和选择应该处理其余的:
tmp.where($"rest".isNotNull).select($"rest.*" +: cubed :+ $"count": _*)
结果如:
+---+---+---+----+----+-----+ | a| b| c| d| e|count| +---+---+---+----+----+-----+ | 1| 2| 3|null| 5| 1| | 1| 2| 3|null|null| 1| | 1| 2| 3| 4| 5| 1| | 1| 2| 3| 4|null| 1| +---+---+---+----+----+-----+