相关问题
这个问题非常相关,但是2岁:In memory OLAP engine in Java
背景
我想在内存中从给定的表格数据集创建一个像表格一样的数据透视表
例如按婚姻状况计算的年龄(行是年龄,列是婚姻状况).
>输入:人员列表,包含年龄和一些布尔属性(例如已婚),
>所需的输出:人数,年龄(行)和已婚(列)
我尝试了什么(Scala)
case class Person(val age:Int,val isMarried:Boolean) ... val people:List[Person] = ... // val peopleByAge = people.groupBy(_.age) //only by age val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status
我设法以天真的方式做到这一点,首先按年龄分组,然后根据婚姻状况进行计数,并输出结果,然后我foldRight汇总
TreeMap(peopleByAge.toSeq: _*).map(x => { val age = x._1 val rows = x._2 val numMarried = rows.count(_.isMarried()) val numNotMarried = rows.length - numMarried (age,numMarried,numNotMarried) }).foldRight(List[FinalResult]())(row,list) => { val cumMarried = row._2+ (if (list.isEmpty) 0 else list.last.cumMarried) val cumNotMarried = row._3 + (if (list.isEmpty) 0 else l.last.cumNotMarried) list :+ new FinalResult(row._1,row._2,row._3,cumMarried,cumNotMarried) }.reverse
我不喜欢上面的代码,它不高效,难以阅读,我确信有更好的方法.
问题
我怎么groupBy“两个”?以及如何对每个子组进行计数,例如
How many people are exactly 30 years old and married?
另一个问题是,我如何做一个总计,回答这个问题:
How many people above 30 are married?
编辑:
谢谢你所有的好答案.
为了澄清,我希望输出包含一个带有以下列的“表”
>年龄(升序)
> Num已婚
> Num未婚
>跑完全婚
>未结婚总数
不仅要回答这些特定的问题,还要制作一份能够回答所有此类问题的报告.
解决方法
这是一个更冗长的选项,但这是以通用方式而不是使用严格的数据类型.你当然可以使用泛型来使这个更好,但我认为你明白了.
/** Creates a new pivot structure by finding correlated values * and performing an operation on these values * * @param accuOp the accumulator function (e.g. sum,max,etc) * @param xCol the "x" axis column * @param yCol the "y" axis column * @param accuCol the column to collect and perform accuOp on * @return a new Pivot instance that has been transformed with the accuOp function */ def doPivot(accuOp: List[String] => String)(xCol: String,yCol: String,accuCol: String) = { // create list of indexes that correlate to x,y,accuCol val colsIdx = List(xCol,yCol,accuCol).map(headers.getOrElse(_,1)) // group by x and y,sending the resulting collection of // accumulated values to the accuOp function for post-processing val data = body.groupBy(row => { (row(colsIdx(0)),row(colsIdx(1))) }).map(g => { (g._1,accuOp(g._2.map(_(colsIdx(2))))) }).toMap // get distinct axis values val xAxis = data.map(g => {g._1._1}).toList.distinct val yAxis = data.map(g => {g._1._2}).toList.distinct // create result matrix val newRows = yAxis.map(y => { xAxis.map(x => { data.getOrElse((x,y),"") }) }) // collect it with axis labels for results Pivot(List((yCol + "/" + xCol) +: xAxis) ::: newRows.zip(yAxis).map(x=> {x._2 +: x._1})) }
我的Pivot类型非常基本:
class Pivot(val rows: List[List[String]]) { val headers = rows.head.zipWithIndex.toMap val body = rows.tail ... }
要测试它,你可以这样做:
val marriedP = Pivot( List( List("Name","Age","Married"),List("Bill","42","TRUE"),List("Heloise","47",List("Thelma","34","FALSE"),List("Bridget",List("Robert",List("Eddie","TRUE") ) ) def accum(values: List[String]) = { values.map(x => {1}).sum.toString } println(marriedP + "\n") println(marriedP.doPivot(accum)("Age","Married","Married"))
产量:
Name Age Married Bill 42 TRUE Heloise 47 TRUE Thelma 34 FALSE Bridget 47 TRUE Robert 42 FALSE Eddie 42 TRUE Married/Age 47 42 34 TRUE 2 2 FALSE 1 1
好处是你可以使用currying传递值的任何函数,就像在传统的excel数据透视表中一样.
更多信息可以在这里找到:https://github.com/vinsonizer/pivotfun