Java / Scala中的内存olap / pivot表中是否有数据结构/库？

相关问题

这个问题非常相关,但是2岁：In memory OLAP engine in Java

背景

我想在内存中从给定的表格数据集创建一个像表格一样的数据透视表

例如按婚姻状况计算的年龄(行是年龄,列是婚姻状况).

>输入：人员列表,包含年龄和一些布尔属性(例如已婚),
>所需的输出：人数,年龄(行)和已婚(列)

我尝试了什么(Scala)

case class Person(val age:Int,val isMarried:Boolean)

...
val people:List[Person] = ... //

val peopleByAge = people.groupBy(_.age)  //only by age
val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status

我设法以天真的方式做到这一点,首先按年龄分组,然后根据婚姻状况进行计数,并输出结果,然后我foldRight汇总

TreeMap(peopleByAge.toSeq: _*).map(x => {
    val age = x._1
    val rows = x._2
    val numMarried = rows.count(_.isMarried())
    val numNotMarried = rows.length - numMarried
    (age,numMarried,numNotMarried)
}).foldRight(List[FinalResult]())(row,list) => {
     val cumMarried = row._2+ 
        (if (list.isEmpty) 0 else list.last.cumMarried) 
     val cumNotMarried = row._3 + 
        (if (list.isEmpty) 0 else l.last.cumNotMarried) 
     list :+ new FinalResult(row._1,row._2,row._3,cumMarried,cumNotMarried) 
}.reverse

我不喜欢上面的代码,它不高效,难以阅读,我确信有更好的方法.

问题

我怎么groupBy“两个”？以及如何对每个子组进行计数,例如

How many people are exactly 30 years old and married?

另一个问题是,我如何做一个总计,回答这个问题：

How many people above 30 are married?

编辑：

谢谢你所有的好答案.

为了澄清,我希望输出包含一个带有以下列的“表”

>年龄(升序)
> Num已婚
> Num未婚
>跑完全婚
>未结婚总数

不仅要回答这些特定的问题,还要制作一份能够回答所有此类问题的报告.

解决方法

这是一个更冗长的选项,但这是以通用方式而不是使用严格的数据类型.你当然可以使用泛型来使这个更好,但我认为你明白了.

/** Creates a new pivot structure by finding correlated values 
  * and performing an operation on these values
  *
  * @param accuOp the accumulator function (e.g. sum,max,etc)
  * @param xCol the "x" axis column
  * @param yCol the "y" axis column
  * @param accuCol the column to collect and perform accuOp on
  * @return a new Pivot instance that has been transformed with the accuOp function
  */
def doPivot(accuOp: List[String] => String)(xCol: String,yCol: String,accuCol: String) = {
  // create list of indexes that correlate to x,y,accuCol
  val colsIdx = List(xCol,yCol,accuCol).map(headers.getOrElse(_,1))

  // group by x and y,sending the resulting collection of
  // accumulated values to the accuOp function for post-processing
  val data = body.groupBy(row => {
    (row(colsIdx(0)),row(colsIdx(1)))
  }).map(g => {
    (g._1,accuOp(g._2.map(_(colsIdx(2)))))
  }).toMap

  // get distinct axis values
  val xAxis = data.map(g => {g._1._1}).toList.distinct
  val yAxis = data.map(g => {g._1._2}).toList.distinct

  // create result matrix
  val newRows = yAxis.map(y => {
    xAxis.map(x => {
      data.getOrElse((x,y),"")
    })
  })

 // collect it with axis labels for results
 Pivot(List((yCol + "/" + xCol) +: xAxis) :::
   newRows.zip(yAxis).map(x=> {x._2 +: x._1}))
}

我的Pivot类型非常基本：

class Pivot(val rows: List[List[String]]) {

  val headers = rows.head.zipWithIndex.toMap
  val body    = rows.tail
  ...
}

要测试它,你可以这样做：

val marriedP = Pivot(
  List(
    List("Name","Age","Married"),List("Bill","42","TRUE"),List("Heloise","47",List("Thelma","34","FALSE"),List("Bridget",List("Robert",List("Eddie","TRUE")

  )
)

def accum(values: List[String]) = {
    values.map(x => {1}).sum.toString
}
println(marriedP + "\n")
println(marriedP.doPivot(accum)("Age","Married","Married"))

产量：

Name     Age      Married  
Bill     42       TRUE     
Heloise  47       TRUE     
Thelma   34       FALSE    
Bridget  47       TRUE     
Robert   42       FALSE    
Eddie    42       TRUE     

Married/Age  47           42           34           
TRUE         2            2                         
FALSE                     1            1

好处是你可以使用currying传递值的任何函数,就像在传统的excel数据透视表中一样.

更多信息可以在这里找到：https://github.com/vinsonizer/pivotfun

Java / Scala中的内存olap / pivot表中是否有数据结构/库？

解决方法

猜你在找的Java相关文章