我有一个k-means算法的实现,我想通过使用Java 8流和多核处理使我的过程更快.
我在Java 7中得到了这段代码:
//Step 2: For each point p:
//find nearest clusters c
//assign the point p to the closest cluster c
for (Point p : points) {
double minDst = Double.MAX_VALUE;
int minClusterNr = 1;
for (Cluster c : clusters) {
double tmpDst = determineDistance(p,c);
if (tmpDst < minDst) {
minDst = tmpDst;
minClusterNr = c.clusterNumber;
}
}
clusters.get(minClusterNr - 1).points.add(p);
}
//Step 3: For each cluster c
//find the central point of all points p in c
//set c to the center point
ArrayList
我想使用带有并行流的Java 8来加速这个过程.
我尝试了一下并提出了这个解决方案:
points.stream().forEach(p -> {
minDst = Double.MAX_VALUE; //<- THESE ARE GLOBAL VARIABLES NOW
minClusterNr = 1; //<- THESE ARE GLOBAL VARIABLES NOW
clusters.stream().forEach(c -> {
double tmpDst = determineDistance(p,c);
if (tmpDst < minDst) {
minDst = tmpDst;
minClusterNr = c.clusterNumber;
}
});
clusters.get(minClusterNr - 1).points.add(p);
});
ArrayList
这种带流的解决方案比没有流的解决方案快得多.我想知道这是否已经使用多核处理?为什么它会突然几乎快两倍?
without streams : Elapsed time: 202 msec &
with streams : Elapsed time: 116 msec
在任何这些方法中使用parallelStream来加速它们还有用吗?当我将流更改为stream()时,它现在所做的就是导致ArrayOutOfBounce和NullPointer异常.andline().forEach(CODE)
— Clustering.java —
package algo;
import java.awt.Color;
import java.awt.Graphics2D;
import java.awt.image.BufferedImage;
import java.util.ArrayList;
import java.util.Random;
import java.util.function.BiFunction;
import graphics.SimpleColorFun;
/**
* An implementation of the k-means-algorithm.
*
— Point.java —
package algo;
public class Point {
public double x;
public double y;
public Point(int x,int y) {
this.x = x;
this.y = y;
}
public Point(double x,double y) {
this.x = x;
this.y = y;
}
}
— Cluster.java —
package algo;
import java.util.ArrayList;
public class Cluster {
public double x;
public double y;
public int clusterNumber;
public ArrayList
— SimpleColorFun.java —
package graphics;
import java.awt.Color;
import java.util.function.BiFunction;
/**
* Simple function for selection a color for a specific cluster identified with an integer-ID.
*
* @author makl,hese
*/
public class SimpleColorFun implements BiFunction
— Main.java —(用一些时间记录机制替换秒表 – 我从我们的工作环境中得到这个)
package main;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import javax.imageio.ImageIO;
import algo.Clustering;
import algo.Point;
import eu.lbase.common.util.Stopwatch;
// import persistence.DataHandler;
public class Main {
private static final String OUTPUT_DIR = (new File("./output/withoutStream")).getAbsolutePath() + File.separator;
private static final String OUTPUT_DIR_2 = (new File("./output/withStream")).getAbsolutePath() + File.separator;
public static void main(String[] args) {
Random rng = new Random();
int numPoints = 300;
int seed = 2;
ArrayList
线程安全解决方案可能看起来像
points.stream().forEach(p -> {
Cluster min = clusters.stream()
.min(Comparator.comparingDouble(c -> determineDistance(p,c))).get();
// your original code used the custerNumber to lookup the Cluster in
// the list,don't know whether this is this really necessary
min = clusters.get(min.clusterNumber - 1);
// didn't find a better way considering your current code structure
synchronized(min) {
min.points.add(p);
}
});
List
但是你没有提供足够的上下文来测试它.有一些未解决的问题,例如您使用Cluster实例的clusterNumber来回顾群集列表;我不知道clusterNumber是否代表我们已经拥有的Cluster实例的实际列表索引,即,这是否是不必要的冗余,或者具有不同的含义.
我也不知道比同步特定群集更好的解决方案,以使其列表线程的操作安全(给定您当前的代码结构).只有在您决定使用并行流(即points.parallelStream().forEach(p – > …))时才需要这样做,其他操作不受影响.
您现在有几个流可以并行和顺序尝试,以找出您从哪里获得利益.通常,只有其他流可以带来显着的好处,如果有的话……