我正在尝试加载大于h2o内存大小的数据.
H2o blog提到:关于Bigger Data和GC的说明:当Java堆太满时,我们进行用户模式交换到磁盘,即,您使用的数据量大于物理DRAM.我们不会死于GC死亡螺旋,但我们会降低到核心外速度.我们会像磁盘允许的那样快.我个人测试过将12Gb数据集加载到2Gb(32位)JVM中;加载数据大约需要5分钟,另外还需要5分钟才能运行Logistic回归.
这是连接到h2o 3.6.0.8的R代码:
h2o.init(max_mem_size = '60m') # alloting 60mb for h2o,R is running on 8GB RAM machine
给
java version "1.8.0_65" Java(TM) SE Runtime Environment (build 1.8.0_65-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01,mixed mode) .Successfully connected to http://127.0.0.1:54321/ R is connected to the H2O cluster: H2O cluster uptime: 2 seconds 561 milliseconds H2O cluster version: 3.6.0.8 H2O cluster name: H2O_started_from_R_RILITS-HWLTP_tkn816 H2O cluster total nodes: 1 H2O cluster total memory: 0.06 GB H2O cluster total cores: 4 H2O cluster allowed cores: 2 H2O cluster healthy: TRUE Note: As started,H2O is limited to the CRAN default of 2 cpus. Shut down and restart H2O as shown below to use all your cpus. > h2o.shutdown() > h2o.init(nthreads = -1) IP Address: 127.0.0.1 Port : 54321 Session ID: _sid_b2e0af0f0c62cd64a8fcdee65b244d75 Key Count : 3
我试图将169 MB的csv加载到h2o中.
dat.hex <- h2o.importFile('dat.csv')
哪个错了,
Error in .h2o.__checkConnectionHealth() : H2O connection has been severed. Cannot connect to instance at http://127.0.0.1:54321/ Failed to connect to 127.0.0.1 port 54321: Connection refused
这表明内存不足error.
Question: If H2o promises loading a data set larger than its memory capacity(swap to disk mechanism as the blog quote above says),is this the correct way to load the data?