目前,我试图使用一个SAX解析器,但约3/4通过文件,它完全冻结,我已经尝试分配更多的内存等,但没有得到任何改进.
有什么办法加速吗?一个更好的方法?
剥去它的裸骨头,所以我现在有以下代码,当在命令行运行它仍然不会像我想要的那么快.
运行它“java -Xms-4096m -Xmx8192m -jar reader.jar”我得到一个GC超出限制超过了约700000
主要:
public class Read { public static void main(String[] args) { pages = XMLManager.getPages(); } }
XMLManager
public class XMLManager { public static ArrayList<Page> getPages() { ArrayList<Page> pages = null; SAXParserFactory factory = SAXParserFactory.newInstance(); try { SAXParser parser = factory.newSAXParser(); File file = new File("..\\enwiki-20140811-pages-articles.xml"); PageHandler pageHandler = new PageHandler(); parser.parse(file,pageHandler); pages = pageHandler.getPages(); } catch (ParserConfigurationException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return pages; } }
页面处理器
public class PageHandler extends DefaultHandler{ private ArrayList<Page> pages = new ArrayList<>(); private Page page; private StringBuilder stringBuilder; private boolean idSet = false; public PageHandler(){ super(); } @Override public void startElement(String uri,String localName,String qName,Attributes attributes) throws SAXException { stringBuilder = new StringBuilder(); if (qName.equals("page")){ page = new Page(); idSet = false; } else if (qName.equals("redirect")){ if (page != null){ page.setRedirecting(true); } } } @Override public void endElement(String uri,String qName) throws SAXException { if (page != null && !page.isRedirecting()){ if (qName.equals("title")){ page.setTitle(stringBuilder.toString()); } else if (qName.equals("id")){ if (!idSet){ page.setId(Integer.parseInt(stringBuilder.toString())); idSet = true; } } else if (qName.equals("text")){ String articleText = stringBuilder.toString(); articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>"," "); //remove references articleText = articleText.replaceAll("(?s)\\{\\{(.+?)\\}\\}"," "); //remove links underneath headings articleText = articleText.replaceAll("(?s)==See also==.+"," "); //remove everything after see also articleText = articleText.replaceAll("\\|"," "); //Separate multiple links articleText = articleText.replaceAll("\\n"," "); //remove new lines articleText = articleText.replaceAll("[^a-zA-Z0-9- \\s]"," "); //remove all non alphanumeric except dashes and spaces articleText = articleText.trim().replaceAll(" +"," "); //convert all multiple spaces to 1 space Pattern pattern = Pattern.compile("([\\S]+\\s*){1,75}"); //get first 75 words of text Matcher matcher = pattern.matcher(articleText); matcher.find(); try { page.setSummaryText(matcher.group()); } catch (IllegalStateException se){ page.setSummaryText("None"); } page.setText(articleText); } else if (qName.equals("page")){ pages.add(page); page = null; } } else { page = null; } } @Override public void characters(char[] ch,int start,int length) throws SAXException { stringBuilder.append(ch,start,length); } public ArrayList<Page> getPages() { return pages; } }
解决方法
您的解析代码可能正常工作,但是您加载的数据量可能太大,无法容纳该ArrayList中的内存.
您需要某种管道才能将数据传递到其实际目的地,无需任何时间
一次存储在内存中.
我有时为这种情况做的事情类似于以下.
创建一个用于处理单个元素的界面:
public interface PageProcessor { void process(Page page); }
通过构造函数向PageHandler提供一个实现:
public class Read { public static void main(String[] args) { XMLManager.load(new PageProcessor() { @Override public void process(Page page) { // ObvIoUsly you want to do something other than just printing,// but I don't know what that is... System.out.println(page); } }) ; } } public class XMLManager { public static void load(PageProcessor processor) { SAXParserFactory factory = SAXParserFactory.newInstance(); try { SAXParser parser = factory.newSAXParser(); File file = new File("pages-articles.xml"); PageHandler pageHandler = new PageHandler(processor); parser.parse(file,pageHandler); } catch (ParserConfigurationException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } }
将数据发送到此处理器,而不是将其放在列表中:
public class PageHandler extends DefaultHandler { private final PageProcessor processor; private Page page; private StringBuilder stringBuilder; private boolean idSet = false; public PageHandler(PageProcessor processor) { this.processor = processor; } @Override public void startElement(String uri,Attributes attributes) throws SAXException { //Unchanged from your implementation } @Override public void characters(char[] ch,int length) throws SAXException { //Unchanged from your implementation } @Override public void endElement(String uri,String qName) throws SAXException { // Elide code not needing change } else if (qName.equals("page")){ processor.process(page); page = null; } } else { page = null; } } }
当然,您可以使您的界面处理多个记录的块,而不仅仅是一个,并让PageHandler在本地将页面收集到较小的列表中,并定期发送列表进行处理并清除列表.
或者(也许更好),您可以实现这里定义的PageProcessor接口,并在其中构建缓冲数据的逻辑,并将其发送以进一步处理块.