使用PIG读取XML

前端之家收集整理的这篇文章主要介绍了使用PIG读取XML前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
我试图使用PIG从xml文件中读取数据,但我的输出不完整.

输入文件-

  1. <document>
  2. <url>htp://www.abc.com/</url>
  3. <category>Sports</category>
  4. <usercount>120</usercount>
  5. <reviews>
  6. <review>good site</review>
  7. <review>This is Avg site</review>
  8. <review>Bad site</review>
  9. </reviews>
  10. </document>

我正在使用的代码是:

  1. register 'Desktop/piggybank-0.11.0.jar';
  2. A = load 'input3' using org.apache.pig.piggybank.storage.XMLLoader('document') as (data:chararray);
  3.  
  4.  
  5. B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(data,'(?s)<document>.*?<url>([^>]*?)</url>.*?<category>([^>]*?)</category>.*?<usercount>([^>]*?)</usercount>.*?<reviews>.*?<review>\\s*([^>]*?)\\s*</review>.*?</reviews>.*?</document>')) as (url:chararray,catergory:chararray,usercount:int,review:chararray);

我得到的输出是:

  1. (htp://www.abc.com/,Sports,120,good site)

这是不完整的输出.有人请帮助我失踪了吗?

呵呵!终于让它使用cross工作了.我正在使用XPath,如果你愿意,你可以使用正则表达式.我发现,XPath比正则表达式更简单,更清晰.我想,你也可以看到它.不要忘记用XML替换testXML.xml.

XPath方式:

  1. DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
  2. A = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('document') as (x:chararray);
  3. B = FOREACH A GENERATE XPath(x,'document/url'),XPath(x,'document/category'),'document/usercount');
  4. C = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('review') as (review:chararray);
  5. D = FOREACH C GENERATE XPath(review,'review');
  6. E = cross B,D;
  7. dump E;

正则表达方式:

  1. A = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('document') as (x:chararray);
  2. B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'(?s)<document>.*?<url>([^>]*?)</url>.*?<category>([^>]*?)</category>.*?<usercount>([^>]*?)</usercount>.*?</document>')) as (url:chararray,usercount:int);
  3. C = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('review') as (review:chararray);
  4. D = FOREACH C GENERATE FLATTEN(REGEX_EXTRACT_ALL(review,'<review>([^>]*?)</review>'));
  5. E = cross B,D;
  6. dump E;

输出

  1. (htp://www.abc.com/,Bad site)
  2. (htp://www.abc.com/,This is Avg site)
  3. (htp://www.abc.com/,good site)

这不是你所期待的吗?

猜你在找的XML相关文章