我是R中解析XML的新手.我正在尝试将XML解析为可行的数据框.我已经尝试了XML包中的一些XPath函数,但似乎无法得出正确的答案.
这是我的XML:
<ResidentialProperty> <Listing> <StreetAddress> <StreetNumber>11111</StreetNumber> <StreetName>111th</StreetName> <StreetSuffix>Avenue Ct</StreetSuffix> <StateOrProvince>WA</StateOrProvince> </StreetAddress> <MLSInformation> <ListingStatus Status="Active"/> <StatusChangeDate>2015-07-05T23:48:53.410</StatusChangeDate> </MLSInformation> <GeographicData> <Latitude>11.111111</Latitude> <Longitude>-111.111111</Longitude> <County>Pierce</County> </GeographicData> <SchoolData> <SchoolDistrict>Puyallup</SchoolDistrict> </SchoolData> <View>Territorial</View> </Listing> <YearBuilt>1997</YearBuilt> <InteriorFeatures>Bath Off Master,Dbl Pane/Storm Windw</InteriorFeatures> <Occupant> <Name>Vacant</Name> </Occupant> <WaterFront/> <Roof>Composition</Roof> <Exterior>Brick,Cement Planked,Wood,Wood Products</ </ResidentialProperty>
当我跑:
ResidentialProperty <- xmlToDataFrame(nodes=getNodeSet(doc,"//ResidentialProperty"))
父节点中子节点的值被压缩为:
11111111thAvenue CtWA2015-07-05T23:48:53.41011.111111-111.111111PiercePuyallupTerritorial
如果我向下移动一个节点,则会发生同样的事情:
11111111thAvenue CtWA
子节点的值全部粘贴在一起.
我也试过一种有点工作的蛮力方法:
StreetAddress <- xmlToDataFrame(nodes=getNodeSet(doc,"//StreetAddress")) MLSInformation <- xmlToDataFrame(nodes=getNodeSet(doc,"//MLSInformation")) GeographicData <- xmlToDataFrame(nodes=getNodeSet(doc,"//GeographicData")) SchoolData <- xmlToDataFrame(nodes=getNodeSet(doc,"//SchoolData")) YearBuilt <- xmlToDataFrame(nodes=getNodeSet(doc,"//YearBuilt")) InteriorFeatures <- xmlToDataFrame(nodes=getNodeSet(doc,"//InteriorFeatures")) Occupant <- xmlToDataFrame(nodes=getNodeSet(doc,"//Occupant")) Roof <- xmlToDataFrame(nodes=getNodeSet(doc,"//Roof")) Exterior <- xmlToDataFrame(nodes=getNodeSet(doc,"//Exterior")) df <- cbind(StreetAddress,MLSInformation,GeographicData,SchoolData,YearBuilt,InteriorFeatures,Occupant,Roof,Exterior)
但是一些列名不符合预期:
> colnames(df) [1] "StreetNumber" "StreetName" "StreetSuffix" "StateOrProvince" "ListingStatus" [6] "StatusChangeDate" "Latitude" "Longitude" "County" "SchoolDistrict" [11] "text" "text" "Name" "text" "text"
colnames [11,12,14,15]应分别为“YearBuilt”,“InteriorFeatures”,“Roof”和“Exterior”. (旁注 – 为什么会这样?)
我试图找到一种方法将每个原子值排序到数据框的适当列中,列名称是节点的名称,即使在嵌套的子节点中也是如此.此外,我的数据可能会随着时间的推移而发生变化,所以我正在寻找一种符合数据的动态函数,如果可能的话会产生预期的结果.
我想这是一个有点常见的XML模式(带有嵌套子元素的层),所以我很惊讶没有找到关于这个主题的太多信息,尽管我可能只是在我的搜索中使用了错误的术语.我猜我有一个简单的答案.你有什么建议吗?
考虑到xml保存您的示例字符串,这是具有不同数量项目的住宅物业的另一种策略:
library(XML) library(plyr) # xml <- '<ResidentialProperty>........' doc <- xmlParse(xml,asText = TRUE) df <- do.call(rbind.fill,lapply(doc['//ResidentialProperty'],function(x) { names <- xpathSApply(x,'.//.',xmlName) names <- names[which(names == "text") - 1] values <- xpathSApply(x,".//text()",xmlValue) return(as.data.frame(t(setNames(values,names)),stringsAsFactors = FALSE)) })) df # StreetNumber StreetName StreetSuffix StateOrProvince StatusChangeDate Latitude Longitude County SchoolDistrict View YearBuilt InteriorFeatures Name Roof Exterior # 1 11111 111th Avenue Ct WA 2015-07-05T23:48:53.410 11.111111 -111.111111 Pierce Puyallup Territorial 1997 Bath Off Master,Dbl Pane/Storm Windw Vacant Composition Brick,Wood Products