将sdmx-xml文件读入R中的数据帧

前端之家收集整理的这篇文章主要介绍了将sdmx-xml文件读入R中的数据帧前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
我想知道是否有人设法将SDMX- XML文件读入数据帧.我想读的文件https://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml(1mb).
我将文件保存为“pensions_funds.xml”到pwd并尝试使用XML包来读取它:
fileName <- system.file("pensions","pensions_funds.xml",package="XML")
parsed<-xmlTreeParse("pension_funds.xml",getDTD=F)
r<-xmlRoot(parsed)
tmp = xmlSApply(r,function(x) xmlSApply(x,xmlValue))

以上几行基本上遵循http://www.omegahat.org/RSXML/gettingStarted.html中的示例
但我想我首先需要以某种方式忽略标题(我已经粘贴在我正在尝试阅读的文件的前几页下面).所以我认为上面的内容可能会起作用,但它从我错误的节点开始.我想抓住他们的time_period和ref_area索引的obs_values.

第一件事就是找到正确的节点然后从那里开始,但是我怀疑我可能是傻瓜的事,因为我对数据格式知之甚少,而且我不确定XML包可以用于SDMX-XML文件.聪明的人似乎试图这样做
http://opensdmxdevelopers.wikispaces.com/RSDMX
我在这里的主页上找不到这个包
https://r-forge.r-project.org/projects/rsdmx/
(我看不到任何链接/下载部分,但也许我是盲目的)它似乎是早期阶段. rsdmx的存在建议使用xml包来读取sdmx可能并不容易,所以我准备在这个阶段放弃,除非任何人都成功了.其实我主要是对阅读这个文件感兴趣
http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml
但这是一个10mb的文件,所以我开始变小了.

EDIT3
使用Mischa评论中的更改尝试sgibb对大文件的回答
库( “XML”)

url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"

    sdmxHandler <- function() {
  ## data.frame which stores results
  data <- data.frame(stringsAsFactors=FALSE)
  ## counter to store current row
  i <- 1
  ## temp value to store current REF_AREA
  ## temp value to store current REF_AREA
  refArea <- NA
  bsItem <- NA
  bsCountSector <- NA

  ## handler subroutine for Obs tag
  Obs <- function(name,attr) {
    ## found an Obs tag and now fill data.frame
    data[i,"refArea"] <<- refArea
    data[i,"timePeriod"] <<- as.numeric(attr["TIME_PERIOD"])
    data[i,"obsValue"] <<- as.numeric(attr["OBS_VALUE"])
    data[i,"bsItem"] <<- bsItem
    data[i,"bsCountSector"] <<- bsCountSector
    i <<- i + 1
  }

  ## handler subroutine for Series tag
  Series <- function(name,attr) {
    refArea <<- attr["REF_AREA"]
    bsItem <<- as.character(attr["BS_ITEM"])
    bsCountSector <<- as.numeric(attr["BS_ITEM"])
  }
  return(list(getData=function() {return(data)},Obs=Obs,Series=Series))
}

## run parser
df <- xmlEventParse(file(url),handlers=sdmxHandler())$getData()
Specification mandate value for attribute OBS_VALUE
attributes construct error
Couldn't find end of Start Tag Obs line 15108
Premature end of data in tag Series line 15041
Premature end of data in tag DataSet line 91
Premature end of data in tag CompactData line 2
Error: 1: Specification mandate value for attribute OBS_VALUE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 15108
4: Premature end of data in tag Series line 15041
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)

EDIT2:
sgibb的答案看起来很理想,并且可以在较小的文件上完美运行.我试着运行它

url <- http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml

(10mb文件,原始链接已更正),唯一的修改添加两行:

data[i,"bsItem"] <<- as.character(attr["BS_ITEM"])

data[i,"bsCountSector"] <<- as.numeric(attr["BS_COUNT_SECTOR"])

(这些是识别此较大数据集中的行所需的附加id变量).
它运行了几分钟,然后完成了这个错误

Error: 1: Specification mandate value for attribute TIME_PE
2: attributes construct error
3: Couldn’t find end of Start Tag Obs line 20743
4: Premature end of data in tag Series line 20689
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2

另外:有50个或更多警告(使用警告()查看前50个)

数据的基本格式看起来非常相似,所以我认为这可能有用. 10mb文件的基本格式如下:

<Series FREQ="M" REF_AREA="AT" ADJUSTMENT="N" BS_REP_SECTOR="A" BS_ITEM="A20" MATURITY_ORIG="A" DATA_TYPE="1" COUNT_AREA="U2" BS_COUNT_SECTOR="0000" CURRENCY_TRANS="Z01" BS_SUFFIX="E" TIME_FORMAT="P1M" COLLECTION="E">
        <Obs TIME_PERIOD="1997-09" OBS_VALUE="275.3" OBS_STATUS="A" OBS_CONF="F"/>
        <Obs TIME_PERIOD="1997-10" OBS_VALUE="275.9" OBS_STATUS="A" OBS_CONF="F"/>
        <Obs TIME_PERIOD="1997-11" OBS_VALUE="276.6" OBS_STATUS="A" OBS_CONF="F"/>

EDIT1:

所需的数据格式:

Ref_area    time_period obs_value

At  2006    118    
At  2007    119    
…    
Be  2006    101    
…

这是数据的第一位.

</Header>
    DataSet xsi:schemaLocation="https://www.ecb.europa.eu/vocabulary/stats/icpf/1 https://www.ecb.europa.eu/stats/sdmx/icpf/1/structure/2011-08-11/sdmx-compact.xsd" xmlns="https://www.ecb.europa.eu/vocabulary/stats/icpf/1"> 
<Group DECIMALS="0" TITLE_COMPL="Austria,reporting institutional sector Insurance corporations and pension funds - Closing balance sheet - All financial assets and liabilities - counterpart area World (all entities),counterpart institutional sector Total economy including Rest of the World (all sectors) - Credit (resources/liabilities) - Non-consolidated,Current prices - Euro,Neither seasonally nor working day adjusted - ESA95 TP table Not applicable" UNIT_MULT="9" UNIT="EUR" ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT"/><Series ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT" COLLECTION="E" TIME_FORMAT="P1Y" FREQ="A"><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="112" TIME_PERIOD="2008"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="119" TIME_PERIOD="2009"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="125" TIME_PERIOD="2010"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="127" TIME_PERIOD="2011"/></Series><Group D
RSDMX似乎处于早期开发状态.恕我直言,还没有包.但是您可以使用XML包轻松地自己实现它.我建议使用xmlEventParse(有关详细信息,请参阅?xmlEventParse):

编辑:根据outstanding_amounts.xml的更改要求调整示例
EDIT2:添加download.file

library("XML")

#url <- "http://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml"
url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"

## download xml file to avoid download errors disturbing xmlEventParse
tmp <- tempfile()
download.file(url,tmp) 

sdmxHandler <- function() {
  ## data.frame which stores results
  data <- data.frame(stringsAsFactors=FALSE)
  ## counter to store current row
  i <- 1
  ## temp value to store current REF_AREA,BS_ITEM and BS_COUNT_SECTOR
  refArea <- NA
  bsItem <- NA
  bsCountSector <- NA

  ## handler subroutine for Obs tag
  Obs <- function(name,"bsCountSector"] <<- bsCountSector
    data[i,"timePeriod"] <<- as.Date(paste(attr["TIME_PERIOD"],"-01",sep=""),format="%Y-%m-%d")
    data[i,"obsValue"] <<- as.double(attr["OBS_VALUE"])
    ## update current row
    i <<- i + 1
  }

  ## handler subroutine for Series tag
  Series <- function(name,attr) {
    refArea <<- attr["REF_AREA"]
    bsItem <<- attr["BS_ITEM"]
    bsCountSector <<- as.numeric(attr["BS_COUNT_SECTOR"])
  }

  return(list(getData=function() {return(data)},Series=Series))
}

## run parser
df <- xmlEventParse(tmp,handlers=sdmxHandler())$getData()

head(df)
#  refArea bsItem bsCountSector timePeriod obsValue
#1      DE    A20          2210      12053     39.6
#2      DE    A20          2210      12084     46.1
#3      DE    A20          2210      12112     50.2
#4      DE    A20          2210      12143     52.0
#5      DE    A20          2210      12173     52.3
#6      DE    A20          2210      12204     47.3

猜你在找的XML相关文章