微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

为什么我无法从 uci 导入以下数据集

如何解决为什么我无法从 uci 导入以下数据集

下午好,

假设我们有以下函数

data_preprocessing<-function(link,drop_last_column=TRUE){
  
  link=as.character(link) 
  DT <- data.table::fread(link,fill = TRUE,na.strings = "?") 
  DT=DT[-1,]
  DT=as.data.frame(DT)
  
  if(drop_last_column==TRUE){
    DT=as.data.frame(DT)[,-ncol(DT)]
  }
  
  
  return(DT)
  
}

当我尝试从 acute 导入 uci 数据集时,出现以下错误

acute=data_preprocessing("https://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data")
 [100%] Downloaded 7276 bytes...
Error in data.table::fread(link,na.strings = "?") : 
  File is encoded in UTF-16,this encoding is not supported by fread(). Please recode the file to UTF-8.

我也试过:

acute=read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data")
Warning messages:
1: In read.table(file = file,header = header,sep = sep,quote = quote,:
  line 1 appears to contain embedded nulls
2: In read.table(file = file,:
  line 2 appears to contain embedded nulls
3: In read.table(file = file,:
  line 3 appears to contain embedded nulls
4: In read.table(file = file,:
  line 4 appears to contain embedded nulls
5: In read.table(file = file,:
  line 5 appears to contain embedded nulls
6: In scan(file = file,what = what,dec = dec,:
  embedded nul(s) found in input

感谢您的帮助!

解决方法

使用具有适当编码的 read.table 代替。

data_preprocessing<-function(link,drop_last_column=TRUE){
  
  link=as.character(link) 
  DT <- read.table(link,fileEncoding="UTF-16",fill = TRUE,na.strings = "?") 
  DT=DT[-1,]
  DT=as.data.frame(DT)
  
  if(drop_last_column==TRUE){
    DT=as.data.frame(DT)[,-ncol(DT)]
  }
  
  
  return(DT)
  
}

acute=data_preprocessing("https://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data")

head(acute)
    V1 V2  V3  V4  V5  V6  V7
2 35,9 no  no yes yes yes yes
3 35,9 no yes  no  no  no  no
4 36,0 no  no yes yes yes yes
5 36,0 no yes  no  no  no  no
6 36,0 no yes  no  no  no  no
7 36,2 no  no yes yes yes yes

编辑: 要自动查找数据文件中使用的编码,可以使用readr 包中的guess_encoding 函数。

data_preprocessing<-function(link,drop_last_column=TRUE){
  
  link=as.character(link) 
  enc_guess <- readr::guess_encoding(link)
  enc <- enc_guess[enc_guess$confidence == max(enc_guess$confidence),]$encoding
  DT <- read.table(link,fileEncoding = enc,-ncol(DT)]
  }
  
  
  return(DT)
  
}

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。