微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

插入覆盖不会删除所有旧数据文件

如何解决插入覆盖不会删除所有旧数据文件

我们正在尝试插入覆盖配置单元表。大多数情况下,它会按预期覆盖,即删除任何旧文件并替换新文件。我们发现这种行为存在一些不一致之处,有时所有旧文件都没有被删除,但新文件正在创建。这会导致数据不一致。

我无法重现这种行为。只是想知道是否有人遇到过类似的问题或对可能发生的事情有任何指示。

我们使用的是 hive 版本 2.1.1。

下面是orc表结构和insert overwrite命令。 Fileid 是表中的唯一列。此表大小约为 500GB。

Hive 表结构:

CREATE EXTERNAL TABLE `tier0.file`(
  `filegroup` struct<collection:struct<name:string,code:string,royaltystate:string,enterprisecollectionid:bigint,isactive:boolean,active:boolean,filefamily:string,contentfamily:string,cfwcollectionname:string,droplocation:string,applyembeddestinationsite:boolean,associatedsource:string,excluderestriction:boolean,ownershiptype:string,collectionid:bigint,notes:string,bundlerestrictions:array<struct<bundleid:bigint,bundletype:string>>,pricecodes:array<struct<collectioncode:string,pricecode:string,iptccategory:string>>>,istockcollection:string,events:array<string>,paidassignmentids:array<string>,sisterfiles:array<string>,clonedfiles:array<string>,vcd:array<string>,source:struct<parentsource:string,parentsourceid:bigint,childsource:string,childsourceid:bigint>>,`filemanagement` struct<filemanagement:string,destinationsites:array<string>,readyforsale:boolean,readyforpublish:boolean,reviewstatus:string,excludedestinationsites:array<string>,displaystatus:string,inactivedate:string,pulledreason:string,pulledreasonaudit:string,approvaldate:string,futurepulledreason:string,futureinactivedate:string,futureactivedate:string>,`primarylanguage` string,`audithistory` struct<note:string,notecategory:string>,`contents` array<struct<deliverylocation:string,contenttype:string,submission:array<struct<data:struct<mimetype:string,fileinfo:struct<filelocation:string,filesize:bigint,filename:string,checksum:string,checksumtype:string>,submitdate:string,createdate:string,mediaformat:string,offlinehd:boolean,postertime:double,shoottype:string,stripaudio:boolean,timein:string,timeout:string,videoencoding:struct<compression:string,bitdepth:string,bitrate:double,deFinition:string,framerate:string,framesize:string,scantype:string,wrapper:string,height:int,width:int,interlaced:boolean>,rotation:string,anamorphic:boolean,pixelwidth:int,pixelheight:int,colorprofile:string,samplesperpixel:string,resolution:string,resolutionunit:string,colormode:string,animated:boolean,imageorientation:string,filmformat:string,duration:string,artistname:string,directlicense:boolean,lyrichook:string,albumtitle:string,parenttrackid:string,key:string,timesignature:string,publicdomain:string,lyrics:string,tracktitle:string,tracktype:string,speed:string,genre:string,mood:string,lyricpov:string,instrument:string,vocal:string,transformedMetadata:map<string,string>,iptc:map<string,exif:map<string,xmp:map<string,xmpraw:map<string,string>>,sizeid:int,sizename:string,keyname:string,schemauri:string,extension:string,fileindex:int,suffix:string,readonly:boolean,ismaster:boolean>>,filepack:array<struct<data:struct<mimetype:string,camerashotdate:string,updatedate:string,audithistory:array<struct<note:string,notecategory:string>>,contract:struct<parentsource:string,contractid:bigint,contentprovidername:string,contentprovidertitle:string,vendornumber:bigint,childsourceid:bigint,istockusername:string,istockuserid:bigint,iptccredit:string,signatorycontentprovidername:string,signatoryguid:string,startdate:string,enddate:string>,release:struct<releaseid:string,releaseinformation:string,releaseMetadata:array<struct<releaseMetadataid:string,aliasid:string,releasetype:string,filelocation:string,name:string,agerange:string,age:string,birthdate:string,gender:string,ethnicity:string,ethnicities:array<string>,talentid:array<string>,usage:array<string>,teamsreleaseid:string>>>,contentmanagement:struct<state:string,messages:array<string>>,contentsource:struct<clientsystemid:string,submittedby:string,ingestionproviderid:int,submissionnotes:string,clientlastmodifieddate:string>,alternateids:array<struct<alternateid:string,alternateidtype:string>>,homeproperty:string,mediatype:int,colorpalettes:struct<rgbmodel:array<struct<red:int,green:int,blue:int,presence:string,x:string,y:string,density:string>>>,transcript:string,hasaudio:boolean,visualcolor:string,era:string,cliptype:string,productiontitle:string,footagespeed:string>>,`submitdate` string,`licensecharacteristics` struct<filefamily:string,restrictioninstructions:string,riskcategory:string,advancedroyaltybearing:boolean,pricingcode:string,callforimage:boolean,exclusivecontent:boolean,subscriptioneligible:boolean,publicistapprovalrequired:boolean,whollyowned:boolean,royaltybearing:string,bundletags:array<string>,paidassignment:boolean,preferredlicensemodel:string,exclusivity:string,parentbundlecollection:string,restrictions:array<struct<id:string,beginningdate:string,enddate:string,controlledrestrictions:array<string>>>>,`fileid` string,`updatedate` string,`version` int,`exclusionrouting` array<string>,`inclusionrouting` array<string>,`errors` map<string,array<struct<errorcode:string,message:string>>>,`dp_schema` string,`dp_source` string,`dp_source_type` string,`dp_proc_time` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  's3a://bucket/tier0/file/'

插入覆盖命令:-

insert overwrite table stg.tier0_file
SELECT 
  filegroup,filemanagement,primarylanguage,audithistory,contents,submitdate,licensecharacteristics,fileid,updatedate,version,errors,dp_schema,dp_source,dp_source_type,dp_proc_time
FROM (
SELECT 
  filegroup,dp_proc_time,ROW_NUMBER() OVER(PARTITION BY fileid     ORDER BY        version DESC,dp_proc_time DESC) AS rownum
  FROM 
  (   SELECT 
          filegroup,dp_proc_time
      FROM tier0.file
      UNION ALL
      SELECT
          filegroup,dp_proc_time
      FROM stg.file
  ) base ) rnk 
where rnk.rownum = 1;

解决方法

刚刚注意到您在使用 Qubole 和 S3,这有很大的不同。删除许多文件后可能会出现最终的一致性问题。在表位置创建的文件过多的情况下,我已经见过很多次了。

阅读 this answer (Important addition about S3 eventual consistency) 以了解有关最终一致性的更多详细信息。

如何显着降低 Qubole 最终出现一致性问题的可能性:

  • 使用时间戳前缀文件:
    --use prefixed files
    set hive.qubole.dynpart.use.prefix=true;
    set hive.qubole.dynpart.bulk.delete=true;

    -- Disable AWS S3 direct writes. This should be set on Qubole to make it possible to rewrite from itself
    set hive.allow.move.on.s3=true;
  • 确保您没有创建太多小文件。如果您拥有基数较低且分布均匀的键,则在查询末尾添加 distribute by <some key(s) with low cardinality and even distribution> 可能会有所帮助。 There are other ways 如何减少文件数量:表 DDL 中的 CLUSTER BY,增加每个减速器的字节数。

如何彻底摆脱最终一致性问题:

每次在文件夹名称中创建包含当前时间戳的位置的表,加载,删除旧表,重命名新表。这样你总是会在表位置创建新文件,删除旧位置后的最终一致性不会影响新数据。这种方法 100% 有效但不方便,我建议将其作为最后的手段,通常为文件添加前缀并减少文件数量有帮助。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?