微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

通过PHP将西里尔HTML标记转换为拉丁语

如何解决通过PHP将西里尔HTML标记转换为拉丁语

这不是删除HTML标签的正则表达式!仔细阅读问题!

我创建了一个脚本,将西里尔字母转换为拉丁字母,并将拉丁文转换为西里尔字母。

将拉丁语转换为西里尔字母会给HTML带来很多问题,因为脚本还会转换HTML元素。我创建了一种算法,可将西里尔HTML转换为拉丁语,并将所有内容保留在西里尔语言的标签内。

该脚本运行良好,但是我遇到了内存问题,或者while循环开始无限期旋转。基本上,问题取决于脚本的速度。

public function html_tags() {
    $tags = explode(',','!DOCTYPE,a,abbr,acronym,address,applet,area,article,aside,audio,b,base,basefont,bdi,bdo,big,blockquote,body,br,button,canvas,caption,center,cite,code,col,colgroup,data,details,dd,del,dfn,dialog,dir,div,dl,dt,em,embed,fieldset,figcaption,figure,font,footer,form,frame,frameset,h1,h2,h3,h4,h5,h6,head,header,hr,html,i,iframe,img,input,ins,kbd,label,legend,li,link,main,map,mark,Meta,master,nav,noframes,noscript,object,ol,optgroup,option,output,p,param,picture,pre,progress,q,rp,rt,ruby,s,samp,script,section,select,small,source,span,strike,strong,style,sub,summary,sup,svg,table,tbody,td,template,textarea,tfoot,th,thead,time,title,tr,track,tt,u,ul,var,video,wbr');
    $tags = array_map('trim',$tags);
    $tags = array_filter($tags);
    return apply_filters('serbian_transliteration_html_tags',$tags);
}

public function fix_cyr_html($content){
    $content = htmlspecialchars_decode($content);

    $tags = $this->html_tags();
    
    $tags_cyr = $tags_lat = array();
    foreach($tags as $tag){
        $tags_cyr[]='<' . str_replace($this->lat(),$this->cyr(),$tag);
        $tags_cyr[]='</' . str_replace($this->lat(),$tag) . '>';
        
        $tags_lat[]= '<' . $tag;
        $tags_lat[]= '</' . $tag . '>';
    }
    
    $tags_cyr = array_merge($tags_cyr,array('&нбсп;','&лт;','&гт;','&ндасх;','&мдасх;','хреф','срц','&лдqуо;','&бдqуо;','&лсqуо;','&рсqуо;','&сцарон;','&Сцарон;','&тилде;'));
    $tags_lat = array_merge($tags_lat,array('&nbsp;','&lt;','&gt;','&ndash;','&mdash;','href','src','&ldquo;','&bdquo;','&lsquo;','&rsquo;','ш','Ш','&tilde;'));
    
    $content = str_replace($tags_cyr,$tags_lat,$content);
    
    $lastPos = 0;
    $positions = [];

    while (($lastPos = mb_strpos($content,'<',$lastPos,'UTF-8')) !== false) {
        $positions[] = $lastPos;
        $lastPos = $lastPos + mb_strlen('<','UTF-8');
    }

    foreach ($positions as $position) {
        if(mb_strpos($content,'>','UTF-8') !== false) {
            $end   = mb_strpos($content,">",$position,'UTF-8') - $position;
            $tag  = mb_substr($content,$end,'UTF-8');
            $tag_lat = $this->cyr_to_lat($tag);
            $content = str_replace($tag,$tag_lat,$content);
        }
    }
    
    // Fix open tags
    $content = preg_replace_callback ('/(<[\x{0400}-\x{04FF}0-9a-zA-Z\/\=\"\'_\-\s\.\;\,\!\?\*\:\#\$\%\&\(\)\[\]\+\@\€]+>)/iu',function($m){
        return $this->cyr_to_lat($m[1]);
    },$content);
    
    // FIx closed tags
    $content = preg_replace_callback ('/(<\/[\x{0400}-\x{04FF}0-9a-zA-Z]+>)/iu',$content);
    
    // Fix HTML entities
    $content = preg_replace_callback ('/\&([\x{0400}-\x{04FF}0-9]+)\;/iu',function($m){
        return '&' . $this->cyr_to_lat($m[1]) . ';';
    },$content);
    
    // Fix JavaScript
    $content = preg_replace_callback('/(?=<script(.*?)>)(.*?)(?<=<\/script>)/s',function($matches) {
            return $this->cyr_to_lat($m[2]);
    },$content);
    
    // Fix CSS
    $content = preg_replace_callback('/(?=<style(.*?)>)(.*?)(?<=<\/style>)/s',$content);
    
    // Fix email
    $content = preg_replace_callback ('/(([\x{0400}-\x{04FF}0-9\_\-\.]+)@([\x{0400}-\x{04FF}0-9\_\-\.]+)\.([\x{0400}-\x{04FF}0-9]{3,10}))/iu',$content);

    // Fix URL
    $content = preg_replace_callback ('/(([\x{0400}-\x{04FF}]{4,5}):\/{2}([\x{0400}-\x{04FF}0-9\_\-\.]+)\.([\x{0400}-\x{04FF}0-9]{3,10})(.*?)($|\n|\s|\r|\"\'\.\;\,\:\)\]\>))/iu',$content);
    
    // Fix attributes with doublequote
    $content = preg_replace_callback ('/(title|alt|data-(title|alt))\s?=\s?"(.*?)"/iu',function($m){
        return sprintf('%1$s="%2$s"',$m[1],esc_attr($this->lat_to_cyr($m[3])));
    },$content);
    
    // Fix attributes with single quote
    $content = preg_replace_callback ('/(title|alt|data-(title|alt))\s?=\s?\'(.*?)\'/iu',function($m){
        return sprintf('%1$s=\'%2$s\'',$content);

    return $content;
}

主要问题是我是否以及如何将不良的HTML标签(西里尔HTML标签属性)转换为拉丁语并将所有其他文本保存在西里尔语言中?

<див цласс="цонтент">Мама воли бебу</див>   --->   <div class="content">Мама воли бебу</div>

可以使用正则表达式来实现吗?还是有更快/更好的解决方案?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。