微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何解析与 R 间距一致的对话行的电影脚本?

如何解决如何解析与 R 间距一致的对话行的电影脚本?

'''

                A stray SKATEBOARD clips her,causing her to stumble and 

      spill her coffee,as well as the contents of her backpack.

      

      The young RIDER dashes over to help,trembling when he sees 

      who his board has hit.

      

                             RIDER

                hey -- sorry.

      

      Cowering in fear,he attempts to scoop up her scattered 

      belongings.

      

                             KAT

                Leave it 

      

      He persists.

      

                             KAT (continuing)

                I said,leave it!

      

                             RIDER

                hey -- sorry.

''''

我正在抓取一些我想用来进行文本分析的脚本。我只想从脚本中提取对话,看起来它有一定的间距。 例如,我想要那行“嘿——抱歉。”。我知道间距是 20,这在整个脚本中是一致的。那么我如何只读取该行而其余的行间距相等?

我想说的是,我要使用read.fwf,读取固定宽度。

大家怎么看?

我正在从这样的网址中抓取: https://imsdb.com/scripts/10-Things-I-Hate-About-You.html

解决方法

library(tidytext)
library(tidyverse)

text <- c("PADUA HIGH SCHOOL - DAY
          
          Welcome to Padua High School,your typical urban-suburban 
          high school in Portland,Oregon.  Smarties,Skids,Preppies,Granolas. Loners,Lovers,the In and the Out Crowd rub sleep 
          out of their eyes and head for the main building.
          
          PADUA HIGH PARKING LOT - DAY
          
          KAT STRATFORD,eighteen,pretty -- but trying hard not to be 
          -- in a baggy granny dress and glasses,balances a cup of 
          coffee and a backpack as she climbs out of her battered,baby blue '75 Dodge Dart.
          
          A stray SKATEBOARD clips her,causing her to stumble and 
          spill her coffee,as well as the contents of her backpack.
          
          The young RIDER dashes over to help,trembling when he sees 
          who his board has hit.
          
                                 RIDER
                    Hey -- sorry.
          
          Cowering in fear,he attempts to scoop up her scattered 
          belongings.
          
                                 KAT
                    Leave it 
          
          He persists.
          
                                 KAT (continuing)
                    I said,leave it!
          
          She grabs his skateboard and uses it to SHOVE him against a 
          car,skateboard tip to his throat.  He whimpers pitifully 
          and she lets him go.  A path clears for her as she marches 
          through a pack of fearful students and SLAMS open the door,entering school.
          
          INT. GIRLS' ROOM - DAY
          
          BIANCA STRATFORD,a beautiful sophomore,stands facing the 
          mirror,applying lipstick.  Her less extraordinary,but 
          still cute friend,CHASTITY stands next to her.  
          
                                 BIANCA
                    Did you change your hair?
          
                                 CHASTITY 
                    No.
          
                                 BIANCA
                    You might wanna think about it
          
          Leave the girls' room and enter the hallway.
          
          HALLWAY - DAY- CONTINUOUS
          
          Bianca is immediately greeted by an admiring crowd,both 
          boys
          and girls alike.
          
                                 BOY
                           (adoring)
                    Hey,Bianca.
          
                                 GIRL
                    Awesome shoes.
          
          The greetings continue as Chastity remains wordless and 
          unaddressed by her side.  Bianca smiles proudly,acknowledging her fans.
          
          GUIDANCE COUNSELOR'S OFFICE - DAY
          
          CAMERON JAMES,a clean-cut,easy-going senior with an open,farm-boy face,sits facing Miss Perky,an impossibly cheery 
          guidance counselor.")
          
          

names_stopwords <- c("^(rider|kat|chastity|bianca|boy|girl)")

text %>% 
  as_tibble() %>% 
  unnest_tokens(text,value,token = "lines") %>% 
  filter(str_detect(text,"\\s{15,}")) %>% 
  mutate(text = str_trim(text)) %>% 
  filter(!str_detect(text,names_stopwords)) 

输出:

# A tibble: 9 x 1
  text                          
  <chr>                         
1 hey -- sorry.                 
2 leave it                      
3 i said,leave it!             
4 did you change your hair?     
5 no.                           
6 you might wanna think about it
7 (adoring)                     
8 hey,bianca.                  
9 awesome shoes. 

您可以在 names_stopwords 向量中包含更多字符名称。

,

您可以尝试以下操作:

url <- 'https://imsdb.com/scripts/10-Things-I-Hate-About-You.html'

url %>%
  #Read webpage line by line
  readLines() %>%
  #Remove '<b>' and '</b>' from string
  gsub('<b>|</b>','',.) %>%
  #select only the text which begins with 20 whitespace characters
  grep('^\\s{20,}',.,value = TRUE) %>%
  #Remove whitespace
  trimws() %>%
  #Remove all caps string
  grep('^([A-Z]+\\s?)+$',value = TRUE,invert = TRUE)

#[1] "Hey -- sorry."             "Leave it"                  "KAT (continuing)"
#[4] "I said,leave it!"         "Did you change your hair?" "No."
#...
#...

我已尝试尽可能多地清理它,但可能需要根据您实际想要提取的内容进行更多清理。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。