如何解决当角色说话时分裂麦克白名称后跟粗体句点麦克白中最常见的名字列表

在向古腾堡计划发送获取请求后，我将整个剧本麦克白作为一个字符串

response = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt')
full_text = response.text
macbeth = full_text[16648:]

我分开了

words_raw = macbeth.split()
word_count = len(words_raw)

print("Macbeth contains {} words".format(word_count))
print("Here are some examples:",words_raw[400:460])

然后我去除所有标点符号并将字符串转换为lower()

import string
punctuation = string.punctuation

words_cleaned = []

for word in words_raw:
    # remove punctuation
    word = word.strip(punctuation)
    # make lowercase
    word = word.lower()
    words_cleaned.append(word)

print("Cleaned word examples:",words_cleaned[400:460])

但是，我不能去掉所有标点符号，因为我需要名字/缩写名字后面的句点作为字符即将说话的指示符。

课程摘录

一个说话的角色由他们名字的（通常是缩写的）版本表示，后跟一个 . （句号）作为一行的第一件事。例如，当麦克白说话时，它以“Macb”开头。您需要修改处理标点符号的方式，因为您不能只是删除所有标点符号

分割后的原始数据切片( )

名称后跟粗体句点

麦克白包含 17737 个单词这里有一些例子：['Gashes','cry','for','helpe','King.','So','well','thy','words','become','thee,','as','wounds,'they','smack','of','Honor','both:','Goe','get','him','Surgeons .','Enter','Rosse','and','Angus.','Who','comes','here?','Mal.','The','值得的','Thane','Lenox.','What','a','haste','lookes','through','his','eyes?','should','he','looke,'that','seemes','to','speake','things','srange','Rosse。 ','God','saue','the','King','King.']

words_raw = macbeth.split()
word_count = len(words_raw)

print("Macbeth contains {} words".format(word_count))
print("Here are some examples:",words_raw[400:460])

我们知道“Malcolm”在他的名字出现后跟一个句点（上面的粗体为“Mal.”）时，“Lenox”开始说话时也是如此（“Lenox”）。有时角色的名称被缩短，其他人使用全名，后面紧跟一个句点。

麦克白中最常见的名字列表

["duncan","malcolm","donalbaine","macbeth","banquo","macduff","lenox","rosse","menteth","angus","cathnes","fleance ","seyward","seyton","boy","lady","messenger","wife"]

目标

从上面的列表中确定所有名称和字符的缩写名称（如果缩短）
找到一个字符开始说话的地方，用句点表示，并在那里拆分

这是我迄今为止尝试过的

尝试隔离非字母数字

print(len(words_raw))
def extra(string):
    return list(c for c in string if not c.isalnum() and not c.isspace())
weird = extra(macbeth)
weird

discard = []
for char in weird:
    if char != '.':
        discard.append(char)
print(len(weird))
print(len(discard))
print(discard)

revised_macbeth = []

for character in words_raw:
    if not character in discard:
        revised_macbeth.append(character)
print(len(revised_macbeth))
        
        

# for character in words_raw:
#     if not character.isalnum():
#         print("found: \'{}\'".format(character))

它的输出

17737
4788
3553
['?','?','-',"'",':',';','(',')','&',

比较

print(macbeth)

The Tragedie of Macbeth

Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

  1. When shall we three meet againe?
In Thunder,Lightning,or in Raine?
  2. When the Hurley-burley's done,When the Battaile's lost,and wonne

   3. That will be ere the set of Sunne

   1. Where the place?
  2. Vpon the Heath

   3. There to meet with Macbeth

   1. I come,Gray-Malkin

print(revised_macbeth)

['The','Tragedie','Macbeth','Actus','Primus.','Scoena','Prima.','Thunder','Lightning.','three','Witches.','1.','When','shall','we','meet','againe?','In','Thunder,'Lightning,'or','in','Raine?','2.',"Hurley-burley's",'done,"Battaile's",'lost,'wonne','3.','That','will','be','ere','set','Sunne','Where','place?','Vpon','Heath','There','with','I','come,'Gray-Malkin','All.','Padock','calls','anon:','faire','is','foule,'foule','faire,'Houer','fogge','filthie','ayre.','Exeunt.','Scena','Secunda.','Alarum','within.','King,'Malcome,'Donalbaine,'Lenox,'attendants,'meeting','bleeding','Captaine.','bloody','man','that?','can','report,'As','seemeth','by','plight,'Reuolt','newest','state','Mal.','This','Serieant,'like','good','hardie','Souldier','fought',"'Gainst",'my','Captiuitie:','Haile','braue','friend;','Say','kNowledge','broyle,'thou','didst','leaue','it','Cap.','Doubtfull','stood,'two','spent','Swimmers,'doe','cling','together,'And','choake','their','Art:','me

解决方法

按照我上面的评论

如果您先拆分成行，然后拆分成单词，您可能会更轻松，因为我希望缩写的字符名称始终位于行首？另外，我注意到当一个新角色开始说话时，该行缩进了几个空格。这可能是另一回事。

分成几行：

macbeth_lines = macbeth.split('\r\n') # Because in your text lines are separated by \r\n

然后，遍历每一行。如果它以空格开头，则从第一个单词中删除除句号以外的所有内容，并从其他单词中删除所有标点符号。如果它不以空格开头，请删除所有单词中的所有标点符号。要替换所有字符，我们将使用 str.translate() (docs)，它采用 dict 将每个输入字符映射到其翻译后的输出字符。我们可以创建这个 dict 来将每个标点符号映射到一个空字符串。

# Create a dictionary for str.translate
strip_chars = {ord(punct): None for punct in string.punctuation}

# And one without the period
strip_chars_no_period = {k: v for k,v in strip_chars.items() if k != 46} # 46 is ord('.')

macbeth_words = []
for line in macbeth_lines:
    line_words = line.split()
    line_proc_words = [] # List to see each line as it's processed
                         # Remove if not needed

    if line.startswith(" "):
        # this line starts with a space. Maybe it contains a name

        # Don't strip periods from the first word
        first_word = line_words[0].translate(strip_chars_no_period)

        line_proc_words.append(first_word) # Debug line

        # Save the word
        macbeth_words.append(first_word)

        # Remaining words yet to be processed in this line
        remaining_words = line_words[1:]
    else:
        # All words in the line are yet to be processed
        remaining_words = line_words

    # Process remaining words
    for other_word in remaining_words:
        # Strip punctuation
        stripped_word = other_word.translate(strip_chars)

        line_proc_words.append(stripped_word) # Debug line

        # Save to list
        macbeth_words.append(stripped_word)
    
    # Print out the line just to make sure it's correct
    print(' '.join(line_proc_words)) # Debug line

我添加了一个 line_proc_words 列表，以便我们可以在处理时打印每一行。上面代码的输出（我只运行了前 100 行）如下所示：

The Tragedie of Macbeth

Actus Primus Scoena Prima

Thunder and Lightning Enter three Witches

1. When shall we three meet againe
In Thunder Lightning or in Raine
2. When the Hurleyburleys done
When the Battailes lost and wonne

3. That will be ere the set of Sunne

1. Where the place
2. Vpon the Heath

3. There to meet with Macbeth

1. I come GrayMalkin

All. Padock calls anon faire is foule and foule is faire
Houer through the fogge and filthie ayre

Exeunt


Scena Secunda

Alarum within Enter King Malcome Donalbaine Lenox with
attendants meeting a bleeding Captaine

King. What bloody man is that he can report
As seemeth by his plight of the Reuolt
The newest state

Mal. This is the Serieant
Who like a good and hardie Souldier fought
Gainst my Captiuitie Haile braue friend
Say to the King the knowledge of the Broyle
As thou didst leaue it

Cap. Doubtfull it stood
As two spent Swimmers that doe cling together
And choake their Art The mercilesse Macdonwald
Worthie to be a Rebell for to that
The multiplying Villanies of Nature
Doe swarme vpon him from the Westerne Isles
Of Kernes and Gallowgrosses is supplyd
And Fortune on his damned Quarry smiling
Shewd like a Rebells Whore but alls too weake
For braue Macbeth well hee deserues that Name
Disdayning Fortune with his brandisht Steele
Which smoakd with bloody execution
Like Valours Minion carud out his passage
Till hee facd the Slaue
Which neur shooke hands nor bad farwell to him
Till he vnseamd him from the Naue toth Chops
And fixd his Head vpon our Battlements

King. O valiant Cousin worthy Gentleman

Cap. As whence the Sunne gins his reflection
Shipwracking Stormes and direfull Thunders
So from that Spring whence comfort seemd to come
Discomfort swells Marke King of Scotland marke
No sooner Iustice had with Valour armd
Compelld these skipping Kernes to trust their heeles
But the Norweyan Lord surueying vantage
With furbusht Armes and new supplyes of men
Began a fresh assault

King. Dismayd not this our Captaines Macbeth and
Banquoh
Cap. Yes as Sparrowes Eagles
Or the Hare the Lyon
If I say sooth I must report they were
As Cannons ouerchargd with double Cracks
So they doubly redoubled stroakes vpon the Foe
Except they meant to bathe in reeking Wounds
Or memorize another Golgotha
I cannot tell but I am faint
My Gashes cry for helpe

King. So well thy words become thee as thy wounds
They smack of Honor both Goe get him Surgeons
Enter Rosse and Angus

Who comes here
Mal. The worthy Thane of Rosse

Lenox. What a haste lookes through his eyes
So should he looke that seemes to speake things strange

Rosse. God saue the King

King. Whence camst thou worthy Thane
Rosse. From Fiffe great King
Where the Norweyan Banners flowt the Skie
And fanne our people cold
Norway himselfe with terrible numbers

您可以使用 collections.defaultdict 将演讲者姓名上的行分组。 enumerate 可用于获取每个字符发出的文本的行号：

import requests,re
from collections import defaultdict
r = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
d,l,keywords = defaultdict(list),None,['Enter','Exit','Flourish','Thunder']
#iterate over the play lines,ignoring empty strings (generated from the split)
for i,a in filter(lambda x:x[-1],enumerate(re.split('[\n\r]+',r[r.index('Actus Primus. Scoena Prima.')+27:]))):
   #check that the line contains character dialog,not stage prompts
   if not re.findall('|'.join(keywords),a):
      #grab the name of the character and append to "d"
      if (n:=re.findall('^\s+[A-Z](?:\.[A-Z])*[a-z]+\.(?=\s\w+)|^[A-Z](?:\.[A-Z])*[a-z\.]+\.(?=\s\w+)',a)):
         d[(l:=re.sub('^\s+|\.$','',n[0]).lower())].append((i,a[len(n[0])+1:].lower()))
      elif l:
         #the line might be a continuation of a larger block of character text
         d[l].append((i,a.lower()))

print(list(d.keys())) #detected characters
print(d['macb'][:10]) #first ten occurrences of Macbeth speaking

输出：

['all','king','mal','cap','lenox','rosse','macb','banquo','mac','banq','ang','lady','mess','la','fleance','porter','macd','port','exeunt','ban','donal','malc','don','ross','seruant','murth','lords','mur','len','hec','lord','appar','musicke','wife','son','mes','doct','ro','gent','lad','ment','cath','ser','sey','seyw','sold','syw','y.sey']
[(137,'so foule and faire a day i haue not seene'),(170,'stay you imperfect speakers,tell me more:'),(171,'by sinells death,i know i am thane of glamis,'),(172,'but how,of cawdor? the thane of cawdor liues'),(173,'a prosperous gentleman: and to be king,(174,'stands not within the prospect of beleefe,(175,'no more then to be cawdor. say from whence'),(176,'you owe this strange intelligence,or why'),(177,'vpon this blasted heath you stop our way'),(178,'with such prophetique greeting?')]

编辑：每个字符的常用词：

要过滤每个字符的常用词，请遍历 d 中每个字符的句子，然后再次遍历每个句子的 str.split 结果。需要注意的是，上一步的结果将包含许多 stop words。我下面的解决方案为您提供了过滤这些的选项：

from collections import Counter
def common_words(character,filter_stop = False,stop_words = []):
   if filter_stop:
      stop_words = set(filter(None,requests.get("https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords").text.split('\n')))
   w = [i for _,b in d['Macb'] for i in re.sub('[\:\.\?]+',b).split() if i.lower() not in stop_words]
   return Counter(w).most_common(5)

print(common_words('Macb',filter_stop=True))

输出：

[('haue',39),('thou',34),('thy',23),('shall',21),('thee',20)]

当角色说话时分裂麦克白 名称后跟粗体句点麦克白中最常见的名字列表

如何解决当角色说话时分裂麦克白 名称后跟粗体句点麦克白中最常见的名字列表