从头开始创建语料库时摆脱 .DS_Store 文件

如何解决从头开始创建语料库时摆脱 .DS_Store 文件

我对 Python 非常陌生，我需要从头开始创建一个语料库。我遇到了 .DS_Store 文件的问题。我尝试手动擦除它，用终端擦除它或用 Python 擦除它，但没有任何效果。当 .DS_Store 存在时，我无法进行 NLP 计算。这是我的代码：

import nltk
import random
nltk.download('cmudict')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
from nltk.corpus import cmudict
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
import string
from nltk import word_tokenize
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpusdir = '/Users/username/nltk_data/corpusfilename'
corp = PlaintextCorpusReader(corpusdir,'.*')
corp.fileids() # gives me 6 fileids,5 existing and one .DS_Store

corp.sents() # error: 'utf-8' codec can't decode byte 0xd5 in position 161: invalid 
continuation byte

我使用的是 Mac，建议使用 if 语句，这样语料库只能读取 .txt 而不能读取 .DS_Store。我不知道该怎么做。

解决方法

来自Wikipedia：

在 Apple macOS 操作系统中，.DS_Store 是一个文件，用于存储其包含文件夹的自定义属性，例如图标的位置或背景图像的选择。

因此，任何地方都可能始终存在 .DS_Store。

在这一行中：corp = PlaintextCorpusReader(corpusdir,'.*') 您选择将在语料库中的文件。

第二个参数 '.*' 是一个正则表达式，用于选择将使用哪些文件。根据 the doc，此参数可以是“指定此语料库中的文件 ID 的列表或正则表达式。”。

因此，在您的情况下，您可以将匹配所有内容的 '.*' 更改为 '.*\.txt' 以匹配任何字符和 '.'和'txt'。或者，如果您知道所需的每个文件的名称，则可以使用文件名列表 ['file1.txt','file2.txt']。

find . -name ".DS_Store" -delete

上面的脚本会从你的目录中删除文件。