微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何在目录中的所有 csvs 文件中执行 python 关键字搜索和字计数器并写入单个 csv? 用法

如何解决如何在目录中的所有 csvs 文件中执行 python 关键字搜索和字计数器并写入单个 csv? 用法

我是 Python 新手并试图了解某些库。不确定如何将 csv 上传到 SO,但此脚本适用于任何 csv,只需替换 'SwitchedProviders_TopicModel'

我的目标是遍历文件目录中的所有 csv - C:\Users\jj\Desktop\autotranscribe 并将我的 python 脚本输出文件写入 csv。

所以让我们说例如我在上面的文件夹中有这些 csv 文件-

'1003391793_1003391784_01bc7e411408166f7c5468f0.csv' '1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv' '1003478130_1003478103_8eef05b0820cf0ffe9a9882d.csv'

我希望我的 python 应用程序(下面)为文件夹/目录中的每个 csv 做一个字计数器,并将输出写入这样的数据帧 -

csvname                                            pre existing  exclusions  limitations  fourteen
1003391793_1003391784_01bc7e411408166f7c5468f0.csv    1           2           0            1

我的脚本 -

import pandas as pd
from collections import defaultdict

def search_multiple_strings_in_file(file_name,list_of_strings):
    """Get line from the file along with line numbers,which contains any string from the list"""
    line_number = 0
    list_of_results = []
    count = defaultdict(lambda: 0)
    # Open the file in read only mode
    with open("SwitchedProviders_TopicModel.csv",'r') as read_obj:
        # Read all lines in the file one by one
        for line in read_obj:
            line_number += 1
            # For each line,check if line contains any string from the list of strings
            for string_to_search in list_of_strings:
                if string_to_search in line:
                    count[string_to_search] += line.count(string_to_search)
                    # If any string is found in line,then append that line along with line number in list
                    list_of_results.append((string_to_search,line_number,line.rstrip()))
 
    # Return list of tuples containing matched string,line numbers and lines where string is found
    return list_of_results,dict(count)


matched_lines,count = search_multiple_strings_in_file('SwitchedProviders_TopicModel.csv',[ 'pre existing ','exclusions','limitations','fourteen'])
    
df = pd.DataFrame.from_dict(count,orient='index').reset_index()
df.columns = ['Word','Count']

print(df)

我如何才能做到这一点?只查找您在我的脚本中看到的特定于计数器的单词,例如“十四”,而不是查找所有单词的计数器

其中一个 csvs 的样本数据 - 信用用户 Umar H

df = pd.read_csv('1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv')
print(df.head(10).to_dict())
{'transcript': {0: 'hi thanks for calling ACCA  this is many speaking Could have the pleasure speaking with ',1: 'so ',2: 'hi ',3: 'I have the pleasure speaking with my name is B. as in boy E. V. D. N. ',4: 'thanks yes and I think I have your account pulled up Could you please verify your email ',5: "sure is yeah it's on _ 00 ",6: 'I T. O.com ',7: 'thank you how can I help ',8: 'all right I mean I do have an insurance with you guys I just want to cancel the insurance ',9: 'sure I can help with that what was the reason for cancellation '},'confidence': {0: 0.73,1: 0.18,2: 0.88,3: 0.72,4: 0.83,5: 0.76,6: 0.83,7: 0.98,8: 0.89,9: 0.95},'from': {0: 1.69,1: 1.83,2: 2.06,3: 2.13,4: 2.36,5: 2.98,6: 3.17,7: 3.65,8: 3.78,9: 3.93},'to': {0: 1.83,1: 2.06,2: 2.13,3: 2.36,4: 2.98,5: 3.17,6: 3.65,7: 3.78,8: 3.93,9: 4.14},'speaker': {0: 0,1: 0,2: 0,3: 0,4: 0,5: 0,6: 0,7: 0,8: 0,9: 0},'Negative': {0: 0.0,1: 0.0,2: 0.0,3: 0.0,4: 0.0,5: 0.0,6: 0.0,7: 0.0,8: 0.116,9: 0.0},'Neutral': {0: 0.694,1: 1.0,2: 1.0,3: 0.802,4: 0.603,5: 0.471,6: 1.0,7: 0.366,8: 0.809,9: 0.643},'Positive': {0: 0.306,3: 0.198,4: 0.397,5: 0.529,7: 0.634,8: 0.075,9: 0.357},'compound': {0: 0.765,3: 0.5719,4: 0.7845,5: 0.5423,7: 0.6369,8: -0.1779,9: 0.6124}}

解决方法

步骤 -

  1. 定义输入路径
  2. 提取所有 CSV 文件
  3. 计数
  4. 创建 1 个结果字典,添加文件名和计数器字典。
  5. 最后,将结果字典转换为数据帧和转置。 (如果需要,用 0 填充 NAN 值)

import string
from collections import Counter,defaultdict
from pathlib import Path

import pandas as pd

inp_dir = Path(r'C:/Users/jj/Desktop/Bulk_Wav_Completed')  # current dir


def search_multiple_strings_in_file(file_name,list_of_strings):
    """Get line from the file along with line numbers,which contains any string from the list"""
    list_of_results = []
    count = defaultdict(lambda: 0)
    # Open the file in read only mode
    with open(file_name,'r') as read_obj:
        # Read all lines in the file one by one
        for line_number,line in enumerate(read_obj,start=1):
            # For each line,check if line contains any string from the list of strings
            for string_to_search in list_of_strings:
                if string_to_search in line:
                    count[string_to_search] += line.count(string_to_search)
                    # If any string is found in line,then append that line along with line number in list
                    list_of_results.append(
                        (string_to_search,line_number,line.rstrip()))

    # Return list of tuples containing matched string,line numbers and lines where string is found
    return list_of_results,dict(count)


result = {}
for csv_file in inp_dir.glob('**/*.csv'):
    print(csv_file) # for debugging
    matched_lines,count = search_multiple_strings_in_file(csv_file,['nation','nation wide','trupanion','pet plan','best','embrace','healthy paws','pet first','pet partners','lemon','AKC','akc','kennel club','club','american kennel','american','lemonade'
                                                                    'kennel','figo','companion protect','true companion','true panion','trusted pals','partners' 'lemonade','partner','wagmo','vagmo','bivvy','bivy','bee' '4paws','paws','pet best','pets best','pet best'])
    print(count)  # for debugging
    result[csv_file.name] = count
df = pd.DataFrame(result).T.fillna(0).astype(int)

输出 -

       exclusions  limitations  pre existing
1.csv           1            3             1
2.csv           1            3             1
,

因为您已经标记了熊猫,我们可以使用 .str.extractall 来搜索单词和行号。

您可以扩展函数并添加一些错误处理(例如如果给定的 csv 文件中不存在成绩单会发生什么情况)。

from pathlib import Path
import pandas as pd

def get_files_to_parse(start_dir : str) -> list:
    
    files = [f for f in Path(start_dir).glob('*.csv')]
    return files 

def search_multiple_files(list_of_paths : list,key_words : list) -> pd.DataFrame:
    dfs = []
    for file in list_of_paths:
        df = pd.read_csv(file)
        word_df = df['transcript'].str.extractall(f"({'|'.join(key_words)})")\
                        .droplevel(1,0)\
                        .reset_index()\
                        .rename(columns={'index' : f"{file.parent}_{file.stem}")\
                        .set_index(0).T
        dfs.append(word_df)
    return pd.concat(dfs)
    
    

用法。

使用您的示例数据框(我从您的列表中添加了几个关键词)

files = get_files_to_parse('target\dir\folder')


[WindowsPath('1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv'),WindowsPath('1003478130_1003478103_8eef05b0820cf0ffe9a9754c_copy.csv')]

search_multiple_files(files,['pre existing','exclusions','limitations','fourteen'])

enter image description here

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。