文本文件中出现的字母

如何解决文本文件中出现的字母

我是初学者程序，因为代码可能建议当前编写一个程序，该程序将计算文本文件中的每个字母并记录字母表中每个字母出现的频率。目前我只写了代码将计算文本文件中 A 的数量。但是，我仍然需要计算其他 25 个字母的频率。在不使用任何花哨的方法的情况下，是否有一种简单的方法可以将其自动化，而不是为每个字母重复代码块？

#include <fstream> 
#include <iostream> 
#include <string>
using namespace std; 


    while (!file.eof()) 
    {
        
    cout << " letter     Frequency"<< endl;
    for (char c = 'A'; c <= 'Z'; ++c)
    {
        cout << "    " << c << "    :      " << Counts[c - 'A'] << endl;
    }
    
    // Return code

    return 0;
}

解决方法

您可以使用 std::map ，键是字符，值是计数。

好吧，如果您查看 ASCII 表，您会看到 'A'to'Z' 和 'a' 到 'z' 是按顺序排列的。 'A' = 65,'Z' = 90,'a' = 97,'z' = 122。棘手的部分是 'Z' 和 'a' 之间的差距为 7。

您可以像这样创建大小写数组：

int lowerCounts[26] = {0}; // might work in initialization...if not,get right syntax of use memset
int upperCounts[26] = {0};

所以你扫描文件中的每个字母 'ch'

if (ch >= 'A' && ch <= 'Z')
   ++upperCounts[ch - 'A'];
else if (ch >= 'a' && ch <= 'z')
   ++lowerCounts[ch - 'a'];

如果这是不区分大小写的，这意味着处理 'a' == 'A'，然后坚持使用所有 upperCounts。

将上面的if语句改为：

if (ch >= 'A' && ch <= 'Z')
   ++upperCounts[ch - 'A'];
else if (ch >= 'a' && ch <= 'z')
   ++upperCounts[ch - 'a']; // using upperCounts array instead of lowerCounts

当然，您可以完全删除对 lowerCounts 的所有引用。

要吐出计数，请执行以下操作

for (char c = 'A'; c <= 'Z'; ++c)
{
   cout << c << " count = " << upperCounts[c - 'A'] << endl;
}

你可能会使用矢量或其他东西或地图，但我认为在你的水平上，这种类型的解决方案更适合你目前的理解和技能——你只是在学习。

这个作业是优化 I/O 的一个很好的练习。
该文件将被读入一个内存块，也就是缓冲区。

让我们使用数组进行频率计数，因为它是一种最佳技术。

#include <iostream>
#include <fstream> 

// Declare the size of the buffer.
static const unsigned int BUFFER_SIZE = 1024*1024;  

int main()
{
    // Declare the buffer as "static" to use a different memory area.
    static char buffer[BUFFER_SIZE];

    /* Use the same file opening as in your original code. */

    while (file.read(buffer,BUFFER_SIZE))
    {
        const unsigned int characters_read = file.gcount();
        for (unsigned int i = 0; i < characters_read; ++i)
        {
            const char ch = buffer[i];
            if (ch >= 'A' && ch <= 'Z')
            {
                ++upperCounts[ch - 'A'];
            }
            else
            {
                if (ch >= 'a' && ch <= 'z')
                {
                    ++lowerCounts[ch - 'a'];
                }
            }
        }
    }
    /* Insert code to print frequencies */
    return 0;  // Indicate success to the operating system.
}

在上面的代码中，使用read()方法将一个字符块读入内存。分块读取总是比一次读取一个字符快。尽管 C++ 流工具可能已经缓冲了输入，但我们正在控制以便设置缓冲区大小。

然后在缓冲区中搜索字母字符并更新频率计数。在内存中搜索总是比搜索文件快。

编辑 1：优化计算
在上面的代码和 OP 的代码中，大部分执行时间都花在计算频率上（通过使用比较）。

我们可以通过将特化移到输入之后并计算所有字符的频率来节省更多时间。

unsigned int frequencies[256] = {0}; // Possible range of characters.

while (file.read(buffer,BUFFER_SIZE))
{
    const unsigned int characters_read = file.gcount();
    for (unsigned int i = 0; i < characters_read; ++i)
    {
        ++frequencies[i];
    }
}

// Now print out the frequencies:  
for (char ch = 'A'; ch <= 'Z'; ++ch)
{
    std::cout << ch << ": " << frequencies[ch] << "\n";
}
for (char ch = 'a'; ch <= 'z'; ++ch)
{
    std::cout << ch << ": " << frequencies[ch] << "\n";
}

在上面的代码中，输入循环被简化为一个目的：计算频率。无需检查范围；输入后进行范围检查。

输入后，输出字母字符的所有频率，仅输出字母字符。

这个例子表明，通过在最常执行的部分进行通用操作，程序可以运行得更快。专业化或详细信息在高性能部分之后或之外进行。