为什么关于大索引类型的函数 huffmandeco 出现八度误差？

如何解决为什么关于大索引类型的函数 huffmandeco 出现八度误差？

我有一个小的 MatLab 脚本，我试着去理解它。它没有太大作用。它只从文件中读取文本并使用霍夫曼函数对其进行编码和解码。但是解码的时候报错：

"错误：内存不足或维度对于 Octave 的索引类型来说太大
错误：从 huffmandeco>dict2tree 第 95 行第 19 列调用“

我不知道为什么，因为我调试了它并没有看到大索引类型。

我添加了根据输入文本计算 p 的部分。

%text is a random input text file in ASCII

%calculate the relative frequency of every Symbol
for i=0:127
    nlet=length(find(text==i));
    p(i+1)=nlet/length(text);
end
symb = 0:127;
dict = huffmandict(symb,p); % Create dictionary
compdata = huffmanenco(fdata,dict); % Encode the data
dsig = huffmandeco(compdata,dict); % Decode the Huffman code

我只能使用八度音程而不是 MatLab。我不知道，是否有意外错误。我在 Win10 上使用 Octave 版本 6.2.0。我尝试了大数据版本，它没有改变任何东西。
也许有人知道这种情况下的错误？

编辑：我再次调试了代码。在 huffmandeco 函数中，我发现了以下函数：

function tree = dict2tree (dict)

  L = length (dict);
  lengths = zeros (1,L);

  ## the depth of the tree is limited by the maximum word length.
  for i = 1:L
    lengths(i) = length (dict{i});
  endfor
  m = max (lengths);

  tree = zeros (1,2^(m+1)-1)-1;

  for i = 1:L
    pointer = 1;
    word    = dict{i};
    for bit = word
      pointer = 2 * pointer + bit;
    endfor
    tree(pointer) = i;
  endfor

endfunction

本例中最大长度m为82，因此函数计算：
树 = 零 (1,2^(82+1)-1)-1.
所以很明显为什么错误称为太大的索引类型。
但是一定有解决方案或者其他错误，因为代码之前已经测试过了。

解决方法

我还没有彻底清除代码以了解原因，但是 huffmandict 并没有像它声称的那样忽略零概率符号。我也没有找到关于 Savannah 的错误报告，但我也没有彻底搜索过。

一种解决方法是将符号列表及其概率限制为仅实际出现的符号。使用 containers.Map 是理想的，但在 Octave 中，您可以使用 unique 的几个输出来做到这一点：

% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols,~,inds] = unique(textstr);
inds = inds.';   % just make it easier to read

对于字符串

textstr = 'Random String Input.';

结果是：

>> symbols
symbols =  .IRSadgimnoprtu
>> inds
inds =
 Columns 1 through 19:
    4    6   11    7   12   10    1    5   15   14    9   11    8    1    3   11   13   16   15
 Column 20:
    2

所以输入字符串中的第一个符号是symbols(4)，第二个是symbols(6)，依此类推。

从那里开始，您只需使用 symbols 和 inds 来创建字典并对信号进行编码/解码。这是一个快速演示脚本：

textstr = 'Random String Input.';
fprintf("Starting string: %s\n",textstr);

% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols,inds] = unique(textstr);
inds = inds.';   % just make it easier to read

% Calculate the frequency of each symbol in table
% max(inds) == numel(symbols)
p = histc(inds,1:max(inds))/numel(inds);

dict = huffmandict(symbols,p);
compdata = huffmanenco(inds,dict);
dsig = huffmandeco(compdata,dict);

fprintf("Decoded string: %s\n",symbols(dsig));

和输出：

Starting string: Random String Input.
Decoded string: Random String Input.

要对原始输入字符串以外的字符串进行编码，您必须将字符映射到符号索引（显然，确保字符串中的所有符号都实际存在于符号表中）：

>> [~,s_idx] = ismember('trogdor',symbols)
s_idx =
   15   14   12    8    7   12   14

>> compdata = huffmanenco(s_idx,dict);
>> dsig = huffmandeco(compdata,dict);
>> fprintf("Decoded string: %s\n",symbols(dsig));
Decoded string: trogdor