如何解决Pearson 哈希 8 位实现产生非常不均匀的值
我正在实现一个 pearson 哈希,以便为需要一个文件名与文件数据配对的 C 项目创建一个轻量级字典结构 - 我想要哈希表的良好常量搜索属性。我不是数学专家,所以我查找了很好的文本哈希,然后 pearson 提出了,声称它是有效的并且具有良好的分布。我测试了我的实现,发现无论我如何改变表大小或文件名最大长度,散列都是非常低效的,例如 18/50 桶被留空。我相信维基百科不会说谎,是的,我知道我可以只下载第三方哈希表实现,但我非常想知道为什么我的版本不起作用。
在下面的代码中,(一个向表中插入值的函数),“csstring”是文件名,要散列的字符串,“cLen”是字符串的长度,“pData”是指向某些插入表中的数据,“pTable”是表结构。初始条件 cHash = cLen - csstring[0]
是我通过实验发现可以略微提高均匀性的东西。我应该补充一点,我正在使用完全随机的字符串(使用 rand() 生成 ascii 值)测试表格,并且长度在某个范围内 - 这是为了轻松生成和测试大量值。
typedef struct StaticStrTable {
unsigned int nRepeats;
unsigned char nBuckets;
unsigned char nMaxCollisions;
void** pBuckets;
} StaticStrTable;
static const char cPerm256[256] = {
227,117,238,33,25,165,107,226,132,88,84,68,217,237,228,58,52,147,46,197,191,119,211,218,139,196,153,170,77,175,22,193,83,66,182,151,99,11,144,104,233,166,34,177,14,194,51,30,121,102,49,222,210,199,122,235,72,13,156,38,145,137,78,65,176,94,163,95,59,92,114,243,204,224,43,185,168,244,203,28,124,248,105,10,87,115,161,138,223,108,192,6,186,101,16,39,134,123,200,190,195,178,164,9,251,245,73,162,71,7,239,62,69,209,159,3,45,247,19,174,149,61,57,146,234,189,15,202,89,111,207,31,127,215,198,231,4,181,154,64,125,24,93,152,37,116,160,113,169,255,44,36,70,225,79,250,12,229,230,76,167,118,232,142,212,98,82,252,130,23,29,236,86,240,32,90,67,126,8,133,85,20,63,47,150,135,100,103,173,184,48,143,42,54,129,242,18,187,106,254,53,120,205,155,216,219,172,21,253,5,221,40,27,2,179,74,17,55,183,56,50,110,201,109,249,128,112,75,220,214,140,246,213,136,148,97,35,241,60,188,180,206,80,91,96,157,81,171,141,131,158,1,208,26,41
};
void InsertStaticStrTable(char* csstring,unsigned char cLen,void* pData,StaticStrTable* pTable) {
unsigned char cHash = cLen - csstring[0];
for (int i = 0; i < cLen; ++i) cHash ^= cPerm256[cHash ^ csstring[i]];
unsigned short cTableIndex = cHash % pTable->nBuckets;
long long* pBucket = pTable->pBuckets[cTableIndex];
// Inserts data and records how many collisions there are - it may look weird as the way in which I decided to pack the data into the table buffer is very compact and arbitrary
// It won't affect the hash though,which is the key issue!
for (int i = 0; i < pTable->nMaxCollisions; ++i) {
if (i == 1) {
pTable->nRepeats++;
}
long long* pSlotID = pBucket + (i << 1);
if (pSlotID[0] == 0) {
pSlotID[0] = csstring;
pSlotID[1] = pData;
break;
}
}
}
解决方法
仅供参考(这不是答案,我只需要格式) 这些只是模拟的单次运行,YMMV。
在 50 个 bin 上随机分布 50 个元素:
kalender_size=50 nperson = 50
E/cell| Ncell | frac | Nelem | frac |h/cell| hops | Cumhops
----+---------+--------+----------+--------+------+--------+--------
0: 18 (0.360000) 0 (0.000000) 0 0 0
1: 18 (0.360000) 18 (0.360000) 1 18 18
2: 10 (0.200000) 20 (0.400000) 3 30 48
3: 4 (0.080000) 12 (0.240000) 6 24 72
----+---------+--------+----------+--------+------+--------+--------
4: 50 50 1.440000 72
同样:在一个生日日历上分配 365 个人(忽略闰日......):
kalender_size=356 nperson = 356
E/cell| Ncell | frac | Nelem | frac |h/cell| hops | Cumhops
----+---------+--------+----------+--------+------+--------+--------
0: 129 (0.362360) 0 (0.000000) 0 0 0
1: 132 (0.370787) 132 (0.370787) 1 132 132
2: 69 (0.193820) 138 (0.387640) 3 207 339
3: 19 (0.053371) 57 (0.160112) 6 114 453
4: 6 (0.016854) 24 (0.067416) 10 60 513
5: 1 (0.002809) 5 (0.014045) 15 15 528
----+---------+--------+----------+--------+------+--------+--------
6: 356 356 1.483146 528
对于 N 个插槽上的 N 个项目,number of empty slots
和 number of slots with a single item in them
的期望是相等的。两者的预期密度均为 1/e。
最终数字 (1.483146) 是每个找到的元素的 ->next 指针遍历次数(当使用链式哈希表时)任何最佳哈希函数几乎都会达到 1.5。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。