《数据结构》学习-- Hash(2) --Separate Chaining

本系列是《数据结构与算法分析-C语言描述》（Data Structures and Algorithm Analysis in C，作者Mark Weiss）一书的学习笔记，当我在做cc150需要补某个知识点时，就会把这本书翻出来学习一下，同时分享~

如果你有任何问题和建议，希望能与我分享.

1. 回顾

上一次我们简单介绍了：

Hash表的概念
由输入数据的数据结构构成的Array
优缺点
查找、插入、删除平均常数时间。但不维护Array的顺序。
四大要素
Hash表主体，Hash表大小，Hash Function，冲突解决方案。

实际上，Hash表最难设计的就是Hash Function以及冲突解决方法（Collision Resolution）。关于Hash Function的设计，没有一个统一的方法，常用的方法我们也在第一章介绍过了。
这一章，我们将介绍一种常用的冲突解决方案，即开散列法（Separate Chaining）。

2. Separate Chaining简介

Separate Chaining的思想无比简单。即原本Hash表每个元素是一个输入数据的数据结构类型，现在把每个元素改成一个由该数据结构类型构成的指针链表。这样，当发生冲突时，只要在该指针链表的尾端或首端插入该值即可。

3.Rehash

在详述我们的HashTable实现之前，我们还要引入最后一个概念：rehash。当我们不断往HashTable内插入元素，HashTable就会越来越满，而Find，Insert，Remove的操作都会越来越慢！
事实上，我们定义一个负重参数（load factor） $\lambda$ ，它的值是HashTable已有的元素数除以HashTable表大小。对于每一次不成功的搜索，平均搜索次数为 $\lambda$ （不包括最后的NULL），对于每一次成功的搜索，平均搜索次数为 $1+\lambda/2$ 。因此，HashTable本身的大小对性能的影响并不大，重要的是load factor的大小。Mark Weiss建议，general rule是保证 $\lambda$ 尽可能接近1。
因此，当我们的load factor比较大时，我们需要扩充HashTable，以让load factor尽可能接近1，这个过程就是rehash。另一方面，rehash是非常耗时的工作，因为我们需要遍历所有元素，然后全部重新插入一遍，因此，只有当必须（即load factor大到某一阈值时）时才去执行。

3. Separate Chaining实现

接下来使用C++代码详解Separate Chaining的实现。该代码在Mac OS X 64bit系统，clang++编译器下调试通过，其他平台不能保证。想直接下载文件可以戳[这里]。

3.1 Hash表主体

struct HashNode{
            ElementType elementValue;
            HashNode* next;
        };
        typedef HashNode* HashList;
        struct HashTbl{
            HashList* table;
            int tableSize;
            int content;
        };
        HashTbl* hashTable;

不知道为何CSDN的Markdown对混合了大段注释语句的高亮支持很不好，我把注释放在下面，已经清楚解释了这个代码的功能。

/*our hash table is actually a link to a struct called HashTbl,* the HashTbl consists of a array of HashList,which is a linked list of HashNode
             * and a tableSize,indicating how large is our hash table
             * and a content,indicating how full is our hash table
             * the HashNode is just a node in a linked list,consisting of the content(elementValue) and a linked to next node
             */

3.2 初始化操作

初始化操作很简单，不断把内存空间分配就好了。
注意，我这里在new以后，做了判断NULL的步骤，不过根据StackOverflow讨论，这种分配方式在内存分配失败后是会返回bad_alloc标志，因此应该用try catch来处理，而不是判断NULL。另外，文中也建议使用STL库而不是手动分配内存来管理动态数据。我还在学习中，以后进一步完善。

template<class ElementType>
void HashTable<ElementType>::initialize(HashTbl*& newHashTable,int minSize)
{
    newHashTable=new HashTbl;
    if(newHashTable==NULL){
        printf("new operation Failed! At file: %s,line: %d\n",__FILE__,__LINE__);
        exit(-1);
    }

    //寻找比minSize大的最近的质数。因为质数大小的Hash表性能最好。minSize是用户一开始指定的Hash表大小。
    int tableSize=nextPrime(minSize);

    newHashTable->table=new HashList[tableSize];
    if(newHashTable->table==NULL){
        printf("new operation Failed! At file: %s,__LINE__);
        exit(-1);
    }

    for(int i=0;i<tableSize;i++){
        newHashTable->table[i]=new HashNode;
        if(newHashTable->table[i]==NULL){
            printf("new operation Failed! At file: %s,__LINE__);
            exit(-1);
        }
        newHashTable->table[i]->next=NULL;
    }

    newHashTable->tableSize=tableSize;
    newHashTable->content=0;
}

值得注意的一点是，这里函数声明时，第一个参数是HashTbl*&,即一个指向指针的引用。这样做的目的是，之后我们传递实际的HashTable指针进来时，可以在函数内部修改这个指针。可以参考这个文章。

3.3 Hash Function

template<class ElementType>
int HashTable<ElementType>::hashFunc(ElementType elementValue)
{
    //getElementKey是根据输入数据的数据结构来获得数据Key的函数。如果数据数据是整数，那么Key可以就等于输入数据。如果输入数据是字符串，那么Key可以等于所有字符对应ASCII码之和。等等。
    int key=getElementKey(elementValue);
    return key%(hashTable->tableSize);//using simple module method to get new position
}

3.4 Find

template<class ElementType>
class HashTable<ElementType>::HashNode* HashTable<ElementType>::findInner(HashTbl* _hashTable,ElementType elementValue)
{
    int position=hashFunc(_hashTable,elementValue);
    HashNode* hashNode=_hashTable->table[position];
    while(hashNode->next!=NULL && !isEqual(hashNode->next->elementValue,elementValue)){
        hashNode=hashNode->next;
    }
    return hashNode;
}

template<class ElementType>
bool HashTable<ElementType>::find(ElementType elementValue)
{
    HashNode* hashNode = findInner(hashTable,elementValue);
    return hashNode->next != NULL;
}

这个函数总体是很容易看懂的（前提是你得懂类模板哈哈，不懂的话可以参考：类模板基础和类模板中的结构体）
至于为什么需要一个findInner和一个find呢？find函数是给用户用的，用户只需要简单的传递一个elementValue就能知道find是否成功。而findInner是给HashTable内部实现用的。
另外，这里的isEqual函数是用户自己定义的，因为不同的数据结构判断相等的标准不一样（如int直接等于即可，而string需要用strcmp）。

3.5 Insert

template<class ElementType>
bool HashTable<ElementType>::insertInner(HashTbl*& _hashTable,ElementType elementValue)
{
    //rehash
    if(_hashTable->content>_hashTable->tableSize*10)
    {
        _hashTable=rehash();
    }

    HashNode* insertNode=findInner(_hashTable,elementValue);
    if(insertNode->next==NULL){
        insertNode->next=new HashNode;
        if(insertNode->next==NULL){
            printf("new operation Failed! At file: %s,__FILE__,__LINE__);
            exit(-1);
        }

        insertNode->next->elementValue=elementValue;
        insertNode->next->next=NULL;
        _hashTable->content++;
        return true;
    }
    return false;
}

template<class ElementType>
bool HashTable<ElementType>::insert(ElementType elementValue)
{
    return insertInner(hashTable,elementValue);
}

同样，这段代码也很容易理解。rehash函数的实现见下面。

3.6 Remove

template<class ElementType>
bool HashTable<ElementType>::removeInner(HashTbl* _hashTable,ElementType elementValue)
{
    HashNode* removeNode=findInner(_hashTable,elementValue);
    if(removeNode->next !=NULL){
        HashNode* tobedelete=removeNode->next;
        removeNode->next=removeNode->next->next;
        delete tobedelete;
        _hashTable->content--;
        return true;
    }
    return false;
}

template<class ElementType>
bool HashTable<ElementType>::remove(ElementType elementValue)
{
    return removeInner(hashTable,elementValue);
}

同样比较清楚。

3.7 rehash

template<class ElementType>
class HashTable<ElementType>::HashTbl* HashTable<ElementType>::rehash()
{
    HashTbl* newTable;
    initialize(newTable,hashTable->tableSize*10);

    for(int i=0;i<hashTable->tableSize;i++){
        HashNode* hashNode=hashTable->table[i]->next;
        while(hashNode){
            insertInner(newTable,hashNode->elementValue);
            hashNode=hashNode->next;
        }
    }
    delete hashTable;
    return newTable;
}

rehash函数非常简单，创建一个两倍于原来大小的新HashTable，然后把之前每个HashNode重新插入到新的HashTable中。
另外，这里我设定当load factor为10时，我们把HashTable扩充到原来的10倍。这个10是我随意设定的，没有进行过性能优化。

3.8 nextPrime

template<class ElementType>
bool HashTable<ElementType>::isPrime(int num)
{
    bool result;
    if(num==2)
        result=true;
    else if(num/2*2 == num)
        result=false;
    else{
        int sqrtNum=sqrt(num);
        result=true;
        for(int i=3;i<sqrtNum;i+=2){
            if(num/i*i==num){
                result=false;
                break;
            }
        }
    }
    return result;
}

template<class ElementType>
int HashTable<ElementType>::nextPrime(int num)
{
    int result=num+1;
    while(!isPrime(result))
        result++;
    printf("result:%d\n",result);
    return result;
}

寻找下一个质数。

4. HashTable测试

我们首先定义好使用的输入数据的数据结构，getElementKey函数以及isEqual函数。在这里，我们用最简单的int做测试。

typedef int ElementType;

int getElementKey(ElementType elementValue){
    return elementValue;
}

bool isEqual(ElementType elementValue1,ElementType elementValue2){
    return elementValue1==elementValue2;
}

4.1正确性测试

为了测试我们的HashTable是否正确。我们在一个循环中，插入循环index，然后搜索这个index，确保每一次都能搜索到。然后在一个循环中，删除index，然后搜索这个index，确保每一次都搜索不到。

bool testHashCorrectness(HashTable<ElementType> &hashTable)
{
    int checkTotal=10000000;
    for(int i=0;i<checkTotal;i++){
        hashTable.insert(i);
        if(!hashTable.find(i))
            return false;
    }
    for(int i=0;i<checkTotal;i++){
        hashTable.remove(i);
        if(hashTable.find(i))
            return false;
    }
    return true;
}

int main()
{
    HashTable<int> hashTable(100000,&getElementKey,&isEqual);
    printf("check correctness: %d\n",testHashCorrectness(hashTable));
    return 0;
}

结果为：

check correctness: 1

4.2 性能测试

最后，我们对于使用或不使用rehash函数来做一个性能测试。

int main()
{
    HashTable<int> hashTable(100000,&isEqual);

    clock_t start=clock();
    for(int i=0;i<10000000;i++){
        int r=rand();
        hashTable.insert(r);
    }
    clock_t finish=clock();
    printf("time is %fs\n",(double)(finish-start)/CLOCKS_PER_SEC);

    return 0;
}

当我们不适用rehash函数，测试结果为：

time is 47.611208s

而使用了rehash函数之后，测试结果为：

time is 8.676079s

相信细调load factor之后，性能可以有更大的提升。

5. 总结

这一章我们介绍了HashTable中的Separate Chaining解决Collision的方法。其核心思想就是用一个链表结构替代原先最简单的HashTable中的每一个元素。另外我们还介绍了rehash函数，并引入了load factor概念。 Separate Chaining的优点在于其实现非常简单。并且能非常好的解决冲突问题。 Separate Chaining的缺点在于，每次插入一个元素都需要new一个内存空间，这个操作涉及到核心函数，所以速度会慢。如何解决这个问题呢？请期待下一篇“《数据结构》学习–Hash（3）–Open Addressing”！