在集合中查找重复元素并将其分组的快速算法是什么？

如何解决在集合中查找重复元素并将其分组的快速算法是什么？

您可以使用从代表性元素到其他元素的列表/向量/双端队列的映射。这要求插入容器中的比较相对较少，这意味着您可以遍历结果组，而不必执行任何比较。

此示例始终将第一个代表性元素插入映射的双端队列存储中，因为这从逻辑上简化了通过组的后续迭代，但是，如果此重复证明存在问题，则仅执行一次push_back仅会很容易if (!ins_pair.second)。

typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}

这样，遍历各个组就变得相对简单且便宜：

void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}

I performed some experiments for comparison and object counts. In a test with 100000 objects in random order forming 50000 groups (i.e. and average of 2 objects per group) the above method cost the following number of comparisons and copies:

1630674 comparisons, 443290 copies

(I tried bringing the number of copies down, but only really managed to at the expense of comparisons which seem to be the higher cost operation in your scenario.)

Using a multimap, and retaining the prevIoUs element in the final iteration to detect the group transitions cost this:

1756208 comparisons, 100000 copies

Using a single list and popping the front element and performing a linear search for other group members cost:

1885879088 comparisons, 100000 copies

Yes, that’s ~1.9b comparisons compared to ~1.6m for my best method. To get the list method to perform anywhere near an optimal number of comparisons it would have to be sorted and this is going to cost a similar number of comparisons as building an inherently ordered container would in the first place.

I took your posted code and ran the implied algorithm (I had to make some assumptions about the code as there as some assumed deFinitions) over the same test data set as I used before and I counted:

1885879088 comparisons, 420939 copies

i.e. exactly the same number of comparisons as my dumb list algorithm, but with more copies. I think that that means we using essentially similar algorithms for this case. I can’t see any evidence of an alternative sort order, but it looks like you want a list of the groups which contain more than one equivalent elements. This can be simply achieved in my Iterate function by adding in if (i->size > 1) clause.

I still can’t see any evidence that building a sorted container such as this map of deques isn’t a good (even if not optimal) strategy.

解决方法

假设您有一个元素集合，那么如何才能挑选出重复的元素并将它们放入比较最少的每个组中？最好在C
++中使用，但是算法比语言更重要。对于给定的示例{E1，E2，E3，E4，E4，E2，E6，E4，E3}，我希望提取出{E2，E2}，{E3，E3}，{E4，E4，E4}。您将选择哪种数据结构和算法？还请包括设置数据结构的成本，例如，是否是像std
:: multimap这样的预先排序的数据结构

更新

按照建议使事情更清晰。这里有一个约束： 必须对元素进行比较 以确保它们是重复的。

因此哈希不适用，因为实际上它们将比较从重元素（例如，数据块）转移到轻元素（整数），并减少了一些比较，但并没有消除它们，最后，我们回到了我们原来的问题，当在一个碰撞桶中时。

假设您每个都有一堆潜在的GB重复文件，那么每个人都知道它们具有相同的哈希值。现在，您将发现真实的重复项。

不，这不是一个现实问题（即使MD5足以为现实文件生成唯一的哈希）。但是，只是假装使我们能够 集中精力查找涉及比较最少的数据结构+算法 。

我正在做的是

将其表示为STL std :: list数据结构（其中1）其元素删除比说一个向量便宜（2）其插入更便宜，不需要排序。）
弹出一个元素并将其与其余元素进行比较，如果找到重复元素，则将其从列表中拉出。一旦到达列表的末尾，就会找到一组重复项（如果有）。
重复上述两个步骤，直到列表为空。

在最佳情况下，它需要N-1，但是（N-1）！在最坏的情况下。

有什么更好的选择？

我的代码使用上面说明的方法：

// algorithm to consume the std::list container,// supports: list<path_type>,list< pair<std::string,paths_type::const_iterater>>
template<class T>
struct consume_list
{
    groups_type operator()(list<T>& l)
    {
        // remove spurious identicals and group the rest
        // algorithm:  
        // 1. compare the first element with the remaining elements,//    pick out all duplicated files including the first element itself.
        // 2. start over again with the shrinked list
        //     until the list contains one or zero elements.

        groups_type sub_groups;           
        group_type one_group; 
        one_group.reserve(1024);

        while(l.size() > 1)
        {
            T front(l.front());
            l.pop_front();

            item_predicate<T> ep(front);
            list<T>::iterator it     = l.begin(); 
            list<T>::iterator it_end = l.end();
            while(it != it_end)
            {
                if(ep.equals(*it))
                {
                    one_group.push_back(ep.extract_path(*(it))); // single it out
                    it = l.erase(it);
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(ep.extract_path(front));                    
                sub_groups.push_back(one_group);

                // clear,memory allocation not freed
                one_group.clear(); 
            }            
        }
        return sub_groups;
    }        
};


// type for item-item comparison within a stl container,e.g.  std::list 
template <class T>
struct item_predicate{};

// specialization for type path_type      
template <>
struct item_predicate<path_type>
{
public:
    item_predicate(const path_type& base)/*init list*/            
    {}
public:
    bool equals(const path_type& comparee)
    {
        bool  result;
        /* time-consuming operations here*/
        return result;
    }

    const path_type& extract_path(const path_type& p)
    {
        return p;
    }
private:
    // class members
};


};

感谢您提供以下答案，但是在我的示例中，它们似乎误导了它是关于整数的。实际上 ，元素是不可知类型的（不一定是整数，字符串或任何其他POD）
，并且相等谓词是自定义的，也就是说 ，比较可能非常繁琐 。

因此，也许我的问题应该是：使用哪种数据结构+算法需要较少的比较。

根据我的测试，使用像multiset这样的预先排序的容器，multimap并不更好，因为

插入时的排序已经进行了比较，
以下相邻发现再次进行比较，
这些数据结构优先于小于运算而不是相等运算，它们执行的小于2（a

我看不到它如何保存比较。

以下是一些答案所忽略的另一件事，我需要将重复的组彼此区分开，而不仅仅是将它们保存在容器中。

结论

经过所有讨论，似乎有3种方法

我原来的天真方法，如上所述
从类似的线性容器开始std::vector，对其进行排序，然后找到相等的范围
从相关的容器开始，如std::map<Type,vector<duplicates>>，如Charles Bailey所建议的，在建立相关容器的过程中挑选出重复项。

我已经编写了一个示例，以测试下面发布的所有方法。

重复项的数量以及何时分发可能会影响最佳选择。

当方法1掉落在前面时最好，而方法1结束时则最差。排序不会改变分布，而是尾数。
方法3的平均表现最高
方法2永远不是最佳选择

感谢所有参与讨论的人。

一个输出，带有来自以下代码的20个样本项。

用[20 10 6 5 4 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1]测试

和[1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 4 5 6 10 20]

使用std :: vector-> sort（）-> neighbor_find（）：

比较：[‘<’= 139，’==’= 23]

比较：[‘<’= 38，’==’= 23]

使用std :: list-> sort（）->收缩列表：

比较：[‘<’= 50，’==’= 43]

比较：[‘<’= 52，’==’= 43]

使用std :: list->收缩列表：

比较：[‘<’= 0，’==’= 121]

比较：[‘<’= 0，’==’= 43]

使用std :: vector-> std :: map>：

比较：[‘<’= 79，’==’= 0]

比较：[‘<’= 53，’==’= 0]

使用std :: vector-> std :: multiset-> neighbor_find（）：

比较：[‘<’= 79，’==’= 7]

比较：[‘<’= 53，’==’= 7]

码

// compile with VC++10: cl.exe /EHsc

#include <vector>
#include <deque>
#include <list>
#include <map>
#include <set>
#include <algorithm>
#include <iostream>
#include <sstream>

#include <boost/foreach.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/format.hpp>

using namespace std;

struct Type
{
    Type(int i) : m_i(i){}

    bool operator<(const Type& t) const
    {
        ++number_less_than_comparison;
        return m_i < t.m_i;
    }

    bool operator==(const Type& t) const
    {
        ++number_equal_comparison;    
        return m_i == t.m_i;
    }
public:
    static void log(const string& operation)
    {
        cout 
        << "comparison during " <<operation << ": [ "
        << "'<'  = " << number_less_than_comparison
        << ","
        << "'==' = " << number_equal_comparison
        << " ]\n";

        reset();
    }

    int to_int() const
    {
        return m_i;
    }
private:
    static void reset()
    {
        number_less_than_comparison = 0;
        number_equal_comparison = 0;      
    }

public:
    static size_t number_less_than_comparison;
    static size_t number_equal_comparison;    
private:
    int m_i;
};

size_t Type::number_less_than_comparison = 0;
size_t Type::number_equal_comparison = 0;

ostream& operator<<(ostream& os,const Type& t) 
{
    os << t.to_int();
    return os;
}

template< class Container >
struct Test
{    
    void recursive_run(size_t n)
    { 
        bool reserve_order = false;

        for(size_t i = 48; i < n; ++i)
        {
            run(i);
        }    
    }

    void run(size_t i)
    {
        cout << 
        boost::format("\n\nTest %1% sample elements\nusing method%2%:\n") 
        % i 
        % Description();

        generate_sample(i);
        sort();
        locate();

        generate_reverse_sample(i);
        sort();
        locate(); 
    }

private:    
    void print_me(const string& when)
    {
        std::stringstream ss;
        ss << when <<" = [ ";
        BOOST_FOREACH(const Container::value_type& v,m_container)
        {
            ss << v << " ";
        }
        ss << "]\n";    
        cout << ss.str();
    }

    void generate_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 1; i <= n; ++i)
        {
            m_container.push_back(Type(n/i));    
        }
        print_me("init value");
        Type::log("setup");
    }

    void generate_reverse_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 0; i < n; ++i)
        {
            m_container.push_back(Type(n/(n-i)));     
        }
        print_me("init value(reverse order)");
        Type::log("setup");
    }

    void sort()
    {    
        sort_it();

        Type::log("sort");
        print_me("after sort");

    }

    void locate()
    {
        locate_duplicates();

        Type::log("locate duplicate");
    }
protected:
    virtual string Description() = 0;
    virtual void sort_it() = 0;
    virtual void locate_duplicates() = 0;
protected:
    Container m_container;    
};

struct Vector : Test<vector<Type> >
{    
    string Description()
    {
        return "std::vector<Type> -> sort() -> adjacent_find()";
    }

private:           
    void sort_it()
    {    
        std::sort(m_container.begin(),m_container.end()); 
    }

    void locate_duplicates()
    {
        using std::adjacent_find;
        typedef vector<Type>::iterator ITR;
        typedef vector<Type>::value_type  VALUE;

        typedef boost::tuple<VALUE,ITR,ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        ITR itr_begin(m_container.begin());
        ITR itr_end(m_container.end());       
        ITR itr(m_container.begin()); 
        ITR itr_range_begin(m_container.begin());

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin,itr_end,[]  (VALUE& v1,VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start,itr_range_begin,itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    }
};

struct List : Test<list<Type> >
{
    List(bool sorted) : m_sorted(sorted){}

    string Description()
    {
        return m_sorted ? "std::list -> sort() -> shrink list" : "std::list -> shrink list";
    }
private:    
    void sort_it()
    {
        if(m_sorted) m_container.sort();////std::sort(m_container.begin(),m_container.end()); 
    }

    void locate_duplicates()
    {       
        typedef list<Type>::value_type VALUE;
        typedef list<Type>::iterator ITR;

        typedef vector<VALUE>  GROUP;
        typedef vector<GROUP>  GROUPS;

        GROUPS sub_groups;
        GROUP one_group;

        while(m_container.size() > 1)
        {
            VALUE front(m_container.front());
            m_container.pop_front();

            ITR it     = m_container.begin(); 
            ITR it_end = m_container.end();
            while(it != it_end)
            {
                if(front == (*it))
                {
                    one_group.push_back(*it); // single it out
                    it = m_container.erase(it); // shrink list by one
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(front);                    
                sub_groups.push_back(one_group);

                // clear,memory allocation not freed
                one_group.clear(); 
            }            
        }
    }

private:
    bool m_sorted;
};

struct Map : Test<vector<Type>>
{    
    string Description()
    {
        return "std::vector -> std::map<Type,vector<Type>>" ;
    }
private:    
    void sort_it() {}

    void locate_duplicates()
    {
        typedef map<Type,vector<Type> > MAP;
        typedef MAP::iterator ITR;

        MAP local_map;

        BOOST_FOREACH(const vector<Type>::value_type& v,m_container)
        {
            pair<ITR,bool> mit; 
            mit = local_map.insert(make_pair(v,vector<Type>(1,v)));   
            if(!mit.second) (mit.first->second).push_back(v); 
         }

        ITR itr(local_map.begin());
        while(itr != local_map.end())
        {
            if(itr->second.empty()) local_map.erase(itr);

            itr++;
        }
    }        
};

struct Multiset :  Test<vector<Type>>
{
    string Description()
    {
        return "std::vector -> std::multiset<Type> -> adjacent_find()" ;
    }
private:
    void sort_it() {}

    void locate_duplicates()
    {   
        using std::adjacent_find;

        typedef set<Type> SET;
        typedef SET::iterator ITR;
        typedef SET::value_type  VALUE;

        typedef boost::tuple<VALUE,ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        SET local_set;
        BOOST_FOREACH(const vector<Type>::value_type& v,m_container)
        {
            local_set.insert(v);
        }

        ITR itr_begin(local_set.begin());
        ITR itr_end(local_set.end());       
        ITR itr(local_set.begin()); 
        ITR itr_range_begin(local_set.begin());

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin,itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    } 
};

int main()
{
    size_t N = 20;

    Vector().run(20);
    List(true).run(20);
    List(false).run(20);
    Map().run(20);
    Multiset().run(20);
}