来自两个以上字符串的最长公共子字符串

我需要从C中的一组文件名计算最长的公共子串.

确切地说,我有一个std :: std :: strings的列表(或QT等价,也很好)

char const *x[] = {"FirstFileWord.xls","SecondFileBlue.xls","ThirdFileWhite.xls","ForthFileGreen.xls"};
std::list<std::string> files(x,x + sizeof(x) / sizeof(*x));

我需要计算所有字符串的n个不同的最长公共子串,在这种情况下,例如对于n = 2

"File" and ".xls"

如果我可以计算最长的公共子序列,我可以将其删除并再次运行算法以获得第二长的,所以基本上归结为：

是否有一个(引用？)实现来计算std :: list的std :: strings的LCS？

这不是一个好的答案,但是我有一个肮脏的解决方案 – 在QUrls的QList上蛮力,只从中获取最后一个“/”之后的部分.我想用“正确的”代码替换它.

(我发现了http://www.icir.org/christian/libstree/ – 这会有很大的帮助,但是我无法在我的机器上编译它.有人可能会使用它吗？)

QString SubstringMatching::getMatchPattern(QList<QUrl> urls)
    {
    QString a;

    int foundPosition = -1;
    int foundLength = -1;
    for (int i=urls.first().toString().lastIndexOf("/")+1; i<urls.first().toString().length(); i++)
    {
        bool hit=true;
        int xj;
        for (int j=0; j<urls.first().toString().length()-i+1; j++ ) // try to match from position i up to the end of the string :: test character at pos. (i+j)
        {
            if (!hit) break;

            QString firstString = urls.first().toString().right( urls.first().toString().length()-i ).left( j ); // this needs to match all k strings
            //qDebug() << "SEARCH " << firstString;

            for (int k=1; k<urls.length(); k++) // test all other strings,k = test string number
            {
                if (!hit) break;

                //qDebug() << " IN  " << urls.at(k).toString().right(urls.at(k).toString().length() - urls.at(k).toString().lastIndexOf("/")+1);
                //qDebug() << " RES " << urls.at(k).toString().indexOf(firstString,urls.at(k).toString().lastIndexOf("/")+1);
                if (urls.at(k).toString().indexOf(firstString,urls.at(k).toString().lastIndexOf("/")+1)<0) {
                    xj = j;
                    //qDebug() << "HIT LENGTH " << xj-1 << " : " << firstString;
                    hit = false;
                }
            }

        }
        if (hit) xj = urls.first().toString().length()-i+1; // hit up to the end of the string
        if ((xj-2)>foundLength) // have longer match than existing,j=1 is match length
        {
            foundPosition = i; // at the current position
            foundLength = xj-1;
            //qDebug() << "Found at " << i << " length " << foundLength;
        }
    }

    a = urls.first().toString().right( urls.first().toString().length()-foundPosition ).left( foundLength );
    //qDebug() << a;
    return a;
}

解决方法

如果你说后缀树太重量级或其他不切实际,以下
相当简单的蛮力方法可能适合您的应用.

我假设不同的子串应该是非重叠的并从中挑选出来
左到右.

即使有这些假设,也不需要包含一个独特的集合
“一组字符串的N个不同的最长公共子串”.无论N是什么,
可能存在多于N个不同的共同子串,它们都是相同的最大值
长度和其中任何N的选择都是任意的.于是
解决方案找到最长的不同共同点的最多N *集*
所有长度相同的子串都是一组.

算法如下：

> Q是目标长度配额.
>字符串是字符串的问题集.
>结果是一个最初为空的多图,它将长度映射到一组字符串,
结果[l]是长度为l的集合
> N,最初为0,是结果中表示的不同长度的数量
>如果Q为0或字符串为空则返回结果
>找到Strings中最短的成员;保留它的副本S并将其删除
来自Strings.我们继续比较S的子串与那些
字符串因为{Strings,S}的所有常见子串必须是
S的子串
>迭代生成S的所有子串,最长,使用
由偏移量和长度控制的明显嵌套循环.对于每个子串ss
S：

>如果ss不是字符串的常见子字符串,则接下来.
>迭代结果[l]为l> = ss的长度直到结束
结果或直到ss被发现是被检查的子串
结果.在后一种情况下,ss与已经的结果没有区别
在手,所以接下来.
> ss是常见的子串,与已有的子串不同.迭代
结果[1]对于l <1. ss的长度,删除每个结果是a
ss的子串,因为所有这些都比ss短并且不明显
从中. ss现在是一个常见的子串,与已有的子串不同
剩下的所有其他人都与ss不同.
>对于l = ss的长度,检查Results [l]是否存在,即是否存在
有任何结果与ss相同的长度.如果没有,请拨打电话
NewLength条件.
>检查是否N == Q,即我们已达到目标配额的不同
长度.如果NewLength获得并且N == Q,则调用StickOrRaise条件.
>如果获得StickOrRaise,则将ss的长度与l = the进行比较
手中最短结果的长度.如果ss小于l
那么它对我们的配额来说太短了,所以接下来.如果ss长于l
那么所有最短的结果都将被推翻为ss,所以删除
结果[l]和减少N.
>将ss插入到按其长度键入的结果中.
>如果NewLength获得,则增加N.
>放弃有S的子串的内部迭代
ss的相同偏移但是更短,因为它们都不是明显的
来自ss.
>通过ss的长度推进外部迭代中S的偏移量,
到下一个非重叠子字符串的开头.

>返回结果.

这是一个实现解决方案并用它来演示的程序
字符串列表：

#include <list>
#include <map>
#include <string>
#include <iostream>
#include <algorithm>

using namespace std;

// Get a non-const iterator to the shortest string in a list
list<string>::iterator shortest_of(list<string> & strings)
{
    auto where = strings.end();
    size_t min_len = size_t(-1);
    for (auto i = strings.begin(); i != strings.end(); ++i) {
        if (i->size() < min_len) {
            where = i;
            min_len = i->size();
        }
    }
    return where;
}

// Say whether a string is a common substring of a list of strings
bool 
is_common_substring_of(
    string const & candidate,list<string> const & strings)
{
    for (string const & s : strings) {
        if (s.find(candidate) == string::npos) {
            return false;
        }
    }
    return true;
}


/* Get a multimap whose keys are the at-most `quota` greatest 
    lengths of common substrings of the list of strings `strings`,each key 
    multi-mapped to the set of common substrings of that length.
*/
multimap<size_t,string> 
n_longest_common_substring_sets(list<string> & strings,unsigned quota)
{
    size_t nlengths = 0;
    multimap<size_t,string> results;
    if (quota == 0) {
        return results;
    }
    auto shortest_i = shortest_of(strings);
    if (shortest_i == strings.end()) {
        return results;
    }
    string shortest = *shortest_i;
    strings.erase(shortest_i);
    for ( size_t start = 0; start < shortest.size();) {
        size_t skip = 1;
        for (size_t len = shortest.size(); len > 0; --len) {
            string subs = shortest.substr(start,len);
            if (!is_common_substring_of(subs,strings)) {
                continue;
            }
            auto i = results.lower_bound(subs.size());
            for (   ;i != results.end() && 
                    i->second.find(subs) == string::npos; ++i) {}
            if (i != results.end()) {
                continue;
            }
            for (i = results.begin(); 
                    i != results.end() && i->first < subs.size(); ) {
                if (subs.find(i->second) != string::npos) {
                    i = results.erase(i);
                } else {
                    ++i;
                }
            }
            auto hint = results.lower_bound(subs.size());
            bool new_len = hint == results.end() || hint->first != subs.size();
            if (new_len && nlengths == quota) {
                size_t min_len = results.begin()->first;
                if (min_len > subs.size()) {
                    continue;
                }
                results.erase(min_len);
                --nlengths;
            }
            nlengths += new_len;
            results.emplace_hint(hint,subs.size(),subs);
            len = 1;
            skip = subs.size();
        }
        start += skip;
    }
    return results; 
}

// Testing ...

int main()
{
    list<string> strings{
        "OfBitWordFirstFileWordZ.xls","SecondZWordBitWordOfFileBlue.xls","ThirdFileZBitWordWhiteOfWord.xls","WordFourthWordFileBitGreenZOf.xls"};

    auto results = n_longest_common_substring_sets(strings,4);
    for (auto const & val : results) {
        cout << "length: " << val.first 
        << ",substring: " << val.second << endl;
    }
    return 0;
}

输出：

length: 1,substring: Z
length: 2,substring: Of
length: 3,substring: Bit
length: 4,substring: .xls
length: 4,substring: File
length: 4,substring: Word

(用gcc 4.8.1构建)

来自两个以上字符串的最长公共子字符串 – C.

解决方法

相关推荐