寻找python函数以找到字符串中最长的顺序重复的子字符串

如何解决寻找python函数以找到字符串中最长的顺序重复的子字符串

我正在对DNA序列进行一些编码，并且我对寻找顺序重复的功能感兴趣（这可能表示引物可以“打滑” AKA做坏事的地方）。

我感兴趣的示例如下：

longest_repeat('ATTTTCCATGATGATG')

，它将输出重复长度和坐标，在这种情况下为9 long和7:15。该函数本应在末尾拾取ATGATGATG，并且由于它比TTTT重复和TGATGA重复的时间更长，因此它将仅报告ATGATGATG。对于平局，我想报告一下所有平局或至少其中之一。

将阈值设置为仅在这些重复序列超过特定长度时才报告这些重复序列也很不错。

我对python有一定的经验，但是这个特定的问题让我感到困惑，因为如果我对它进行低效率的编码并放入50个字符长的字符串，那可能会花费很多时间。我感谢所有帮助！

解决方法

以下内容将非常有效地工作。它返回最长的序列，其长度，其起始索引和终止索引。如果有多个最大长度序列，结果将是它们的列表。函数longest（s，threshold）中的第二个参数是所需的阈值最小长度：

import numpy as np

def b(n): #it returns the factors of an integer. It will be used in next function
    r = np.arange(1,int(n ** 0.5) + 1)
    x = r[np.mod(n,r) == 0]
    return set(np.concatenate((x,n / x),axis=None))
   
def isseq(s): #it tests if a string is a sequence. Using the result from previous function it compares all smaller parts of the devided string to check if they are equal
    l=[int(p) for p in sorted(list(b(len(s))))[:-1]]
    for i in l:
        if len(set(s[k*i:i*(k+1)] for k in range(len(s)//i)))==1:
            return True
    return False

def longest(s,threshold): #the main function that returns the lenghtier sequense or a list of them if they are multiple,using a threshold as minimum length
    m=[]
    for i in range(len(s),threshold-1,-1):
        for k in range(len(s)-i+1):
            if isseq(s[k:k+i]):
                m.append([s[k:k+i],i,k,k+i-1])
        if len(m)>0:
            return m
    return False

示例：

>>>s='ATTTTCCATGATGATGGST'
>>> longest(s,1)
[['ATGATGATG',9,7,15]]

>>> s='ATTTTCCATGATGATGGSTLWELWELWEGFRJGHIJH'
>>> longest(s,15],['LWELWELWE',19,27]]


>>>s='ATTTTCCATGATGATGGSTWGTKWKWKWKWKWKWKWKWKWKWKWFRGWLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERFGTFRGFTRUFGFGRFGRGBHJ'
>>> longest(longs,1)
[['LWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWER',64,48,111]]

这是一个解决方案：

def longest_repeat(seq,threshold):
    results = []
    longest = threshold
    
    # starting position
    for i in range(len(seq)):
        
        # pattern period
        for p in range(1,(len(seq)-i)//2+1):
            # skip unecessary combinations
            if results != [] and results[-1][0] == i and results[-1][3] % p == 0: continue
            
            # max possible number of repetitions
            repetitions = len(seq)//p
            
            # position within the pattern's period
            for k in range(p):
                # get the max repetitions the k-th character in the period can support
                m = 1
                while i+k+m*p < len(seq) and seq[i+k] == seq[i+k+m*p]:
                    m += 1
                repetitions = min(m,repetitions)
                
                # check if we're already below the best result so far 
                if repetitions*p < longest:    break
            
            # save the result if it's good
            if repetitions > 1 and repetitions*p >= longest:
                # overwrite lesser results
                if repetitions*p > longest: results = []
                
                # store the current one (with ample information)
                results += [(i,seq[i:i+p],repetitions,repetitions*p)]
                longest = max(longest,repetitions*p)
    
    return results

逻辑是您遍历序列（i）中的每个起始位置，检查每个明智的模式周期（p），并针对该组合检查它们是否导致在至少和目前为止最好的一样好（如果尚未找到结果，则为阈值）。

结果是(starting index,period string,total length)形式的元组列表。运行示例

threshold = 5
seq = 'ATTTCCATGATGATG'

t = time.time()
results = longest_repeat(seq,threshold)
print("execution time :",time.time()-t)

for t in results:
    print(t)

我们得到

exec : 0.00010848045349121094
(6,'ATG',3,9)

从那里获取完全匹配的字符串很简单（只需执行period_string * repetitions）

对于700个字符的随机输入，执行时间约为6.8秒，而使用@IoaTzimas的答案约为20.2秒。