使用linq缓慢识别重复项

如何解决使用linq缓慢识别重复项

我有两个用户名列表，其中BigList包含20000个用户名和电子邮件，而SmallList包含1500个用户名。大列表包含重复的用户，这意味着他们具有相同的电子邮件，但用户名是唯一的。小清单具有唯一的用户名。如果该用户也存在于SmallList中，我需要返回list1中每个重复用户的最短用户名（由电子邮件确定）。

我已经使用linq解决了这个问题，但是它最多需要30秒，这太慢了：

#Code 2
ggplot(valuesDT,aes(x=1,y=pct,fill=flag,group=key)) +
  geom_bar(stat="identity",width=2)+
  coord_polar(theta='y')+
  theme_classic()+
  theme(axis.ticks=element_blank(),axis.title=element_blank(),axis.line.y = element_blank(),axis.line.x = element_blank(),axis.text.y = element_blank(),panel.grid  = element_blank(),axis.text.x = element_blank(),plot.title = element_text(color="blue",size=16,face="bold")) +
  # scale_fill_manual(values = c("grey","green"),labels = c("non-T","F"),name = "")+
  geom_text(aes(y = pos,label = paste0(round(pct*100,digits = 2)," %")),size = 3)+
  facet_grid(key ~ year)

可以采取什么措施来改善此查询的性能？谢谢！

解决方法

对于 } 。考虑为List<T>.Contains使用n代替BigList ,

如果您要丢弃18,500个用户，为什么要订购20.000个用户？首先选择所需的项目，然后按递增的userName长度排序它们，会不会更有效率？

首先，我将BigList转换为具有相同电子邮件的用户组。在每个组的所有元素中，我保留最短的用户名。显然，您对最终结果中的电子邮件不感兴趣。

在其余的用户名中，我只保留那些也在SmallList中的用户名。

我使用overload of Enumerable.GroupBy that has a parameter resultSelector，所以我可以操纵结果。

var result = BigList.GroupBy(

   // keySelector: make groups of users with the same EmailAddress:
   user => user.EmailAddress,// resultSelector: from each emailAddress and all Users that have this emailAddress
   // make one new Object,the one that contains the smallest UserName
   (emailAddress,usersWithThisEmailAddress) => usersWithThisEmailAddres
       .Select(user => user.UserName)
       .OrderBy(userName => userName.Length)
       .FirstOrDefault())

// You don't want to keep all UserNames,keep only those that are also in smallList:
.Where(userName => smallList.Contains(userName));

要获得每个组中用户最短的用户名，可以按用户名长度的升序对其进行排序，并采用第一个。但是，如果只使用排序序列中的第一个序列，为什么还要订购第二，第三和第五十四。

鲜为人知的方法Enumerable.Aggregate是一种您只需枚举序列一次的方法：

(emailAddress,usersWithThisEmailAddress) => usersWithThisEmailAddres
    .Select(user => user.UserName)
    .Aggregate( (shortestUserName,nextUserName) => 
         (nextUserName.Length < shortestUserName.Length) ? nextUserName : shortestUserName);

聚合执行以下操作。

IEnumerable<string> userNames = ...
string shortestUserName = userNames.First();
foreach (string nextUserName in userNames.Skip(1))
{
    shortestUserName = (nextUserName.Length < shortestUserName.Length) ?
        nextUserName : shortestUserName;
}
return shortestUserName;

实际上，通过使用GetEnumerator和MoveNext，聚合的效率甚至更高。这需要一些关于如何在最低级别枚举的知识，如果您不了解它，不要担心，您很少需要使用它，通常只有在您想提高性能时才使用它：

IEnumerable<string> userNames = ...
IEnumerator<string> enumerator = userNames.GetEnumerator();
if (enumerator.MoveNext())
{
    // there is at least one user name in the sequence,it is the shortest until now
    string shortestUserName = enumerator.Current;

    // while there are more userNames,check if the next one is shorter:
    while (enumerator.MoveNext())
    {
        // There is a next user name. Is it shorter?
        shortestUserName = (enumerator.Current.Length < shortestUserName.Length) ?
        enumerator.Current: shortestUserName;
    }
}
// else: there are no elements at all,decide what to do.

如果您想从中进行最后的优化：

while (enumerator.MoveNext())
{
    if (enumerator.Current.Length < shortestUserName.Length)
    {
        shortestUserName = enumerator.Current;
    }
}

使用linq缓慢识别重复项

如何解决使用linq缓慢识别重复项

解决方法

相关推荐