使用java比较两个大型数据集的有效方法是什么

如何解决使用java比较两个大型数据集的有效方法是什么

我正在为以下要求创建一个基于 Java 的实用程序 -

使用 sql 查询从旧数据库中获取 50 万条记录。
使用一次获取一条记录的网络服务从现代化数据库中获取 50 万条记录。
比较两个数据集。
生成比较报告。

我按照以下方法完成了该实用程序的开发 -

Connection con = establishLegacyDBConnection()  // Using JDBC connect to database
ResultSet resultset = executeLeagacyQuery(Connection con,String query)
while(resultset.hasNext()) {
     // Comparing one record at a time. For each resultset record compare data with webservice response
     1. Read the record to map1.
     2. GET the response from webservice based on a value in map1.
     3. Capture the response and parse the Json,read required fields to map2 (Both maps have same set of keys).
     4. for each key in map1,compare map1 value with map2 value.
     5. If there is mismatches,write the key-value information to a flat file - "Non-Matching.txt"
     6. If all key-values are matching,write a message to a flat file - "Matching.txt"
}

现在关心的问题是，这个程序需要几十个小时才能完成。有没有更好的方法来解决这个问题并提高性能？

解决方法

查看您当前方法的伪代码，让我感到震惊的是，一次向网络服务器发送一个请求可能是瓶颈。如果是这样，那么你可以试试这个：

Connection con = establishLegacyDBConnection()  // Using JDBC connect to database
ResultSet resultset = executeLeagacyQuery(Connection con,String query)
while(resultset.hasNext()) {
     // Comparing one record at a time. For each resultset record compare data with webservice response
     1. Read the record to map1.
     2. Submit a task to a Executor service with a bounded work queue and a bounded thread pool.

Each task does this:
     1. GET the response from webservice based on a value in map1.
     2. Capture the response and parse the Json,read required fields to map2 (Both maps have same set of keys).
     3. for each key in map1,compare map1 value with map2 value.
     4. If there is mismatches,write the key-value information to a flat file - "Non-Matching.txt"
     5. If all key-values are matching,write a message to a flat file - "Matching.txt"

换句话说，并行处理遗留查询中的记录。

技巧在于调整：

调整线程池大小，使应用程序处于关键繁忙状态，但不会因同时请求过多而淹没网络服务器
调整工作队列大小，使工作线程不会耗尽工作，但队列不会使用太多内存。

如果瓶颈在数据库端，你也可以考虑并行化那端；例如对旧数据集的“切片”并行运行多个查询。

我希望你也先把顾虑分开。

一个实用程序/类，用于从旧数据库和现代化数据库中获取数据
另一个用于比较数据的实用程序/类
将数据写入平面文件的第三个实用程序

使用这种方式，您可以实现关注点分离以及您可以在差异/多线程中运行的每个关注点/任务。

另一个问题我可以看到您的现代化数据库一次返回一条记录（因此调用 500K 获取请求会使您的应用程序变慢）如果您的现代化数据库 get 请求可以一次返回多条记录，那就更好了。它将最大限度地减少您的网络调用。

另一个问题，我可以看到您在单线程中执行此操作。创建具有多个线程的线程池执行程序并提交作业进行比较。通过这种方式，您的比较逻辑将并行运行（同时进行多个比较）并减少执行时间

（您可以在线程池中保留的最大线程数取决于您的机器拥有的 CPU 和 CPU 内核数）

试试看，祝你好运