PHP - 我的用于获取文件名和查找新文件的脚本可以更快吗？

如何解决PHP - 我的用于获取文件名和查找新文件的脚本可以更快吗？

我可以通过 FTP 访问 1 个目录，该目录包含供应商所有产品的所有图像。 1 件商品有多个图片：商品的尺寸变化和展示方式的变化。

没有“列表”（XML、CSV、数据库..）可以让我知道“有什么新东西”。目前我看到的唯一方法是获取所有文件名并将它们与我的数据库中的文件名进行比较。

最后一次检查在该目录中计数了 998.283 个文件。 1 个产品有多种变体，并且没有关于它们如何命名的文档。

我初步抓取了文件名，将它们与我的产品进行了比较，并将其文件名和修改日期（来自文件）保存在“图像”数据库表中。

下一步是检查“新的”。

我现在正在做的是：

// get the file list /
foreach ($this->getFilenamesFromFtp() as $key => $image_data) {
  // I extract data from filenames (product code,size,variation number,extension..) so I can store them in table and later use that as reference (ie. I want to use only large images of variation,not all sizes 
  $data=self::extractDataFromImage($image_data);
  // checking if filename already exists in DB images
  // if there is DB entry (TRUE) it will do nothing,and if there is none it will continue with insertion in DB
  if($this->checkForFilenameInDb($data['filename'])){
  }
  else{
    $export_codes=$this->export->getProductIds();
    // check if product code is in export table - that is do we really need this image
    if($this->functions->in_array_r($data['product_code'],$export_codes)){
      self::insertimageDataInDb($data);
    } // end if                     
  } // end if check if filename is already in DB
} // end foreach

我的方法 getFilenamesFromFtp() 看起来像这样：

$filenames = array();
$i=1;
$ftp = $this->getFtpConfiguration();

// set up basic connection
$conn_id = ftp_ssl_connect($ftp['host']);

// login with username and password
$login_result = ftp_login($conn_id,$ftp['username'],$ftp['pass']);

ftp_set_option($conn_id,FTP_USEPASVADDRESS,false);
$mode = ftp_pasv($conn_id,TRUE);
ftp_set_option($conn_id,FTP_TIMEOUT_SEC,180);

//Login OK ?
if ((!$conn_id) || (!$login_result) || (!$mode)) { //  || (!$mode)
   die("FTP connection has Failed !");
}
else{
  // I get all filenames and store them in array
  $files=ftp_nlist($conn_id,".");
  // I count the number of files in array = the number of files on FTP 
  $nofiles=count($files);
  foreach($files as $filename){
  // the limit I implemented while developing or testing,but in production (current mode) it has to run without limit
  if(self::LIMIT>0 && $i==self::LIMIT){ //!empty(self::LIMIT) &&    
      break;
    }
    else{
      // I get date modified from from file
      $date_modified = ftp_mdtm($conn_id,$filename);
      
      // I create new array for filenames and date modified so I  can return it and store it in DB
      $filenames[]= array(
         "filename" => $filename,"date_modified" => $date_modified
      );
    } // end if LIMIT empty
    $i++;
  } // end foreach
  // close the connection
  ftp_close($conn_id);
  return $filenames;
}

问题是脚本需要很长时间。我现在检测到的最长时期是在 getFilenamesFromFtp() 中创建数组时：

      $filenames[]= array(
         "filename" => $filename,"date_modified" => $date_modified
      );

到目前为止，这部分持续了 4 小时，但仍未完成。

在写这篇文章时，我有一个想法，从一开始就删除“修改日期”，然后仅当我打算将该图像存储在数据库中时才使用它。

我将在完成此更改和测试后立即更新此问题:)

解决方法

处理一百万个文件名需要时间，但是，我认为没有理由将这些文件名（和 date_modified）存储在数组中，为什么不直接处理文件名？

此外，与其完全处理文件名，不如先将其存储在数据库表中？之后就可以进行真正的处理了。通过将任务一分为二，检索和处理，它变得更加灵活。例如，如果您想更改处理方式，则无需进行新的检索。

如果目标只是在网页上显示新文件：

您可以只存储从数据库中创建/修改的最高文件时间。
这样，对于下一批，只需获取上次修改时间并将其与所有文件的文件创建/修改时间进行比较。这将使您的应用程序非常轻量级。为此，您可以使用 filemtime。
现在，取迭代中所有当前文件的最高 filemtime 并将最高记录存储在数据库中并重复上述相同步骤。

建议：

foreach ($this->getFilenamesFromFtp() as $key => $image_data) {

如果上述代码段获取数组中的所有文件名，则可以放弃此策略。这会消耗大量内存。而是使用 answer 中提到的目录函数一个一个地读取文件，因为这个函数维护句柄的内部指针并且不会一次加载所有文件。当然，对于嵌套目录，您需要使指出的答案遵循递归迭代。