java – 由于网络中断后锁定nfs文件导致JVM崩溃

以下代码片段导致JVM崩溃：如果获取锁定后发生网络中断

while (true) {

       //file shared over nfs
       String filename = "/home/amit/mount/lock/aLock.txt";
       RandomAccessFile file = new RandomAccessFile(filename,"rws");
       System.out.println("file opened");
       FileLock fileLock = file.getChannel().tryLock();
       if (fileLock != null) {
          System.out.println("lock acquired");
       } else {
          System.out.println("lock not acquired");
       }

       try {
          //wait for 15 sec
          Thread.sleep(30000);
       } catch (InterruptedException e) {
          e.printstacktrace();
       }
       System.out.println("closing filelock");
       fileLock.close();
       System.out.println("closing file");
       file.close();
    }

观察：JVM接收KILL(9)信号并退出,退出代码为137(128 9).

可能在网络连接重建之后,文件描述符表中出现了问题.
使用系统调用flock(2)和shell实用程序flock(1)可以重现此行为.

任何建议/解决方案？

PS：将Oracle JDK 1.7.0_25与NFSv4一起使用

编辑：
此锁定将用于标识分布式高可用性集群中哪个进程处于活动状态.
退出代码是137.
我期待什么？
检测问题的方法.关闭文件并尝试重新获取.

解决方法

NFS服务器重新启动后,所有具有任何活动文件锁定的客户端都会启动锁定回收过程,该过程持续时间不超过所谓的“宽限期”(仅为常量).如果回收过程在宽限期内失败,则NFS客户端(通常是内核空间野兽)会将SIGUSR1发送到无法恢复其锁定的进程.这是你问题的根源.

When the lock succeeds on the server side,rpc.lockd on the client system requests another daemon,rpc.statd,to monitor the NFS server that implements the lock. If the server fails and then recovers,rpc.statd will be informed. It then tries to reestablish all active locks. If the NFS server fails and recovers,and rpc.lockd is unable to reestablish a lock,it sends a signal (SIGUSR1) to the process that requested the lock.

http://menehune.opt.wfu.edu/Kokua/More_SGI/007-2478-010/sgi_html/ch07.html

你可能想知道如何避免这种情况.嗯,有几种方法,但没有一种是理想的：

>增加宽限期. AFAIR,在linux上可以通过/ proc / fs / nfsd / nfsv4leasetime进行更改.>在代码中创建一个SIGUSR1处理程序,并在那里做一些聪明的事情.例如,在信号处理程序中,您可以设置一个标志,表示锁定恢复失败.如果设置了此标志,则程序可以尝试等待NFS服务器的准备就绪(只要它需要),然后它可以尝试自己恢复锁.效果不佳……>不要再使用NFS锁定.如果可以像之前建议的那样切换到zookeeper.

java – 由于网络中断后锁定nfs文件导致JVM崩溃

解决方法

相关推荐