微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

1.2 Terrabyte 文件的慢趋势 sax 解析

如何解决1.2 Terrabyte 文件的慢趋势 sax 解析

我喜欢解析行星(osm planet-200309.xml ~1.26tb)

为了计算使用 java 最少需要多长时间,我创建了一个小的 sax-parsing-application:

    final File f = new File("f:/planet-200309.xml");
    SAXParserFactory newInstance = SAXParserFactory.newInstance();
    final long start = System.currentTimeMillis();
    final CountingInputStream cif = new CountingInputStream(new FileInputStream(f));
    Thread t = new Thread() {
        public void run() {
            try {
                while (cif.available() > 0) {
                    Thread.sleep(10000);
                    long stop = System.currentTimeMillis();
                    long seconds = (stop - start) / 1000;
                    long bytesRead = cif.getBytesRead();
                    float bytePerSecond = bytesRead / seconds;
                    int expectedSeconds = (int) (f.length() / bytePerSecond);
                    System.out.println("Expected minutes: " + expectedSeconds / 60 + ",bytes per second:" + (int) bytePerSecond + " (reat: " + bytesRead
                            + ",took: " + seconds + ")");
                }
            } catch (IOException e) {
                e.printstacktrace();
            } catch (InterruptedException e) {
                e.printstacktrace();
            }
        };
    };
    t.start();

    newInstance.newSAXParser().parse(cif,new DefaultHandler() {
        @Override
        public void startElement(String uri,String localName,String qName,Attributes attributes) throws SAXException {
            if (!qName.equals("changeset") & !qName.equals("tag")) {
                System.out.println(qName);
            }
        }
    });

    long stop = System.currentTimeMillis();
    long took = stop - start;
    System.out.println(took / 1000);

我决定每 10 秒计算一次需要多长时间。这是我的输出

Expected minutes: 174,bytes per second:122728576 (reat: 1227285768,took: 10)
Expected minutes: 173,bytes per second:123918400 (reat: 2478367973,took: 20)
Expected minutes: 172,bytes per second:124213368 (reat: 3726401103,took: 30)
Expected minutes: 175,bytes per second:122289280 (reat: 4891571271,took: 40)
Expected minutes: 186,bytes per second:115111152 (reat: 5755557747,took: 50)
Expected minutes: 197,bytes per second:108455448 (reat: 6507327092,took: 60)
Expected minutes: 212,bytes per second:100975920 (reat: 7068314710,took: 70)
Expected minutes: 224,bytes per second:95568256 (reat: 7645460576,took: 80)
Expected minutes: 236,bytes per second:90757120 (reat: 8168140838,took: 90)
Expected minutes: 237,bytes per second:90276424 (reat: 9027642303,took: 100)
Expected minutes: 240,bytes per second:89072968 (reat: 9798026674,took: 110)
Expected minutes: 250,bytes per second:85678456 (reat: 10281415054,took: 120)
Expected minutes: 257,bytes per second:83305712 (reat: 10829743052,took: 130)
Expected minutes: 256,bytes per second:83664016 (reat: 11712962690,took: 140)
Expected minutes: 250,bytes per second:85531576 (reat: 12829736785,took: 150)

在前 10 秒内,我计算了大约 3 小时的时间消耗。两分钟后,我计算了 4 小时的时间消耗。

有趣的是:似乎有一种趋势是放慢萨克斯解析。

我将这些值放入图表中:

bytes reat per 10 second

这里有什么问题?为什么趋势放缓?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?