寻找数据中的最大区域

如何解决寻找数据中的最大区域

假设我有以下数据：

import pandas pd

data = [44.5,27.0,22.0,23.0,24.0,21.0,20.0,19.0,18.0,16.0,17.0,17.5,54.0,37.0,31.0,30.0,28.0,26.0,27.5,25.0,29.0,36.0,23.5,41.0,19.5,15.0,14.0,12.0,13.0,11.0,33.0,15.5,47.0,32.0,38.0,33.5,34.0,99.0,73.0,76.0,64.0,72.0,68.0,63.0,66.0,70.0,62.0,65.0,59.0,61.0,60.0,82.0,69.0,67.0,68.5,75.0,167.0,120.0,111.0,102.0,100.0,107.0,105.0,102.5,96.0,97.0,101.5,95.0,87.0,89.0,101.0,94.0,93.0,119.0,134.0,140.0,149.0,211.0,219.0,160.0,127.0,115.0,112.0,108.0,90.0,91.0,86.0,85.0,83.0,81.0,77.0,79.0,84.0,132.5,104.0,92.0,78.0,98.0,74.0,80.0,76.5,109.0,88.0,71.0,75.5,84.5,110.0,94.5,106.0,117.0,108.5,103.0,187.0,152.0,138.0,137.0,135.5,146.0,136.0,129.0,130.0,141.0,133.0,131.0,153.0,147.0,142.0,150.0,174.0,157.0,145.0,151.0,132.0,254.0,229.0,222.0,212.0,207.0,230.0,210.0,206.0,201.0,194.0,209.0,199.0,202.0,200.0,225.0,220.0,231.0,218.0,321.0,1018.0,588.0,491.0,456.0,441.0,477.0,427.0,411.0,375.0,377.0,422.0,368.0,359.0,342.0,400.0,373.0,355.0,358.0,363.0,387.0,357.0,350.0,336.0,328.0,348.0,316.0,301.0,305.0,313.0,599.0,535.0,504.0,498.5,485.0,536.0,505.0,468.5,455.0,470.0,516.0,464.0,452.5,436.0,430.0,519.0,473.0,451.0,433.0,495.0,431.0,437.0,467.0,424.0,372.0,452.0,1067.0,804.0,715.0,667.5,632.0,689.0,624.0,575.5,569.0,555.0,605.0,546.5,522.0,511.0,603.5,532.0,512.5,512.0,543.0,499.0,472.0,463.0,500.0,457.0,435.0,461.0,773.0,705.5,680.0,644.0,639.0,668.0,620.5,581.0,584.0,667.0,597.5,590.5,568.0,559.0,577.0,571.0,566.0,610.0,585.5,575.0,537.0,548.0,586.0,520.0,778.0,703.0,648.0,607.0,633.0,578.0,552.0,534.0,523.0,566.5,525.0,595.0,516.5,518.0,560.0,531.0,513.0,498.0,530.0,550.0,545.0,602.0,799.0,918.0,644.5,585.0,496.0,489.0,526.0,480.0,469.5,466.0,440.0,434.0,415.0,404.5,412.0,449.0,416.0,408.0,443.0,495.5,445.0,395.0,404.0,381.0,373.5,394.0,380.0,418.0,397.0,386.0,369.0,384.5,383.0,385.0,420.0,359.5,362.0,348.5,339.0,324.5,329.0,315.0,312.0,356.0,310.0,319.0,302.0,296.0,293.0,289.0,297.5,407.0,304.0,335.0,297.0,293.5,308.0,285.5,290.0,283.0,326.0,300.5,294.0,285.0,281.0,291.0,277.0,306.0,292.0,280.5,279.0,292.5,365.0,303.0,287.0,298.0,274.0,303.5,282.0,275.0,271.0,273.0,280.0,276.0,299.0,295.0,288.0,439.0,379.0,378.0,353.0,396.0,384.0,366.0,406.0,389.0,388.5,378.5,460.5,443.5,524.5,506.0,503.0,508.0,571.5,687.5,739.5,1058.0,1998.0,1973.0,916.5,459.5,358.5,262.0,239.0,212.5,203.0,214.5,191.0,186.0,176.0,182.5,185.0,170.0,163.0,161.0,162.5,156.0,156.5,164.0,159.0,157.5,148.0,144.0,150.5,154.0,260.5,267.0,168.0,151.5,113.0,121.0,72.5,126.0,52.0,48.0,46.0,44.0,43.0,45.0,42.0,53.0,35.0,39.0,40.0,69.5,31.5,55.0,28.5,13.5,10.0,9.5,9.0,8.0,12.5,13.0]

如果我们绘制，我们可以看到（定性）大约 750 到 1100 之间的最大区域。

如果我们平滑数据，我们可以更清楚地看到这个最大区域：

pd.Series(data).ewm(span=100).mean().plot()

我的问题是，可以使用哪些技术/算法来识别区间（例如 (800,1200)）？我有很多这样的数据集，它们具有不同的形状，但都包含 1 或 2 个最大的“区域”。

有什么想法吗？谢谢

解决方法

这是我在评论中提到的问题的登山者解决方案。我将您发布的数据保存到一个 numpy 文件中：https://drive.google.com/file/d/192jp5LvEE0Dc8QVMVmzzuHSehl2_bBLF/view?usp=sharing

均值滤波和爬山后的图

阈值后的界限基于上升开始时的值。

import numpy as np
import matplotlib.pyplot as plt

# returns direction of gradient
# 1 if positive,-1 if negative,0 if flat
def getDirection(one,two):
    dx = two - one;
    if dx == 0:
        return 0;
    if dx > 0:
        return 1;
    return -1;

# detects and returns peaks and valleys
def mountainClimber(vals,minClimb):
    # init trackers
    last_valley = vals[0];
    last_peak = vals[0];
    last_val = vals[0];
    last_dir = getDirection(vals[0],vals[1]);

    # get climbing
    peak_valley = []; # index,height,climb (positive for peaks,negative for valleys)
    for a in range(1,len(vals)):
        # get current direction
        sign = getDirection(last_val,vals[a]);
        last_val = vals[a];

        # if not equal,check gradient
        if sign != 0:
            if sign != last_dir:
                # change in gradient,record peak or valley
                # peak
                if last_dir > 0:
                    last_peak = vals[a];
                    climb = last_peak - last_valley;
                    climb = round(climb,2);
                    peak_valley.append([a,vals[a],climb]);
                else:
                    # valley
                    last_valley = vals[a];
                    climb = last_valley - last_peak;
                    climb = round(climb,climb]);

                # change direction
                last_dir = sign;

    # filter out very small climbs
    filtered_pv = [];
    for dot in peak_valley:
        if abs(dot[2]) > minClimb:
            filtered_pv.append(dot);
    return filtered_pv;

# run an mean filter over the graph values
def meanFilter(vals,size):
    fil = [];
    filtered_vals = [];
    for val in vals:
        fil.append(val);

        # check if full
        if len(fil) >= size:
            # pop front
            fil = fil[1:];
            filtered_vals.append(sum(fil) / size);
        else:
            # pad to maintain index positions
            filtered_vals.append(0);
    return filtered_vals;

# load from file
data = np.load("data.npy");

# filter and round values
mean_filter_size = 150;
filtered_vals = meanFilter(data,mean_filter_size);

# get peaks and valleys
pv = mountainClimber(filtered_vals,0);

# filter for the largest climb
biggest_climb = -1;
top_index = None;
for pv_index,feature in enumerate(pv):
    # unpack
    _,_,climb = feature;

    # check climb
    if climb > biggest_climb:
        biggest_climb = climb;
        top_index = pv_index;

# pull out the threshold
start = pv[top_index - 1][0];
threshold = pv[top_index - 1][1];

# look through and find the first spot where the graph drops below threshold
end = None;
for index in range(start + 1,len(data)):
    if data[index] < threshold:
        end = index;
        break;

# draw the bounding lines
markers_x = [start,end];
markers_y = [data[start],data[end]];

# draw plot
x = [a for a in range(len(data))];
fig = plt.figure();
ax = plt.axes();
ax.plot(x,data);
ax.plot(markers_x,markers_y,'or');
plt.show();

寻找数据中的最大区域

如何解决寻找数据中的最大区域

解决方法

相关推荐