如何解决如何从图中选择合适的节点样本大小
我有一个网络,其节点属性标记为 0 或 1。我想找出具有相同属性的节点之间的距离与具有不同属性的节点之间的距离有何不同。由于在计算上很难找到所有节点组合之间的距离,因此我想选择节点的样本大小。我将如何选择节点的样本大小?我正在研究 python 和 networkx
解决方法
你没有提供很多细节,所以我会发明一些数据并做出假设,希望它有用。
首先导入包并对数据集进行采样:
import random
import networkx as nx
# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)
# set labels to either 0 or 1
for i,attr in G.nodes.data():
attr['label'] = 1 if random.random() < 0.2 else 0
接下来,计算随机节点对之间的最短路径:
results = []
# I had to use 100,000 pairs to get the CI small enough below
for _ in range(100000):
a,b = random.sample(list(G.nodes),2)
try:
n = nx.algorithms.shortest_path_length(G,a,b)
except nx.NetworkXNoPath:
# no path between nodes found
n = -1
results.append((a,b,n))
最后,这里是一些总结结果并打印出来的代码:
from collections import Counter
from scipy import stats
# somewhere to counts of both 0,both 1,different labels
c_0 = Counter()
c_1 = Counter()
c_d = Counter()
# accumulate distances into the above counters
node_data = {i: a['label'] for i,a in G.nodes.data()}
cc = { (0,0): c_0,(0,1): c_d,(1,0): c_d,1): c_1 }
for a,n in results:
cc[node_data[a],node_data[b]][n] += 1
# code to display the results nicely
def show(c,title):
s = sum(c.values())
print(f'{title},n={s}')
for k,n in sorted(c.items()):
# calculate some sort of CI over monte carlo error
lo,hi = stats.beta.ppf([0.025,0.975],1 + n,1 + s - n)
print(f'{k:5}: {n:5} = {n/s:6.2%} [{lo:6.2%},{hi:6.2%}]')
show(c_0,'both 0')
show(c_1,'both 1')
show(c_d,'different')
以上打印出来:
both 0,n=63930
-1: 60806 = 95.11% [94.94%,95.28%]
1: 107 = 0.17% [ 0.14%,0.20%]
2: 753 = 1.18% [ 1.10%,1.26%]
3: 1137 = 1.78% [ 1.68%,1.88%]
4: 584 = 0.91% [ 0.84%,0.99%]
5: 334 = 0.52% [ 0.47%,0.58%]
6: 154 = 0.24% [ 0.21%,0.28%]
7: 50 = 0.08% [ 0.06%,0.10%]
8: 3 = 0.00% [ 0.00%,0.01%]
9: 2 = 0.00% [ 0.00%,0.01%]
both 1,n=3978
-1: 3837 = 96.46% [95.83%,96.99%]
1: 6 = 0.15% [ 0.07%,0.33%]
2: 34 = 0.85% [ 0.61%,1.19%]
3: 34 = 0.85% [ 0.61%,1.19%]
4: 31 = 0.78% [ 0.55%,1.10%]
5: 30 = 0.75% [ 0.53%,1.07%]
6: 6 = 0.15% [ 0.07%,0.33%]
为了节省空间,我剪掉了标签不同的部分。方括号中的比例是蒙特卡罗误差的 95% CI。使用上面的更多迭代可以减少这个错误,同时显然需要更多的 CPU 时间。
,这或多或少是我与 Sam Mason 讨论的延伸,只想给你一些时间数字,因为正如所讨论的那样,检索所有距离是可行的,甚至可能更快。根据 Sam Mason 答案中的代码,我测试了两种变体,检索 1000 个节点的所有距离比采样 100 000 对要快得多。主要优点是使用了所有“检索距离”。
import random
import networkx as nx
import time
# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)
# set labels to either 0 or 1
for i,attr in G.nodes.data():
attr['label'] = 1 if random.random() < 0.2 else 0
def timing(f):
def wrap(*args,**kwargs):
time1 = time.time()
ret = f(*args,**kwargs)
time2 = time.time()
print('{:s} function took {:.3f} ms'.format(f.__name__,(time2-time1)*1000.0))
return ret
return wrap
@timing
def get_sample_distance():
results = []
# I had to use 100,000 pairs to get the CI small enough below
for _ in range(100000):
a,2)
try:
n = nx.algorithms.shortest_path_length(G,b)
except nx.NetworkXNoPath:
# no path between nodes found
n = -1
results.append((a,n))
@timing
def get_all_distances():
all_distances = nx.shortest_path_length(G)
get_sample_distance()
# get_sample_distance function took 2338.038 ms
get_all_distances()
# get_all_distances function took 304.247 ms
``
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。