Python group numbers into ranges

I referred Unsupervised clustering with unknown number of clusters as suggested by @schwobaseggl and changed the code a bit as per my need. Here is the new code:

import numpy
import scipy.cluster.hierarchy as hcluster

temp_data = [31,68,74,46,47,83,29,11,9,52,1272,1288,1297,1285,1294,1251,1265,1257,1280,1265,1292,1297,1271,1273,1253,1273,1291,1251,1295,1298,1264,1281,1294,1280,1250,1279,1298,1290,1294,1299,1266,1260,1298,1292,1280,1259,1266,1276,1253,1252,1271,1280,1284,1266,1254,1259,1291,1268,1253,1298,1288,1271,1298,1300,1274,1294,1263,1298,1270,1254,1266,1269,1283,1285,1286,1276,1257,1266,1272,1298,1261,1251,1272,1260,1291,1269,1260,1294,1287,1256,1253,1284,1269,1287,1292,1269,1272,1275,1250,1289,56,35,19,80,47,22,92,8,10,24,87,76,60,63,64,0,1295,1268,1280,1281,1277,1300,1278,1273,1250,1296,1266,1269,1282,1281,1272,1260,1292,1272,1253,1255,1299,1269,1268,1294,1250,1299,1292,1254,1281,1289,1259,1290,1271,1280,1272,1300,1258,1290,1289,1300,1299,1261,1300,1276,1290,1299,1280,1267,1283,1282,1269,1260,1285,1252,1250,1263,1297,1300,1292,1266,1260,1263,1292,1296,1289,1297,1251,1261,1250,1294,1278,1284,1291,1281,1269,1261,1257,1267,1265,1288,1291,1257,1296,1251,1260,1272,1294,1285,1269,1283,1297,1287,1253,1292,1299,1295,1286,1288,1283,1290,20,73,81,6,49,88,96,61,49,94,57,16,61,16,17,19,1280,1257,1259,1277,1257,1262,1263,1280,1292,1250,1287,1272,1258,1253,1285,1285,1257,1291,1273,1260,1267,1250,1280,1281,1263,1269,1292,1250,1282,1263,1274,1288,1296,1266,1291,1271,1273,1281,1261,1289,1269,1287,1296,1283,1280,1298,1259,1270,1259,1289,1269,1284,1295,1297,1256,1300,1281,1296,1284,1288,1285,1296,1277,1251,1279,1295,1281,1264,1280,1263,69,8,30,45,89,61,80,45,9,18,19,11,1255,1299,1296,1293,1287,1250,1265,1291,1281,1250,1286,1286,1251,1287,1266,1288,1254,1260,1260,1254,1267,1299,1273,1250,1300,1250,1279,1255,1293,1292,1278,1277,1252,1299,1278,1258,1268,1274,1285,1258,1279,1270,1278,1286,1278,1253,1267,1300,1295,1298,1285,1288,1274,1272,1252,1256,1283,1289,1251,1258,1253,1257,1297,1269,1292,1253,1273,1281,1251,1280,1253,1274,1275,1287,1296,1298,1296,1291,1284,1261,1267,1290,1273,1281,1263,1270,1264,1269,1278,1284,67,8,40,59,97,64,45,72,45,90,94,7,33,58,97,97,1252,1297,1265,1278,1272,1252,1258,1261,1287,1260,1260,1258,1280,1263,1256,1296,1269,1270,1296,1282,696,678,665,700,700,691,689,688,650,663,662,698,655,660,662,684,690,657,653,663,670,691,687,675,694,670,676,659,661,664,664,689,683,675,687,691,676,659,689,657,659,656,654,679,669,687,666,662,691,1260,1276,1252,1295,1257,1277,1281,1257,1295,1269,1265,1290,1266,1269,1286,1254,1260,1265,1290,1294,1286,1279,1254,1256,1276,1285,1282,1251,1282,1261,1253,56,74,85,94,18,83,38,80,8,4,78,43,7,79,68,78,1275,1250,1268,1297,1284,1255,1294,1262,1250,1252,680,693,677,676,670,653,670,661,658,695,665,671,656,686,662,691,675,658,671,650,667,653,652,686,667,682,694,654,689,682,667,658,651,652,692,652,655,651,650,698,655,650,679,672,697,696,696,683,1277,1264,1274,1260,1285,1285,1283,1259,1260,1288,1281,1284,1281,1257,1285,1295,1273,1264,1283,1284,1300,1299,1257,1297,1254,1257,1270,1257,1295,34,5,73,42,27,36,91,85,19,50,34,21,73,38,18,73]

ndata = [[td, td] for td in temp_data]
data = numpy.array[ndata]

# clustering
thresh = [11.0/100.0] * [max[temp_data] - min[temp_data]]  #Threshold 11% of the total range of data

clusters = hcluster.fclusterdata[data, thresh, criterion="distance"]

total_clusters = max[clusters]

clustered_index = []
for i in range[total_clusters]:
    clustered_index.append[[]]

for i in range[len[clusters]]:
    clustered_index[clusters[i] - 1].append[i]

clustered_range = []
for x in clustered_index:
    clustered_index_x = [temp_data[y] for y in x]
    clustered_range.append[[min[clustered_index_x] , max[clustered_index_x]]]

print clustered_range

i have chosen threshold value [thres] as 11% of the total range of data

so the output for this dataset is:

[[0, 97], [1250, 1300], [650, 700]]

Your use of the dictionary seems to be a way to allow the numbers to arrive out-of-order. Rather than sort them, you seem to be trying to use a dictionary [and associated hashing] to maximize efficiency.

This doesn't really work out perfectly, since you wind up doing sequential searches for a given value.

Hashing a low:high range [a dictionary key:value pair] to avoid a search doesn't help much. Only they key gets hashed. It does help in the case where you're extending a range at the low end. But for extending a range at the high end, you have to resort to searches of the dictionary values.

What you're really creating is a collection of "partitions". Each partition is bounded by a low and high value. Rather than a dictionary, you can also use a trivial tuple of [low, high]. To be most Pythonic, the [low,high] pair includes the low, but does not include the high. It's a "half-open interval".

Here's a version using a simple set of tuples, relying on hashes instead of bisection.

A binary search [using the bisect module] may perform well, also. It would slightly simplify adjacent range merging, since the two ranges would actually be adjacent. However, this leads to a cost in restructuring the sequence.

You start with an empty set of partitions, the first number creates a trivial partition of just that number.

Each next number can lead to one of three things.

  1. It extends an existing partition on the low end or high end. The number is adjacent to exactly one range.

  2. It creates a new partition. The number is not adjacent to any range.

  3. It "bridges" two adjacent partitions, combining them into one. The number is adjacent to two ranges.

It's something like this.

def make_ranges[numberlist]:
    ranges=set[]
    for number in numberlist:
        print ranges, number
        adjacent_to = set[ r for r in ranges if r[0]-1 == number or r[1] == number ]
        if len[adjacent_to] == 0:
            # Add a new partition.
            r = [number,number+1]
            assert r[0] 

Chủ Đề