Python group numbers into ranges

I referred Unsupervised clustering with unknown number of clusters as suggested by @schwobaseggl and changed the code a bit as per my need. Here is the new code:

import numpy
import scipy.cluster.hierarchy as hcluster

temp_data = [31,68,74,46,47,83,29,11,9,52,1272,1288,1297,1285,1294,1251,1265,1257,1280,1265,1292,1297,1271,1273,1253,1273,1291,1251,1295,1298,1264,1281,1294,1280,1250,1279,1298,1290,1294,1299,1266,1260,1298,1292,1280,1259,1266,1276,1253,1252,1271,1280,1284,1266,1254,1259,1291,1268,1253,1298,1288,1271,1298,1300,1274,1294,1263,1298,1270,1254,1266,1269,1283,1285,1286,1276,1257,1266,1272,1298,1261,1251,1272,1260,1291,1269,1260,1294,1287,1256,1253,1284,1269,1287,1292,1269,1272,1275,1250,1289,56,35,19,80,47,22,92,8,10,24,87,76,60,63,64,0,1295,1268,1280,1281,1277,1300,1278,1273,1250,1296,1266,1269,1282,1281,1272,1260,1292,1272,1253,1255,1299,1269,1268,1294,1250,1299,1292,1254,1281,1289,1259,1290,1271,1280,1272,1300,1258,1290,1289,1300,1299,1261,1300,1276,1290,1299,1280,1267,1283,1282,1269,1260,1285,1252,1250,1263,1297,1300,1292,1266,1260,1263,1292,1296,1289,1297,1251,1261,1250,1294,1278,1284,1291,1281,1269,1261,1257,1267,1265,1288,1291,1257,1296,1251,1260,1272,1294,1285,1269,1283,1297,1287,1253,1292,1299,1295,1286,1288,1283,1290,20,73,81,6,49,88,96,61,49,94,57,16,61,16,17,19,1280,1257,1259,1277,1257,1262,1263,1280,1292,1250,1287,1272,1258,1253,1285,1285,1257,1291,1273,1260,1267,1250,1280,1281,1263,1269,1292,1250,1282,1263,1274,1288,1296,1266,1291,1271,1273,1281,1261,1289,1269,1287,1296,1283,1280,1298,1259,1270,1259,1289,1269,1284,1295,1297,1256,1300,1281,1296,1284,1288,1285,1296,1277,1251,1279,1295,1281,1264,1280,1263,69,8,30,45,89,61,80,45,9,18,19,11,1255,1299,1296,1293,1287,1250,1265,1291,1281,1250,1286,1286,1251,1287,1266,1288,1254,1260,1260,1254,1267,1299,1273,1250,1300,1250,1279,1255,1293,1292,1278,1277,1252,1299,1278,1258,1268,1274,1285,1258,1279,1270,1278,1286,1278,1253,1267,1300,1295,1298,1285,1288,1274,1272,1252,1256,1283,1289,1251,1258,1253,1257,1297,1269,1292,1253,1273,1281,1251,1280,1253,1274,1275,1287,1296,1298,1296,1291,1284,1261,1267,1290,1273,1281,1263,1270,1264,1269,1278,1284,67,8,40,59,97,64,45,72,45,90,94,7,33,58,97,97,1252,1297,1265,1278,1272,1252,1258,1261,1287,1260,1260,1258,1280,1263,1256,1296,1269,1270,1296,1282,696,678,665,700,700,691,689,688,650,663,662,698,655,660,662,684,690,657,653,663,670,691,687,675,694,670,676,659,661,664,664,689,683,675,687,691,676,659,689,657,659,656,654,679,669,687,666,662,691,1260,1276,1252,1295,1257,1277,1281,1257,1295,1269,1265,1290,1266,1269,1286,1254,1260,1265,1290,1294,1286,1279,1254,1256,1276,1285,1282,1251,1282,1261,1253,56,74,85,94,18,83,38,80,8,4,78,43,7,79,68,78,1275,1250,1268,1297,1284,1255,1294,1262,1250,1252,680,693,677,676,670,653,670,661,658,695,665,671,656,686,662,691,675,658,671,650,667,653,652,686,667,682,694,654,689,682,667,658,651,652,692,652,655,651,650,698,655,650,679,672,697,696,696,683,1277,1264,1274,1260,1285,1285,1283,1259,1260,1288,1281,1284,1281,1257,1285,1295,1273,1264,1283,1284,1300,1299,1257,1297,1254,1257,1270,1257,1295,34,5,73,42,27,36,91,85,19,50,34,21,73,38,18,73]

ndata = [[td, td] for td in temp_data]
data = numpy.array(ndata)

# clustering
thresh = (11.0/100.0) * (max(temp_data) - min(temp_data))  #Threshold 11% of the total range of data

clusters = hcluster.fclusterdata(data, thresh, criterion="distance")

total_clusters = max(clusters)

clustered_index = []
for i in range(total_clusters):
    clustered_index.append([])

for i in range(len(clusters)):
    clustered_index[clusters[i] - 1].append(i)

clustered_range = []
for x in clustered_index:
    clustered_index_x = [temp_data[y] for y in x]
    clustered_range.append((min(clustered_index_x) , max(clustered_index_x)))

print clustered_range

i have chosen threshold value (thres) as 11% of the total range of data

so the output for this dataset is:

[(0, 97), (1250, 1300), (650, 700)]

Your use of the dictionary seems to be a way to allow the numbers to arrive out-of-order. Rather than sort them, you seem to be trying to use a dictionary (and associated hashing) to maximize efficiency.

This doesn't really work out perfectly, since you wind up doing sequential searches for a given value.

Hashing a low:high range (a dictionary key:value pair) to avoid a search doesn't help much. Only they key gets hashed. It does help in the case where you're extending a range at the low end. But for extending a range at the high end, you have to resort to searches of the dictionary values.

What you're really creating is a collection of "partitions". Each partition is bounded by a low and high value. Rather than a dictionary, you can also use a trivial tuple of (low, high). To be most Pythonic, the (low,high) pair includes the low, but does not include the high. It's a "half-open interval".

Here's a version using a simple set of tuples, relying on hashes instead of bisection.

A binary search (using the bisect module) may perform well, also. It would slightly simplify adjacent range merging, since the two ranges would actually be adjacent. However, this leads to a cost in restructuring the sequence.

You start with an empty set of partitions, the first number creates a trivial partition of just that number.

Each next number can lead to one of three things.

  1. It extends an existing partition on the low end or high end. The number is adjacent to exactly one range.

  2. It creates a new partition. The number is not adjacent to any range.

  3. It "bridges" two adjacent partitions, combining them into one. The number is adjacent to two ranges.

It's something like this.

def make_ranges(numberlist):
    ranges=set()
    for number in numberlist:
        print ranges, number
        adjacent_to = set( r for r in ranges if r[0]-1 == number or r[1] == number )
        if len(adjacent_to) == 0:
            # Add a new partition.
            r = (number,number+1)
            assert r[0] <= number < r[1] # Trivial, right?
            ranges.add( r )
        elif len(adjacent_to) == 1:
            # Extend this partition, build a new range.
            ranges.difference_update( adjacent_to )
            r= adjacent_to.pop()
            if r[0]-1 == number: # Extend low end
                r= (number,r[1])
            else:
                r= (r[0],number+1) # Extend high end
            assert r[0] <= number < r[1] 
            ranges.add( r )
        elif len(adjacent_to) == 2:
            # Merge two adjacent partitions.
            ranges.difference_update( adjacent_to )
            r0= adjacent_to.pop()
            r1= adjacent_to.pop()
            r = ( min(r0[0],r1[0]), max(r0[1],r1[1]) )
            assert r[0] <= number < r[1]
            ranges.add( r )
    return ranges

This is a pretty radical rethinking of your algorithm, so it's not a proper code review.

Assembling a string with separators is simpler in Python. For your example of ","-separated strings, always think of doing it this way.

final = ",".join( details )

Once you have that, you're simply making a sequence of details. In this case, each detail is either "x-y" or "x" as the two forms that a range can take.

So it has to be something like

details = [ format_a_range(r) for r in ranges ]

In this example, I've shown the format as an in-line def. It can be done as an if-else expression, also.

Given your ranges, the overall function is this.

def formatpagelist (numberlist):
    ranges= make_ranges( numberlist )
    def format_a_range( low, high ):
        if low == high-1: return "{0}".format( low )
        return "{0}-{1}".format( low, high-1 )
    strings = [ format_a_range(r) for r in sorted(ranges) ]
    return ",".join( strings )