For finding trending topics, I use the Standard score in combination with a moving average:
z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]
Until now, I do it as follows:
Whatever the time is, for the historic trends I simply go back 24h. Assuming we have January 12, 3:45pm now:
current_trend = hits [Jan 11, 3:45 - Jan 12, 3:45]
historic_trends = hits [Jan 10, 3:45 - Jan 11, 3:45] + hits [Jan 9, 3:45 - Jan 10, 3:45] + hits [Jan 8, 3:45 - Jan 9, 3:45] + ...
But is this really adequate? Wouldn't it be better if I always started at 00:00 o'clock? For example this way for the same data (3:45pm):
current_trend = hits [Jan 11, 0:00 - Jan 12, 0:00]
historic_trends = hits [Jan 10, 0:00 - Jan 11, 0:00] + hits [Jan 9, 0:00 - Jan 10, 0:00] + hits [Jan 9, 0:00 - Jan 9, 0:0] + ...
I'm sure the results would be different. But which approach will give you better results?
I hope you've understood my question and you can help me. :) Thanks in advance!
I think that the problem you may be seeing with your current implementation is that topics that were hot 23 hours ago are influencing your rankings right now. The problem I see with your new proposed implementation is that you're wiping the slate clean at midnight, so topics that were hot late last night won't seem hot early the next morning (but they should).
I suggest you look into implementing a Digg-style algorithm (sorry for linking to Digg) where the hotness of a topic decays with age. You could do this by counting up the hits/hour for each of the last 24 hour periods then divide each period-score by how many hours ago the period took place. Add up the 24 periods to get the score.
hottness = (score24 / 24) + (score23 / 23) + ... + (score2 / 2) + score1
Where score24 is the number of "hits" that a topic got in the one-hour period that occured 24 hours ago (maybe not the hits exactly, but the normalized score for that hour).
This way topics that were hot 24 hours ago will still be counted in your algorithm, but not as heavily as topics that were hot an hour ago.
But your example concerning my algorithm and the time periods is very good. So do you recommend the first approach (simply going 24h back instead of starting at 0:00) - caw 2009-06-17 17:15