How might I rank terms by their frequency of use on Twitter over a period of time?

-4

I'm trying to create an app that collects data from Twitter: I want to take a list of words and determine the frequency with which they appear over a given time frame.

How should I go about accomplishing this?

2012-04-04 19:41
by Antarr Byrd

This will probably help you, http://stackoverflow.com/questions/1280942/which-c-net-twitter-api-do-you-recomen - Orn Kristjansson 2012-04-04 19:42

why would you vote this down - Antarr Byrd 2012-04-04 20:39

@orn I was referring to the actual twitter apis. I need to develop my own library - Antarr Byrd 2012-04-04 20:40

Why vote down? Hover over the downvote button - "This question does not show any research effort." The very first result on google for "twitter api" is, surprise, surprise, the twitter API. Read through the docs and tutorials there and come back with a concrete question if you get stuck - Kevin 2012-04-04 20:52

You don't have many choices when it comes to offerings that Twitter supports directly.

You can use the Twitter Search API but it has the following limitations:

The current index includes between six-nine days of tweets.

You cannot use the Search API to find Tweets older than about a week.

That said, if it's alright for you to search within this range for a tweet, then you have a limited number of parameters that you can use to filter out tweets by time:

until - Will return tweets up to a certain date
since_id - Gives you tweets that occur since a certain tweet id
max_id - Gives you tweets up to a certain tweet id

Because tweet ids increase in ascending order, it's better to try and have ids of tweets that delimit the range that you want to search.

Note that for the keywords, you would use the q parameter.

Also note that you'll have to page the results through the use of the page and rpp (results per page) parameters.

You could also use third-party services to archive tweets, but the risk here is that these services might not be around as long as you need them to.

If you have the capacity, I recommend using the Streaming API to get a firehose of tweets fed to your application, which you would then store for future processing.

Basically, you make and keep an open connection with Twitter which then feeds tweets to you. Note that this feed is rate-limited and quality-controlled. However, it's a good way to keep the data that you want for as long as you want from the moment you turn the switch on in your application.

Once you've cleared up how you're going to get the results, getting the frequency is easy. Assuming you are storing the results, I'd recommend using a document-oriented database (something like elasticsearch or RavenDB); they are better suited for handling the JSON format that Tweet Entities are returned in as well as giving you much better mechanisms by which to query and manipulate that data in the future.

In both of the mentioned solutions, you can get the counts of the total number of items as well as how many items fit a certain search term (and you can additionally filter on properties of the JSON document, if you want).

If you want to get term frequency/inverse document frequency, then I believe that elasticsearch will allow you to access those statistics of the index directly (not sure about RavenDB), or you could build a document store yourself with Lucene.NET if you want to get really bare-bones (it's much more work to implement, but you are much closer to the stats you want to get).

2012-04-04 20:41
by casperOne

Thanks for the input great help. Just one question. Why would you prefer document-oriented database in this situation vs a relational database - Antarr Byrd 2012-04-04 20:54

@atbyrd Most of the document-oriented databases that are out there nowadays use JSON as their wire format (they are all schemaless). It would reduce the friction you would have in getting the responses from Twitter and storing them. Additionally, you wouldn't be subject to changes in the schema that Twitter imposes from time to time, as well as have enhanced abilities to query the individual documents as well as perform proper text searches (with the correct analyzers, stemming, etc, something that is difficult to do with relational database) - casperOne 2012-04-04 20:58