I'm trying to create an app that collects data from Twitter: I want to take a list of words and determine the frequency with which they appear over a given time frame.
How should I go about accomplishing this?
You don't have many choices when it comes to offerings that Twitter supports directly.
You can use the Twitter Search API but it has the following limitations:
- The current index includes between six-nine days of tweets.
- You cannot use the Search API to find Tweets older than about a week.
That said, if it's alright for you to search within this range for a tweet, then you have a limited number of parameters that you can use to filter out tweets by time:
until
- Will return tweets up to a certain datesince_id
- Gives you tweets that occur since a certain tweet idmax_id
- Gives you tweets up to a certain tweet idBecause tweet ids increase in ascending order, it's better to try and have ids of tweets that delimit the range that you want to search.
Note that for the keywords, you would use the q
parameter.
Also note that you'll have to page the results through the use of the page
and rpp
(results per page) parameters.
You could also use third-party services to archive tweets, but the risk here is that these services might not be around as long as you need them to.
If you have the capacity, I recommend using the Streaming API to get a firehose of tweets fed to your application, which you would then store for future processing.
Basically, you make and keep an open connection with Twitter which then feeds tweets to you. Note that this feed is rate-limited and quality-controlled. However, it's a good way to keep the data that you want for as long as you want from the moment you turn the switch on in your application.
Once you've cleared up how you're going to get the results, getting the frequency is easy. Assuming you are storing the results, I'd recommend using a document-oriented database (something like elasticsearch or RavenDB); they are better suited for handling the JSON format that Tweet Entities are returned in as well as giving you much better mechanisms by which to query and manipulate that data in the future.
In both of the mentioned solutions, you can get the counts of the total number of items as well as how many items fit a certain search term (and you can additionally filter on properties of the JSON document, if you want).
If you want to get term frequency/inverse document frequency, then I believe that elasticsearch will allow you to access those statistics of the index directly (not sure about RavenDB), or you could build a document store yourself with Lucene.NET if you want to get really bare-bones (it's much more work to implement, but you are much closer to the stats you want to get).