INTRO
To get started, you’ll need access to PHP 5.x on a web server. I’m currently working on an Apache server with the default installation of PHP 5.3. This should be available on most hosting services, especially those setups featuring open source software (as opposed to Microsoft’s .NET framework). In addition, I’m using a Postgres database on the back end to store the information I’m scraping and extracting (you can just as easily use MySQL). If you want to run this code on your local machine, download WAMP, MAMP, XAMPP, or another flavor of server/language/database package.
TWITTER API, OAuth, & PHP twitteroauth Library
First, familiarize yourself with the Twitter Developer Website. If you want to skip right to the API, check out the REST API v1.1 documentation. To test a search, go to the Twitter Search page and type in a search term; try typing #BigData in the query field to search for the BigData hashtag. You’ll be presented with a GUI version of the results. If you want to try doing the same thing programatically and return data in JSON format, you’ll need to use the REST API search query… and you must be authenticated to do this. To create credentials to use the search query, you must create an OAuth profile; so go and visit https://dev.twitter.com/docs/auth/tokens-devtwittercom so you can retrieve your ACCESS TOKEN and ACCESS SECRET. Luckily we can use the PHP twitteroauth library to connect to Twitter’s API and start writing code (here’s an example of the code you’ll need: https://dev.twitter.com/docs/auth/oauth/single-user-with-examples#php). At this point you’ll need to set up your OAuth profile with Twitter and download the PHP twitteroauth library, edit the proper information to add your TOKEN and SECRET to the PHP twitteroauth library, and ensure all the files are on your web server in the appropriate place.
PERFORMING A SEARCH & RETRIEVING JSON DATA
I’m assuming you have set up the OAuth profile on Twitter and that you’ve downloaded the PHP twitteroauth library. I like to create an “app_tokens.php” file containing my CONSUMER_KEY, CONSUMER_SECRET, USER_TOKEN, and USER_SECRET information assigned to variables; this way I can include anywhere I need it.
<?php /*********************************** SET THE PHP TIMEZONE TO MATCH TWITTER'S TIMEZONE ***********************************/ date_default_timezone_set('UTC'); /*********************************** INCLUDE THE twitteroauth library files *********************************** require './tmhOAuth/tmhOAuth.php'; require './tmhOAuth/tmhUtilities.php'; /*********************************** INCLUDE MY TWITTER SECRETS *********************************** require './app_tokens.php'; /*********************************** USE twitteroauth TO SET UP A TWITTER REQUEST OBJECT *********************************** $tmhOAuth = new tmhOAuth(array( 'consumer_key' => $consumer_key, 'consumer_secret' => $consumer_secret, 'user_token' => $user_token, 'user_secret' => $user_secret )); ?>
Now that we have our authorization credentials we are ready to use tmhOAuth as the middle man to send a request to Twitter’s API. Let’s say we want to perform the same search we did above, but this time we don’t want a GUI version of the data… instead we want JSON data back so that we can easily add it to a database. We need to find out what command the Twitter API expects and pass it a value; for our example, the Twitter API search query is simply: https://api.twitter.com/1.1/search/tweets.json We can pass it several different parameters, but we’ll start with the most basic and use the q query parameter. We want to pass the parameter the value “#BigData”, but we need to convert the pound sign (#) to a URL encoded version => %23… Our code then looks like this:
<?php $tmhOAuth->request( 'GET', 'https://api.twitter.com/1.1/search/tweets.json', array( 'q' => '%23BigData', 'include_entities'=>false, 'count' => '5', 'result_type' => 'mixed' ) ); ?>
This request will use the REST API v1.1 and return JSON data. We are passing the search a paramater of q=>’%23BigData’ which translates to searching for the hashtag “#BigData” (without the quotes). We are also passing the ‘count’ and ‘result_type’ parameters (for more info on the other parameters, see the documentation). Lastly, we need to get the response back from Twitter and output it; if we have an error, we need to output that too. Using the twitteroauth libraries examples, I know I need to have the following code:
<?php // HTTP response code $response_code = $tmhOAuth -> response['code']; // JSON conversion $response_data = json_decode($tmhOAuth -> response['response'], true); if ($response_code <> 200) { print "Error: $response_code"; } echo "<pre>"; print_r($response_data); echo "</pre>"; ?>
The above code receives two pieces of data from the Twitter API: the response code and the response data. The response code indicates if we have errors. The response data holds the JSON data the we received from the query. The first result of my JSON data (yours won’t contain the same information, but it will contain similar structure) looks like this:
[0] => Array ( [metadata] => Array ( [result_type] => popular [iso_language_code] => en ) [created_at] => Sun Jun 23 18:10:41 +0000 2013 [id] => 348865709624922112 [id_str] => 348865709624922112 [text] => In Kazakhstan #bigdata had long been in dictionary, define as "Any data too big for copy to DVDs and fit into 1 Lada." [source] => web [truncated] => [in_reply_to_status_id] => [in_reply_to_status_id_str] => [in_reply_to_user_id] => [in_reply_to_user_id_str] => [in_reply_to_screen_name] => [user] => Array ( [id] => 539296619 [id_str] => 539296619 [name] => Big Data Borat [screen_name] => BigDataBorat [location] => Алматы [description] => Learnings of Big Data for Make Nation of Kazakhstan #1 Leading Data Scientist Nation [url] => [entities] => Array ( [description] => Array ( [urls] => Array ( ) ) ) [protected] => [followers_count] => 9384 [friends_count] => 42 [listed_count] => 294 [created_at] => Wed Mar 28 19:01:05 +0000 2012 [favourites_count] => 0 [utc_offset] => [time_zone] => [geo_enabled] => [verified] => [statuses_count] => 442 [lang] => en [contributors_enabled] => [is_translator] => [profile_background_color] => C0DEED [profile_background_image_url] => http://a0.twimg.com/images/themes/theme1/bg.png [profile_background_image_url_https] => https://si0.twimg.com/images/themes/theme1/bg.png [profile_background_tile] => [profile_image_url] => http://a0.twimg.com/profile_images/1979623485/borat_normal.jpg [profile_image_url_https] => https://si0.twimg.com/profile_images/1979623485/borat_normal.jpg [profile_link_color] => 0084B4 [profile_sidebar_border_color] => C0DEED [profile_sidebar_fill_color] => DDEEF6 [profile_text_color] => 333333 [profile_use_background_image] => 1 [default_profile] => 1 [default_profile_image] => [following] => [follow_request_sent] => [notifications] => ) [geo] => [coordinates] => [place] => [contributors] => [retweet_count] => 106 [favorite_count] => 20 [favorited] => [retweeted] => [lang] => en )
If you look at the JSON data above, you’ll see a key titled “text” and the value assigned to it; this is the content of the tweet and you can clearly see that it contains the hashtag #bigdata. So we now know the code works and we can programatically query Twitter. When you examine the Twitter API you will find that we can make 450 request every 15 minutes; this will of course not get us ALL the tweets using the hashtag “#bigdata”, but it will give us a useful sample at 30 results per request == 13,500 tweets every 15 minutes.
Cheers.