How to Classify Tweets with Google’s Prediction API

Posted by Brian Porter on January 27, 2011

Ok, I know this should be easy, but I just wanted to try it out as a first step.  So my goal was to classify some tweets with Google’s Prediction API. Luckily they have a sample code in Java with a training file for Language (English, French, and Spanish).  I have been seeing tweets from @loic alot today from #Davos which are mostly English and some in French, so I have some good tests.

For those of you wondering how this works – it is basically the same way your mail client identifies SPAM (think Firefox or Apple Mail). You identify the first few hundred spam. After that, it marks those it classifies as spam based on the training you gave it. If you keep updating the training by marking spam that it misses, or unmarking falsely identified spam, then the classification improves over time.

But you want to try it yourselves, right?

First signup for Google Storage and Google Prediction API’s if you havn’t.

Go to the Java Sample page from Google: http://samples.google-api-java-client.googlecode.com/hg/prediction-json-clientlogin-sample/instructions.html?r=default and first download the training data.  You will then need to create a “Bucket” in Google Storage, and upload the training file.  Remember what the Bucket’s name is.

Then go back and download the sample (use Mercurial to check it out).

I first updated the “src/com/google/api/client/sample/prediction/ClientLoginCredentials.java” file as noted with my Google Username and Password, then the Training File.  I was able to compile and run it right aways, and after a few minutes the training was finished, and the sample texts were properly classified!

Next I wanted to integrate twitter, so I added the Twitter4J Library, by following the instructions for Maven integration:

<repositories>
      <repository>
         <id>twitter4j.org</id>
         <name>twitter4j.org Repository</name>
         <url>http://twitter4j.org/maven2</url>
         <releases>
            <enabled>true</enabled>
         </releases>
         <snapshots>
            <enabled>true</enabled>
         </snapshots>
      </repository>
   </repositories>
   <dependencies>
      <dependency>
         <groupId>org.twitter4j</groupId>
         <artifactId>twitter4j-core</artifactId>
         <version>[2.1,)</version>
      </dependency>
   </dependencies>

I then added a new method to the PredictionSampler.java file for getting a list of tweets:

private static void twitterCheck(HttpTransport transport, String twitterUser) throws TwitterException, IOException {
	    int French = 0;
	    int English = 0;
	    int Spanish = 0;
	    // The factory instance is re-useable and thread safe.
	    Twitter twitter = new TwitterFactory().getInstance();
	    Query query = new Query("@"+twitterUser);
	    query.setRpp(250);
	    QueryResult result = twitter.search(query);
	    System.out.println("hits for " + query + ":" + result.getTweets().size());
	    for (Tweet tweet : result.getTweets()) {
	        System.out.print(tweet.getFromUser() + " ("+tweet.getIsoLanguageCode() + ") : ");
	        String language = predict(transport, tweet.getText());
	        if ("English".equals(language)) English++;
	        if ("French".equals(language)) French++;
	        if ("Spanish".equals(language)) Spanish++;
	    }
	    System.out.println("Total: \t" +  result.getTweets().size());
	    System.out.println("French:\t" + French);
	    System.out.println("English:\t" + English);
	    System.out.println("Spanish:\t" + Spanish);
  }

You will note that I also modified the predict method to return the language, and format the text with language classification output in one line. I also outputed the language that Twitter thinks the tweet is.

The results were, well – OK. I guess above 80%. Here are some of the results:

gertbaudoncq (en) : 	Predicted language: English	Text: @loic nowadays even Belgium starts to look more and more like antimatter http://bit.ly/gTtmy6
joshspear (en) : 	Predicted language: English	Text: @loic ask my question: if anything could wrong colliding this stuff!
Vinrob (fi) : 	Predicted language: French	Text: @loic Sympa ça Loïc dis moi 😉
olsenks (en) : 	Predicted language: English	Text: RT @Bill_Gross: @loic When the Sarkozy translator says, "everything was going fine and DANDY," what word did Sarkozy really use? #WEF
jeremister (en) : 	Predicted language: English	Text: ouch “@loic: Composition of universe. This explains everything. (we are the .4% of the universe!) #Davos #wef http://t.co/XnsC1qJ”
joshspear (en) : 	Predicted language: English	Text: @loic and you think tech pioneers and .com people are nerdy? Not even close compared to these dudes!
HollisCGuerra (en) : 	Predicted language: English	Text: Haha! RT @alexcalic If @scobleizer and @loic don't attend conference, does it even exist?
Peaceful200906 (en) : 	Predicted language: English	Text: RT @Bill_Gross: @loic When the Sarkozy translator says, "everything was going fine and DANDY," what word did Sarkozy really use? #WEF
Total: 	100
French:	20
English:	69
Spanish:	11

The real question for me is… will this work for sentiment? I have been meaning to try that for years – maybe now with it being this easy I will. I just need to gather the Training data…

Source: PredictionSample.java