Ok, I know this should be easy, but I just wanted to try it out as a first step.  So my goal was to clas­sify some tweets with Google’s Pre­dic­tion API. Luck­ily they have a sam­ple code in Java with a train­ing file for Lan­guage (Eng­lish, French, and Span­ish).  I have been see­ing tweets from @loic alot today from #Davos which are mostly Eng­lish and some in French, so I have some good tests.

For those of you won­der­ing how this works — it is basi­cally the same way your mail client iden­ti­fies SPAM (think Fire­fox or Apple Mail). You iden­tify the first few hun­dred spam. After that, it marks those it clas­si­fies as spam based on the train­ing you gave it. If you keep updat­ing the train­ing by mark­ing spam that it misses, or unmark­ing falsely iden­ti­fied spam, then the clas­si­fi­ca­tion improves over time.

But you want to try it your­selves, right?

First signup for Google Stor­age and Google Pre­dic­tion API’s if you havn’t.

Go to the Java Sam­ple page from Google: http://​sam​ples​.google​-api​-java​-client​.google​code​.com/​h​g​/​p​r​e​d​i​c​t​i​o​n​-​j​s​o​n​-​c​l​i​e​n​t​l​o​g​i​n​-​s​a​m​p​l​e​/​i​n​s​t​r​u​c​t​i​o​n​s​.​h​t​m​l​?​r​=​d​e​f​a​ult and first down­load the train­ing data.  You will then need to cre­ate a “Bucket” in Google Stor­age, and upload the train­ing file.  Remem­ber what the Bucket’s name is.

Then go back and down­load the sam­ple (use Mer­cu­r­ial to check it out).

I first updated the “src/com/google/api/client/sample/prediction/ClientLoginCredentials.java” file as noted with my Google User­name and Pass­word, then the Train­ing File.  I was able to com­pile and run it right aways, and after a few min­utes the train­ing was fin­ished, and the sam­ple texts were prop­erly classified!

Next I wanted to inte­grate twit­ter, so I added the Twitter4J Library, by fol­low­ing the instruc­tions for Maven integration:

   <repositories>
      <repository>
         <id>twitter4j.org</id>
         <name>twitter4j.org Repository</name>
         <url>http://twitter4j.org/maven2</url>
         <releases>
            <enabled>true</enabled>
         </releases>
         <snapshots>
            <enabled>true</enabled>
         </snapshots>
      </repository>
   </repositories>
   <dependencies>
      <dependency>
         <groupId>org.twitter4j</groupId>
         <artifactId>twitter4j-core</artifactId>
         <version>[2.1,)</version>
      </dependency>
   </dependencies>

I then added a new method to the PredictionSampler.java file for get­ting a list of tweets:


  private static void twitterCheck(HttpTransport transport, String twitterUser) throws TwitterException, IOException {
	    int French = 0;
	    int English = 0;
	    int Spanish = 0;
	    // The factory instance is re-useable and thread safe.
	    Twitter twitter = new TwitterFactory().getInstance();
	    Query query = new Query("@"+twitterUser);
	    query.setRpp(250);
	    QueryResult result = twitter.search(query);
	    System.out.println("hits for " + query + ":" + result.getTweets().size());
	    for (Tweet tweet : result.getTweets()) {
	        System.out.print(tweet.getFromUser() + " ("+tweet.getIsoLanguageCode() + ") : ");
	        String language = predict(transport, tweet.getText());
	        if ("English".equals(language)) English++;
	        if ("French".equals(language)) French++;
	        if ("Spanish".equals(language)) Spanish++;
	    }
	    System.out.println("Total: \t" +  result.getTweets().size());
	    System.out.println("French:\t" + French);
	    System.out.println("English:\t" + English);
	    System.out.println("Spanish:\t" + Spanish);
  }

You will note that I also mod­i­fied the pre­dict method to return the lan­guage, and for­mat the text with lan­guage clas­si­fi­ca­tion out­put in one line. I also out­puted the lan­guage that Twit­ter thinks the tweet is.

The results were, well — OK. I guess above 80%. Here are some of the results:

gertbaudoncq (en) : 	Predicted language: English	Text: @loic nowadays even Belgium starts to look more and more like antimatter http://bit.ly/gTtmy6
joshspear (en) : 	Predicted language: English	Text: @loic ask my question: if anything could wrong colliding this stuff!
Vinrob (fi) : 	Predicted language: French	Text: @loic Sympa ça Loïc dis moi ;)
olsenks (en) : 	Predicted language: English	Text: RT @Bill_Gross: @loic When the Sarkozy translator says, "everything was going fine and DANDY," what word did Sarkozy really use? #WEF
jeremister (en) : 	Predicted language: English	Text: ouch “@loic: Composition of universe. This explains everything. (we are the .4% of the universe!) #Davos #wef http://t.co/XnsC1qJ”
joshspear (en) : 	Predicted language: English	Text: @loic and you think tech pioneers and .com people are nerdy? Not even close compared to these dudes!
HollisCGuerra (en) : 	Predicted language: English	Text: Haha! RT @alexcalic If @scobleizer and @loic don't attend conference, does it even exist?
Peaceful200906 (en) : 	Predicted language: English	Text: RT @Bill_Gross: @loic When the Sarkozy translator says, "everything was going fine and DANDY," what word did Sarkozy really use? #WEF
Total: 	100
French:	20
English:	69
Spanish:	11

The real ques­tion for me is… will this work for sen­ti­ment? I have been mean­ing to try that for years — maybe now with it being this easy I will. I just need to gather the Train­ing data…

Source: PredictionSample.java