Monday, June 15, 2009

GSoC: Hackystat, June 8-15

Now, with 100% more ritualistic mapping!

This week spent lots of time on Vision. I wasted a lot of time wondering how in the world I was going to make it easy for the users to run learning algorithms, and if I should server-base it or have them download their own personal little learner... and then I realize that they won't be interacting with the learning algorithms at all.

Here's what (I think) will happen.

The user will download the Twitter and Ohloh sensors, much the way they download Eclipse and other sensors. Those sensors will do an intial dump of sensor data to the sensor base. This initial dump will include followers, following, contributers, etc. Then, those sensors will basically be sleeper threads that will wait patiently until something at Twitter or Ohloh changes, at which point it'll transmogrify that into XML (just like it did before) and send that on to the sensorbase.

That's pretty straightforward. The Facebook sensor is a little different because it doesn't get to live on the user's computer. It has to be hosted. Right now I've set up Joylent hosting for it, as it's free. However, it's only free for a year, and you have to have over 50 users or it either stops being free or they delete your program wholesale. (I can't remember which. Either is kind of lame.) However, other than the fact that it has to be hosted somewhere, it's basically the same deal.

All right, so after I have all that pretty data going to the sensor base, I will somehow download all of it. (I need to email the dev. list about that.)

After downloading all of it, I will run as many different algorithms as I can think of to generate models. This will probably actually be done on Zack's super computer, as I shudder to think how long it would take on my laptop.

Once I have those models, I will build an interface for them so that the user can enter their hackystat user name, it will build a graph from the nodes and edges connected to that user, and then make predictions based on whatever the model says about the graph.

That was a lot of words, so I have also ritualistically mapped the components for the more visual learners in our audience. (Click on this to make it larger and legible.)

This is by no means set in stone, folks. I've kept the information gathered somewhat minimal--it can be expanded pretty easily. The other thing that can be changed is the location of the Social Network Query Program. To me, it seems somewhat like the analysis services that Hackystat already has on the servers, so it seems like it fits there. However, it could also easily be made as something to download and run locally. I open this for debate.

Hackystat as Administrator!

I started getting the Hackystat services running locally this week. (I was hoping to be sending data to them, too, but I'll get to that in a minute.) I had a LOT of trouble with the DailyProjectData service...for no perceivable reason.

Here's what happened:

I downloaded it from the DPD project site, and had it all set up to run. My sensorbase was humming away, my properties files were all in order. So I ran it and BAM! Null Pointer Exception.

Null pointer exception? Really? Specifically:

Exception in thread "main" java.lang.NullPointerException
at org.hackystat.dailyprojectdata.server.Server.disableRestletLogging(
at org.hackystat.dailyprojectdata.server.Server.newInstance(

I dug into the source code, and the line that appears to be throwing it is actually library code.

It's the line highlighted in red in this method:

* Disable all loggers from com.noelios and org.restlet.
private static void disableRestletLogging() {
LogManager logManager = LogManager.getLogManager();
for (Enumeration e = logManager.getLoggerNames(); e.hasMoreElements() ;) {
String logName = e.nextElement().toString();
if (logName.startsWith("com.noelios") ||
logName.startsWith("org.restlet")) {

Granted, not exactly sure what identifying the source of the problem was going to do me since I am not supposed to compile it from the sources (for which I am somewhat grateful).

Anyway, I asked Aaron about it. I checked to make sure I was using the right versions of the necessary libraries (I seemed to be.) Finally I noticed somewhere a suggestion to always used the DPD jar that is distributed with the services package.

I tried this, and it worked like a charm, right out of the box. Why does it seem that so much of programming is filled with this elusive problems that vanish for seemingly no reason at all?

After I started using only the ones that were distributed in a nice package together, I had no more trouble. Everything else ran like melting goat cheese on roasted nectarines, in other words, great.

Twitter App

The Twitter Sensor is in progress. I'm having some trouble with the library I decided to use, in that it doesn't seem to work at all. I've only been tinkering with it for a few days, though, so it's possible I've just missed some crucial detail. This is unfortunate, because from the look of things, this should code up tickety-boo. However, I can't even get their examples to run. Splendid!

Data Storage

Two weeks ago I discussed a Social Media SDT. Aaron thinks (and I agree) that this is the best way to handle the data I'm going to be collecting. My only beef with it is the resource field. All of the sensor data has to be associated either with a Hackystat user or a Hackystat project. While that is not a huge deal, it is somewhat restrictive.

Side note on the SDT: I feel like the best documentation of the SDT stuff is actually in the REST API. However, there seems to be some distinction between SD and SDT that I do not get and find tremendously confusing. I have reread and reread that section and still have no clue.

I am also not exactly sure how to fit the less dynamic attributes into the SDT. One solution is to structure all of the attributes like relationships. ie, "User Q is interested in herpatology", as an example of a Facebook interest structured as a relationship. That would make Herpatology a node in the graph and the "is interested in" an edge. I feel like this would add unecessary additional complexity, but it could also make it interested. I'm not sure how to toy with that until it works.

The other solution, one that I am currently leaning towards, is ditching the idea of trying to back the data so closely with the Hackystat system. Instead, I would download as much of the data as I can get my grubby little paws on and reconfigure it into a format that allows me to have whatever kind of objects I need to have (as opposed to simply people and projects.) I could potentially use the nifty database thing that Philip sent across the list a few days ago.

Guidance on this is good. I will probably pose this idea to the list. I just need to pare it down so it's not long, rambly, and incoherent. You know, so that it's actually clear that I'm posing a question.

Google Project Setup

So I started this (it's easy!) but I realize that I have NO IDEA what I should name this crazy contraption. I thought "hackystat-social-media-extravaganza", and then thought it might not be professional or descriptive enough (and then realized I probably shouldn't be allowed in public). It can't be changed, so it should probably be good from the start. "hackystat-social-network-predictive-widget". I have no idea. Does Hackystat have component naming conventions that I missed somewhere? Suggestions?

All right, that about wraps up this week's work.

For the coming week!
Twitter app!
More decisions about storage and hosting
Maybe a Facebook app, too!

Remember, guys and dolls: the open sesame is printed also on the fire.