Monday, June 22, 2009

GSoC: Hackystat, June 15-22

This week:

So, after all of the ritualistic mapping from last week, Philip gave me some feedback, and I have been applying rationcinactivity to it all week. This ended with me basically scrapping the earlier setup, and a couple of the earlier ideas.

Ideas that hit were ritualistically dumped:

Social Media SDT

Why did this hit the scrap heap? Because I had a problem--storing relational data--and a tool by which to solve it, the hammer that is the sensorbase. This led to me beating the problem with the sensorbase hammer in spite of the fact that my problem was not a nail. It was too restrictive for what I need to be able to do, so I am tossing it in favor of Neo4J, which I have fallen in love with this week.

That said, I will still probably be sending some sensor data from Twitter to the sensorbase, but only the most atomic stuff. This would be very similar to a build event, only instead of the result of the build and a timestamp, it would be an "tweeted" and a timestamp. Potentially I will be able to indicate "code-related" and "not code-related". However, that will involve text mining, which is going to take a little while to work up.

Sensors on the User's Computer

Originally, having the Twitter sensor running from the user's computer seemed like a really good way to deal with the secure storage of user authentication details and to avoid me having to set up my own server. However, after I decided that storing the bulk of the social networking data in the sensorbase was not a good solution, it became necessary for me to start running a server to store things, and so the "I don't want to deal with running a server" argument became moot.

That left the issue of secure storage of user authentication details. However, at least in the instance of the Twitter app, this is actually a moot point. I can get all of the information that I need from a Twitter account that is friended by the user I am collecting data on.

How the Twitter App Works

Note: The Twitter app is not quite ready to go live yet, for reasons I will get to in a moment. If you're in a hurry, jump to The Database.

The Twitter sensor will live on my server.

In order for the Twitter sensor to gather information on a user, they have to follow the Hackystat Twitter Sensor's Twitter account, which is HackystatSocNet.

Once they have followed that account, then I can access information about their followers and status updates. Some of that I will store directly in my database--other bits of it I will send to the sensorbase.

It's actually a much simpler setup than I was initially expecting. So, once I iron out The Database, it should get online very quickly.

Google Project

I named my project! I have dubbed it Hackystat SocNet (pronounced "sock net")! Why Hackystat SocNet? It's from Social Networking. It's reasonable descriptive, and further, it means I can have a sock puppet as a logo. Being that I love sock puppets so much that I spent my 21st birthday making them, I think this is fitting. For your viewing pleasure, I have included a rough draft of a logo. (I have a pretty digitized version of this somewhere but can't locate it.)



For some more ideas about how versatile a sock puppet logo can be, imagine a data mining sock puppet accesorized with a hard hat and a pick. Or, check out Alton Brown's bevy of yeast sock puppets--how fun to represent social networking with socks!

At any rate, I have started a google project (not that there's any code there yet.)

It is located here http://code.google.com/p/hackystat-analysis-socnet/ . I will likely end up doing the hackystat model and have a separate project for each sensor, for the data mining part of the application, and then for the final prediction application. But this will be the aggregator of those.

Next week:
The Database

This ended up being the obstacle to getting the Twitter App live this week. I realized somewhat suddenly that I didn't have any good or concrete plans on how to store the data, much less a consistent interface for accessing and storing the data.

I have decided to use Neo4J, which has turned out to be extremely intuitive and nice to use. I am a fan.

I am somewhat struggling with what to store as nodes, what to use as relationships, and what to use as attributes. It all seemed so clear cut in my initial proposal! Most of the things I had planned as attributes might be more useful as nodes themselves. Here is what I am totally sure I will have:

Nodes:

Coder (Implements Human)
Person (Implements Human)
Project
Employer
School

Relationships:

Is Friends with (human to human)
Follows (human to human)
Is Following (human to human)
Worked at (human to employer)
Worked with (human to human)
Contributed to (human to project)
Owns (human to project)
Went to School with (human to human)

I am less sure how to handle interests. Should interests (listed on facebook) be nodes? I mean, the approach of making all of these attributes nodes can be taken quite far, and seems like it might make retrieval easier. For instance, instead of storing birth date as an attribute for a Human object, it could be a node. Every Human object born in 1984 would have an edge connecting that Human node to the 1984 node. Then, retrieving all Human objects born in 1984 would only be a matter of pulling all the nodes connected to the 1984 node by a "Born In" relationship. This could be taken scarily far, to the length of having nodes for days or times. I'm not sure that this would be useful or efficient, but it is what I am thinking about. Thoughts?

I am also struggling with how to store the time series of Hackystat data. I want to store like, six or seven standard telemetry series (Build, DevTime, etc) for each hackystat project that is present in the database. Neo4J does not allow arbitrary objects for attributes, which makes storing timeseries more difficult. However, it can store arrays of primitive types and strings, so I am considering arrays of arrays... We'll see.

This is the biggest project for this week.

The Server

I cannot even describe how much setting up a server is not my area of expertise. I haven't even begun to research how you set up a server. This will be this week's other big project.

Milestone:

Things that I know I will have ready for my first milestone:

1. Relational database
2. Twitter App
3. Hackystat data accessing applet

This week I am planning to get the relational database ready this week, as all of the others depend on that (the data gathering is pretty simple and straightforward--the rough part is transmitting and storing it.) The Twitter app can be ready within two days of having the database up, so we'll call that next Wednesday. The other pressing order of business is the applet to get data from Hackystat, which will probably take four or five days, bringing us in the timeline to after the 4th of July weekend.

I am hesitant to commit to having the Facebook and Ohloh apps ready for the milestone, as I cannot guarantee that I will have them ready to look at by July 6th. I will leave it open as an option, though, as it's possible that once the database is up and running and I have a consistent interface for dealing with it, then everything will go much more quickly. However, I would like to leave some wiggle room if it does not.

0 comments: