Tuesday, June 30, 2009

GSOC: Hackystat June 22-30

A slightly belated blog entry, for those of you who were waiting with bated breath.

This Previous Week:

Was not as productive as I hoped. I had taken a week off in early June to study like a madwoman for a Chemistry CLEP test, but after failing the practice test hardcore, decided that a little more time was necessary so that I could actually graduate. So, much of last week was spent living like a hobo, perched on the midden heap that was a corner of my sofa, reading the entire chemistry textbook from MIT. Fortunately, I passed, so now I get to graduate.

Here's what I did get squeezed in:

Neo Transalpine Database

This is mostly finished! I think I redesigned it at least six times, but at the very least, the bones are there. The big thing left to do for it is make an easy way to retrieve information, but since this supposedly is what Neo shines at (traversals and so forth), I think we should be good.

Later this evening I will post a diagram of the design. I decided to go the way of making everything possible a node instead of an attribute. Potentially, this will come back to chew on my butt, but for the time being it made for some really beautifully simple implementation.

My one concern with that path is that I ended up with quite a few classes that did not have much substance. All of the classes extend the abstract SocialMediaNode class, which houses the underlying node and has accessors and mutators for the only property I am requiring all nodes to have, that is, a name. Some of my classes (for instance, the Coder class), have quite a few more properties than just a name. However, there are probably an equal number (perhaps slightly more) that are just sort of wrappers for the abstract superclass, consisting only of a constructor that passes a value to the superconstructor. I did this so it would be easier to distinguish between types of nodes and so the database creation code would be more straightforward, but I do not know that it was the correct decision.

The Facebook App

Yeah, I know I said that I wasn't going to work on this this week. But, after a long discussion with Zack about the server issue, I decided that it the Facebook application was going to require the most specific things from the server (it's actually the only thing that needs to be hosted) and therefore be the most pressing.

Now, my original plan was to build the app on using Joyent as my host and then transfer it to another host later, but this is not going to work. Joyent does not provide free hosting for Facebook applications in Java, only Facebook applications in PHP and (I think) Ruby. Now, I have written in Ruby, but it's been almost three years and at this point in the summer I think it would be stupid unwise to take on something else. I searched for other free hosting, and have as yet found none. Lame.

So, I started setting up one of Zack's desktops as a host. So, I registered with DynDNS, and installed a dyndns updater. Glassfish is installed and configured. Whee.

Other things from this week

The structure of this whole thing changed again. At last update, I was planning to host the application that let users release to me for mining their Hackystat data. However, I decided that it was probably most secure to make that application something they download to their own computer, as then it can access sensorshell.properties and I don't have to securely store authentication information. This also makes this application significantly simpler to write, which is pleasing.

Next Week:

I am a trifle behind (I think I had planned to have the database ready to roll by today), but it should be ready in the bat of an eye span. (I would love to say tomorrow, but if I say tomorrow, overnight I will decide that I need to completely refashion it again and that would automatically kill the database contentedness.) Besides, after last week I've forgotten what fun and leisure are like, so I forsee just oodles of productivity.

  • Okay, so database.
  • Twitter application. Do we still want this to be sending updates to the sensorbase?
  • After those two, the downloadable "Let Rachel Access Your Hackystat Data for Mining" application.
  • Also, database commenting. Right now it is naked, and it isn't super hard to understand, but then again, I just wrote it. So commenting it before I forget what I was thinking is a good idea.
  • And the server stuff. Philip was so kind as to make a number of screencasts for me, and so I really ought to watch them.
  • I suppose I should also actually upload my code to my very lonely Google project site.

Vision

I admit that I am a little concerned about where this whole thing is going. My original plan was to have stuff collecting data by now, so that I could do the data mining of it for a graduate independent study in July.

However, I'm not sure that was ever a reasonable goal, particularly considering you know, wanting to really test things before I unleash them on the computers of unsuspecting users. While I feel that I can certainly have everything in place, all of the components built, by the end of the summer, I'm not sure that I will have models ready to be queried yet. We laid out our goals for Hackystat, but what are Hackystat's goals for us?

Other things that I have been wondering about: let's say, for instance, that we decide we do want to store some Twitter instances in the Sensorbase. At one point is the decision made, "Yeah, okay, this is not going to blow things up/eat Tokyo/create infinite lolcats and can send data to the sensorbase for reals"?

Technical Difficulties

Eclipse is driving me crazy with its slowness, its utter refusal to clean build when I tell it to, and the otherwise batspit insane errors its been giving me. It is also making me into the whiniest coder ever, which I am sure everyone who follows my twitter is tired of. (But yes, I do accept cheese with my whine!)

Philip suggested that the slowness might come from the eclipse sensor and its communications with Hawaii. This seemed reasonable, so I took his advice and uninstalled the plugin. This helped somewhat, but I'm not sure it helps enough to justify the derth of data.

It also (sadly) did not fix the other bizarre errors I've been getting... Like, today, in which Eclipse spent an hour insisting that I couldn't implement an interface because it "wasn't an interface". It WAS an interface. Usually when one gets errors of this nature, a clean build fixes it... But several clean builds later... same error. This did not make sense to me. I added in a couple of syntax errors, which it caught, and then I removed them, which it didn't catch, even three builds later. Unfortunately, I don't remember what series of steps I took to finally make it come to its senses and recognize my interface as an interface. I'm going to reinstall the wretched, illiberal tomorrow and see if that helps. Or I may just beat my face against my monitor until it is satisfied by the ritualistic dumping of my blood and decides to behave again.

In Other News:

Additionally, this week, I have gained 10 hours of chemistry credit and killed three tomato plants by watering them with milk.

Monday, June 22, 2009

GSoC: Hackystat, June 15-22

This week:

So, after all of the ritualistic mapping from last week, Philip gave me some feedback, and I have been applying rationcinactivity to it all week. This ended with me basically scrapping the earlier setup, and a couple of the earlier ideas.

Ideas that hit were ritualistically dumped:

Social Media SDT

Why did this hit the scrap heap? Because I had a problem--storing relational data--and a tool by which to solve it, the hammer that is the sensorbase. This led to me beating the problem with the sensorbase hammer in spite of the fact that my problem was not a nail. It was too restrictive for what I need to be able to do, so I am tossing it in favor of Neo4J, which I have fallen in love with this week.

That said, I will still probably be sending some sensor data from Twitter to the sensorbase, but only the most atomic stuff. This would be very similar to a build event, only instead of the result of the build and a timestamp, it would be an "tweeted" and a timestamp. Potentially I will be able to indicate "code-related" and "not code-related". However, that will involve text mining, which is going to take a little while to work up.

Sensors on the User's Computer

Originally, having the Twitter sensor running from the user's computer seemed like a really good way to deal with the secure storage of user authentication details and to avoid me having to set up my own server. However, after I decided that storing the bulk of the social networking data in the sensorbase was not a good solution, it became necessary for me to start running a server to store things, and so the "I don't want to deal with running a server" argument became moot.

That left the issue of secure storage of user authentication details. However, at least in the instance of the Twitter app, this is actually a moot point. I can get all of the information that I need from a Twitter account that is friended by the user I am collecting data on.

How the Twitter App Works

Note: The Twitter app is not quite ready to go live yet, for reasons I will get to in a moment. If you're in a hurry, jump to The Database.

The Twitter sensor will live on my server.

In order for the Twitter sensor to gather information on a user, they have to follow the Hackystat Twitter Sensor's Twitter account, which is HackystatSocNet.

Once they have followed that account, then I can access information about their followers and status updates. Some of that I will store directly in my database--other bits of it I will send to the sensorbase.

It's actually a much simpler setup than I was initially expecting. So, once I iron out The Database, it should get online very quickly.

Google Project

I named my project! I have dubbed it Hackystat SocNet (pronounced "sock net")! Why Hackystat SocNet? It's from Social Networking. It's reasonable descriptive, and further, it means I can have a sock puppet as a logo. Being that I love sock puppets so much that I spent my 21st birthday making them, I think this is fitting. For your viewing pleasure, I have included a rough draft of a logo. (I have a pretty digitized version of this somewhere but can't locate it.)



For some more ideas about how versatile a sock puppet logo can be, imagine a data mining sock puppet accesorized with a hard hat and a pick. Or, check out Alton Brown's bevy of yeast sock puppets--how fun to represent social networking with socks!

At any rate, I have started a google project (not that there's any code there yet.)

It is located here http://code.google.com/p/hackystat-analysis-socnet/ . I will likely end up doing the hackystat model and have a separate project for each sensor, for the data mining part of the application, and then for the final prediction application. But this will be the aggregator of those.

Next week:
The Database

This ended up being the obstacle to getting the Twitter App live this week. I realized somewhat suddenly that I didn't have any good or concrete plans on how to store the data, much less a consistent interface for accessing and storing the data.

I have decided to use Neo4J, which has turned out to be extremely intuitive and nice to use. I am a fan.

I am somewhat struggling with what to store as nodes, what to use as relationships, and what to use as attributes. It all seemed so clear cut in my initial proposal! Most of the things I had planned as attributes might be more useful as nodes themselves. Here is what I am totally sure I will have:

Nodes:

Coder (Implements Human)
Person (Implements Human)
Project
Employer
School

Relationships:

Is Friends with (human to human)
Follows (human to human)
Is Following (human to human)
Worked at (human to employer)
Worked with (human to human)
Contributed to (human to project)
Owns (human to project)
Went to School with (human to human)

I am less sure how to handle interests. Should interests (listed on facebook) be nodes? I mean, the approach of making all of these attributes nodes can be taken quite far, and seems like it might make retrieval easier. For instance, instead of storing birth date as an attribute for a Human object, it could be a node. Every Human object born in 1984 would have an edge connecting that Human node to the 1984 node. Then, retrieving all Human objects born in 1984 would only be a matter of pulling all the nodes connected to the 1984 node by a "Born In" relationship. This could be taken scarily far, to the length of having nodes for days or times. I'm not sure that this would be useful or efficient, but it is what I am thinking about. Thoughts?

I am also struggling with how to store the time series of Hackystat data. I want to store like, six or seven standard telemetry series (Build, DevTime, etc) for each hackystat project that is present in the database. Neo4J does not allow arbitrary objects for attributes, which makes storing timeseries more difficult. However, it can store arrays of primitive types and strings, so I am considering arrays of arrays... We'll see.

This is the biggest project for this week.

The Server

I cannot even describe how much setting up a server is not my area of expertise. I haven't even begun to research how you set up a server. This will be this week's other big project.

Milestone:

Things that I know I will have ready for my first milestone:

1. Relational database
2. Twitter App
3. Hackystat data accessing applet

This week I am planning to get the relational database ready this week, as all of the others depend on that (the data gathering is pretty simple and straightforward--the rough part is transmitting and storing it.) The Twitter app can be ready within two days of having the database up, so we'll call that next Wednesday. The other pressing order of business is the applet to get data from Hackystat, which will probably take four or five days, bringing us in the timeline to after the 4th of July weekend.

I am hesitant to commit to having the Facebook and Ohloh apps ready for the milestone, as I cannot guarantee that I will have them ready to look at by July 6th. I will leave it open as an option, though, as it's possible that once the database is up and running and I have a consistent interface for dealing with it, then everything will go much more quickly. However, I would like to leave some wiggle room if it does not.

Monday, June 15, 2009

GSoC: Hackystat, June 8-15

Now, with 100% more ritualistic mapping!

This week spent lots of time on Vision. I wasted a lot of time wondering how in the world I was going to make it easy for the users to run learning algorithms, and if I should server-base it or have them download their own personal little learner... and then I realize that they won't be interacting with the learning algorithms at all.

Here's what (I think) will happen.

The user will download the Twitter and Ohloh sensors, much the way they download Eclipse and other sensors. Those sensors will do an intial dump of sensor data to the sensor base. This initial dump will include followers, following, contributers, etc. Then, those sensors will basically be sleeper threads that will wait patiently until something at Twitter or Ohloh changes, at which point it'll transmogrify that into XML (just like it did before) and send that on to the sensorbase.

That's pretty straightforward. The Facebook sensor is a little different because it doesn't get to live on the user's computer. It has to be hosted. Right now I've set up Joylent hosting for it, as it's free. However, it's only free for a year, and you have to have over 50 users or it either stops being free or they delete your program wholesale. (I can't remember which. Either is kind of lame.) However, other than the fact that it has to be hosted somewhere, it's basically the same deal.

All right, so after I have all that pretty data going to the sensor base, I will somehow download all of it. (I need to email the dev. list about that.)

After downloading all of it, I will run as many different algorithms as I can think of to generate models. This will probably actually be done on Zack's super computer, as I shudder to think how long it would take on my laptop.

Once I have those models, I will build an interface for them so that the user can enter their hackystat user name, it will build a graph from the nodes and edges connected to that user, and then make predictions based on whatever the model says about the graph.

That was a lot of words, so I have also ritualistically mapped the components for the more visual learners in our audience. (Click on this to make it larger and legible.)


This is by no means set in stone, folks. I've kept the information gathered somewhat minimal--it can be expanded pretty easily. The other thing that can be changed is the location of the Social Network Query Program. To me, it seems somewhat like the analysis services that Hackystat already has on the servers, so it seems like it fits there. However, it could also easily be made as something to download and run locally. I open this for debate.

Hackystat as Administrator!

I started getting the Hackystat services running locally this week. (I was hoping to be sending data to them, too, but I'll get to that in a minute.) I had a LOT of trouble with the DailyProjectData service...for no perceivable reason.

Here's what happened:

I downloaded it from the DPD project site, and had it all set up to run. My sensorbase was humming away, my properties files were all in order. So I ran it and BAM! Null Pointer Exception.

Null pointer exception? Really? Specifically:

Exception in thread "main" java.lang.NullPointerException
at org.hackystat.dailyprojectdata.server.Server.disableRestletLogging(Server.java:104)
at org.hackystat.dailyprojectdata.server.Server.newInstance(Server.java:91)

I dug into the source code, and the line that appears to be throwing it is actually library code.

It's the line highlighted in red in this method:

/**
* Disable all loggers from com.noelios and org.restlet.
*/
private static void disableRestletLogging() {
LogManager logManager = LogManager.getLogManager();
for (Enumeration e = logManager.getLoggerNames(); e.hasMoreElements() ;) {
String logName = e.nextElement().toString();
if (logName.startsWith("com.noelios") ||
logName.startsWith("org.restlet")) {
logManager.getLogger(logName).setLevel(Level.OFF);
}
}
}

Granted, not exactly sure what identifying the source of the problem was going to do me since I am not supposed to compile it from the sources (for which I am somewhat grateful).

Anyway, I asked Aaron about it. I checked to make sure I was using the right versions of the necessary libraries (I seemed to be.) Finally I noticed somewhere a suggestion to always used the DPD jar that is distributed with the services package.

I tried this, and it worked like a charm, right out of the box. Why does it seem that so much of programming is filled with this elusive problems that vanish for seemingly no reason at all?

After I started using only the ones that were distributed in a nice package together, I had no more trouble. Everything else ran like melting goat cheese on roasted nectarines, in other words, great.

Twitter App

The Twitter Sensor is in progress. I'm having some trouble with the library I decided to use, in that it doesn't seem to work at all. I've only been tinkering with it for a few days, though, so it's possible I've just missed some crucial detail. This is unfortunate, because from the look of things, this should code up tickety-boo. However, I can't even get their examples to run. Splendid!

Data Storage

Two weeks ago I discussed a Social Media SDT. Aaron thinks (and I agree) that this is the best way to handle the data I'm going to be collecting. My only beef with it is the resource field. All of the sensor data has to be associated either with a Hackystat user or a Hackystat project. While that is not a huge deal, it is somewhat restrictive.

Side note on the SDT: I feel like the best documentation of the SDT stuff is actually in the REST API. However, there seems to be some distinction between SD and SDT that I do not get and find tremendously confusing. I have reread and reread that section and still have no clue.

I am also not exactly sure how to fit the less dynamic attributes into the SDT. One solution is to structure all of the attributes like relationships. ie, "User Q is interested in herpatology", as an example of a Facebook interest structured as a relationship. That would make Herpatology a node in the graph and the "is interested in" an edge. I feel like this would add unecessary additional complexity, but it could also make it interested. I'm not sure how to toy with that until it works.

The other solution, one that I am currently leaning towards, is ditching the idea of trying to back the data so closely with the Hackystat system. Instead, I would download as much of the data as I can get my grubby little paws on and reconfigure it into a format that allows me to have whatever kind of objects I need to have (as opposed to simply people and projects.) I could potentially use the nifty database thing that Philip sent across the list a few days ago.

Guidance on this is good. I will probably pose this idea to the list. I just need to pare it down so it's not long, rambly, and incoherent. You know, so that it's actually clear that I'm posing a question.

Google Project Setup

So I started this (it's easy!) but I realize that I have NO IDEA what I should name this crazy contraption. I thought "hackystat-social-media-extravaganza", and then thought it might not be professional or descriptive enough (and then realized I probably shouldn't be allowed in public). It can't be changed, so it should probably be good from the start. "hackystat-social-network-predictive-widget". I have no idea. Does Hackystat have component naming conventions that I missed somewhere? Suggestions?

All right, that about wraps up this week's work.

For the coming week!
Twitter app!
More decisions about storage and hosting
Maybe a Facebook app, too!

Remember, guys and dolls: the open sesame is printed also on the fire.

Eclipse just bit it hardcore. More hardcore than I have ever experienced an IDE crash before.

Sunday, June 14, 2009

The dailyprojectdata service is running!!! Clearly, collard greens solve all the world's problems.

You know, even if I can't get the dailyprojectdata service to run locally, I am so in love with life right now. Tummy full of greens. yay!

Thursday, June 11, 2009

Null pointer exception in the daily project data thingy i was attempting to run locally. Line 104 in Server.

Okay, the sensorbase is running locally... Now what?

Bless google for helping me with my ignorance.

Wtf is a smtp server?

Monday, June 8, 2009

GSoC: Hackystat June 1-8

I have been almost completely unproductive this week. It was the last week of my intersession class and I was studying for a CLEP exam, so that took up basically all of my time.

This week will hopefully be much more productive.

In addition to working on what I was to be working on from last week, it looks like a paper of interest to me just went across the list, so I will be investigating that.

Monday, June 1, 2009

GSoC: Hackystat, May 25-June 1

This week, by and large, has focused on Hackystat, Developer Style! This has involved a lot of reading and planning. I have been anxious to start pounding out code, but after the reading I have done this week, anything beyond simple system testing seemed a little premature.

This week, I have:
1. Watched developer screencasts, including the new hackystat-developer-example.

I sometimes feel that Hackystat is likely to spoil us with its user-friendliness from an open source development standpoint. I feel confident that most open source projects don't have handy little video tutorials. However, I am not at all complaining! They are immensely helpful.

I really like the ivy integration a lot. I didn't have an iota of trouble with the developer example.

2. Contemplated issues of hosting, data retrieval, and storage.

The bulk of my mental energy went to this problem this week. There are several issues involved here. The first: what data am I allowed to access to mine, and is there a way I can get it that isn't on a per user/ per project basis? Aaron says there is a way to do without having to know a user's name and password. (I was contemplating an approach in which users interested in including their data in the mining project would download my app, which would access use the info. in their sensorshellproperties file to access their data, but this has the distinct disadvantages of requiring effort from the user, which may significantly limit the amount of data I have access to.)

So that's the data retrieval issue. Now for the hosting question.

The Facebook application has to be hosted. Joylent is offering free hosting, but only for a year, and only for applications that have more than 50 users. So that might be pushing it for our purposes. I would potentially like to host it on the Hackystat server. Aaron has asked for a more detailed specification of what the application will do, at which point we will be discussing it with The Powers that Be.

Storage is related to hosting.

It occurred to me while I was trying to pin down a design that I would definitely need to be storing all of this information somewhere. Aaron suggested creating a generic SocialMedia SensorDataType to do this.

3. Read boatloads of Facebook developer documentation

Which, to my disappointment, is not quite as easy to work with as the Hackystat documentation, or as pleasant and well-organized as the Java API. (Like I said, I'm getting spoiled.)

I am still deciding what language to use for the Facebook app. FB seems to lean towards PHP. I've had some experience with Ruby on Rails, so I'm thinking about using it instead. The initial Facebook application is not a terribly complex creature, in my head. Basically, it asks your permission to access your FB profile information, friends, etc, and then it takes that information, bundles it into the SDT, and sends that to Hackystat. I suppose it's really very little different than other development sensors, other than that the events are related to friends and interests.

4. (Somewhat unrelated) Solved the environment variable issue that was causing me such grief.

See this entry.

5. Started thinking about a potential SensorDataType for social networking data.

I'm having some difficulty coming up with something generic enough to cover ohloh, twitter, and facebook relationships and attributes. Not to mention that Aaron suggested adding SVN or mailing lists relationships (an idea that I freaking LOVE, but am not sure how to implement). The SDT needs to be generic enough to be easily expanded on--I can come up with many different additions to this idea, so i want this to be really easy to extend. For me, the most natural way to represent this data is two objects and a relationship. Initially, I though that didn't exactly jive with the key/value setup of the SDT. However, now, I think it could work reasonably well. Something along the lines of:

SDT Key Example Values
Social Media Object1 “Person”, “File”, “Bug”, “Project”, etc.

Object1ID Some unique integer id

Object2 “Person”, “File”, “Bug”, “Project”, etc.

Object2ID Some unique integer id

Relationship “Friend”, “Follows”, “Contributes to”, “Edited”, etc.

RelationshipID Some unique integer id

Vision

I was informed that I am working on a vision document, which was news to me, particularly after being told that I already had such a thing? I suppose that might be the original proposal.

What I had not included in my original proposal (in which I was mostly excited about the immediate data-mining prospects), but that I now definitely see as something worthwhile, is making sure that the setup I'm working on is 1. easily extensible and 2. easy to query for future projects as they arise. Aaron's idea of making the social media sdt really fits in nicely with that.

Questions that I have:

What kinds of questions should be directed to the dev list?

Direction for the coming week:
  • Start a google code project, even if I don't actually have any code yet.
  • Run a couple toy facebook applications on Joylent hosting to have a better sense of the requirements of my facebook app, and what hosting it will require.
  • Develop Social Media SDT
  • Read Twitter API
  • Play some more with hackystat as a user (Hackystat, user style!)
  • Get hackystat services to run locally (Hackystat, Admin style!)
You know, they say that minutes of planning saves hours of coding. Hoping that turns out to be right.

Hours spent this week: 17