Sunday, August 23, 2009

GSoC: Hackystat August 17-23 and A Summer in Review

Whew. Well. Where to start?


The Last Week

This week was something of a trial. It seemed like every time I fixed a problem, that solution would cause a cascade of new problems. But, I FINALLY have the Hackystat sensor up and running. It's available as a featured download on the project page.

The biggest issues that I ran into where a result of the solution that I applied to the XML problem from last week. Because I had my own version of the Telemetry XML objects, which were identical to the Hackystat Telemetry objects but in a different package, getting things from the TelemetryClient into a form that I could send to my database was a real trial. It produced a lot of angst and some decidedly unattractive code. However, my decidedly nasty code could be ditched permanently if the Hackystat schemas were given namespaces and prefixes. That would be awesome!

The other problem that I had relating to the TelemetryClient was really strange. getChart (the one that doesn't take extra parameters) didn't work. So, I had to dig up the default parameters for all of the charts that I needed and use the getChart method that did take extra parameters. Thanks to Shaoxuan for all his help debugging that problem!

I've saved all the default parameters for the charts I'm using as String constants, so they can be accessed easily.

Another note: it was significantly non-obvious to me that the Telemetry service runs on a different port on the server, and that one would need to specify this. I'm sure it's in the documentation somewhere, but where is it?

Finally, I realized at some point yesterday that I hadn't covered a couple of the cases for the Hackystat client, such as computer crashes and so on, so I had to add those in. In other nasty surprises, as I suspected, the lack of Findbugs and PMD errors was too good to be true. So I'm not quite ready to be pulled into the Hackystat server umbrella.

What I didn't accomplish this week, that I had really hoped to accomplish, was the visualization tool. It's still next on the roster, but it doesn't get to be under the GSoC hat, I suppose.


The Summer as a Whole

This project became so much more gargantuan than I had intended! Looking back on the time line that I wrote at the beginning of this whole thing makes me laugh. Knowing what I know now, I might have come up with something considerably less insane!

It ended up being sort of a trial by fire! I certainly didn't expect to be standing at the end of the summer, having put together my own server! (The sensorbase was a good template to work off of, definitely!) The whole thing just ended up being so much larger and more complicated than I anticipated.

That, maybe more than anything, is what I have learned from this experience. Actual use by real people in the real world complicates even simple tasks. Really, until this summer I've been working on mostly toy projects. Nothing huge, and nothing really meant to be used by people. When making something that is supposed to be used, there are so many more things that one has to consider that I am accustomed to considering; the end result, then, is that even simple things (like pulling data from one server and sending it to another server) can spiral endlessly into complication.

Stopping that complication from overwhelming the project (and the coder!) is a really important thing, I think. It's somewhat difficult when one is sort of programming in an echo chamber of one's own mind. Because the GSoCers were all working on individual projects, it was somewhat more difficult for us to bounce ideas off one another. I feel that the current release of my project would be of significantly higher quality if I had spent more time communicating my ideas and having the designs vetted as I went. I say this because I was describing a bug to an unrelated third party, and he was like, "WTF were you thinking, designing that this way?" Coming from a place of such inexperience with real software development, it's hard to know if you're making a good design decision. How do you choose the right way when you don't know the wrong way? So I know now to take better advantage of my available resources. Perhaps--gasp, the horror!--request a code review? Spend more time talking design with my mentor? All of these things!

Beyond the new understanding of how much time things actually take, and how important collaboration is to the end product, I also learned a metric boatload of new skills and tools. JAXB, for instance, is still, to my mind, the coolest thing since the internet. Java Property files are also really awesome. Not to mention suddenly understanding how the internet really works. I mean, sure, I've done some very low level networking stuff. I've implemented go-back-n, etc. But one day after getting the SocNet server up and running I abruptly realized that when you call for a web page, it's a GET request! Some server somewhere returns a representation of the HTML to you! That was a very exciting moment for me.

What I Want to Do Now

I would like to take a little break from the SocNet project, now the GSoC is over. But not a break from Hackystat! I would like to get a NetBeans sensor up and running. I know one was tried in the past; but that has been several years and I would like to give it another shot. Eclipse and I parted ways mid-summer, so a I've been missing out on some of the hackystat data that I would really like to have had. I think I might have figured out a way to ghetto-rig such a sensor, at least, using the hints capability.

After that, back to SocNet! The next two big chunks of it that need doing are the Ohloh sensor and the visualizations/analysis tool. I may do the Ohloh sensor first--the hackystat sensor taught me that anything run from a user's computer (ie, one that I do not have direct control over) is much more difficult and painful. After this week, I would gladly accept a slightly less painful task!

I would also like to start work on the sensor discussed over the list, which crawls a repository to determine how familiar a coder is with a particular concept. It was suggested as a Master's thesis--perhaps I can make it mine?


A Word of Thanks

Now that the summer has drawn to an end, I would like to thank all of the Hackystat Hackers who helped me through my first GSoC! Special thanks to Aaron, Philip, and Shaoxuan, without whom I may never have surived.

Tuesday, August 18, 2009

GSoC: Hackystat August 10-August 17

This week

Hackystat app, now and forevermore.

REST API support for the Hackystat App is up and running.

Had a lot of trouble with XML. I wanted to write a complex type that contained TelemetryStreams as elements. Something along these lines:

xs:element name="XMLContributesToRelationship">
xs:complexType>
xs:sequence>
xs:element ref="Type" minOccurs="1" maxOccurs="1"/>
xs:element ref="ID" minOccurs="1" maxOccurs="1"/>
xs:element ref="StartTime" minOccurs="1" maxOccurs="1"/>
xs:element ref="EndTime" minOccurs="0" maxOccurs="1"/>
xs:element ref="XMLNode" minOccurs="2" maxOccurs="2"/>
xs:element ref="TelemetryStream" minOccurs="9" maxOccurs="9"/>
/xs:sequence>
/xs:complexType>
/xs:element

But I couldn't figure out how to include the telemetry definitions in the file. There were a lot of namespace problems. The full rundown is available on the dev list, but I shall repeat the solution I arrived at.

My final solution:

1. Give telemetry.resource a namespace and a prefix. Append the prefix where necessary to elements and complex type definitions in telemetry.resource.
2. Give my schema a namespace and a prefix. Append the prefix where necessary to elements and complex type definitions in my schema.
3. Import telemetry.resource.
4. Drop the element declaration of TelemetryStream in my schema.

I'm not sure which of these are necessary and which are superstitious fluff, but at least I got it working. I know you can import things without a namespace, but I couldn't make that fly with this.

However, this solution to my server-side problem caused a client-side issue that I am still working with. Now I have to choose whether to use my telementry schema (the same as Hackystat's, but with namespace and prefixes) for the client-side stuff, or hackystat's. I was coding merrily along until I realized I had imported half of the telemetry stuff from the hackystat library and half from my jaxb folder.

However, once I clear that up and do some testing, then we have Hackystat Application LAUNCH! Exciting, n'est-ce pas?

Then: visualizations and analysis tool hardcore.

Also: test coverage, documentation, continuous integration

In other words, it's going to be a busy week. I'll keep you posted.

Wednesday, August 12, 2009

GSoC: Hackystat August 3-August 10

Sorry for the delay in posting--I landed myself a nice bronchial infection and have spent most of the last week coughing like a sea lion barks. It's awesome! However, it will probably also contribute to my brevity today, which I imagine many of you in the audience will appreciate.

This week:

Hackystat app. (Still. Possibly forever more.)

Working on the hackystat app makes me feel like I might be the only person who has ever tried to access Telemetry data who was not intimately familiar with the workings of the system. Much time has gone towards trying to find a constant or list or SOMETHING that includes the names of all of the Telemetry charts. The test cases for the Telemetry chart stuff don't seem to use them--they just use hard-coded strings, which makes me suspect that there is no such set of constants. For those with commit access--man, would that be handy! Judging by the test cases and the list of telemetry stuff in the project browser, I decided that the names must be the same as the list in the project browser.

I will be storing these charts in SocNet:
Build
Churn
CodeIssue
Commit
Coverage
CyclomaticComplexity
DevTime
Issue
UnitTest

If anyone has a favorite chart they would like to see stored, speak now (or soon) or forever hold your peace. (Just kidding. But do speak up, because knowing would be good.)

The app is mostly finished (if my assumptions about the names were correct)--now I'm implementing its REST API support.


Visualizations:

TouchGraph is out, because it has virtually no documentation, and the code is a relatively old version. (They don't know when they will be releasing the new one.) I haven't been able to figure out how to use it, so I have moved on to other options.

Jung, which I mentioned last week, has better documentation than TouchGraph by a long shot. However, I am working most seriously with Giny (http://csbi.sourceforge.net/). Giny is a LOT easier to work with than Jung, and implements a bunch of handy graphing algorithms that will make rudimentary analysis that much easier.

What I can't decide is how to host the visualizations. It would be easiest (from my perspective), to run them on the user's computer. However, it would probably be best to do a project browser style thing, visualization via web browser. My concern is that I will not be able to manage that in two weeks.


Library problems:

I am running into trouble using the hackystat client libraries. For instance, with the Telemetry client most recently added to my system, I pulled the ivy retrieve target from the telemetry system build file. Somewhat lazy, I know, but why duplicate work? The problem is that the target only works if you've compiled and built from source hackystat-utilities, hackystat-sensorbase-uh, hackystat-sensor-shell, and hackystat-dailyprojectdata. Which is fine if you've done it, but not great if you haven't. I think this is because the individual projects don't have modules in ivy-roundup or in my module repository.

Since I don't want individual users to have to compile the entire hackystat system from sources just to be able to use my stuff, it would be awesome if the sensorshell jar and the telemetry jar were added to ivy roundup. If they can't be added to ivy roundup, is it cool if I add them to my module repository?


Next week:

My plan from last week was sort of a general overview for the next two weeks, so it still stands. So I'll be gluing the hackystat sensor to the socnet server and working on visualizations with Giny.

Something I'd like to do that MIGHT not be such a big deal would be to start one-way hashing the passwords.

Tuesday, August 4, 2009

GSoC: Hackystat July 27-August 3

This week:

Has not been as productive as I needed it to be. Mostly, it has been consumed with Ivy frustrations. I will be trucking along and realize that I need another library, and have to stop and futz with the Ivy stuff until it works. This week, that also involved updating Ant, since apparently the version of Ant running on this release of Ubuntu is two years old. In a couple of the cases I was having difficulty figuring out which jar I would need. For instance, I needed the SensorBaseClient class, but I didn't feel like it was a good idea to have the hackystat client dependent on the whole sensorbase. So, I looked at how the eclipse sensor did it, and saw that the eclipse sensor pulls the sensorshell jar, and assumed that it was all wrapped up in the sensorshell jar. I copied that little bit from the sensorshell build file and put it in the build file for my project, ran it.... Break. Fail. Lose. This did not make sense to me. So I downloaded the sensorshell and tried to build it. Fail, but because of Ant.

In the end, I downloaded the new releases and built everything from the source, so all of the necessary jars would be in my cache. I didn't experience any problems building the system from source, though I do have a question about the ant -f jar.build.xml publish-all command. Does it only build the projects that are immediately dependent upon the project you are building? Or does it cascade? I mean, does it follow the chain rule?

Like, if project x depends on project y which depends on projext z, and you invoke the publish all on project z, does it only build project y, or does it also build project x? My working understanding is that it only builds project y. It would be super nice if it also build project x.

So, while I love Ivy a lot for installation purposes, and for downloading and organizing purposes, writing something to use Ivy as you go is an enormous pain, particularly if something is not already in the roundup.

Other library related concerns. The sensorshell jar seems to contain pretty much the entire sensorbase. Why? It seems like I tried to avoid a sensorbase dependency and ended up with one anyway.

Anyway, here's a preview of the app. Note that where it currently says, "Item 1.... " etc on the list will actually be a list of the user's projects in the SensorBase. This is just the preview of the GUI.



also with tool tip texts.





Not that this will run right now. Have two more libraries to ivy-ify.


Visualizations:

These are the network visualization libraries I'm experimenting with.

http://sourceforge.net/projects/touchgraph/

http://jung.sourceforge.net/

I am really excited about touchgraph.


Splitting the Project:

This is going to have to wait until I've finished writing the code, more or less. It's enough trouble to update one set of build files. I don't look forward to having to update 4 or so.


Sprinting to the finish!

The firm pencil's down date is in just two weeks' time. Things that need to get done before then:

1. meaningful registration process, by which a user can link the email address they used to register with the sensorbase to all of their various socnet stuff, so that I can limit who accesses the data in an appropriate way.

2. useful visualizations and initial analysis tools
Other than displaying the network and allowing the user to navigate through it in a sort of physical manipulate-y way (touchgraph!), this will probably also include some summarization of the other information in the network. I don't, however, want to simply mirror what the dailyprojectdata and telemetry services already do. Still thinking on what that will be.

then, once that is done, splitting the project into smaller, manageable chunks, and making sure my documentation is nice, etc.

Also, on a personal note, I graduated from college this week.

Tuesday, July 28, 2009

GSoC: Hackystat July 20-27

This week:

Client authentication is FINALLY working! I cannot describe the satisfaction. Currently, the PUT authentication is significantly more meaningful than the GET authentication. I am still waffling on how I should address retrieving information from the database. In some cases, my apps need to access it freely (which is no problem and is mercifully working beautifully), but other users may need to have permissions set up so that they can only access nodes in the database that are connected to their nodes. This will be more difficult. On the other hand, I'm not sure that it's the best solution. It really does depend a lot on who is going to be using it. It might be good to set up sort of levels of access. Default, for a user, is for them to be able to access all of the information that is related to their user node. Then, permissions can be added for them to access the information (read only, naturally) of other users. I don't really know how to go about doing that--I suspect I can just change the user schema to include a permissions lists. Have I mentioned that I really like JAXB?

Continuous Integration:

I surprised myself by being much further along on this than I had initially thought. All of that pain a couple of weeks ago with Ivy was well worth the time and trouble! I do have much in the way of findbugs, pmd, and checkstyle errors, but not nearly as many as I was anticipating (somewhere in the range of 100.) I suspect this comes from having canablized portions of the code from the sensorbase. I'm considering setting the project up as an Eclipse project again, just so I can use the checkstyle plugin, but Eclipse and I get into a fist fight basically every time I try to use it. (I've been using Netbeans, which is a balm to my soul). Hopefully this week I'll get that under control. I am also considering breaking the project out into separate projects once the hackystat app is up this week, so that the clients have separate build files, etc.

Hackystat App

Is going slowly. Man, has it ever been a long time since I've done GUI work. The list consensus on which Telemetry to store... wasn't, although it did bring up an interested research question, of what exactly the telemetry streams a user singles out as important says about that user. This is such an interesting thing to me that I am going to store some standard telemetry streams, and then also store a list of which streams the user thinks are important.

Issues I'm still having:

JUnit tests. Somehow, my tests are not independent and this is causing them to fail. It has something to do with the initialization of the database.

This week!

Getting down to the wire, somewhat. I'm going to start focusing on some visualizations of the data. Basic things, like just showing the network. I don't know how to make it easy to see the relationship data (telemetry streams, for instance, are attributes of the relationship between a hackystat user and a project). So that's going to be a hurdle. Okay, truthfully, displaying it at all is going to be a hurdle. There are a couple of network graphics libraries out there that are open source, so I may look there. There's also Improvise, which I know can display stuff like this...though it's more oriented towards building the visualizations by hand as opposed to programattically, but since it can do it, it's possible that I can use their visualization engine. I <3 open source software.

Going to split the project up this week, as I feel that that's more in keeping with the way hackystat is built, and may be easier to upkeep. The CI stuff goes hand in hand with that--I want to have that all wrapped up by the end of the week.

I'm aiming at having the Hackystat app ready to go live tomorrow evening or wednesday around noon my time.

Vision:

I have decided where I want to go with this as a tool set. I really like the idea of having an analysis tool for this that is, as Philip says, something of a hypothesis generator for groups of coders, such as within a class or company.

Unfortunately in my experience data mining algorithms require being tweaked halfway to hell before they work, so I'm guessing the inital version of this will not be terribly awesome. I already have a number of standard mining algorithms implemented in a library that zack and I have been intending to open source for a while, once it was cleaned up and documented. Unfortunately the part that works best, the spectral clustering, is based on a very fast, very ancient fortran library (I don't write in fortran), that has been seg faulting mysteriously for about five months now. I have had neither the time nor the skill to repair this.

However, the non-spectral clusterings work splendidly, so in theory, that could be plugged in. The graph specific stuff is likely to be more useful, like the SRPTs and the other thing that I have since completely forgotten the name of.

Miscellaneous:

There are some things that I have just come to REALLY love this summer. Most of them are exceptionally nifty tools to which I had never been exposed during previous experiences. I LOVE the properties files. How clever is that? It just tickles me to death. It makes me want to find more things to hide in them, though I suspect after a while it drives your user crazy.

JAXB. There is nothing nicer than having your code write code for you. <3

Ivy. I am so glad that I tried to build the system before the ivy integration was finished. I feel that I have a much greater appreciation for how truly awesome ivy is.

Tuesday, July 21, 2009

GSoC: Hackystat July 13-20

Client Authentication is Made of Lose and Fail.

Okay, so it's maybe not that bad, but I'm definitely having much more difficulty with it than I had anticpated. In some ways I think it would be significantly easier if I had put it in from the beginning, as in having to go back through and make my client and resource code work with the authentication, I managed to break things pretty badly. The biggest difficulty that I have conquered so far was the Mailer. I'm using gmail as my smtp server, and I could not for the life of me get it to authenticate properly. As far as I can tell, the original Mailer code for the server does NO authentication. How is that even possible? Anyway, I tried a couple of different ways to add in the authentication for the mailer, but it was a variation of the following code from GaryM at the VelocityReviews forum thread on gmail as an smtp server that finally got it to work.

public class GoogleTest {

private static final String SMTP_HOST_NAME = "smtp.gmail.com";
private static final String SMTP_PORT = "465";
private static final String emailMsgTxt = "Test Message Contents";
private static final String emailSubjectTxt = "A test from gmail";
private static final String emailFromAddress = "";
private static final String SSL_FACTORY = "javax.net.ssl.SSLSocketFactory";
private static final String[] sendTo = { ""};


public static void main(String args[]) throws Exception {

Security.addProvider(new com.sun.net.ssl.internal.ssl.Provider());

new GoogleTest().sendSSLMessage(sendTo, emailSubjectTxt,
emailMsgTxt, emailFromAddress);
System.out.println("Sucessfully Sent mail to All Users");
}

public void sendSSLMessage(String recipients[], String subject,
String message, String from) throws MessagingException {
boolean debug = true;

Properties props = new Properties();
props.put("mail.smtp.host", SMTP_HOST_NAME);
props.put("mail.smtp.auth", "true");
props.put("mail.debug", "true");
props.put("mail.smtp.port", SMTP_PORT);
props.put("mail.smtp.socketFactory.port", SMTP_PORT);
props.put("mail.smtp.socketFactory.class", SSL_FACTORY);
props.put("mail.smtp.socketFactory.fallback", "false");

Session session = Session.getDefaultInstance(props,
new javax.mail.Authenticator() {

protected PasswordAuthentication getPasswordAuthentication() {
return new PasswordAuthentication("xxxxxx", "xxxxxx");
}
});

The code highlighted in red I initially left out, because I didn't know what it did. Leaving it out causes the whole thing to fail, so obviously it's important. The code highlighted in green is what is missing from the SensorBase Mailer that makes me think that the SensorBase mailer doesn't authenticate before sending emails.

So, the mailer was a somewhat frustrating goose chase. Now, the goose I am chasing is an "Undefined user null" error. Not an exception, mind you. I think it's a status message that's being set somewhere in code that I don't have direct access to (my best guess is somewhere in the restlet router or guard stuff.)

So, currently, the client authentication has stalled at the retrieving data part. You can register new users all the live long day, and it will send them emails with their login credentials. However, if you try to get information as an authenticated user, it finds you in the database, sees that you're properly authenticated, returns that you are a user and are okay to access the data... and then fails.

Hackystat Client

The hackystat client will allow the users to specify which project(s) and time frames they want to allow SocNet to access, through a simple little gui. (Only contiguous time periods can be selected for a project, so not like, two weeks here and then another two weeks later). The thing left to decide is... which telemetry data to use? What are the most common/most useful analysis?


This week:

Finishing the client authentication is the biggest deal, followed by finishing the hackystat client. I'll be sending an email to the list to get input on what telemetry analysis are the most useful, as well.

Additionally, I'm going to take a crack at the continuous integration stuff Philip has encouraged us to work on, so that will be a couple of new and exciting toolsets. I'm particularly excited about checkstyle.

Sunday, July 12, 2009

GSoC: Hackystat July 6-13

This Week:

Server and Database
The server and the database are up and open for business! (Well, in that you can run them. They are not open for business in the public server sense). It is Ivy integrated and mostly lovely to behold. It does not currently support the full API listed on the project wiki, but we're getting there. It implements all those necessary for clients to store information in the graph.

The server does not implement client authentication yet, but it's on the roster of things to do this week. It is also possible that there are one to many layers of abstraction between the resources and the graph, but that can be addressed in future revisions.

Problems encountered:

JAXB ugliness.

I had a really lovely interface created for my XMLRelationship objects that included two instances of a named complex type, XMLNode startNode and XMLNode endNode. However, when I tried to marshall it, the marshaller threw an exception because the node was missing an @XmlRootElement annotation. Adding that annotation works just fine, but apparently it wasn't being generated. Google provided this answer: http://weblogs.java.net/blog/kohsuke/archive/2006/03/why_does_jaxb_p.html

Apparently the JAXB compiler won't add the XMLRoot unless it can empirically prove that the type isn't going to be used by anything outside of that file.

Neither of the fixes proposed in the above blog worked for me, so I had to refactor the XMLNode to be an anonymous type. This makes the interfaces a lot more ugle, as instead of having two separately named instance of XMLNode, you have a list of two XMLNodes, called XMLNode. And instead being able to call

relationship.setStartNode(beginNode);

I have to do

ArrayList nodes = (ArrayList) relationship.getXMLNode();
nodes.add(beginNode);

The bad plurals hurt my soul. I've been looking into how to configure JAXB to generate custom types in hopes that I can clean it up later that way, but it's not a super high priority. Just something that would make my soul hurt a little less.

Not all of the REST API is supported by the server yet. The server code is also a little shy on the comments.

Time Traveling Exceptions.

So, in testing one of my get methods, I attempted to retrieve the user "Eliza Doolittle" from the database, using this uri

"{host}/nodes/IS_USER/Eliza Doolittle"

If you look closely you can probably guess what the problem was. Yes, folks, http calls do not like them some spaces, not at all. So it transferred as "Eliza+Doolittle", and asked the database for the node of that name, which didn't exist. Normally, this would have just thrown a NodeNotFoundException and that would have been handled appropriately, but in this case it threw the exception and through a series of unfortunate events, completely obscurred what was happening. I spent about four hours trying to figure out how something could be not null in the method passing it and then null when it is received... But, fortunately, there was no break in the space time continuum and I eventually got it worked out. Moral of the story is that spaces are a no-no for URIs until I implement + removal in the server.

Neo4J Ivy Integration

Neo was a poor choice for my first attempt at Ivy Integration. Neo4j has no consistent naming convention for their directories and libs in their releases, so Ivy kept trying to download a file that didn't exist. I couldn't for the life of me figure out where Ivy was getting that name from... Still haven't, actually. I solved the problem by instructing Ivy to rename Neo to the name it wanted before it started looking for it in the cache. The rest of the ivy integration was MUCH smoother. However, I would really like to know how one generates the xml files from xslt files.

Twitter App

The twitter app is also up! It sleeps for 15 minutes if there's an unidentified exception, and an hour between each round of polling twitter for changes and sending the information to the database. I am particularly proud of the caching. I was initially very frustrated because it's highly parallel but not quite identical between getting the followers of the Twitter Client account and getting the followers of a particular user. I arrived at a solution that was efficient in its code reuse and linear time instead of quadratic time. I was pleased.

Documentation

Gasp! There is actual documentation up at the socnet google project page! There are directions for installing and building from sources! There are directions for how to start sending your Twitter data to the server! It's full of awesome and wow.

This next week:

Client Authentication in the server. This is necessary so that people can't DOS my server, or fill it with a million instances of RickAstley objects.
Cleaning up server documentation
Hackystat data grabber
Getting Ohloh API key

I'm putting Facebook on the back burner for the time being. Currently, they are being sued about their developer data access policy. Not that this is likely to be resolved soon enough to help me, but I think I can do a fair amount with the hackystat and ohloh stuff.


Check out the shinier project page : http://code.google.com/p/hackystat-analysis-socnet/


Friday, July 10, 2009

GSoC: Hackystat -- The SocNet Server Goes PUT

PUT works!

And on the first run, too.

GSoC: Hackystat SocNet Server and the Ivy Integration

Today, I tried to build just the server. However, because of the new directory structure, it wouldn't build without also building the twitter sensors and the social media graph. I tried to exclude both from the build script, but for some reason the includes/excludes did not work properly in any way, shape, or form. After beating my head against that for a while, I decided to bite the bullet and integrate with Ivy.

ZOMG, what an ordeal. It was somewhat easier because there were already examples of the stuff that Philip had integrated. However, one of the libraries that I needed (neo4j) has all of the naming consistency of a teeter-totter in a hurricane. So in order to get that to download and install properly, there was an exceptional amount of bother. As it is, the neo4j xml files have significantly more hardcoding in them than I am comfortable with. It took FOUR HOURS to get the bloody thing working. The second library was much easier, in part because they followed a consistent naming scheme, and in part because by that time I had some clue what I was doing. The XML was beginning to develop meaning, the mysteries of ivy were becoming somewhat clear...

Just trying to figure out what needed to be in the XML files was difficult, even with the examples from IvyRoundup. I notice that most of those were generated files, which appear to somehow have been generated with an xlst file. I would very much like to know how that works, as I suspect that would make the job much easier.

For the time being, the ivy modules are stored on my google project page. I would eventually like to move them to IvyRoundup, but as I will have to get permission to do so first, I figured this was a good interim solution.

Thursday, July 9, 2009

GSoC: Hackystat and The Twitter App

Twitter FTW
The Twitter App is up!

Well, mostly, anyway. What it is missing is a more specialize wrapper for the generic wrapper for the REST API to the SocNet server. Write now I have the stubs of such a thing written, and it should be done in the early afternoon tomorrow.

Let's talk for a moment about the wrappers.

The generic wrapper for the REST API is going to throw exceptions if anything goes wrong (a different exception for each kind of thing that could go wrong.) You know, it's a very thinly veneer on the http calls. The more specialize wrapper will do some nice things, such as check to see if an object is already in the database before trying to add it. So that will add some latency and complexity, but it makes using it a little safer and more user friendly.

Things I have learned from the Twitter App:
Caching is non-trivial. In fact, it's bloody difficult.
There are a million things that can go wrong at any point when communicating over a network.

Here's how the app works, on a high level.

Upon initialization, the twitter client asks the SocNet server for the twitter accounts already listed in the database. All of these twitter accounts should be following the Twitter client Twitter account (HackystatSocNet, I think). Then, it asks the server for the followers and friends of each of the twitter accounts in the database. It uses these lists to initialize the cache. The first time around, all these lists will be empty.

Then, it gets the list of the users following the HackystatSocNet account from Twitter, compares the list received from Twitter to the cache, and adds or deletes users from the database and the cache as necessary.

Once it's done that for all of the users, it sleeps for an hour.

The caching ended up being much more painful to implement than I had anticipated. It took me a lot of rewriting to find an implementation that I was really happy with, but I like what I've got now.

It is, however, somewhat minimal in the "Catching Things If Exceptions Hit The Fan" department.

Next: REST Wrappers, and Server Work

I do not think I will have client authentication up on the server until next week. This week's server will probably be somewhat bare bones--handling of the REST calls but not as many of the niceties as the Sensorbase code has at this moment.

Tuesday, July 7, 2009

Reading up on JUnit and listening to Hank Williams.

Monday, July 6, 2009

GSoC: Hackystat, June 30-July 6

Last Week:

The database ended up in a much different (and significantly pared down) form than I had anticipated. Instead of the multitude of node wrappers that I had last week (one for each kind of object in the database), now there is only one kind, and nodes are distinguished between based on relationships. For instance, all "People" nodes are connected to the "People" subreference node. For clarity, here is a picture:



Subreference nodes are in blue, the reference node (database entry point) is green. Relationships are labeled in all caps. This is mostly an implementation issue, but I thought I would touch on it. Having lots of different node types (Plain Old Java Objects wrapping the underlying nodes) is only useful if they are each storing lots of different information. Because as planned, mine weren't (I brought most of the properties out to make them into nodes), it made more sense to kill the proliferation of wrapper nodes. It certainly cleaned up the implementation a lot.

All right, so that is up at http://code.google.com/p/hackystat-analysis-socnet/
It's shiny and commented, too.

The Server

Also this week, the Server work began. I watched the screencast on the design of the sensorbase, and the one on building it from sources...

I built the Sensorbase from source this week. This was actually very easy and pleasant. (I was pleasantly surprised!) Yay, Ivy integration! There was an issue with JavaMail and JAXB, but that was ironed out very easily. JavaMail only comes with Java 6 Enterprise Edition, or something like that.

I edited the SensorBase code so that when you ping it, it responds with a Hello World. This did not require much change, actually--I just removed a lot of code from the Server class, added a HelloPing resource, and borrowed code from the Client class to test and make sure it was working. I will put the changed code up on my google project page, though it's not terribly exciting.

Now, Philip says that developing an API really informs the design of a server, so I have the first draft of that. I based it somewhat off the Sensorbase API. I would very much appreciate feedback on the API. I think I hit on most of what was needed, but could totally have left things out. This will go in my google project's wiki as soon as I learn wiki markup for tables. Currently, it's in the downloads section.








































































































































































































GET {host}/nodetypes Returns a list of all of the node types in the graph
GET {host}/nodetypes/{nodetypename} Returns a representation of the named node type
PUT {host}/nodetypes/{nodetypename} Create a representation of the named node type (admin)
DELETE {host}/nodetypes/{nodetypename} Delete the named node type (admin)
GET {host}/relationshiptypes Returns a list of all the relationship types in the graph
GET {host}/relationshiptypes/{relationshiptypename} Returns a representation of the named relationship
PUT {host}/relationshiptypes/{relationshiptypename} Create a representation of the named relationship type (admin)
DELETE {host}/relationshiptypes/{relationshiptypename} Delete the named relationship type (admin)
GET {host}/clients Returns a list of all of the clients using the server
PUT {host}/clients/{client} Returns a representation of the named client
POST {host}/clients/{client} Updates the representation of the named client
DELETE {host}/clients/{client} Deletes the named client
GET {host}/nodes Returns a list of all the nodes in the graph
GET {host}/nodes/{nodetype} Returns a list of all the nodes of that type in the graph
GET {host}/nodes/{node}/ Returns a list of all of the nodes connected to the named node
GET {host}/nodes/{node}/{relationshiptype}/{relationshipdirection} Returns a list of all of the nodes connected to that node by the specified relationship and relationship direction
GET {host}/nodes/{node}/{nodetype} Returns a list of all of the nodes of the specified type that are connected to the named node
GET {host}/nodes/{node}/{relationshiptype}/{relationshipdirection}/nodes?startTime={tstamp}&endTime={tstamp} Returns a list of all nodes that were connected to the named node by the specified relationship and relationship direction during the specified time period
GET {host}/nodes/{node}/{nodetype}/nodes?startTime={tstamp}&endTime={tstamp} Returns a list of all of the nodes of the specified type that were connected to the named node by a relationship between the start time and the end time
GET {host}/nodes/{node}/relationships?startTime={tstamp}&endTime={tstamp} Returns a list of all of the relationships that were connected to the named node during the specified time interval
PUT {host}/nodes/{node} Creates a representation of the named node
POST {host}/nodes/{node} Updates the representation of the named node
DELETE {host}/nodes/{node} Delete the named node
GET {host}/people Returns a list of all of the people nodes in the graph
GET {host}/people/users Returns a list of all of the people nodes in the graph who are users of the system (ie, those who have added the facebook/twitter apps and are submitting data
GET {host}/people/nonusers Returns a list of all of the people nodes in the graph who are not users of the system (ie, those that are, for instance, friends of users, but are not users themselves.)
GET {host}/people/users/{user} Returns a representation of the named user
PUT {host}/people/users/{user} Creates a representation of the named user (admin)
POST {host}/people/users/{user} Updates the representation of the named user
DELETE {host}/people/users/{user} Deletes the named user
GET {host}/people/nonusers/{nonuser} Returns a representation of the named nonuser
PUT {host}/people/nonusers/{nonuser} Creates a representation of the named nonuser (admin)
POST {host}/people/nonusers/{nonuser} Updates the representation of the named nonuser
DELETE {host}/people/nonusers/{nonuser} Deletes the named nonuser



This Coming Week:

The SocNet Server is top priority. I will be modeling it after the Sensorbase, with a couple of changes. I intend for resources to be loaded from a config file, for one.

Authentication is somewhat of an issue: each client sending data will have a username and password, but what about getting data? There is no simple way for me to ensure that a user who has added the facebook app is the same one who is requesting data. So that is currently an issue that I am not sure how to resolve.

To get a better idea of what I am talking about, here is another picture:



The issue is that there is (currently) no overarching user registration plan. Perhaps such a thing should be added.

Sunday, July 5, 2009

Am writing a REST API specification for the socnet server.

Saturday, July 4, 2009

GSOC 2009: Hackystat and the Hello World Server Ping, and other adventures in Servers

Today I made the Sensorbase server code respond to a ping with "Hello, World!"

Basically, this just required adding another resource, in this case, HelloPingResource. I modeled it fairly closely after the PingResource, but I didn't use the client to check to make sure that it had worked. I swiped code from the SensorBaseClient isHost() method to both perform the ping and receive the response, but put it in a separate test class for the HelloPing. I also wrote a JUnit test (my first!), which ran (I think) when I ran ant -f junit.build.xml, but there was so much junk in that output that I couldn't find my little message in all of it.

I've also got the beginnings of an API framework for this server critter that I am building.

Friday, July 3, 2009

Built the sensorbase from source on three different computers today. The only difficulties I ran into are that in some versions of ubuntu, the installed java is not Java Enterprise edition, and therefore doesn't include javx or javamail. However, those things were not difficult to remedy. I have also found the part of the server that responds to a ping!

Tuesday, June 30, 2009

GSOC: Hackystat June 22-30

A slightly belated blog entry, for those of you who were waiting with bated breath.

This Previous Week:

Was not as productive as I hoped. I had taken a week off in early June to study like a madwoman for a Chemistry CLEP test, but after failing the practice test hardcore, decided that a little more time was necessary so that I could actually graduate. So, much of last week was spent living like a hobo, perched on the midden heap that was a corner of my sofa, reading the entire chemistry textbook from MIT. Fortunately, I passed, so now I get to graduate.

Here's what I did get squeezed in:

Neo Transalpine Database

This is mostly finished! I think I redesigned it at least six times, but at the very least, the bones are there. The big thing left to do for it is make an easy way to retrieve information, but since this supposedly is what Neo shines at (traversals and so forth), I think we should be good.

Later this evening I will post a diagram of the design. I decided to go the way of making everything possible a node instead of an attribute. Potentially, this will come back to chew on my butt, but for the time being it made for some really beautifully simple implementation.

My one concern with that path is that I ended up with quite a few classes that did not have much substance. All of the classes extend the abstract SocialMediaNode class, which houses the underlying node and has accessors and mutators for the only property I am requiring all nodes to have, that is, a name. Some of my classes (for instance, the Coder class), have quite a few more properties than just a name. However, there are probably an equal number (perhaps slightly more) that are just sort of wrappers for the abstract superclass, consisting only of a constructor that passes a value to the superconstructor. I did this so it would be easier to distinguish between types of nodes and so the database creation code would be more straightforward, but I do not know that it was the correct decision.

The Facebook App

Yeah, I know I said that I wasn't going to work on this this week. But, after a long discussion with Zack about the server issue, I decided that it the Facebook application was going to require the most specific things from the server (it's actually the only thing that needs to be hosted) and therefore be the most pressing.

Now, my original plan was to build the app on using Joyent as my host and then transfer it to another host later, but this is not going to work. Joyent does not provide free hosting for Facebook applications in Java, only Facebook applications in PHP and (I think) Ruby. Now, I have written in Ruby, but it's been almost three years and at this point in the summer I think it would be stupid unwise to take on something else. I searched for other free hosting, and have as yet found none. Lame.

So, I started setting up one of Zack's desktops as a host. So, I registered with DynDNS, and installed a dyndns updater. Glassfish is installed and configured. Whee.

Other things from this week

The structure of this whole thing changed again. At last update, I was planning to host the application that let users release to me for mining their Hackystat data. However, I decided that it was probably most secure to make that application something they download to their own computer, as then it can access sensorshell.properties and I don't have to securely store authentication information. This also makes this application significantly simpler to write, which is pleasing.

Next Week:

I am a trifle behind (I think I had planned to have the database ready to roll by today), but it should be ready in the bat of an eye span. (I would love to say tomorrow, but if I say tomorrow, overnight I will decide that I need to completely refashion it again and that would automatically kill the database contentedness.) Besides, after last week I've forgotten what fun and leisure are like, so I forsee just oodles of productivity.

  • Okay, so database.
  • Twitter application. Do we still want this to be sending updates to the sensorbase?
  • After those two, the downloadable "Let Rachel Access Your Hackystat Data for Mining" application.
  • Also, database commenting. Right now it is naked, and it isn't super hard to understand, but then again, I just wrote it. So commenting it before I forget what I was thinking is a good idea.
  • And the server stuff. Philip was so kind as to make a number of screencasts for me, and so I really ought to watch them.
  • I suppose I should also actually upload my code to my very lonely Google project site.

Vision

I admit that I am a little concerned about where this whole thing is going. My original plan was to have stuff collecting data by now, so that I could do the data mining of it for a graduate independent study in July.

However, I'm not sure that was ever a reasonable goal, particularly considering you know, wanting to really test things before I unleash them on the computers of unsuspecting users. While I feel that I can certainly have everything in place, all of the components built, by the end of the summer, I'm not sure that I will have models ready to be queried yet. We laid out our goals for Hackystat, but what are Hackystat's goals for us?

Other things that I have been wondering about: let's say, for instance, that we decide we do want to store some Twitter instances in the Sensorbase. At one point is the decision made, "Yeah, okay, this is not going to blow things up/eat Tokyo/create infinite lolcats and can send data to the sensorbase for reals"?

Technical Difficulties

Eclipse is driving me crazy with its slowness, its utter refusal to clean build when I tell it to, and the otherwise batspit insane errors its been giving me. It is also making me into the whiniest coder ever, which I am sure everyone who follows my twitter is tired of. (But yes, I do accept cheese with my whine!)

Philip suggested that the slowness might come from the eclipse sensor and its communications with Hawaii. This seemed reasonable, so I took his advice and uninstalled the plugin. This helped somewhat, but I'm not sure it helps enough to justify the derth of data.

It also (sadly) did not fix the other bizarre errors I've been getting... Like, today, in which Eclipse spent an hour insisting that I couldn't implement an interface because it "wasn't an interface". It WAS an interface. Usually when one gets errors of this nature, a clean build fixes it... But several clean builds later... same error. This did not make sense to me. I added in a couple of syntax errors, which it caught, and then I removed them, which it didn't catch, even three builds later. Unfortunately, I don't remember what series of steps I took to finally make it come to its senses and recognize my interface as an interface. I'm going to reinstall the wretched, illiberal tomorrow and see if that helps. Or I may just beat my face against my monitor until it is satisfied by the ritualistic dumping of my blood and decides to behave again.

In Other News:

Additionally, this week, I have gained 10 hours of chemistry credit and killed three tomato plants by watering them with milk.

Monday, June 22, 2009

GSoC: Hackystat, June 15-22

This week:

So, after all of the ritualistic mapping from last week, Philip gave me some feedback, and I have been applying rationcinactivity to it all week. This ended with me basically scrapping the earlier setup, and a couple of the earlier ideas.

Ideas that hit were ritualistically dumped:

Social Media SDT

Why did this hit the scrap heap? Because I had a problem--storing relational data--and a tool by which to solve it, the hammer that is the sensorbase. This led to me beating the problem with the sensorbase hammer in spite of the fact that my problem was not a nail. It was too restrictive for what I need to be able to do, so I am tossing it in favor of Neo4J, which I have fallen in love with this week.

That said, I will still probably be sending some sensor data from Twitter to the sensorbase, but only the most atomic stuff. This would be very similar to a build event, only instead of the result of the build and a timestamp, it would be an "tweeted" and a timestamp. Potentially I will be able to indicate "code-related" and "not code-related". However, that will involve text mining, which is going to take a little while to work up.

Sensors on the User's Computer

Originally, having the Twitter sensor running from the user's computer seemed like a really good way to deal with the secure storage of user authentication details and to avoid me having to set up my own server. However, after I decided that storing the bulk of the social networking data in the sensorbase was not a good solution, it became necessary for me to start running a server to store things, and so the "I don't want to deal with running a server" argument became moot.

That left the issue of secure storage of user authentication details. However, at least in the instance of the Twitter app, this is actually a moot point. I can get all of the information that I need from a Twitter account that is friended by the user I am collecting data on.

How the Twitter App Works

Note: The Twitter app is not quite ready to go live yet, for reasons I will get to in a moment. If you're in a hurry, jump to The Database.

The Twitter sensor will live on my server.

In order for the Twitter sensor to gather information on a user, they have to follow the Hackystat Twitter Sensor's Twitter account, which is HackystatSocNet.

Once they have followed that account, then I can access information about their followers and status updates. Some of that I will store directly in my database--other bits of it I will send to the sensorbase.

It's actually a much simpler setup than I was initially expecting. So, once I iron out The Database, it should get online very quickly.

Google Project

I named my project! I have dubbed it Hackystat SocNet (pronounced "sock net")! Why Hackystat SocNet? It's from Social Networking. It's reasonable descriptive, and further, it means I can have a sock puppet as a logo. Being that I love sock puppets so much that I spent my 21st birthday making them, I think this is fitting. For your viewing pleasure, I have included a rough draft of a logo. (I have a pretty digitized version of this somewhere but can't locate it.)



For some more ideas about how versatile a sock puppet logo can be, imagine a data mining sock puppet accesorized with a hard hat and a pick. Or, check out Alton Brown's bevy of yeast sock puppets--how fun to represent social networking with socks!

At any rate, I have started a google project (not that there's any code there yet.)

It is located here http://code.google.com/p/hackystat-analysis-socnet/ . I will likely end up doing the hackystat model and have a separate project for each sensor, for the data mining part of the application, and then for the final prediction application. But this will be the aggregator of those.

Next week:
The Database

This ended up being the obstacle to getting the Twitter App live this week. I realized somewhat suddenly that I didn't have any good or concrete plans on how to store the data, much less a consistent interface for accessing and storing the data.

I have decided to use Neo4J, which has turned out to be extremely intuitive and nice to use. I am a fan.

I am somewhat struggling with what to store as nodes, what to use as relationships, and what to use as attributes. It all seemed so clear cut in my initial proposal! Most of the things I had planned as attributes might be more useful as nodes themselves. Here is what I am totally sure I will have:

Nodes:

Coder (Implements Human)
Person (Implements Human)
Project
Employer
School

Relationships:

Is Friends with (human to human)
Follows (human to human)
Is Following (human to human)
Worked at (human to employer)
Worked with (human to human)
Contributed to (human to project)
Owns (human to project)
Went to School with (human to human)

I am less sure how to handle interests. Should interests (listed on facebook) be nodes? I mean, the approach of making all of these attributes nodes can be taken quite far, and seems like it might make retrieval easier. For instance, instead of storing birth date as an attribute for a Human object, it could be a node. Every Human object born in 1984 would have an edge connecting that Human node to the 1984 node. Then, retrieving all Human objects born in 1984 would only be a matter of pulling all the nodes connected to the 1984 node by a "Born In" relationship. This could be taken scarily far, to the length of having nodes for days or times. I'm not sure that this would be useful or efficient, but it is what I am thinking about. Thoughts?

I am also struggling with how to store the time series of Hackystat data. I want to store like, six or seven standard telemetry series (Build, DevTime, etc) for each hackystat project that is present in the database. Neo4J does not allow arbitrary objects for attributes, which makes storing timeseries more difficult. However, it can store arrays of primitive types and strings, so I am considering arrays of arrays... We'll see.

This is the biggest project for this week.

The Server

I cannot even describe how much setting up a server is not my area of expertise. I haven't even begun to research how you set up a server. This will be this week's other big project.

Milestone:

Things that I know I will have ready for my first milestone:

1. Relational database
2. Twitter App
3. Hackystat data accessing applet

This week I am planning to get the relational database ready this week, as all of the others depend on that (the data gathering is pretty simple and straightforward--the rough part is transmitting and storing it.) The Twitter app can be ready within two days of having the database up, so we'll call that next Wednesday. The other pressing order of business is the applet to get data from Hackystat, which will probably take four or five days, bringing us in the timeline to after the 4th of July weekend.

I am hesitant to commit to having the Facebook and Ohloh apps ready for the milestone, as I cannot guarantee that I will have them ready to look at by July 6th. I will leave it open as an option, though, as it's possible that once the database is up and running and I have a consistent interface for dealing with it, then everything will go much more quickly. However, I would like to leave some wiggle room if it does not.

Monday, June 15, 2009

GSoC: Hackystat, June 8-15

Now, with 100% more ritualistic mapping!

This week spent lots of time on Vision. I wasted a lot of time wondering how in the world I was going to make it easy for the users to run learning algorithms, and if I should server-base it or have them download their own personal little learner... and then I realize that they won't be interacting with the learning algorithms at all.

Here's what (I think) will happen.

The user will download the Twitter and Ohloh sensors, much the way they download Eclipse and other sensors. Those sensors will do an intial dump of sensor data to the sensor base. This initial dump will include followers, following, contributers, etc. Then, those sensors will basically be sleeper threads that will wait patiently until something at Twitter or Ohloh changes, at which point it'll transmogrify that into XML (just like it did before) and send that on to the sensorbase.

That's pretty straightforward. The Facebook sensor is a little different because it doesn't get to live on the user's computer. It has to be hosted. Right now I've set up Joylent hosting for it, as it's free. However, it's only free for a year, and you have to have over 50 users or it either stops being free or they delete your program wholesale. (I can't remember which. Either is kind of lame.) However, other than the fact that it has to be hosted somewhere, it's basically the same deal.

All right, so after I have all that pretty data going to the sensor base, I will somehow download all of it. (I need to email the dev. list about that.)

After downloading all of it, I will run as many different algorithms as I can think of to generate models. This will probably actually be done on Zack's super computer, as I shudder to think how long it would take on my laptop.

Once I have those models, I will build an interface for them so that the user can enter their hackystat user name, it will build a graph from the nodes and edges connected to that user, and then make predictions based on whatever the model says about the graph.

That was a lot of words, so I have also ritualistically mapped the components for the more visual learners in our audience. (Click on this to make it larger and legible.)


This is by no means set in stone, folks. I've kept the information gathered somewhat minimal--it can be expanded pretty easily. The other thing that can be changed is the location of the Social Network Query Program. To me, it seems somewhat like the analysis services that Hackystat already has on the servers, so it seems like it fits there. However, it could also easily be made as something to download and run locally. I open this for debate.

Hackystat as Administrator!

I started getting the Hackystat services running locally this week. (I was hoping to be sending data to them, too, but I'll get to that in a minute.) I had a LOT of trouble with the DailyProjectData service...for no perceivable reason.

Here's what happened:

I downloaded it from the DPD project site, and had it all set up to run. My sensorbase was humming away, my properties files were all in order. So I ran it and BAM! Null Pointer Exception.

Null pointer exception? Really? Specifically:

Exception in thread "main" java.lang.NullPointerException
at org.hackystat.dailyprojectdata.server.Server.disableRestletLogging(Server.java:104)
at org.hackystat.dailyprojectdata.server.Server.newInstance(Server.java:91)

I dug into the source code, and the line that appears to be throwing it is actually library code.

It's the line highlighted in red in this method:

/**
* Disable all loggers from com.noelios and org.restlet.
*/
private static void disableRestletLogging() {
LogManager logManager = LogManager.getLogManager();
for (Enumeration e = logManager.getLoggerNames(); e.hasMoreElements() ;) {
String logName = e.nextElement().toString();
if (logName.startsWith("com.noelios") ||
logName.startsWith("org.restlet")) {
logManager.getLogger(logName).setLevel(Level.OFF);
}
}
}

Granted, not exactly sure what identifying the source of the problem was going to do me since I am not supposed to compile it from the sources (for which I am somewhat grateful).

Anyway, I asked Aaron about it. I checked to make sure I was using the right versions of the necessary libraries (I seemed to be.) Finally I noticed somewhere a suggestion to always used the DPD jar that is distributed with the services package.

I tried this, and it worked like a charm, right out of the box. Why does it seem that so much of programming is filled with this elusive problems that vanish for seemingly no reason at all?

After I started using only the ones that were distributed in a nice package together, I had no more trouble. Everything else ran like melting goat cheese on roasted nectarines, in other words, great.

Twitter App

The Twitter Sensor is in progress. I'm having some trouble with the library I decided to use, in that it doesn't seem to work at all. I've only been tinkering with it for a few days, though, so it's possible I've just missed some crucial detail. This is unfortunate, because from the look of things, this should code up tickety-boo. However, I can't even get their examples to run. Splendid!

Data Storage

Two weeks ago I discussed a Social Media SDT. Aaron thinks (and I agree) that this is the best way to handle the data I'm going to be collecting. My only beef with it is the resource field. All of the sensor data has to be associated either with a Hackystat user or a Hackystat project. While that is not a huge deal, it is somewhat restrictive.

Side note on the SDT: I feel like the best documentation of the SDT stuff is actually in the REST API. However, there seems to be some distinction between SD and SDT that I do not get and find tremendously confusing. I have reread and reread that section and still have no clue.

I am also not exactly sure how to fit the less dynamic attributes into the SDT. One solution is to structure all of the attributes like relationships. ie, "User Q is interested in herpatology", as an example of a Facebook interest structured as a relationship. That would make Herpatology a node in the graph and the "is interested in" an edge. I feel like this would add unecessary additional complexity, but it could also make it interested. I'm not sure how to toy with that until it works.

The other solution, one that I am currently leaning towards, is ditching the idea of trying to back the data so closely with the Hackystat system. Instead, I would download as much of the data as I can get my grubby little paws on and reconfigure it into a format that allows me to have whatever kind of objects I need to have (as opposed to simply people and projects.) I could potentially use the nifty database thing that Philip sent across the list a few days ago.

Guidance on this is good. I will probably pose this idea to the list. I just need to pare it down so it's not long, rambly, and incoherent. You know, so that it's actually clear that I'm posing a question.

Google Project Setup

So I started this (it's easy!) but I realize that I have NO IDEA what I should name this crazy contraption. I thought "hackystat-social-media-extravaganza", and then thought it might not be professional or descriptive enough (and then realized I probably shouldn't be allowed in public). It can't be changed, so it should probably be good from the start. "hackystat-social-network-predictive-widget". I have no idea. Does Hackystat have component naming conventions that I missed somewhere? Suggestions?

All right, that about wraps up this week's work.

For the coming week!
Twitter app!
More decisions about storage and hosting
Maybe a Facebook app, too!

Remember, guys and dolls: the open sesame is printed also on the fire.

Eclipse just bit it hardcore. More hardcore than I have ever experienced an IDE crash before.

Sunday, June 14, 2009

The dailyprojectdata service is running!!! Clearly, collard greens solve all the world's problems.

You know, even if I can't get the dailyprojectdata service to run locally, I am so in love with life right now. Tummy full of greens. yay!

Thursday, June 11, 2009

Null pointer exception in the daily project data thingy i was attempting to run locally. Line 104 in Server.

Okay, the sensorbase is running locally... Now what?

Bless google for helping me with my ignorance.

Wtf is a smtp server?

Monday, June 8, 2009

GSoC: Hackystat June 1-8

I have been almost completely unproductive this week. It was the last week of my intersession class and I was studying for a CLEP exam, so that took up basically all of my time.

This week will hopefully be much more productive.

In addition to working on what I was to be working on from last week, it looks like a paper of interest to me just went across the list, so I will be investigating that.

Monday, June 1, 2009

GSoC: Hackystat, May 25-June 1

This week, by and large, has focused on Hackystat, Developer Style! This has involved a lot of reading and planning. I have been anxious to start pounding out code, but after the reading I have done this week, anything beyond simple system testing seemed a little premature.

This week, I have:
1. Watched developer screencasts, including the new hackystat-developer-example.

I sometimes feel that Hackystat is likely to spoil us with its user-friendliness from an open source development standpoint. I feel confident that most open source projects don't have handy little video tutorials. However, I am not at all complaining! They are immensely helpful.

I really like the ivy integration a lot. I didn't have an iota of trouble with the developer example.

2. Contemplated issues of hosting, data retrieval, and storage.

The bulk of my mental energy went to this problem this week. There are several issues involved here. The first: what data am I allowed to access to mine, and is there a way I can get it that isn't on a per user/ per project basis? Aaron says there is a way to do without having to know a user's name and password. (I was contemplating an approach in which users interested in including their data in the mining project would download my app, which would access use the info. in their sensorshellproperties file to access their data, but this has the distinct disadvantages of requiring effort from the user, which may significantly limit the amount of data I have access to.)

So that's the data retrieval issue. Now for the hosting question.

The Facebook application has to be hosted. Joylent is offering free hosting, but only for a year, and only for applications that have more than 50 users. So that might be pushing it for our purposes. I would potentially like to host it on the Hackystat server. Aaron has asked for a more detailed specification of what the application will do, at which point we will be discussing it with The Powers that Be.

Storage is related to hosting.

It occurred to me while I was trying to pin down a design that I would definitely need to be storing all of this information somewhere. Aaron suggested creating a generic SocialMedia SensorDataType to do this.

3. Read boatloads of Facebook developer documentation

Which, to my disappointment, is not quite as easy to work with as the Hackystat documentation, or as pleasant and well-organized as the Java API. (Like I said, I'm getting spoiled.)

I am still deciding what language to use for the Facebook app. FB seems to lean towards PHP. I've had some experience with Ruby on Rails, so I'm thinking about using it instead. The initial Facebook application is not a terribly complex creature, in my head. Basically, it asks your permission to access your FB profile information, friends, etc, and then it takes that information, bundles it into the SDT, and sends that to Hackystat. I suppose it's really very little different than other development sensors, other than that the events are related to friends and interests.

4. (Somewhat unrelated) Solved the environment variable issue that was causing me such grief.

See this entry.

5. Started thinking about a potential SensorDataType for social networking data.

I'm having some difficulty coming up with something generic enough to cover ohloh, twitter, and facebook relationships and attributes. Not to mention that Aaron suggested adding SVN or mailing lists relationships (an idea that I freaking LOVE, but am not sure how to implement). The SDT needs to be generic enough to be easily expanded on--I can come up with many different additions to this idea, so i want this to be really easy to extend. For me, the most natural way to represent this data is two objects and a relationship. Initially, I though that didn't exactly jive with the key/value setup of the SDT. However, now, I think it could work reasonably well. Something along the lines of:

SDT Key Example Values
Social Media Object1 “Person”, “File”, “Bug”, “Project”, etc.

Object1ID Some unique integer id

Object2 “Person”, “File”, “Bug”, “Project”, etc.

Object2ID Some unique integer id

Relationship “Friend”, “Follows”, “Contributes to”, “Edited”, etc.

RelationshipID Some unique integer id

Vision

I was informed that I am working on a vision document, which was news to me, particularly after being told that I already had such a thing? I suppose that might be the original proposal.

What I had not included in my original proposal (in which I was mostly excited about the immediate data-mining prospects), but that I now definitely see as something worthwhile, is making sure that the setup I'm working on is 1. easily extensible and 2. easy to query for future projects as they arise. Aaron's idea of making the social media sdt really fits in nicely with that.

Questions that I have:

What kinds of questions should be directed to the dev list?

Direction for the coming week:
  • Start a google code project, even if I don't actually have any code yet.
  • Run a couple toy facebook applications on Joylent hosting to have a better sense of the requirements of my facebook app, and what hosting it will require.
  • Develop Social Media SDT
  • Read Twitter API
  • Play some more with hackystat as a user (Hackystat, user style!)
  • Get hackystat services to run locally (Hackystat, Admin style!)
You know, they say that minutes of planning saves hours of coding. Hoping that turns out to be right.

Hours spent this week: 17