Sunday, August 23, 2009

GSoC: Hackystat August 17-23 and A Summer in Review

Whew. Well. Where to start?


The Last Week

This week was something of a trial. It seemed like every time I fixed a problem, that solution would cause a cascade of new problems. But, I FINALLY have the Hackystat sensor up and running. It's available as a featured download on the project page.

The biggest issues that I ran into where a result of the solution that I applied to the XML problem from last week. Because I had my own version of the Telemetry XML objects, which were identical to the Hackystat Telemetry objects but in a different package, getting things from the TelemetryClient into a form that I could send to my database was a real trial. It produced a lot of angst and some decidedly unattractive code. However, my decidedly nasty code could be ditched permanently if the Hackystat schemas were given namespaces and prefixes. That would be awesome!

The other problem that I had relating to the TelemetryClient was really strange. getChart (the one that doesn't take extra parameters) didn't work. So, I had to dig up the default parameters for all of the charts that I needed and use the getChart method that did take extra parameters. Thanks to Shaoxuan for all his help debugging that problem!

I've saved all the default parameters for the charts I'm using as String constants, so they can be accessed easily.

Another note: it was significantly non-obvious to me that the Telemetry service runs on a different port on the server, and that one would need to specify this. I'm sure it's in the documentation somewhere, but where is it?

Finally, I realized at some point yesterday that I hadn't covered a couple of the cases for the Hackystat client, such as computer crashes and so on, so I had to add those in. In other nasty surprises, as I suspected, the lack of Findbugs and PMD errors was too good to be true. So I'm not quite ready to be pulled into the Hackystat server umbrella.

What I didn't accomplish this week, that I had really hoped to accomplish, was the visualization tool. It's still next on the roster, but it doesn't get to be under the GSoC hat, I suppose.


The Summer as a Whole

This project became so much more gargantuan than I had intended! Looking back on the time line that I wrote at the beginning of this whole thing makes me laugh. Knowing what I know now, I might have come up with something considerably less insane!

It ended up being sort of a trial by fire! I certainly didn't expect to be standing at the end of the summer, having put together my own server! (The sensorbase was a good template to work off of, definitely!) The whole thing just ended up being so much larger and more complicated than I anticipated.

That, maybe more than anything, is what I have learned from this experience. Actual use by real people in the real world complicates even simple tasks. Really, until this summer I've been working on mostly toy projects. Nothing huge, and nothing really meant to be used by people. When making something that is supposed to be used, there are so many more things that one has to consider that I am accustomed to considering; the end result, then, is that even simple things (like pulling data from one server and sending it to another server) can spiral endlessly into complication.

Stopping that complication from overwhelming the project (and the coder!) is a really important thing, I think. It's somewhat difficult when one is sort of programming in an echo chamber of one's own mind. Because the GSoCers were all working on individual projects, it was somewhat more difficult for us to bounce ideas off one another. I feel that the current release of my project would be of significantly higher quality if I had spent more time communicating my ideas and having the designs vetted as I went. I say this because I was describing a bug to an unrelated third party, and he was like, "WTF were you thinking, designing that this way?" Coming from a place of such inexperience with real software development, it's hard to know if you're making a good design decision. How do you choose the right way when you don't know the wrong way? So I know now to take better advantage of my available resources. Perhaps--gasp, the horror!--request a code review? Spend more time talking design with my mentor? All of these things!

Beyond the new understanding of how much time things actually take, and how important collaboration is to the end product, I also learned a metric boatload of new skills and tools. JAXB, for instance, is still, to my mind, the coolest thing since the internet. Java Property files are also really awesome. Not to mention suddenly understanding how the internet really works. I mean, sure, I've done some very low level networking stuff. I've implemented go-back-n, etc. But one day after getting the SocNet server up and running I abruptly realized that when you call for a web page, it's a GET request! Some server somewhere returns a representation of the HTML to you! That was a very exciting moment for me.

What I Want to Do Now

I would like to take a little break from the SocNet project, now the GSoC is over. But not a break from Hackystat! I would like to get a NetBeans sensor up and running. I know one was tried in the past; but that has been several years and I would like to give it another shot. Eclipse and I parted ways mid-summer, so a I've been missing out on some of the hackystat data that I would really like to have had. I think I might have figured out a way to ghetto-rig such a sensor, at least, using the hints capability.

After that, back to SocNet! The next two big chunks of it that need doing are the Ohloh sensor and the visualizations/analysis tool. I may do the Ohloh sensor first--the hackystat sensor taught me that anything run from a user's computer (ie, one that I do not have direct control over) is much more difficult and painful. After this week, I would gladly accept a slightly less painful task!

I would also like to start work on the sensor discussed over the list, which crawls a repository to determine how familiar a coder is with a particular concept. It was suggested as a Master's thesis--perhaps I can make it mine?


A Word of Thanks

Now that the summer has drawn to an end, I would like to thank all of the Hackystat Hackers who helped me through my first GSoC! Special thanks to Aaron, Philip, and Shaoxuan, without whom I may never have surived.

Tuesday, August 18, 2009

GSoC: Hackystat August 10-August 17

This week

Hackystat app, now and forevermore.

REST API support for the Hackystat App is up and running.

Had a lot of trouble with XML. I wanted to write a complex type that contained TelemetryStreams as elements. Something along these lines:

xs:element name="XMLContributesToRelationship">
xs:complexType>
xs:sequence>
xs:element ref="Type" minOccurs="1" maxOccurs="1"/>
xs:element ref="ID" minOccurs="1" maxOccurs="1"/>
xs:element ref="StartTime" minOccurs="1" maxOccurs="1"/>
xs:element ref="EndTime" minOccurs="0" maxOccurs="1"/>
xs:element ref="XMLNode" minOccurs="2" maxOccurs="2"/>
xs:element ref="TelemetryStream" minOccurs="9" maxOccurs="9"/>
/xs:sequence>
/xs:complexType>
/xs:element

But I couldn't figure out how to include the telemetry definitions in the file. There were a lot of namespace problems. The full rundown is available on the dev list, but I shall repeat the solution I arrived at.

My final solution:

1. Give telemetry.resource a namespace and a prefix. Append the prefix where necessary to elements and complex type definitions in telemetry.resource.
2. Give my schema a namespace and a prefix. Append the prefix where necessary to elements and complex type definitions in my schema.
3. Import telemetry.resource.
4. Drop the element declaration of TelemetryStream in my schema.

I'm not sure which of these are necessary and which are superstitious fluff, but at least I got it working. I know you can import things without a namespace, but I couldn't make that fly with this.

However, this solution to my server-side problem caused a client-side issue that I am still working with. Now I have to choose whether to use my telementry schema (the same as Hackystat's, but with namespace and prefixes) for the client-side stuff, or hackystat's. I was coding merrily along until I realized I had imported half of the telemetry stuff from the hackystat library and half from my jaxb folder.

However, once I clear that up and do some testing, then we have Hackystat Application LAUNCH! Exciting, n'est-ce pas?

Then: visualizations and analysis tool hardcore.

Also: test coverage, documentation, continuous integration

In other words, it's going to be a busy week. I'll keep you posted.

Wednesday, August 12, 2009

GSoC: Hackystat August 3-August 10

Sorry for the delay in posting--I landed myself a nice bronchial infection and have spent most of the last week coughing like a sea lion barks. It's awesome! However, it will probably also contribute to my brevity today, which I imagine many of you in the audience will appreciate.

This week:

Hackystat app. (Still. Possibly forever more.)

Working on the hackystat app makes me feel like I might be the only person who has ever tried to access Telemetry data who was not intimately familiar with the workings of the system. Much time has gone towards trying to find a constant or list or SOMETHING that includes the names of all of the Telemetry charts. The test cases for the Telemetry chart stuff don't seem to use them--they just use hard-coded strings, which makes me suspect that there is no such set of constants. For those with commit access--man, would that be handy! Judging by the test cases and the list of telemetry stuff in the project browser, I decided that the names must be the same as the list in the project browser.

I will be storing these charts in SocNet:
Build
Churn
CodeIssue
Commit
Coverage
CyclomaticComplexity
DevTime
Issue
UnitTest

If anyone has a favorite chart they would like to see stored, speak now (or soon) or forever hold your peace. (Just kidding. But do speak up, because knowing would be good.)

The app is mostly finished (if my assumptions about the names were correct)--now I'm implementing its REST API support.


Visualizations:

TouchGraph is out, because it has virtually no documentation, and the code is a relatively old version. (They don't know when they will be releasing the new one.) I haven't been able to figure out how to use it, so I have moved on to other options.

Jung, which I mentioned last week, has better documentation than TouchGraph by a long shot. However, I am working most seriously with Giny (http://csbi.sourceforge.net/). Giny is a LOT easier to work with than Jung, and implements a bunch of handy graphing algorithms that will make rudimentary analysis that much easier.

What I can't decide is how to host the visualizations. It would be easiest (from my perspective), to run them on the user's computer. However, it would probably be best to do a project browser style thing, visualization via web browser. My concern is that I will not be able to manage that in two weeks.


Library problems:

I am running into trouble using the hackystat client libraries. For instance, with the Telemetry client most recently added to my system, I pulled the ivy retrieve target from the telemetry system build file. Somewhat lazy, I know, but why duplicate work? The problem is that the target only works if you've compiled and built from source hackystat-utilities, hackystat-sensorbase-uh, hackystat-sensor-shell, and hackystat-dailyprojectdata. Which is fine if you've done it, but not great if you haven't. I think this is because the individual projects don't have modules in ivy-roundup or in my module repository.

Since I don't want individual users to have to compile the entire hackystat system from sources just to be able to use my stuff, it would be awesome if the sensorshell jar and the telemetry jar were added to ivy roundup. If they can't be added to ivy roundup, is it cool if I add them to my module repository?


Next week:

My plan from last week was sort of a general overview for the next two weeks, so it still stands. So I'll be gluing the hackystat sensor to the socnet server and working on visualizations with Giny.

Something I'd like to do that MIGHT not be such a big deal would be to start one-way hashing the passwords.

Tuesday, August 4, 2009

GSoC: Hackystat July 27-August 3

This week:

Has not been as productive as I needed it to be. Mostly, it has been consumed with Ivy frustrations. I will be trucking along and realize that I need another library, and have to stop and futz with the Ivy stuff until it works. This week, that also involved updating Ant, since apparently the version of Ant running on this release of Ubuntu is two years old. In a couple of the cases I was having difficulty figuring out which jar I would need. For instance, I needed the SensorBaseClient class, but I didn't feel like it was a good idea to have the hackystat client dependent on the whole sensorbase. So, I looked at how the eclipse sensor did it, and saw that the eclipse sensor pulls the sensorshell jar, and assumed that it was all wrapped up in the sensorshell jar. I copied that little bit from the sensorshell build file and put it in the build file for my project, ran it.... Break. Fail. Lose. This did not make sense to me. So I downloaded the sensorshell and tried to build it. Fail, but because of Ant.

In the end, I downloaded the new releases and built everything from the source, so all of the necessary jars would be in my cache. I didn't experience any problems building the system from source, though I do have a question about the ant -f jar.build.xml publish-all command. Does it only build the projects that are immediately dependent upon the project you are building? Or does it cascade? I mean, does it follow the chain rule?

Like, if project x depends on project y which depends on projext z, and you invoke the publish all on project z, does it only build project y, or does it also build project x? My working understanding is that it only builds project y. It would be super nice if it also build project x.

So, while I love Ivy a lot for installation purposes, and for downloading and organizing purposes, writing something to use Ivy as you go is an enormous pain, particularly if something is not already in the roundup.

Other library related concerns. The sensorshell jar seems to contain pretty much the entire sensorbase. Why? It seems like I tried to avoid a sensorbase dependency and ended up with one anyway.

Anyway, here's a preview of the app. Note that where it currently says, "Item 1.... " etc on the list will actually be a list of the user's projects in the SensorBase. This is just the preview of the GUI.



also with tool tip texts.





Not that this will run right now. Have two more libraries to ivy-ify.


Visualizations:

These are the network visualization libraries I'm experimenting with.

http://sourceforge.net/projects/touchgraph/

http://jung.sourceforge.net/

I am really excited about touchgraph.


Splitting the Project:

This is going to have to wait until I've finished writing the code, more or less. It's enough trouble to update one set of build files. I don't look forward to having to update 4 or so.


Sprinting to the finish!

The firm pencil's down date is in just two weeks' time. Things that need to get done before then:

1. meaningful registration process, by which a user can link the email address they used to register with the sensorbase to all of their various socnet stuff, so that I can limit who accesses the data in an appropriate way.

2. useful visualizations and initial analysis tools
Other than displaying the network and allowing the user to navigate through it in a sort of physical manipulate-y way (touchgraph!), this will probably also include some summarization of the other information in the network. I don't, however, want to simply mirror what the dailyprojectdata and telemetry services already do. Still thinking on what that will be.

then, once that is done, splitting the project into smaller, manageable chunks, and making sure my documentation is nice, etc.

Also, on a personal note, I graduated from college this week.

Tuesday, July 28, 2009

GSoC: Hackystat July 20-27

This week:

Client authentication is FINALLY working! I cannot describe the satisfaction. Currently, the PUT authentication is significantly more meaningful than the GET authentication. I am still waffling on how I should address retrieving information from the database. In some cases, my apps need to access it freely (which is no problem and is mercifully working beautifully), but other users may need to have permissions set up so that they can only access nodes in the database that are connected to their nodes. This will be more difficult. On the other hand, I'm not sure that it's the best solution. It really does depend a lot on who is going to be using it. It might be good to set up sort of levels of access. Default, for a user, is for them to be able to access all of the information that is related to their user node. Then, permissions can be added for them to access the information (read only, naturally) of other users. I don't really know how to go about doing that--I suspect I can just change the user schema to include a permissions lists. Have I mentioned that I really like JAXB?

Continuous Integration:

I surprised myself by being much further along on this than I had initially thought. All of that pain a couple of weeks ago with Ivy was well worth the time and trouble! I do have much in the way of findbugs, pmd, and checkstyle errors, but not nearly as many as I was anticipating (somewhere in the range of 100.) I suspect this comes from having canablized portions of the code from the sensorbase. I'm considering setting the project up as an Eclipse project again, just so I can use the checkstyle plugin, but Eclipse and I get into a fist fight basically every time I try to use it. (I've been using Netbeans, which is a balm to my soul). Hopefully this week I'll get that under control. I am also considering breaking the project out into separate projects once the hackystat app is up this week, so that the clients have separate build files, etc.

Hackystat App

Is going slowly. Man, has it ever been a long time since I've done GUI work. The list consensus on which Telemetry to store... wasn't, although it did bring up an interested research question, of what exactly the telemetry streams a user singles out as important says about that user. This is such an interesting thing to me that I am going to store some standard telemetry streams, and then also store a list of which streams the user thinks are important.

Issues I'm still having:

JUnit tests. Somehow, my tests are not independent and this is causing them to fail. It has something to do with the initialization of the database.

This week!

Getting down to the wire, somewhat. I'm going to start focusing on some visualizations of the data. Basic things, like just showing the network. I don't know how to make it easy to see the relationship data (telemetry streams, for instance, are attributes of the relationship between a hackystat user and a project). So that's going to be a hurdle. Okay, truthfully, displaying it at all is going to be a hurdle. There are a couple of network graphics libraries out there that are open source, so I may look there. There's also Improvise, which I know can display stuff like this...though it's more oriented towards building the visualizations by hand as opposed to programattically, but since it can do it, it's possible that I can use their visualization engine. I <3 open source software.

Going to split the project up this week, as I feel that that's more in keeping with the way hackystat is built, and may be easier to upkeep. The CI stuff goes hand in hand with that--I want to have that all wrapped up by the end of the week.

I'm aiming at having the Hackystat app ready to go live tomorrow evening or wednesday around noon my time.

Vision:

I have decided where I want to go with this as a tool set. I really like the idea of having an analysis tool for this that is, as Philip says, something of a hypothesis generator for groups of coders, such as within a class or company.

Unfortunately in my experience data mining algorithms require being tweaked halfway to hell before they work, so I'm guessing the inital version of this will not be terribly awesome. I already have a number of standard mining algorithms implemented in a library that zack and I have been intending to open source for a while, once it was cleaned up and documented. Unfortunately the part that works best, the spectral clustering, is based on a very fast, very ancient fortran library (I don't write in fortran), that has been seg faulting mysteriously for about five months now. I have had neither the time nor the skill to repair this.

However, the non-spectral clusterings work splendidly, so in theory, that could be plugged in. The graph specific stuff is likely to be more useful, like the SRPTs and the other thing that I have since completely forgotten the name of.

Miscellaneous:

There are some things that I have just come to REALLY love this summer. Most of them are exceptionally nifty tools to which I had never been exposed during previous experiences. I LOVE the properties files. How clever is that? It just tickles me to death. It makes me want to find more things to hide in them, though I suspect after a while it drives your user crazy.

JAXB. There is nothing nicer than having your code write code for you. <3

Ivy. I am so glad that I tried to build the system before the ivy integration was finished. I feel that I have a much greater appreciation for how truly awesome ivy is.

Tuesday, July 21, 2009

GSoC: Hackystat July 13-20

Client Authentication is Made of Lose and Fail.

Okay, so it's maybe not that bad, but I'm definitely having much more difficulty with it than I had anticpated. In some ways I think it would be significantly easier if I had put it in from the beginning, as in having to go back through and make my client and resource code work with the authentication, I managed to break things pretty badly. The biggest difficulty that I have conquered so far was the Mailer. I'm using gmail as my smtp server, and I could not for the life of me get it to authenticate properly. As far as I can tell, the original Mailer code for the server does NO authentication. How is that even possible? Anyway, I tried a couple of different ways to add in the authentication for the mailer, but it was a variation of the following code from GaryM at the VelocityReviews forum thread on gmail as an smtp server that finally got it to work.

public class GoogleTest {

private static final String SMTP_HOST_NAME = "smtp.gmail.com";
private static final String SMTP_PORT = "465";
private static final String emailMsgTxt = "Test Message Contents";
private static final String emailSubjectTxt = "A test from gmail";
private static final String emailFromAddress = "";
private static final String SSL_FACTORY = "javax.net.ssl.SSLSocketFactory";
private static final String[] sendTo = { ""};


public static void main(String args[]) throws Exception {

Security.addProvider(new com.sun.net.ssl.internal.ssl.Provider());

new GoogleTest().sendSSLMessage(sendTo, emailSubjectTxt,
emailMsgTxt, emailFromAddress);
System.out.println("Sucessfully Sent mail to All Users");
}

public void sendSSLMessage(String recipients[], String subject,
String message, String from) throws MessagingException {
boolean debug = true;

Properties props = new Properties();
props.put("mail.smtp.host", SMTP_HOST_NAME);
props.put("mail.smtp.auth", "true");
props.put("mail.debug", "true");
props.put("mail.smtp.port", SMTP_PORT);
props.put("mail.smtp.socketFactory.port", SMTP_PORT);
props.put("mail.smtp.socketFactory.class", SSL_FACTORY);
props.put("mail.smtp.socketFactory.fallback", "false");

Session session = Session.getDefaultInstance(props,
new javax.mail.Authenticator() {

protected PasswordAuthentication getPasswordAuthentication() {
return new PasswordAuthentication("xxxxxx", "xxxxxx");
}
});

The code highlighted in red I initially left out, because I didn't know what it did. Leaving it out causes the whole thing to fail, so obviously it's important. The code highlighted in green is what is missing from the SensorBase Mailer that makes me think that the SensorBase mailer doesn't authenticate before sending emails.

So, the mailer was a somewhat frustrating goose chase. Now, the goose I am chasing is an "Undefined user null" error. Not an exception, mind you. I think it's a status message that's being set somewhere in code that I don't have direct access to (my best guess is somewhere in the restlet router or guard stuff.)

So, currently, the client authentication has stalled at the retrieving data part. You can register new users all the live long day, and it will send them emails with their login credentials. However, if you try to get information as an authenticated user, it finds you in the database, sees that you're properly authenticated, returns that you are a user and are okay to access the data... and then fails.

Hackystat Client

The hackystat client will allow the users to specify which project(s) and time frames they want to allow SocNet to access, through a simple little gui. (Only contiguous time periods can be selected for a project, so not like, two weeks here and then another two weeks later). The thing left to decide is... which telemetry data to use? What are the most common/most useful analysis?


This week:

Finishing the client authentication is the biggest deal, followed by finishing the hackystat client. I'll be sending an email to the list to get input on what telemetry analysis are the most useful, as well.

Additionally, I'm going to take a crack at the continuous integration stuff Philip has encouraged us to work on, so that will be a couple of new and exciting toolsets. I'm particularly excited about checkstyle.

Sunday, July 12, 2009

GSoC: Hackystat July 6-13

This Week:

Server and Database
The server and the database are up and open for business! (Well, in that you can run them. They are not open for business in the public server sense). It is Ivy integrated and mostly lovely to behold. It does not currently support the full API listed on the project wiki, but we're getting there. It implements all those necessary for clients to store information in the graph.

The server does not implement client authentication yet, but it's on the roster of things to do this week. It is also possible that there are one to many layers of abstraction between the resources and the graph, but that can be addressed in future revisions.

Problems encountered:

JAXB ugliness.

I had a really lovely interface created for my XMLRelationship objects that included two instances of a named complex type, XMLNode startNode and XMLNode endNode. However, when I tried to marshall it, the marshaller threw an exception because the node was missing an @XmlRootElement annotation. Adding that annotation works just fine, but apparently it wasn't being generated. Google provided this answer: http://weblogs.java.net/blog/kohsuke/archive/2006/03/why_does_jaxb_p.html

Apparently the JAXB compiler won't add the XMLRoot unless it can empirically prove that the type isn't going to be used by anything outside of that file.

Neither of the fixes proposed in the above blog worked for me, so I had to refactor the XMLNode to be an anonymous type. This makes the interfaces a lot more ugle, as instead of having two separately named instance of XMLNode, you have a list of two XMLNodes, called XMLNode. And instead being able to call

relationship.setStartNode(beginNode);

I have to do

ArrayList nodes = (ArrayList) relationship.getXMLNode();
nodes.add(beginNode);

The bad plurals hurt my soul. I've been looking into how to configure JAXB to generate custom types in hopes that I can clean it up later that way, but it's not a super high priority. Just something that would make my soul hurt a little less.

Not all of the REST API is supported by the server yet. The server code is also a little shy on the comments.

Time Traveling Exceptions.

So, in testing one of my get methods, I attempted to retrieve the user "Eliza Doolittle" from the database, using this uri

"{host}/nodes/IS_USER/Eliza Doolittle"

If you look closely you can probably guess what the problem was. Yes, folks, http calls do not like them some spaces, not at all. So it transferred as "Eliza+Doolittle", and asked the database for the node of that name, which didn't exist. Normally, this would have just thrown a NodeNotFoundException and that would have been handled appropriately, but in this case it threw the exception and through a series of unfortunate events, completely obscurred what was happening. I spent about four hours trying to figure out how something could be not null in the method passing it and then null when it is received... But, fortunately, there was no break in the space time continuum and I eventually got it worked out. Moral of the story is that spaces are a no-no for URIs until I implement + removal in the server.

Neo4J Ivy Integration

Neo was a poor choice for my first attempt at Ivy Integration. Neo4j has no consistent naming convention for their directories and libs in their releases, so Ivy kept trying to download a file that didn't exist. I couldn't for the life of me figure out where Ivy was getting that name from... Still haven't, actually. I solved the problem by instructing Ivy to rename Neo to the name it wanted before it started looking for it in the cache. The rest of the ivy integration was MUCH smoother. However, I would really like to know how one generates the xml files from xslt files.

Twitter App

The twitter app is also up! It sleeps for 15 minutes if there's an unidentified exception, and an hour between each round of polling twitter for changes and sending the information to the database. I am particularly proud of the caching. I was initially very frustrated because it's highly parallel but not quite identical between getting the followers of the Twitter Client account and getting the followers of a particular user. I arrived at a solution that was efficient in its code reuse and linear time instead of quadratic time. I was pleased.

Documentation

Gasp! There is actual documentation up at the socnet google project page! There are directions for installing and building from sources! There are directions for how to start sending your Twitter data to the server! It's full of awesome and wow.

This next week:

Client Authentication in the server. This is necessary so that people can't DOS my server, or fill it with a million instances of RickAstley objects.
Cleaning up server documentation
Hackystat data grabber
Getting Ohloh API key

I'm putting Facebook on the back burner for the time being. Currently, they are being sued about their developer data access policy. Not that this is likely to be resolved soon enough to help me, but I think I can do a fair amount with the hackystat and ohloh stuff.


Check out the shinier project page : http://code.google.com/p/hackystat-analysis-socnet/