Sergio's blog

About style guides && code conventions

Weeks ago, I’ve been moved to a new project at work. My tasks now are focused on maintain an Ariba-based application that is extended by using several Java modules. Speaking out: understanding and bugfixing other’s code.

With this premise, once I came to my new team, I started to investigate if they had some kind of style guide for maintaining such a monstrous project. It was a bit disappointment to see (once again) how this computing world works. No style guide. No code conventions. Because of this, I started to talk with my mates about the benefits of establishing a few guides and conventions for improving the maintenance times. They agree, so I got to work. While I thought this tasks would be easy (or at least, fast),  I was positively surprised to see how complex could be define a good style guide which may help the team to go through clear conventions.

Besides of some obvious points, such as:

  • Lines’ max length
  • Braces position
  • Blank lines
  • White spaces
  • Indentation

I was confused about some other (more philosophical) questions:

Where is the limit?

Should I recommend using one control structure over another?

Should I write about best practices such as: working with many modular and small methods?

Does the style guide have the responsibility of recommend clear names for variables and methods?

So…

What is the real scope for any style guide?

In my opinion, the perfect style guide is one that allows programmers feel comfortable using it (this is not a prison!), but, also by creating some common criteria that reduces the times for understanding the other’s code and optimizing the time for merging conflicts at SVC. There is no reason for developers going crazy. Just make a few guidelines for formatting aspects. Code conventions are not the solution for other background problems…

Advertisements

Share your code. Blog your thoughts.

These days I was talking a lot with a colleague about the benefits of sharing your code on GitHub or also blogging your everyday experiences or thoughts in computing. Because there is no need to reinvent the wheel, I give you here a really nice post by Garann Means about this topic which came to me tonight. I hope you enjoy the read.

Working with large tweets’ collections

These days I’ve been facing the first challenges working on my Master Thesis. As I’ve been explaining in other posts, to be able of achieve the main goal of the Thesis I have to create a system which can work with large collections of tweets in order to analyze them.

I already had a large collection of spanish tweets (± a million and a half) which I had been collected on July. This collection was stored in a CSV file which I thought it was the easiest way of processing them later, because other alternatives such as XML or JSON would imply to create a tree structure (with more than a million of nodes) in memory for being able to process them.

Of course, this was my first mistake. Working with CSV presents another kind of potential disadvantages because of the kind of content which I’m going to analyze. My tweets’ collection resulted in a non-valid CSV file with comma chars embedded inside the values and several characters which I’d have to escape. In fact, although the use of CSV may seem simple, there are many complex cases to manage, and also specialized libraries fail when working with some kinds of file. This was my experience with OpenCSV, which offers a really nice performance for several tasks with CSV but, in my particular case, it couldn’t parse my CSV file successfully.

So, I had to start thinking about one of the other alternatives: XML.  OK, firstly I thought about a large tree structures in memory… but, obviously it must exist an alternative way.  XML it is present in almost every kind of projects, and its use with large files must be already covered.  And, a Google search later, I remembered SAX. Apparently, problem solved.

SAX works without loading the entire XML tree structure in memory, by only  analyzing each element of the XML document sequentially. This key feature allows me to process each tweet without a memory leak.

In addition, XML offers to me a better way for structuring the tweets’ collection, avoiding some of the problems presented before.

Due to this strategy change, I had to start on thinking about technologies which facilitate the work of manipulate XML files. The July tweets’ collection was done by using an small Java system which I had developed for the last subject of the Master’s first course.

IMHO, working with XML in Java is no easy and I remembered some interesting points about the Scala support for this kind of files. Once I thought this, I started to read about Scala and, finally, it was the language which I’ve chosen for making the tasks about collecting and analyzing the set of tweets.

Some of the benefits of using Scala (which perhaps may not be applied to your particular case) are:

XML native support

Working and serialize an object in XML by using Scala is absurdly simple:

/**
 * Project: falcon
 * Package: org.falcon.model
 *
 * Author: Sergio Álvarez
 * Date: 09/2013
 */
class Tweet(username: String, location: String, timezone: String,
  latitude: String, longitude: String, text: String) {
  
  def toXML =
    <tweet>
      <username>
        {username}
      </username>
      <location>
        {location}
      </location>
      <timezone>
        {timezone}
      </timezone>
      <latitude>
        {latitude}
      </latitude>
      <longitude>
        {longitude}
      </longitude>
      <text>
        {text}
      </text>
    </tweet>
}

The Scala support for working with literals allows us to code the objects’ XML representation easily. As well as its later deserialization.

Compile to JVM code

This is specially interesting because allows Java developers to use libraries which they are familiar with in their Java programs.

Actually, Scala has another good points for Java developers because, although it is a functional language, it allows imperative programming and its object-oriented philosophy is really similar to Java. These reasons give to Java developers the ability of starting to write Scala code quickly (which does not imply that they can write high quality Scala code quickly….)

Summary

Just for making a brief summary about the exposed in this post:

  • CSV is not a good choice for storing large collections of tweets with complex content
  • SAX offers the possibility of working with large XML files efficiently with no memory leaks
  • Scala’s support for XML is a key feature for choosing it over Java

You can see the application of the previous ideas on my GitHub profile.

Article about Teamwork

Yesterday I read an interesting article about teamwork posted on CodeBetter.com by Marcus Hammarberg. It shows some improvements for Scrum methodology which make it more practical and agile.

In my experience, (bad) Scrum is tedious. Really tedious. At work, it is usual that the daily standup takes more than 30-40 minutes for a team with 8 developers. Tasks are no well-defined. And there is no time for improving the team (by increasing our knowledge).

I really love Agile Methodologies but, in my opinion, it is neccesary spending some time thinking about whether the team is really prepared for using them. Sometimes, might be positive to hire a professional Coach for training the team about the Agile principles. It is essential make people totally convinced about the process and make them feel as important part of the team lifecycle.

I hope you enjoy the reading.

Setting up the environment

These weeks I have been spending my time by doing different stuff for the Thesis and also by starting my new way with Mac OSX.

First of all, I read the last articles from the Recommendation’s list and, as figured, there is no any new idea about how to face the challenge. Studies revealed that there are two main strategies for achieving the goal:

  • By finding Location Indicative Words
  • By integrating other systems and study the user’s Twitter history

Both concepts have to be my start point for implementing the first prototype.

Second, and here is where I want to focus the post, I have been making the initial tasks for starting and planning the project. Because of the project is going to follow Agile rules, I’ve created an organization in Trello where to put the different ToDo’s boards.  Some of the Trello’s nicest features is that it brings to us an online tool where  also my Professor and me can be aware about the Thesis progression and the pending/doing/done tasks.

It is probably that in future I’ll make the organization open and everyone can see how the Thesis is growing up and how close the deadline is  🙂

As I said previously, I started to work with Mac OSX few days ago. It was important because I had to be searching new tools and new programs which replace my usual software on Windows. Thanks to this search, I found: ShareLaTeX, probably one of my best discovery on Internet at the last months.

ShareLaTeX is a full online editor for LaTeX that, in addition, allows collaborative edition (similar to Google Docs). It rescued me of install one of the large TeX distributions on my laptop…

I also installed Node.js with their incredibly simple installer. I’m thinking  seriously in doing my thesis with this technology (and by adding some frameworks as: express.js, knockout.js or socket.io, depending on requirements).

Finally, I will using SublimeText 2 for coding and GitHub as Git server for source control (this is: public repository for code. It would be nice if we make an small community around the project. Let’s try it! ).

First thoughts

Since the last post, I’ve been reading some interesting articles about an small module inside of text analysis / data mining: Based-Content Geolocation.
Once I decided to start my Master Thesis focused on Text Analysis, I had a meeting with my professor and other colleage-mate who is also interested in developing his project centered on Text Analysis.

On that meeting, the professor (@PFCdgayo) proposed us several topics related with his interest field, as well as a really nice set of articles supporting the topics’ understanding. After reading the abstract of all of them and have an exchange of ideas with him, I’ve taken the topic about Based-Content Geolocation. There are some important goals because this election:

  • Only 7% of global tweets are tagged with latitude and longitude (±60% are from USA)
  • Last studies reveal an interesting accuracy (±60%)
  • Posibilities for integrating several systems (Foursquare, Facebook check-in, Google+ check-in, etc.)

After reading, more or less, the 80% of all the articles proposed, I have seen that there are two main branches for facing the problem. On one hand, some investigators have focused their efforts in developing algorithms responsible for finding discriminatory words related with one place. For instance, “guaje” is an spanish word specially used by spaniards from Asturias, so there is a high probability that one tweet which contains that word have been written from that region. Because of this, some of these algorithms are specially developed for finding this kind of words and determine the tweet (or any text) location based on this prediction.

On the other hand, other investigators have been focused on integrating different location systems for being able to infer the geolocation of any tweet. On this type of studies, is specially important the user history. The system develops a history of the user’s locations based on his post on Twitter tagged by other location services as Foursquare among others. Their results are based in the probability that one user who has the most of his post tagged from one place, is because he is actually from that place.

For my Master Thesis I want to start from the idea of trying to mix both scopes, for being able to obtain a higher accuracy and reduce dependencies.

Choosing Final Project

Yesterday, I finally decided to write to one of my professors to know if he was interested in being the supervisor of my Final Project.
It was not easy to know what way I had to choose, but finally I have taken the way of Text Analysis and Data Mining.

There are two main reasons because I have chosen this option:

  1. Social Networks
  2. Business

Social Networks

Thanks them, we have tons of information walking around us. People write about everything on their social networks and it is necessary having a tool which must be able to analyze and sort it. There are hundreds of different applications for this technology.

Business

As I say previously, this kind of tools can be used in too many different ways. But, keep in mind, students would be really happy if you show them a tool for auto-summarizing their lessons… 😉

Anyway, I also want to take this opportunity for starting to work with some dynamic language as Ruby or Python (or, why not, JavaScript on server side…). Beacuse of this, I have started a course about Ruby at Codecademy. Ruby is a fantastic programming language with a really nice syntax and thousands of wonderfull surprises… (I’m really enjoying learning it). I also had a book written by Matz which I have bought two years ago, but, you know, books are a really good idea for having a deeper knowlegde about the language, but I need some action for first 🙂