First thoughts

by Sergio Álvarez

Since the last post, I’ve been reading some interesting articles about an small module inside of text analysis / data mining: Based-Content Geolocation.
Once I decided to start my Master Thesis focused on Text Analysis, I had a meeting with my professor and other colleage-mate who is also interested in developing his project centered on Text Analysis.

On that meeting, the professor (@PFCdgayo) proposed us several topics related with his interest field, as well as a really nice set of articles supporting the topics’ understanding. After reading the abstract of all of them and have an exchange of ideas with him, I’ve taken the topic about Based-Content Geolocation. There are some important goals because this election:

  • Only 7% of global tweets are tagged with latitude and longitude (±60% are from USA)
  • Last studies reveal an interesting accuracy (±60%)
  • Posibilities for integrating several systems (Foursquare, Facebook check-in, Google+ check-in, etc.)

After reading, more or less, the 80% of all the articles proposed, I have seen that there are two main branches for facing the problem. On one hand, some investigators have focused their efforts in developing algorithms responsible for finding discriminatory words related with one place. For instance, “guaje” is an spanish word specially used by spaniards from Asturias, so there is a high probability that one tweet which contains that word have been written from that region. Because of this, some of these algorithms are specially developed for finding this kind of words and determine the tweet (or any text) location based on this prediction.

On the other hand, other investigators have been focused on integrating different location systems for being able to infer the geolocation of any tweet. On this type of studies, is specially important the user history. The system develops a history of the user’s locations based on his post on Twitter tagged by other location services as Foursquare among others. Their results are based in the probability that one user who has the most of his post tagged from one place, is because he is actually from that place.

For my Master Thesis I want to start from the idea of trying to mix both scopes, for being able to obtain a higher accuracy and reduce dependencies.