Reinventing the wheel

June 11, 2009 in en, google, gsoc, open source, projects

It’s been a week and something since my last post and again quite a lot of things happened. I had several tasks from my last week:
  • get in touch with gentoo-infra(structure) team
  • improve way build logs were handled
  • look at possible ways to create tinderbox chroot environments
  • make it possible to test all versions of dependencies

I managed to get acquainted with gentoo-infra team a bit, and get a few answers too. Remember last time when I was talking about security issues of pickle module when used over untrusted connection?  That’s not an issue anymore apparently since we’ll be using encrypted connection in final version. We can consider route between Matchbox and Tinderboxen to be friendly environment.

Few people suggested that I look into few similar project Gentoo, namely catalyst and AutotuA. Main feature of catalyst is release engineering, not testing per se. But it can also create tinderboxes (effectively chroot environments). Perhaps some ideas could be used for my project, but so far it seems that catalyst is not the one and only for me. AutotuA is much more similar to collagen (did I mention that’s name of my project yet?). There is master server (web application accepting jobs) and slaves (processing jobs). There were quite a few interesting design decisions (such as keeping jobs in git repository) and some of it I will at least reuse. Integration would be possible, but for now I have a feeling that such integration would be just as complicated as writing my own master/slaves. That is because AutotuA is generic system for jobs and their processing not specific for package compilation and testing. I’ll keep both projects in mind during my future endeavours.

As far as build log handling goes, my last POC (proof of concept) code simply grabbed stdout/stderr of whole install process. It also used higher-level interfaces for installing packages in gentoo, I switched to lower APIs because I need to do few things higher-level APIs did not offer. Most of these things had to do with dependency handling. Best way to explain what I have to do is using example. But first a little Gentoo package installation introduction. Package “recipes” called ebuilds reside in so-called portage tree. Most packages have more than one ebuild because there are always older and newer versions supported simultaneously. Each of these package versions has its own set of dependencies, that is other packages that need to be installed for package to compile/run. These dependencies look something like this:

=dev-libs/glib-2*
samba? ( >=net-fs/samba-3.0.0 )

This means that package would need any version of glib-2 library, and if samba feature (USE flag) is enabled then also samba version 3 or higher would be required. My task is to verify that package can be compiled with ALL allowed versions of ALL dependencies. Now the promised example.

Lets assume that we want to install package mc (midnight commander). There are currently 2 versions of app-misc/mc in portage: 4.6.2_pre1 and 4.6.1-r4. List of their dependencies is quite long, but to show you principle I’ll use just one dependency, namely sys-libs/ncurses. Version 4.6.2 of mc depends on sys-libs/ncurses and version 4.6.1-r4 depends on >=sys-libs/ncurses-5.2-r5. There are currently 2 versions of sys-libs/ncurses in portage: 5.7 and 5.6-r2. Based on these dependencies it should be possible to install package mc (both versions) with either ncurses-5.7 or 5.6-r2. From this point on there is ping-pong of installing ncurses-5.6-r2, then mc-4.6.1-r4/4.6.2_pre1 followed by uninstalling them all and installing ncurses-5.7 and installing mc-4.6.1-r4/4.6.2_pre1. If mc-4.6.2_pre1 fails to compile with ncurses-5.6-r2 we will know that ebuild needs to be modified with dependency >=sys-libs/ncurses-5.7. All this has to be repeated for every dependency for every version of every package in portage tree. Currently there are 26623 ebuilds in portage tree. Now imagine that some of them will have to be compiled even 30-50 times to test all dependency versions. Good thing we will have dedicated tinderboxes for compiling all those ebuilds.

One more thing for now. Gentoo has project management website based on redmine for all of GSoC students on soc.gentooexperimental.org. From now on I will aggregate all of documentation for my project there. This blog will go on will less technical details and I will link to documentation where needed.

First commit for GSoC

June 1, 2009 in en, google, gsoc, open source, projects, software

Recently I finished all of my duties as a student for this term and I could therefore spend the weekend catching up on GSoC (since I am one week behind schedule). In the end it turned out to be pretty productive weekend.

I’ll summarize basic architecture without any images (I’ll create them later this week probably when everything will settle down). There are two core packages:

  • Matchbox
  • Tinderbox

Matchbox is master server that knows what still needs to be compiled and collects all information. There is always only one Matchbox. There can however be more Tinderboxes. These machines connect to Matchbox and ask for next package to emerge (compile). After emerging package they collect information about files in the package, use flags, emerge environment and error logs from compile phase. This information is then sent back to Matchbox. Tinderbox then asks for another file to emerge. repeat while true.

First thing I did was create basic data model for storing data about compiled packages. What use flags were used, error logs and stuff like that. Lot of things are not in the model, for example information about tinderboxes, but for now this will do. UML diagram is on following picture:

This model should allow efficient storage of data and a lot of flexibility to boot. There can be more versions of the same package (of course) and also packages can change package category (happens quite often). We can also collect different data sets based on USE flags.

With basic data model in place it was time for some serious prototyping :-) Naturally I decided to split implementation into two parts, one for each core modules (more to come later). Matchbox is simple listening server waiting for incoming connections. I wanted to simplify network communication for myself, so I used python module pickle. This module is able to create string representation of classes/functions and basic data types. Because of this I was able to use objects as  network messages. Objects representing Matchbox command set:

class MatchboxCommand(object): pass

class GetNextPackage(MatchboxCommand):
    pass

class AddPackageInfo(MatchboxCommand):
    def __init__(self, package_info):
        self.package_info = package_info

On the other side Tinderbox understands these commands (for now):

class MatchboxReply(object): pass

class GetNextPackageReply(MatchboxReply):
    def __init__(self, package_name, version, use_flags):
        self.package_name = package_name
        self.version = version
        self.use_flags = use_flags

Communication (simplified) goes something like this:
Tinderbox
msg = GetNextPackage()
msg_pickled = pickle.dumps(msg)
sock.sendall(msg_pickled)

Matchbox
data = sock.recv()
command = pickle.loads(data)
if type(command) is GetNextPackage:
        package = get_next_package_to_emerge()
        msg = GetNextPackageReply(package)
        msg_pickled = pickle.dumps(msg)
        sock.sendall(msg_pickled)

There is one BIG caveat to this kind of communication. It is very easy tampered with. This is directly from pickle documentation:

Warning: The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

We will have to decide whether to reimplement this part, or trust Gentoo infrastructure. So what do we have for now?
  • Basic communication between Matchbox/Tinderbox
  • Compiling works with file list/emerge environment/stdout/stderr/etc being send back to Matchbox

There is still much more ahead of us:

  • package selection on Matchbox side
  • block resolution on Tinderbox
  • rest of services (web interface, client, etc)
Since GSoC students didn’t get git repositories on gentoo servers just yet you can see the code in gentoo-collagen@github. So long and thanks for all the fish (for now)

Accepted for GSoC 2009

April 27, 2009 in en, google, gsoc, projects

Few weeks ago I mentioned that I applied for GSoC 2009 as a student. Things have cleared up a bit and I can now say that I’ve been accepted (YAY!). Soon I’ll start working to improve quality of my beloved Linux distribution. How best to do that than to scratch my own itch? I am now going to quote Eric S. Raymond:
Every good work of software starts by scratching a developer’s personal itch.

This quote is from one of most interesting books written on Software engineering, specifically with Open-Source in mind. C’mon! I know you know the book! Yes you guessed right, it’s “The Cathedral and the Bazaar“.

Now the obvious question is…what’s my itch? I’ve been using Gentoo happily for over 4 years now and it’s getting better and better. One thing is still missing though. When emerging (that is installing) new application I never know how much space it will occupy once it’s on my hard drive. I only know download size. I say that’s not enough! I want to know at least a ballpark figure on size before I try to install some work of devil. However, this is not exactly focus of my project “Tree-wide collision checking and files database”, but it could easily become byproduct of solution for my GSoC task. I will most probably keep blogging about my work on GSoC project. This will make it easier to sort various thoughts and make it easier to create progress reports in future. Oh..just so that I won’t forget. My mentor is Andrey Kislyuk, apparently a bioinformatics PhD student interested in privacy and security. I better get to know him better, seems like and interesting person :-).

Applying for Gentoo project in GSoC

April 11, 2009 in en, google, gsoc, open source, projects

I don’t know if you’ve heard of Google Summer of Code, but most probably yes. Basic description is
Google Summer of Code (GSoC) is a global program that offers student developers stipends to write code for various open source software projects…Through Google Summer of Code, accepted student applicants are paired with a mentor or mentors from the participating projects, thus gaining exposure to real-world software development scenarios and the opportunity for employment in areas related to their academic pursuits. In turn, the participating projects are able to more easily identify and bring in new developers. Best of all, more source code is created and released for the use and benefit of all.

It sure is a great idea and I was always intrigued to participate. This year I finally decided to try it out. Gentoo, my distribution of choice was chosen as one of projects to mentor students. I read the ideas page and some of them seemed pretty interesting. One of them was “Tree-wide collision checking and files database“. What was the idea? Basically, Gentoo is a source based distribution so noone knows what files will get installed with a certain package. This can be a problem for quality assurance team (QA). Emerge checks for file collisions before installing so you will never screw up your system, but installing some less known packages can give you a headache. Implementing this idea should help fix this situation and most probably be used for other purposes as well. One of them could be approximation of size of package to be installed. This is common in binary distributions, because they know size of package, but Gentoo has a lot of small “gotcha’s” in this department. One of them is USE flags, great way to customize your distribution, but a nightmare for this sort of thing. Good thing emerge is such a great package manager :-). I will not repeat things I’ve said elsewhere so if you want to read more, you can read my application.

If I get accepted (I should know on April the 20th), there is a long way from where I am to making a great project for Gentoo. Hopefully I’ll be able to keep up and help out a bit.

Power to the masses

March 2, 2009 in en, open source, projects, software

Few months back I had a rant about participation in opensource. Things have moved a bit since then. I had a few more commits to the mob branch of gstfs repository on repo.or.cz. More importantly Bob Copeland got in touch with me and one more developer who voiced his interest in the project and offered to hand over gstfs to us. Understandably he doesn’t have as much free time to spend on such project as an average university student :-).

During the weekend I created project page, new code repository with my patches included, developer and user mailing lists and few issue tickets. I just hope I will be able to keep up the work on the project at least a bit better than this blog…Fingers crossed.

Kflickr hidden bugs and developer unfriendliness

January 13, 2009 in en, linux, open source, problem, projects, software

First of all…All hail our new overlord. And by overlord I mean year 2009. I hope you all will have a great time. I know I will :-). I didn’t write for some time, because I was travelling then I was celebrating holidays with my family and friends. All in all I didn’t have so much time to keep my information up to date not to mention doing anything resembling work. That’s changing NOW!.

I recently bought new camera (lovely Nikon D90) and also decided I need to backup my previous photos to more than 2 places. I realized you can never have enough backups after a few failed HDDs. So what were the options I was considering?

  • Google’s Picasa: 20$/year for 10GB  storage space
  • Flickr: 25$/year for unlimited storage and better sharing/privacy settings, presentation options etc.

I didn’t consider other services because…well because I didn’t.

Now the issue was…How to upload all of my photos (several gigabytes)? Flickr has client for Windows/MacOS, but not for Linux (The orignal client appears to work through wine though). Kflickr to the rescue! I started uploading photos in no time. But I wouldn’t be writing this blog entry if everything went according to plan now would I?

Everything seemed to work, the photos were on the web. I could see them, organize them, tag them…you name it. Then I wanted to download original file from certain photo (for reason I don’t remember). How great was my surprise when the file was <1MB in size. The originals I had were ~3 MB. Something rotten in here. The files were obviously recompressed with lower jpeg quality settings before being uploaded. Not all of them were this way though. It seemed like it has something to do with license I used for the files. Power is in the source, Luke so there I was. I wanted to investigate the problem and maybe fix it. Unfortunately opening Kflickr project files with Kdevelop and trying to debug didn’t work. For some reason the gdb was ignoring my breakpoins as if the application was compiled without debugging information. It was however compiled with -g3 (all debugging info). So far I was unable to properly diagnose the orignal bug, but I wrote to author of Kflickr asking for information. Now let’s wait.

Opensource participation

October 24, 2008 in en, open source, projects, software

In my previous post I mentioned project to transparently convert media files uploaded to my mp3 player. At first I wanted to create my own project from scratch. But then I searched around on the net and FUSE wiki and I found Gstreamer filesystem or GSTFS for short.

GSTFS is based on GStreamer multimedia framework to handle media conversions and FUSE to create virtual filesystem. Because of GStreamer, simple changes on command line allow you to do almost any task concerning media files. Conversion of music files, videos, resizing of pictures and more.

I started playing with GSTFS trying to convert my music collection to lower bitrate mp3s. Simple ‘cp -R * music/ music_converted/‘ should have worked. But it didn’t. Why? Well GSTFS shows non-converted files as 0-sized files. And cp tried to optimize copying by actually not copying empty files. It doesn’t even try to read them. That meant running cp twice since second run would see actual sizes. Even then there is a problem with expiration of file cache so if your music collection is more than a few files, you are out of luck.

And here we come to great advantage of opensource (at least for me). Source code of GSTFS is available, so I fixed this small bug and send a few line patch to original author Bob Copeland. I also asked if he could perhaps create public repository of sources on repo.or.cz. Interestingly enough I was aparently not the only one to ask for it. And so, lo and behold, Git repository of GSTFS is online. You can now find my work on improving GSTFS in mob branch in the repository. Hopefully I will be able to contribute more to this great idea and my code will actually make it into the main branch :)

Sound quality is relative

October 16, 2008 in en, open source, projects, software

We all know that, right? Right. Who am I kidding? Most people don’t notice the difference between 96kbps 4x re-encoded mp3 and FLAC. Of course it depends on your setup. Headphones, mp3 players. All of it makes your experience better (or worse).

Since I bought my first Koss Porta Pro headphones I realized that there are headphones and HEADPHONES. With some you don’t even hear the important parts, while others make you realize how much noise there is in the source :). And so my music collection is now mostly flac, high quality ogg or vbr mp3. Normally I don’t care about the size of music files. Storage is not that expensive these days. But I have an old Cowon U3 music player (still cannot find anything better) with only 2GB of memory. Of course that’s where size comes into play. What I usually do is convert music to lower bitrates (160 kbps vbr usually) before transfering them to the player. But I don’t keep those converted versions around since I don’t have that much space lying around :). So I am wasting time chosing files to transfer, then converting them and finally copy them to player.

Manually converting and then transferring files is kind of a bummer though. Now I realized…I can actually program right?! So how about making a FUSE virtual filesystem on top of vfat filesystem on the player. This virtual filesystem would convert music files being copied to the filesystem to specified format in background. Processor speeds are fast enough to do this more-less realtime these days so why not?

How will this affect my workflow? Compare:

Now Virtual FS
  • Copy files to temporary directory
  • Convert big files to lower bitrates
  • Copy files to mp3 player
  • Copy files to mp3 player

So far this is just an idea. I don’t know of any other project doing the same thing (there are a few dealing with general data compression but not media files specific).

Expect more to come (just don’t expect deadlines :) ).

*EDIT* As it happend there are already FUSE project that do exactly what I had in mind. I guess I should check the page more often. The projects are GSTFS and MP3FS where the first one seems more promising and flexible.

Picasa Album Downloader roadmap

August 19, 2008 in en, google, java, projects, python

In my first post about Picasa Album Downloader java applet I promised more in-depth technical information about the project.

Project idea came when few of my less computer savvy friends wanted to download all photos from my Picasa Web Album. So far there have been few different ways to do that:

  • install Picasa and use it,
  • install some other software to download photos,
  • go photo-by-photo and download them one-by-one.

None of those methods is very user friendly. Why isn’t there a “Download album as zip archive” link on Picasa? I have a few theories, but that’s probably for another blog post :)

Question is: How to enable users to download Picasa albums easily? Apparently I was not the only one with the idea of creating web service to create zip file for users to download. Fortunately Google provides APIs for most of its services in few languages. More precisely you can access Picasa easily using:

  • .NET
  • Java
  • PHP
  • Python

Since I started learning Python step-by-step few months ago, I thought about using it for the job. Then I realized that I will need hosting for the web service. There are not too many free python hosting services. Those that are free usually have some restrictions.

Even Google provides hosting services using its own App Engine, with support for Python in particular. I created simple prototype python script that was able to download selected album from given user to the chosen output directory. It ran just fine when I was testing it, but stopped working when run inside dev_appserver.py. Reason? Hotlinking.

Picasa Web Album checks referer header and if it doesn’t match one of Google domains, Picasa blocks access to the images. Since App Engine dosn’t allow clearing of referer header, this effectively blocks using full scale images from Picasa in App Engine. So python is out. What else is there?

I don’t have much experience with .NET, and I also don’t think that it would be suitable for web application that is supposed to be free. I already had some experience with PHP and for project like this one, it would probably do the job just fine. There was a problem though…Google Data APIs needs at least PHP 5.14 to work, but hosting services I had at my disposal had lower versions installed.

Status? Python, .NET, PHP, Java. And here we are. The result is a Java applet that enables users to download full Picasa album without installing any software. There is also a project page at Google Code. First version took about 1 day to code. I released it under GPLv3, so if you want to contribute, you are welcome to do so. If you find any bugs or have ideas how to make the applet better, let me know.

Picasa Album Download

August 14, 2008 in en, google, java, projects, python

Picasa web albums is a great service. As far as I can tell it has very few disadvantages over competing websites. Although I have never used Flickr or similar services so I am not really one to judge.

There is one thing with Picasa web albums that quite a few people have asked me:

Can I download whole album from Picasa at once, without having to click through all the photos one-by-one?

Well I used to tell people to install Picasa to their computer, but less tech-savvy users had problems with this approach. Some companies also have restrictions on installing software in their networks. No wonder with numbers of trojans, malware and similar things on the Internet these days. Getting rid of them can take forever…

I found quite a few projects dealing with downloading from picasa. All of them required installing some application (or at least download one). Perfect solution? Web service.

As an aspiring Software engineer (pun intended) I set up on a quest to solve this problem once and for all. Goal:

  • Download complete Picasa Web Album into computer without having to install anything first
  • Multiplatform (Windows, Linux, MacOS X,…) support. Ideally only browser-requiring solution.

Simple right? Well yes and no. I will publish technical details and solutions I tried in some other post (edit: I already did). Now. without further ado, I present to you: