Boehm Garbage Collector

November 10th, 2007

Toying with the Nokia 770 has been teaching me more than I ever wanted to know about C, gcc, and the other things we do to write software for handhelds. I had known about the existence of garbage collectors for C and C++ but had always assumed they worked similar to GObject. A GObject programmer must call g_object_ref() and g_object_unref() to manipulate the reference count of each object on the heap. When the count reaches zero the space is reclaimed. A Boehm GC, rather, can function as a drop-in replacement for malloc() and free(). To my amazement the free() procedure isn’t required at all and can be replaced with a noop. The word “automagically” doesn’t even annoy me in this case since the Boehm GC is quite a trick. It can determine when an object on the heap is no longer required by comparing places where pointers usually live with it’s internal book. For example, a function that calls malloc usually sets a pointer on the stack to the allocated memory. When the stack shrinks the reference is gone and the dead memory can be automatically freed.

The main benefit, for me anyway, is that I can write consing functions in C. Any linked list implementation has a foreach method. Foreach is easy because it doesn’t have to keep the return value of its lambda argument. Map, filter, and almost every other good thing that comes from functional programming requires consing up a new list. Allocating the right amount of memory for each return value is doable. The difficult part is the bookeeping required to later free all the unreferenced lists. A functional program might map, filter, then reduce in a single line as it takes the GC for granted. Manual allocation is a better fit for building fewer, less disposable structures. I’m not sure I’ll use map that much with C but it would be a good way to test-drive the Boehm GC.

There has to be a catch. I’m still wide eyed that the Boehm GC even works. It might be a sign that Mozilla and X, the two biggest hogs on my system, use it.

Why directories?

May 8th, 2007

I had this crazy idea the other day that I can’t get out of my head. It’s sort of Copernican revolutionist and perhaps plain wrong so I don’t blame you if you disagree.

I was thinking it might be for the best if we could scrap directories in the filesystem. A regular file would then be linked to children, along with it’s stream. In other words, merge directories and files into one type. I suppose Dennis Ritchie thought about these things and implemented directories as a separate type from regular files because they ARE a separate type. Regular files are streams of data. Directories are a list of files, both regular and not. I should thank my lucky stars that the two are polymorphic enough to be listed along side each other when I type ls.

From a user perspective though it is sort of bothersome. I remember an old version of Word asking me if I wanted to link to a picture or include the picture data in file. Now, if Word files were implemented as directories we wouldn’t need to let the application worry about such things (assuming FAT had symlinks). I could cd mywordfile.doc and inspect the contents to see for myself whether or not it contained images or links to images. Database implementations almost always use a directory per database; why not every other program? Usually I want to create a rich (as in treelike) document, not a stream, but the application saves my data as a stream. Shouldn’t the OS handle this?

Positives of blurring regular files and directories

  • One less first class data-type. Our filesystems now have at least three; regular files, directories, and sym-links. Hardlinks are not really first class. What I am proposing is to scrap the concept of directories and hardlink files in a hierarchical manner. I would never want to lose symlinks.
  • No more non-standard formats for embedded images, etc.
  • No more tar. One less step to do before or after transporting data.
  • Less multipart-mixed sections in email because we could attach one file that contained the others.
  • Less XML on disk, more on the wire. Perhaps if it were less of a chore to transmit a directory people would use them more often. We would see configuration files, hierarchical in nature like MS registry files, that use the OS instead of XML.

Negatives and Unknowns

  • We can already accomplish this by using a convention like index.html, where the directory is made to contain data of it’s own as if it where a regular file.
  • The transport problem is largely solved by tar, and I do mean largely. tar—help is 244 lines. Maybe email clients should open tar files instead of expecting us to include all the components of a document via multipart-mixed.
  • Possibly lots of small files eating up blocks.
  • Large number of open file handles.
  • How would file permissions work? Same as they do now?
  • Dennis Ritchie was probably right with our current implementation. The best way to avoid a leaky abstraction is to avoid abstraction. Directories and regular files are different enough that they warrant exposing different types to the user.
Finding a file by filename takes a while (far too long) with the GNU find(1) command. Here's one solution. I've added this to my crontab to update a list of local files every morning.

# Get list of files for grepping at my leisure
33 03	* * *	root	find / -mount > /root/files
Now look at the time I'm saving. Here I count the C files on the local disk using both a cached, and a non-cached list of files.

$ time sudo find / -mount -name *.c | wc
    265     265   15190

real    1m17.288s
user    0m1.168s
sys     0m2.772s

$ time sudo grep "\.c$" /root/files | wc
    265     265   15190

real    0m0.309s
user    0m0.280s
sys     0m0.020s
Of course you need to set logical permissions on the /root/files file. And note that on my system this is a 35MB file with almost 500,000 lines so be careful not to open it in MS Word :) *** Update 05-08-2007 slocate is better!