May 4th, 2009
Schemes of Dynamic Dispatch under Study
The performance of three dynamic dispatch schemes were examined. I also pontificate on the merits of my favorite schemes.
Virtual-table: Objects contain a pointer to a function table. Each method has a fixed offset within the table.
Function-table: There is one table for each method. Objects contain an integer corresponding to the object type that is used as an index in the table.
Switch-statement: Objects contain a type-integer like they do in the function-table scheme. The switch executes the correct function depending on the integer key.
Virtual-table dispatch is used in classed OO languages because the vtable corresponds to a class. It is a data-centric approach because the dispatch mechanism is associated with the data type and not with any specific function type. Method names in the vtable can be overloaded in different struct definitions of the vtable. Extending a copy of the table corresponds well to single inheritance.
The latter two schemes are more similar to each other than they are to the first. They are function-centric because the dispatch mechanism is associated with a function type rather than a data type. An object passed to these dispatchers contains an integer type-id that serves as an index in a dispatch table or a key in a switch statement. Objects in this scheme are unclassed, though there is nothing prohibiting an association on the type integer with a class-like structure. This class-like structure is actually one interpretation of the dispatch table.
- Function-table at a static address.
- Function-table at a dynamic address.
- Function-table with fix-up subtraction.
- Switch-statment with inline method definitions.
- Switch-statement with few, ordered keys.
- Switch-statement with few, dispersed keys.
- Switch-statement with many, dispersed keys.
- Direct call (for comparison).
The essence of each scheme is in the call instruction. '-S -fverbose-asm' is kind enough to show the names of static addresses.
| call *16(%edx)
The constant 16 corresponds to the offset of the method in the table. %edx holds the address of the table.
Static function table
| call *func_table-3996(,%edx,4)
The constant *func_table is the static offset of the function table. %edx contains the integer corresponding to the object's type. The constant 4 is the size of a pointer on my machine. Finally, 3996 is the index offset multiplied by 4. Index offset would be 0 for a table beginning at type 0, but a subtype might be found in a sub-range of indexes. Rather than build a sparsely populated table for this subtype, a smaller table can be used with a fix-up subtraction computed at compile time.
Dynamic function table
| call *(%edx,%eax,4)
%edx contains the address of the function table. %eax again contains the object's type-id. Fix-up subtractions would obviously need to be computed at run-time. I didn't test this combined with a fix-up subtraction because I consider it pretty useless anyway. I included this test for completeness since a virtual-table is inherently dynamic. Only the method offset is fixed in the virtual-table.
| call compiler black magic
It should be noted that this scheme semantically performs two direct calls for each dynamic method call. The implementation could be placed in the switch to eliminate function call overhead but it may be beneficial in some cases to have separate implementations and treat the switch statement like an interface declaration. In-lining removes the double-call overhead and can be counted on within a compilation unit.
Most popular benchmark algorithms do not depend on dynamic dispatch. I just did the naive thing and called a function recursively. It can be tricky to fool gcc into making a wheel-spinner that does no real work. For example, it is capable of rearranging the statements of a direct call that is not tail recursive into one that is, then unrolling it, then placing it in-line in the caller.
Timing points are in the code because instrumenting the executable for profiling caused the execution time to double. The key to using gprof might be to turn off instrumentation of the function calls under test and just time the top-level. However, measuring with the real-time clock gives results that are consistent enough to trust.
Times are in microseconds.
-O0 Classed virtual: 1404442 Function table: 1501014 Function table with subtraction: 1559695 Function table dynamic: 1607794 Switch inline code: 1823262 Function switch: 2612194 Function switch large: 3633052 Function switch dispersed: 3003073 Direct call: 1348344 -O1 Classed virtual: 1506692 Function table: 1506174 Function table with subtraction: 1506311 Function table dynamic: 1506651 Switch inline code: 1311783 Function switch: 2588107 Function switch large: 3026327 Function switch dispersed: 2549603 Direct call: 1368460 -O2 Classed virtual: 1505832 Function table: 1505468 Function table with subtraction: 1505384 Function table dynamic: 1506625 Switch inline code: 1376533 Function switch: 1982367 Function switch large: 2518238 Function switch dispersed: 2046317 Direct call: 1366636 -O3 Classed virtual: 1505480 Function table: 1506294 Function table with subtraction: 1505536 Function table dynamic: 1506019 Switch inline code: 1476872 Function switch: 1476428 Function switch large: 2517948 Function switch dispersed: 1647105 Direct call: 516581 -O3 -fomit-frame-pointer -mtune=native Classed virtual: 1379984 Function table: 1384040 Function table with subtraction: 1380380 Function table dynamic: 1505872 Switch inline code: 1439144 Function switch: 1438757 Function switch large: 2248245 Function switch dispersed: 1506101 Direct call: 509054
These represent a single run. I computed no averages and made no attempt at statistics. However, on repeated runs the times were consistent to within two milliseconds, which is significantly smaller than than the differences between dispatch schemes.
Conclusion, extrapolation, even some guessing and recommendations
Small switch statements do better than large ones, and dispersion of keys has only a small negative effect. For this reason it may be beneficial to reserve switch statements for dispersed keys while using table dispatch for grouped keys. The switch can actually best the table for a small number of keys but then its advantage depends on compiler in-lining or placing implementation code inside the switch. Even if the compiler succeeds in translating the switch to a jump table the double call elimination still depends on in-lining. A macro definition for explicit table dispatch is short and has guaranteed pretty-good performance. It could easily be converted to a switch, or even a switch macro, for the absolute best performance.
One technique might be to treat switch statement/function-table dispatch as one would an interface. Methods like "show", "display", "draw" and other things that usually go in base classes as stubs are good candidates for an interface.
Methods that need to be overloaded for each type definitely benefit from the virtual-table since the virtual-table can contain different types for the method name across different class structures. The advantage of the virtual-table is overloading and inheritance, but if a function is overloaded at the call site then the object type is probably known, and if its type is known then it should be a direct call anyway. Virtual-tables may also have a performance advantage for types that are dynamically created since the table is found through a pointer regardless. The performance of type-indexed table dispatch theoretically relies on having a static table address but my data doesn't show any difference in performance with optimizations turned on. Only -O0 shows any significant difference between method-indexed virtual-tables and different schemes of type-indexed tables.
Large switch statements with dispersed keys perform poorly but this case should not often arise in practice since type-id integers can be assigned sequentially. It is notable that the widest range in performance occurs in switch statements because they give the compiler the most room for optimization. These numbers show that a switch on type-id is the fastest possible dispatch scheme if implementation code is placed in the switch rather than in a separate method. But that's messy. In-lining in -O3 is capable of removing some of the function call overhead and the direct call-switch-call performs slightly better than a function-pointer. The best number here is 200,000,000 calls in 1376533 microseconds for a switch with explicit in-line code compiled with -O2. -O3 brings the performance down to 1476872 microseconds, which is nearly equivalent to the switch that calls a function for each case. The performance is identical in -O3 because of gcc in-lining. With some different tuning parameters the virtual-table and function-table are just as fast.
Personally, I prefer the interface style that jives with the type-indexed schemes. They are interchangeable, don't need a class, and perform as well as virtual-tables. The style lends itself more to multiple interface inheritance, similar to Haskell type-classes with monomorphism turned off. If a type (which is an integer index) has interfaces A and B (which are table or switch dispatched functions), then it may implement C (which is dispatched the same way).
This excellent style of multiple inheritance can also be done in C++ with one pure virtual base class for each set of functions that comprise an interface. But because C++ uses vtables, each object must keep a pointer for every interface from which it derives. I can't help but think a type-id implementation would make more sense in this case too. Tnen each object would have exactly one type-id and every pure virtual base class would be a table of implementations for those type-ids. Maybe I'm ignoring practical issues like linking and horribly sparse tables, but what about the practical issue of keeping objects small? Virtual tables yield big objects and small tables. Type-ids yield big tables and small objects. I ran this same test with C++ and the performance matched the vtable here. Automatic casting to the interface reference didn't appear to affect performance.