F2 40 01 2A (some notes on SIMD, instruction sets and everything)

I’ve been following the steps of Måns and got myself Gdium too. Since it’s no fun just owning less-spread computer architecture and not writing anything on it, I’ve tried SIMDifying the easiest operations one can do on it — vector sum, vector subtraction and vector scalar product. And I have a decoder that uses those operations extensively, so why not try to benchmark it a bit?

Test sample was first 26 seconds of Monkey Audio file with insane compression since this mode uses longest filters and benefits from SIMD most (and is slow enough even for short samples ;). In all cases I’m the one who has written SIMD code, so it’s fair 🙂

PowerPC (Freescale 7447A 1.42 GHz): 25 seconds and 6 seconds
MIPS (Loongson 2F 900 MHz): 37 seconds and 7 seconds
ARM (Cortex A8 600 MHz): 138 seconds and 22 seconds
x86 (Intel Atom N270 800 MHz): 50 seconds and 9 seconds

Mind you, SIMD instructions in Loongson are custom for that CPU and modelled after MMX (64-bit registers, actually reusing FPU regs, similar names) but at least they are done in RISC fashion, i.e. you can store result in some other register.

I’ve also looked out of interest at binary representation of SIMD. On x86 the principle is to prefix SIMD instruction (usually with 0x66 “opcode for CPU with half of current bits” byte) so SSE7 instructions will look like instructions for 1-bit FPU on Intel 4004 predecessor and will take 8-16 bytes to represent.

Other architectures use simple 32-bit word for any instruction. NEON (on ARM) and AltiVec (PowerPC) use some opcodes in general instruction space, Loongson 2 SIMD are custom calls to the second co-processor.

Talking about instruction sets I cannot omit the fact that IDA 5.2 sucks at disassembling PowerPC code (not only AltiVec but some of the core instructions too) and objdump sucks at disassembling MacOSX format (it ignores internal structure and disassembles it as raw file), that looks like the reason why we don’t have Apple Intermediate Codec RE’d yet.

P.S. Jag vill gärna få AVR32, BlackFin, ColdFire och andra exotisk CPU:ar. Alpha eller Sparc är bra ochså men det är bara orealistisk, tror jag.

6 Responses to “F2 40 01 2A (some notes on SIMD, instruction sets and everything)”

  1. conrad says:

    Not completely fair: Cortex-A8 lacks hardware division, which APE uses in its range coder 😉
    Though I don’t know how just much that matters for insane compression; for whatever APE files I have it was over 1/3 of the overall time and the dsp functions didn’t even show up in the profile.

  2. Which objdump are you using to try disassembling the Mac OS X format? On Fedora and other distributions, the objdump command comes from elfutils and only supports ELF files; on Gentoo if you don’t enable the multitarget USE flag in binutils, it also will only support ELF files (but should support Mach-O fine when using multitarget).

  3. Kostya says:

    I’ve simply compiled binutils on MacOSX.

  4. diego_not says:

    you dont need to disassemble PowerPC code. ArcSoft MediaImpressions 2 (for windows) is capable of importing .mov files encoded with “ICOD” codec. there’s a trial version you can download. you might be interested in this file: PlugIn_Import\MOVImport.dll

  5. Mans says:

    @kostya: ARM/NEON instructions use the VFP coprocessor instruction space.

    @conrad: With increasing compression levels, the DSP functions quickly overtake the range coder in cpu time.

  6. Zino says:

    Jag kan nog skaka fram en Sparc och möjligen någon Alpha, men jag misstänker att de alphor som var över blev slängda tidigare i år. Hör av dig till zinoatlysatorliuse så tar vi en titt på vad vi kan göra.