Jan 25 2020 Programming with RISC-V Vector Instructions

Perhaps the most interesting part of the open RISC-V instruction set architecture (ISA) is the vector extension (RISC-V "V"). In contrast to the average single-instruction multipe-data (SIMD) instruction set, RISC-V vector instructions are vector length agnostic (VLA). Thus, a RISC-V "V" CPU is flexible in choosing a vector register size while RISC-V "V" binary code is portable between different CPU implementations.

This articles compares the two main different styles of vector ISAs, discusses a string processing example that is implemented using RISC-V "V" draft version 0.8 (current as of early 2020) vector instructions and details how to set up a RISC-V "V" development environment under Linux.

SIMD Challenges

With a vector length specific (VLS) SIMD instruction set the main problem is to pick the right vector register size. Of course there is a trade-off between the amount of data-level parallelism and hardware costs. Due to Moore's law, vector register sizes can be increased over time without making the CPU chip more expensive. Also, some users are interested in powerful CPUs with wider vector registers while the average user is fine with averagely sized register. Thus, there is no one right vector register size. This shows for example with x86, where the answer is to provide one VLS ISA after the other, such as MMX (64 bit registers), SSE (128 bit), AVX (256 bit) and AVX512 (512 bit).

Because of backward compatibility, each CPU that adds a new VLS ISA also has to support all existing ones. This leads to a waste of opcode space and increases the complexity of the CPU's instruction decoder. Of course this also increases the complexity for the programmer who has then remember (or look up all the time) syntactic and functional differences between all the VLS ISAs.

That means that while VLS code written for smaller vector registers runs on newer CPUs, it can't make use of the wider vector registers. Thus, existing code has to be reimplemented again and again to make use of new VLS ISAs. Similarly, code written for high-end CPUs doesn't run on middle-end CPUs (because it requires the VLS-ISA with wider vector registers). Thus one either has to target some older (hopefully widely available) VSL-ISA or has to provide multiple implementations for different VSL-ISAs.

The Solution: Agnosticism

The solution to all this is to design a variable length vector instruction set. In that way the instructions are then agnostic to the vector register size of a concrete CPU implementation. Thus, the binary code is portable between low, middle and high-end CPUs, and automatically makes use of wider registers in newer CPUs.

The RISC-V vector extension "V" implements such vector instruction set. As of early 2020, the RISC-V "V" specification is at version 0.8 and has draft status.

RISC-V "V" adds 32 vector registers, where the first register can be used as mask register and up to 8 registers can be grouped together. The operands of a vector instruction such as vadd.vv are single vector registers or vector register groups.

Since vector registers are of variable length, RISC-V "V" code has to indicate the maximum vector length it wants to work with, e.g.:

vsetvli t0, a2, e8

Meaning that a vector length (vl) of up to a2 8 bit wide (e8) elements is requested while the instruction returns the resulting length in register t0. Thus, if the a2 register is set to - say - 4096, on a CPU with a vector register length (VLEN) of 128 bits, the following vector instructions work on 16 element wide vectors and t0 is thus set to 16, while on a CPU with 512 bit registers the vectors are configured to be 64 elements wide and t0 is set to 64.

This approach also simplifies loops that iterate over an input array in vector length chunks. For example (where a1 contains the address of an array of a2 times 4 bytes):

.Loop:                        # local symbol name because of .L prefix
    vsetvli t0, a2, e32       # configure vectors of 32 bit elements

    vlw.v   v4, (a1)          # Load t0 elements into v4,
                              # starting at the address stored in a1

    ...                       # work with that chunk

    slli    t1, t0, 2         # shift-left logical, i.e. times 4
    add     a1, a1, t1        # increment src by read elements
    sub     a2, a2, t0        # decrement n
    bnez    a2, .Loop         # branch to loop head if not equal to zero

    ...                       # continue

In cases where a2 isn't a multiple of the maximum vector length, the last iteration sets the vector length to a smaller value and the following vector instructions ignore the unused trailing elements. This implicit masking mechanism is orthogonal to the optional mask operand that is supported by most RISC-V vector instructions.

In contrast to that, with a vector length specific ISA, the main loop usually has to be followed by some finalization code block to explicitly deal with the last elements that don't fill a complete register, e.g.:

const unsigned char *p = inp;
size_t l = n / (VECTOR_LENGTH * ELEMENT_BYTES);
for (size_t i = 0; i < l; ++i, p += VECTOR_LENGTH * ELEMENT_BYTES) {
    ... // load p into a vector register
    ... // execute some vector instructions
}
// deal with some remaining bytes
// e.g. by setting up a mask or work on single elements
for (size_t i = l; i < n; ++i, p += ELEMENT_BYTES) {
    ... // work on the next element located at p
}

Example

To illustrate RISC-V "V" with a real example, this section shows how to implement a vectorized function that converts a string of binary coded decimals (BCD) into an ASCII string. Why BCD to ASCII conversion? The task is complex enough such that most of the different vector instructions are used. On the other hand, it's simple enough to fit into a small article and doesn't require domain specific knowledge. It also demonstrates some perhaps not entirely obvious ways how vector instructions are used for string processing where those instruction could be assumed to only be useful for calculations.

With BCD, a byte (8 bits) is divided into two nibbles (4 bits) such that each nibble stores a (hexa-)decimal digit. Note that 4 bits allow to exactly encode 2⁴ values, thus when using it just for storing decimal digits it's not a very efficient encoding.

For the purpose of our example, the exercise is to write vector code that efficiently converts a BCD string such as { 0x12, 0x34, ..., 0xcd, 0xef } to a corresponding ASCII string (e.g. { '1', '2', '3', '4', ..., 'c', 'd', 'e', 'f' }). On a high-level, a solution involves separating the nibbles into single bytes and then converting each byte to the matching ASCII value.

The complete example source code is available in my github repository.

Shuffling Nibbles

Our function has the following function signature:

void bcd2ascii(void* dst, void const * src, size_t n);

Meaning that n input bytes are read from src and the conversion writes 2*n bytes into the dst output buffer. Under the RISC-V calling conventions, dst is passed in register a0, src in register a1 and n in register a2.

.Loop:                        # local symbol name because of .L prefix
    vsetvli a3, a2, e16, m8   # switch to 16 bit element size,
                              # 4 groups of 8 registers
    # --> a3 = min(a2, 8*vlenb/2)
    vlbu.v v16, (a1)          # Load a3 unsigned bytes,
                              # one byte per 16 bit element, zero-extend,
                              # starting at addr stored in a1
    # --> v16 = | 0, a1[vlenb/2-1], ..., 0, a1[1], 0, a1[0] |, ...,
    #     v23 = | 0, a1[a3-1],       ...,  0, a1[7*vlenb/2] |
    # --> v16 = | ... 00mn 00kl 00ij 00gh |

    add a1, a1, a3            # increment src by read elements
    sub a2, a2, a3            # decrement n

The main loop starts with configuring a vector element size of 16 bit (e16), grouping 8 registers together (m8) and requesting a vector length that equals the number of remaining source bytes or the CPU maximum. With this grouping, each register group is accessed by using a vector register with a number that is dividable by 8. That means v0 identifies the group consisting of v0, v1, ..., v7, v8 identifies v8, ..., v15, etc.

The vl*.v load instruction comes in different variants. Here, the vlbu.v variant zero extends each input byte per 16 bit element which is useful in our example because this directly leaves room for shuffling the nibbles. In other words, it's a widening load and thus saves a separate widening operation such as vwaddu.vx.

That means on CPUs with 256 bit vector registers, this code loads up to 128 input bytes into the v16 register group.

Note that register content in the comments is enclosed in | | and written right to left, starting with the least significant element. Arbitrary nibbles are denoted sometimes by placeholder variables such as g, h, ....

The actual nibble shuffling:

vsll.vi v24, v16, 8       # shift-left-logical each element by 8 bits
# --> v24 = | ... mn00 kl00 ij00 gh00 |

vsrl.vi v16, v16, 4       # shift-right-logical each element by 4 bits
# --> v16 = | ... 000m 000k 000i 000g |

slli a3, a3, 1            # shift left logical by immediate,
                          # i.e. to double the number of vector elements
vsetvli t4, a3, e8, m8    # switch to 8 bit element size,
                          # 4 groups of 8 registers

vand.vx v24, v24, t2      # and each element with 0x0f,
                          # i.e. zero-out the high nibbles
# --> v24 = | ... 0n 00 0l 00 0j 00 0h 00 |
vor.vv  v16, v16, v24     # or each element
# --> v16 = | ... 0n 0m 0l 0k 0j 0i 0h 0g |

So far the example shows most of the syntactic conventions of the "V" ISA. Vector instructions start with v and a suffix such as .vi, .vx and .vv describe the source operand types, i.e. vector-immediate, vector-scalar and vector-vector.

The bit-shift instructions don't cross element boundaries. Thus, just vector group v24 has to be zero-masked and not v16. The mask is located in register t2 which is set before the loop start.

Switching the vector register configuration to 8 bit elements (e8) at this point allows to use 0xf as mask value instead of the larger 0xf00. Thus, it fits into the immediate operand of the load immediate instruction such that one additional instruction is saved (i.e. addi t2,zero,15). It even fits into the immediate operand of the compressed load immediate instruction, which just encodes into two bytes (i.e. c.li) instead of the regular four.

The final clean result of separated digits is located in vector group v16.

Converting Bytes

The actual conversion is done in one instruction:

vrgather.vv v24, v8, v16
# --> v24[i] = (v16[i] >= VLMAX) ? 0 : v8[v16[i]]

Here, vector group v8 is used as table to look up the ASCII values. That means the v8 lookup table maps the integers {0, 1, 2, ..., 0xd, 0xe, 0xf } to the ASCII characters { '0', '1', '2', ..., 'd', 'e', 'f' }.

Of course, this lookup table has to be constructed before the loop is entered:

li a6, 16                 # load immediate (pseudo instruction)
vsetvli t0, a6, e8, m8    # switch to 8 bit element size,
                          # i.e. 4 groups of 8 registers

vid.v v8                  # store Vector Element Indices,
                          # i.e. v8 = | 16, ..., 2, 1, 0 |
vmsgtu.vi v0, v8, 9       # set mask-bit if greater than unsigned immediate
# --> v0 = | 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |

li a7, 48                 # load immediate, i.e. '0'
vadd.vx v8, v8, a7        # add that scalar to each element

addi a7, a7, -9           # add immediate, i.e. set to 39 == 'a'-'0'-10,
                          # i.e. to arrive at 'a', 'b', ...
vadd.vx v8, v8, a7, v0.t  # masked add for the additional offset

Configuring a grouping of 8 registers for a vector of 16 elements might look like overkill because 128 bit vector registers are sufficient and should be widely available. On the other hand, there might be a CPU with "V" support that just implements - say - 64 bit vector registers where we would need to group 2 registers. Since a grouping thus may be needed it really doesn't hurt to configure the maximum here.

The v0.t syntax is just a marker that v0 is used as mask. Note that masks always just consist of one vector register, even if register groups are configured. With the current "V" 0.8 draft, the v0 register is the only valid choice for a mask operand.

Similar to before, the value 39 is constructed with addi instead of directly loading it with the pseudo-instruction li into another register because -9 fits into the immediate operand of the compressed c.addi instruction.

Storing the Result

vsb.v v24, (a0)           # write result to dst
# --> a0[0] = v24[0], a0[1] = v24[1], ..., a0[vl-1] = v24[vlenb-1], ...,
#     a0[vlenb*7] = v31[0],           ..., a0[t0-1] = v31[vlenb-1]
# --> a0[0..t0-1] = [ 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n' ]
add  a0, a0, t3           # increment dst
bnez a2, .Loop            # branch to loop head if not equal to zero
ret

The loop and function is left if the complete input buffer is processed. Note that while the syntax of most RISC-V instructions follows the destination-source order, store instructions have this order inverted.

Concluding Remarks

The RISC-V "V" vector extension ISA is sufficiently diverse as it contain useful bit and byte-shuffling instructions, instructions that allow the masking of elements and instructions implementing operations that are useful for string processing such as element gathering and widening.

The available instructions in combination with the vector length agnostic (VLA) design leads to compact code. For example, each iteration of the main loop just executes 14 instructions and there is no extra code necessary to deal with trailing bytes.

The thus realized throughput is excellent, i.e. the resulting binary code automatically utilizes the complete vector register size on each CPU, be it low or high-end. In addition, the grouping of vector registers allows to increase the throughput since there are many registers available. For example, on a CPU with 128 bit vector registers, the loop has a throughput of 9 digits per instruction.

Since each regular RISC-V instruction encodes into 4 bytes, the density of the assembled binary code is also good. For example, the presented bcd2ascii function has a size of 96 bytes. When enabling the "C" compressed instructions extension during assembling (such that certain instructions can be replaces by compressed 2 byte versions), the size drops by 20 per cent down to 76 bytes. Which is fine, especially given that most instructions of that function are vector ones and there are no compressed variants of the vector instructions.

This can be contrasted by x86-64, where for example the SSSE 3 shuffle instruction encodes into 5 bytes and some moves encode into 7 bytes. Plus of course, the vector length is fixed to 128 bit when using SSSE 3 as lowest common denominator SIMD ISA.

Emitting compressed RISC-V instruction is kind of transparent to the assembly programmer, one has just to set a assembler command line option. But of course, since compressed instructions implement compromises (otherwise why wouldn't be all instructions compressed?!), the programmer has to take care to write instructions in a way such that they are compressible, where possible. For example, some compressed instructions only work on a register subset (e.g. s0..s1, a0..5), one source operand is implicit, there are less bits for an immediate operand, there is just one variant that sign-extends the immediate etc.

See also my github repository that contains the complete example code.

Getting Started

Since as of early 2020, the "V" vector extension still has draft status and version 0.8 was just released recently, support for it isn't widely available. That means there is no hardware with a RISC-V "V" CPU available, but also some well-known RISC-V emulators such as Qemu don't support the "V" extension or just support an older version of the "V" extension. Similarly, support for "V" version 0.8 for the standard development toolchain (binutils, gcc) is available, but not yet upstreamed. Meaning that one has to hunt down repositories, identify the right branches and compile those with the right flags, instead of just being able to use distro packages.

Another pitfall is that the "V" extension (similar to "F" and "D" floating point extensions) has to be enabled in the running system by setting a status register. Since the status register can only be accessed in machine-/system-mode that means that one also needs kernel support for the "V" extension.

This section details how to build the different components required for a RISC-V "V" 0.8 toolchain and an emulator.

Spike

The Spike RISC-V emulator does have "V" version 0.8 support. As of early 2020, there is one other emulator with "V" 0.8 support but it isn't open source.

Building Spike is straight forward:

sudo dnf install dtc  # i.e. device-tree-compiler
git clone https://github.com/riscv/riscv-isa-sim.git --depth 1
cd riscv-isa-sim
mkdir build
cd build
../configure --prefix=$HOME/local/riscvv08/spike
make
make install

Of course, the --depth 1 switch is optional, it just saves some disk space.

Make sure to a have a fresh clone that has "V" support fixed.

By default Spike enables the RV64IMAFDC ISAs, but this default can be changed at runtime (or even configure time). For example when we call spike like this:

spike --isa=RV64IMAFDCV ...
spike --isa=RV64gcV     ...    # equivalent

For executing user-space programs such as our example, spike needs the Proxy-Kernel (pk).

GNU Toolchain

Technically, binutils with "V" extension support is sufficient to assemble our example. However, building the Proxy-Kernel requires the full GNU toolchain.

git clone https://github.com/riscv/riscv-gnu-toolchain.git --branch rvv-0.8.x \
          --single-branch --depth 1 riscv-gnu-toolchain_rvv-0.8.x
cd riscv-gnu-toolchain_rvv-0.8.x
git submodule update --init --recursive --depth 1 riscv-binutils riscv-gcc \
                        riscv-glibc riscv-dejagnu riscv-newlib riscv-gdb
mkdir build
cd build
../configure --prefix=$HOME/local/riscvv08/gnu --enable-multilib
make
make install

The explicit submodule update is done like this to skip the optional Qemu module. Besides Qemu doesn't supporting the "V" extension, it would also require a deeper clone and take up some disk space and waste some compile time.

Note that the make install step is superfluous because the previous make call already installs everything.

Proxy-Kernel

The RISC-V Proxy-Kernel (pk) implements enough to get a user-space program in Spike running, i.e. including setting up some status registers in machine-mode, switching to user-mode and implementing some syscalls. That means that calling the write syscall to write to stdout then just works in Spike and the text is printed to the console.

The pk needs to be cross-compiled with the GNU Toolchain (see previous Section).

git clone --depth 1 https://github.com/riscv/riscv-pk.git
cd riscv-pk
mkdir build
cd build
PATH=$HOME/local/riscvv08/gnu/bin:$PATH ../configure --prefix=$HOME/local/riscvv08/pk \
                                                     --host=riscv64-unknown-elf
PATH=$HOME/local/riscvv08/gnu/bin:$PATH make
PATH=$HOME/local/riscvv08/gnu/bin:$PATH make install

Again make sure to get a recent pk clone with fixed "V" support.

Binutils

If you already have the GNU Toolchain you can skip this (as it already contains the binutils with "V" support). This is just relevant if you have obtained the Proxy-Kernel with "V" support in binary form and want to skip building the GNU Toolchain.

git clone https://github.com/riscv/riscv-binutils-gdb.git --branch rvv-0.8.x \
          --single-branch --depth 1 risv-binutils-gdb_rvv-0.8.x
mkdir build
cd build
../configure --prefix=$HOME/local/riscvv08/binutils --target riscv64-unknown-elf \
             --enable-multilib
make
make install

Assembling

Finally, to actually execute our example, a small test program is needed that calls the bcd2ascii() function with some sample input and prints the results. If the complete GNU toolchain is available the simplest thing is to write that part in C, e.g.:

#include <stddef.h>

void bcd2ascii(void* dst, const void* src, size_t n);

static const unsigned char inp[] = {
    0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef,
    0xfe, 0xdc, 0xba, 0x98, 0x76, 0x54, 0x32, 0x10,
    0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef,
    0xfe, 0xdc, 0xba, 0x98, 0x76, 0x54, 0x32, 0x10
};

#include <stdio.h>

int main()
{
    char out[sizeof inp * 2 + 1] = {0};
    // expected output:
    // out = { '0', '1', '2', '3', ... }

    bcd2ascii(out, inp, sizeof inp);
    puts(out);
    return 0;
}

Everything can then be cross-assembled, cross-compiled and linked with:

~/local/riscvv08/gnu/bin/riscv64-unknown-elf-as -march=rv64gcv -o bcd2ascii.o bcd2ascii.s
~/local/riscvv08/gnu/bin/riscv64-unknown-elf-gcc -Wall  main_bcd2a.c -o bcd2a bcd2ascii.o

Supplying just -march=rv64gv disables the use of compressed instructions.

Alternatively, without a C cross compiler but cross binutils, we need an assembly test program such as:

    .text                     # Start text section
    .balign 4                 # align 4 byte instructions by 4 bytes
    .global _start            # global
_start:
                              # check if vector extension is enabled
                              # user-mode doesn't have privileges to
                              # read mstatus/sstatus/misa CSRs
                              # thus, unclear how to check for V support
    li    t1, 0x1800000       # disable this check for now
    #csrr  t1, mstatus        # control and status register, i.e. read the
                              # mstatus register
    li    t2, 0b11            # load immediate mask
    slli  t2, t2, 23          # shift left logical immediate by 23 bits
                              # because "V" draft 0.8 defines the vector
                              # context status field VS as mstatus[24:23]
                              # (0b00 -> off, 0b01 -> initial, 0b10 -> clean,
                              #  0b11 -> dirty)
    and   t3, t1, t2
    beqz  t3, v_disabled_error

                              # Prepare calling bcd2ascii()
    addi  sp, sp, -68         # grow stack by 64+4 bytes, some additional
                              # space but keep it 4 byte aligned
    mv    a0, sp              # store output on stack
    lui   a1, %hi(inp)        # load start address of
    addi  a1, a1, %lo(inp)    # the input string
    li    a2, 32              # load immediate: sizeof inp
    call  bcd2ascii           # we don't need to save/restore our
                              # return address because we don't return ...
    li    t0, 0xa             # load immediate: newline
    sb    t0, 64(sp)          # store byte
                              # i.e. terminate output string with '\n'
    li    a0, 1               # stdout
    mv    a1, sp              # read output located on the stack
    li    a2, 65              # i.e. 64+1 characters
    li    a7, 64              # write syscall number
    ecall                     # call write(2)

    li    a0, 0               # set exit status to zero
exit:
    li    a7, 93              # exit syscall number
    ecall                     # call exit(2)
1:
    j     1b                  # loop forever in case exit failed ...

v_disabled_error:
    li    a0, 2               # stderr
    lui   a1, %hi(err_msg)    # load error message start address
    addi  a1, a1, %lo(err_msg)
    lui   a2, %hi(err_msg_size)     # load error message size
    addi  a2, a2, %lo(err_msg_size)
    li    a7, 64              # write syscall number
    ecall                     # call write(2)
    li    a0, 1               # load immediate exit argument
    j     exit


    .section .rodata          # Start read-only data section
    .balign 4                 # align to 4 bytes
inp:
    .byte 0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef
    .byte 0xfe, 0xdc, 0xba, 0x98, 0x76, 0x54, 0x32, 0x10
    .byte 0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef
    .byte 0xfe, 0xdc, 0xba, 0x98, 0x76, 0x54, 0x32, 0x10
err_msg:
    .string "ERROR: RISC-V 'V' vector extension is disabled!\n"
    .set err_msg_size, . - err_msg

Cross-assembling and linking everything:

~/local/riscvv08/riscv64-unknown-elf/bin/as -march=rv64gcv -o bcd2ascii.o bcd2ascii.s
~/local/riscvv08/riscv64-unknown-elf/bin/as -march=rv64gcv -o start_bcd2a.o start_bcd2a.s
~/local/riscvv08/riscv64-unknown-elf/bin/ld start_bcd2a.o bcd2ascii.o -o bcd2a

Of course, my repository also contains a makefile to simplify building the example.

Emulating

Example emulating session:

$ ~/local/riscvv08/spike/bin/spike --isa=RV64gcV \
        ~/local/riscvv08/riscv64-unknown-elf/bin/pk bcd2a
bbl loader
0123456789abcdeffedcba98765432100123456789abcdeffedcba9876543210

Spike also has an interactive mode that allows to step through the instructions, inspect registers etc. For example:

$ ~/local/riscvv08/spike/bin/spike -d --isa=RV64gcV \
        ~/local/riscvv08/riscv64-unknown-elf/bin/pk bcd2a
: until pc 0 100e2
bbl loader
: vreg 0 8
VLEN=128 bits; ELEN=32 bits
v8  : [3]: 0x00000000  [2]: 0x020ae6a0  [1]: 0x00000000  [0]: 0x020ae630
:
core   0: 0x00000000000100e2 (0x5208a457) vid.v   v8
: vreg 0 8
VLEN=128 bits; ELEN=32 bits
v8  : [3]: 0x0f0e0d0c  [2]: 0x0b0a0908  [1]: 0x07060504  [0]: 0x03020100
: q

In comparison with GDB the interactive prompt is a bit spartanic and doesn't really report syntactic errors in the interactive commands, but it's sufficient. The help can be displayed with h, <ENTER> steps to the next instruction and q quits it.

The address 100e2 in the above example session comes from the disassembled bcd2a executable (i.e. using objdump).