FT600 / FT601 Performance : Throughput and roundtrip time

Experiments

My setup for the experiments is a custom FT601Q PCB with Intel MAX10 FPGA, a normal 60 cm USB cable into a USB Hub into a 2 m USB extension cord to a Windows 11 PC. This means there is a combined 3 m of cable between the PC and the PCB which should make for a sufficiently challenging setup. Signals are logged with a logic analyzer.

Code for the software is posted with the experiments below, also check out the official “Data Streamer Demo App”.

Experiment set up

One-way throughput PC->FTDI

Writing data from PC to FTDI is quite simple on the FPGA side, because we can just set RD_N and OE_N low and the FTDI will shuffle data out as fast as it can. On the software side, we can use blocking or non-blocking API calls.

Code:

1
2
3
4
5
6
7


for (int i = 0; i < 1000; ++i) {
    FT_STATUS status = FT_WritePipe(h, 0x02, (unsigned char*)buffer, BUFFER_SIZE, &written, nullptr);
    if (status != FT_OK || written != BUFFER_SIZE) {
        std::cerr << "error ft_writePipe\n";
        return false;
    }
}

Full Gist

Comparison of throughput using blocking and non-blocking API calls for different transfer sizes.

Waveforms using blocking API calls

1024 bytes / transfer. 3 MB/s 1024 x 8 bytes / transfer. 25 MB/s 1024 x 32 bytes / transfer. 102 MB/s 1024 x 64 bytes / transfer. 146 MB/s 1024 x 1024 bytes / transfer. 327 MB/s

Waveforms using non-blocking API calls (quad-buffering)

1024 x 64 bytes / transfer. 205 MB/s 1024 x 1024 bytes / transfer. 366 MB/s

Some waveforms for the different transfer setups are behind the spoilers above. For the smaller transfer sizes - where transfer size refers to the buffer size passed to an individual call to FT_WritePipe() - we see relatively long gaps (~250 µs) between chunks of data arriving at the FPGA. On top of that, after every 4 KB, there is a small pause (see below) in the data output, likely because the FIFO inside of the FTDI is 4 KB in size.

We can optimize the throughput by reducing the impact of the time (~250 µs) that every API call seems to be taking at minimum, by using the non-blocking API to submit further transmission to the queue before the last one ends. Additionally, it helps to choose a large buffer size in each API call. The “saturated” waveform at full capacity is shown below, where the FTDI outputs data for 10.24 µs (1024 cycles -> 4 KB) and then needs around 460 ns to start this process again. This leads to a maximum throughput of $ \frac{4~\mathrm{KB}}{10.24~\mathrm{µs} + 460~\mathrm{ns}} = 374~\mathrm{MB/s} $.

Saturated traffic pattern for PC->FTDI.

One-way throughput FTDI->PC

One-way throughput from FTDI to PC shows similar behavior as the other direction, i.e. performance increases with larger buffer and using the non-blocking API. However, the saturated traffic pattern (at least in my case) is slightly less regular and the time between every 4 KB filling of the FIFO is slightly longer than above, leading to a slightly lower maximum throughput of 346 MB/s.

Saturated traffic pattern for FTDI->PC. Slightly more irregular and longer pauses between FIFO fills reduces the overall throughput.

Two-way throughput

For two-way throughput, I’m not listing values here because results heavily depend on how often the direction is switched, how many cycles are allocated to reverse the bus and how big the chunks of data in each direction are.

Roundtrip time PC->FTDI->PC

This test first writes 20 bytes from PC to FTDI, then receives 20 bytes from the FTDI. There are two variants of this on the software side. The first uses the naive approach of FT_WritePipe() and then FT_ReadPipe() such that the read request is only submitted once the WriteRequest is done. The first variant pre-loads an overlapped FT_ReadPipe() before dispatching the write, such that the read pipe should already be open once the write completes.

Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


for (int i = 0; i < 100; ++i) {
    int64_t ticks_start = GetTicks();
    if (!testOverlapped(h))
        return 3;
    int64_t ticks_stop = GetTicks();

    timeOverlapped += (ticks_stop - ticks_start) / (1.0 * freq.QuadPart);

    ticks_start = GetTicks();
    if (!testNormal(h))
        return 4;
    ticks_stop = GetTicks();

    timeNormal += (ticks_stop - ticks_start) / (1.0 * freq.QuadPart);
}
FT_Close(h);

timeOverlapped /= 100;
timeNormal /= 100;

std::cout << "Average overlapped time: " << timeOverlapped << " s. Average normal time: " << timeNormal << " s.\n";

Full Gist

Result:

1

Average overlapped time: 0.000517638 s. Average normal time: 0.000525985 s.

We reach around 500 µs roundtrip time, independent of which variant we use. This is likely due to a couple hundred µs being used within each of the driver calls, which is why the overlapped approach brings no benefits. However note that it does make a big difference in terms of throughput.

Waveform in overlapped mode. TXE_N goes low early because the ReadPipe is queued before actually sending data. Waveform in normal mode.

Roundtrip time FTDI->PC->FTDI

This experiment covers scenarios where data is acquired on the FPGA and sent to the PC via the FT60X, then some or most processing is performed on the PC and depending on the processing, results are sent back to the FTDI. So here we will first receive a block of 1024 bytes via FT_ReadPipe() and afterwards send a response via FT_WritePipe(), measuring the time until that response arrives at the FPGA. Since the reading is completed before the writing in this case, there’s no need for the overlapped variant.

Here, we can achieve around 275 µs from beginning to send out the 1024 bytes until we receive an answer at the FPGA, which seems like a decently fast roundtrip time for a non-realtime system.