Decreasing latency

tlbtlbtlb · November 30, 2018, 1:22am

I’m also seeing high latency for read/write ops. It seems to average about 1.15 mS for read, 1.21 mS for write (see Python cProfile lines below, there were 3033 reads and 1014 writes in a roughly 5-second run.)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3033    0.059    0.000    3.450    0.001 remote_object.py:71(get_value)
     1014    0.018    0.000    1.237    0.001 remote_object.py:75(set_value)

For every loop, we write 2 values (axis{0,1}.vel_setpoint) and read 6 values (axis{0,1}.{Iq_setpoint, pos_estimate, pll_vel}). That limits our maximum update rate to about 100 Hz, which makes it hard to have any kind of tight feedback loop in software.

A “set all” and “read all” command might be a better way to do the API.

This is on an nVidia TX2 board, with 6 64-bit ARM cores.

tlbtlbtlb · November 30, 2018, 9:14pm

I upgraded the odrive firmware from 0.4.2 to 0.4.6, and the read/write times got worse by about 1.5x. Now 1.77 mS for read, 1.93 mS for write.

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1941    0.059    0.000    3.439    0.002 remote_object.py:71(get_value)
     650    0.018    0.000    1.256    0.002 remote_object.py:75(set_value)

madcowswe · December 2, 2018, 3:41am

So we recently discussed this issue on the ODrive Discord. Here is sort of a summary:

Full Speed USB has a frame period of 1ms. Interrupt transfers and Isochronous transfers are initiated once per frame, with guaranteed latency. However, we use bulk transfers, which use up the “unallocated” bandwidth when the bus is otherwise idle. So the transfers may be initiated at any time: but there are no guarantees.
We found (at least on the machines we tested) that the bulk transfer had the lowest latency, because of this “unscheduled” attribute. Specifically: “Data are transferred in the same manner as in Interrupt Transfers, but have no defined polling rate. Bulk Transfers take up all the bandwidth that is available after the other transfers have finished. If the bus is very busy, then a Bulk Transfer may be delayed.”

So if there is no other traffic on the bus, we get better than 1ms latency, but if there is other heavy traffic on the bus, or if the host doesn’t really bother being timely about the bulk transfers, then there will be more latency
I am pretty sure that the latency on the ODrive side is extremely short, on the order of 10-100 microseconds.

On my Windows 10 machine it benchmarked a 1-out 1-response (2 packet) sustained full cycle time of 250 microseconds. So around 4000 full request-responses per second.

So if you get much worse latency than this, I would check:

Is there other traffic on the bus?
Does your platform not bother to be realtime about bulk transfers?

In the future we may upgrade to USB High-Speed (USB 2.0). This has 125 microsecond micro-frames, which lets us guarantee 125 microsecond latency Interrupt transfers.

Wetmelon · December 2, 2018, 5:59am

I went ahead and benchmarked this using a timing script that was provided by @moorage. Results as follows:

That’s ~ 0.3ms consistent performance. Windows 10 desktop with an i5

moorage · December 2, 2018, 6:36pm

@Wetmelon glad to see you got good results. We’re running on Ubuntu 16.04 on a NVIDIA TX2, and the oDrive is the only thing hooked up to USB. I wonder if it’s an ubuntu thing or an ubuntu+arm64 thing…

Code provided here:

gist.github.com

https://gist.github.com/moorage/8d2caffe87f9300822ea1aad8c00c325

odrive_time_trials.py

import time
import math
import odrive
import odrive.enums

ODRIVE0 = None
ODRIVE0_HULL_SENSOR_COUNTS_PER_POLE_PAIR = 6
ODRIVE0_MOTOR_LEFT_COUNTS_PER_REV = None # Read from config on odrive
ODRIVE0_MOTOR_RIGHT_COUNTS_PER_REV = None # Read from config on odrive
ODRIVE0_MOTOR_LEFT_KV = 15.98

This file has been truncated. show original

Samuel · December 5, 2018, 12:05pm

I would say trying to reduce the USB packet latency is a workaround at best.
The root problem here is that for a single variable write we do a complete round-trip on USB.

This can be fixed by implementing the following two things (in combination):

make remote_endpoint_operation asynchronous or “fire-and-forget”
transmit multiple commands in the same USB frame. As a temporary solution this can be achieved connecting to the ODrive via the COM-port interface.

yconst · December 13, 2018, 7:39pm

So I went ahead and implemented a simple futures mechanism using some of the classes included in fibre. There were two aims at this point:

Make property writes fire-and-forget. In my case this almost halved the time spent in a property write (from 390μs to ~180μs), as would be expected.
Make property reads optionally return a future instead of waiting for the actual value to be retrieved from the Odrive. To measure whether this had any effect in latency I stacked a bunch of read commands and stored their returned futures in an array, and then went on and read them one by one. The resulting latency was slightly increased from 390μs to 420μs (probably due to the futures mechanism overhead).

(partial) test code:

import sys
import inspect
from timeit import Timer

import odrive
import unittest
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')

odrv = None
REPEAT = 200
SAMPLES = 64


def time_read_current_limit(samples=SAMPLES):
    futures = []
    for _ in range(samples):
        futures.append(odrv.axis0.motor.config.current_lim_future) # just some property
    for f in futures:
        f.get_value()


class TestTiming(unittest.TestCase):

    def setUp(self):
        global odrv
        odrv = odrive.find_any()

    def runTest(self):
        fset = [f for name, f in inspect.getmembers(
            sys.modules[__name__]) if inspect.isfunction(f)]
        for f in fset:
            fn = f.__name__
            if fn.startswith("time"):
                logging.info("Testing "+fn)
                logging.info(Timer(lambda: f()
                                ).timeit(number=REPEAT)/float(REPEAT)/float(SAMPLES))

As seen above the futures functionality is currently implemented as property aliases; to get a property to return a future, you append the string “_future” to its name, e.g. odrv0.vbus_voltage_future will give you a Future instance which you can query for the value using get_value(). I’m not sure if this is the ideal way to do this, but that’s what I ended up with.

If anyone wishes to take a look, the code is here: https://github.com/yconst/ODrive/tree/futures

What’s missing here is obviously a mechanism to defer sending packets and stack them up altogether so that they can be sent in one go over USB. Perhaps that should give a measurable drop in latency.

The tests mentioned above were performed on a Macbook Pro with the odrive connected directly to the computer’s USB port. Tests with the Raspberry Pi pending.

I would love to hear your ideas.

madcowswe · January 21, 2019, 7:36am

I really like the concept! Now code that doesn’t need the response right away can be way more efficient. Also I see how it can be a very neat way to interface to an underlying batching mechanism that can serve to use the USB bandwidth better by using larger packets, when/if we were to add that.

I don’t think I have the bandwidth (pun intended) to have a look at what you have made at the moment, but I’ve added it to the roadmap to take a look at later.

Thanks for your work!

yconst · January 23, 2019, 8:59am

@madcowswe thanks for the reply. I’ve also tried creating new endpoints for dual motor control etc., to have more than one commands sent in one go. I’ve eventually found that the most limiting factor seems to be packet size, which seems to top off at 64bit of payload in my experiments, i.e. transferring anything more than two int32s seems to trigger two USB transactions and this seems to take almost double the time.

I am not well versed in USB to understand all the details but it definitely seems to me it would make sense to have the maximum packet size increased, if possible.

Looking forward to your thoughts when you have the time.

Riewert · January 23, 2019, 10:15am

If I remember correctly the packages can be pretty big. The longest package I recorded coming from the ODrive is when it sends configuration data:

5b7b226e616d65223a22222c226964223a302c2274797065223a226a736f = 60 bytes + 4 sequence number bytes.

Since an int32 is 4 bytes. You should be able to send up to 15 values in a package (assuming you don’t send a CRC or the endpoint address).

Skim through my post about working with the packages in java - Odrive and Java

yconst · January 23, 2019, 12:55pm

@Riewert interesting. Have you done any timing measurements sending/receiving packets?

Riewert · January 23, 2019, 1:15pm

About 200us - 400us per package one-way, but I had quite a few incidental delays that bumped the average for a round trip up to 1.5ms ± 0.5ms. Probably due to thread priority or related issues on my PC, the ODrive was the only connected USB device aside from my mouse.

madcowswe · January 24, 2019, 9:37am

Yes the underlying USB Full Speed is using 64 Byte packets. The current implementation sends each transfer in its own packet, an obvious optimisation would be to bundle multiple requests/transfers.

tarun · July 12, 2019, 10:04pm

Hey I’ve been working on Doggo, but unfortunately the ODrive was communicating too slowly for us as well. We were unable to close the loop and do torque control. USB was too slow for us since we would only be able to communicate at ~250Hz. Previously for each loop, we had to give a current command for each axis and request encoder positions and velocities for each axis (and we have 4 ODrives onboard). After digging into the firmware a little bit, we added an endpoint function that would take in both current commands packaged into a struct. This function would return all four encoder positions/velocities as struct. This allowed us to use more of the large 64 byte packet. The other thing we did is thread all the USB requests. We were unsure how much this would help, but with the threading and packaging of the commands we saw a huge speed up with communication rates >1500Hz. This resulted in a visible improvement in our closed loop torque control from an external board. If you would like to see our fork of ODrive it is here. I still haven’t tested it with the python communication library yet since we have been using a C++ library to communicate with the ODrives.

Riewert · July 13, 2019, 8:04am

That is really nice. I was looking into this exact same issue/feature! Will use your code, very useful, thanks for sharing.

Wetmelon · July 13, 2019, 10:34am

If you get time, I’d love if you could add the USB handling code in a pull request back to the main repository.

Riewert · July 13, 2019, 6:16pm

Can you explain this a little clearer? Also looking at your code, did you succeed in reducing the amount of messages necessary per function call from 6 total to 2? I did something similar, putting everything into a single function call, but it is still pretty slow, so I’m guessing a lot of the increase is from your C++ library? - Odrive and Java

madcowswe · July 14, 2019, 8:58pm

This looks great, and is a good inspiration for others who need to do the same.

One day we will hopefully have a transfer-coalescing feature, and a subscription feature, which will hopefully achieve the same thing but in a more dynamic way. Also of course a concurrent communication feature to allow parallel talking to multiple ODrives to reduce overall latency.
One day we’ll get there

tarun · July 15, 2019, 6:55pm

Yes, we needed to have 6 messages before, but 4 of them required responses from the ODrive which takes a long time. With the way that the function call works, there are actually 3 required messages but 2 of them can be fire and forget which definitely helps. This means that the function call only requires one response from the ODrive. You are correct though that the bulk of the speedup comes from the threading of the requests. I was planning on submitting a pull request and getting the new endpoints mainlined into the ODrive repo.

The C++ library was developed by another member of our team but it’s still kinda messy and a work in progress. It uses libusb and it works well enough for our purposes so far but definitely can be improved. All the threading was on the computer side with this C++ library and the testing repo/C++ library can be found here. I would like to implement a thread pool so threads are not created and destroyed every loop, but that’s something I’ll do in the near future.

camrbuss · October 1, 2019, 10:11pm

I don’t mean to revive an old thread if there is a newer one out there (I couldn’t find one), but has anyone done any latency testing for the other protocols? CAN, I2C, SPI, and UART

On a similar note, is there any documentation on max bus speeds, or should I just follow the STM32F4 reference manual?