Multiple problems after upgrading to fw-0.5.4

I upgraded my odrive (v3.6, 56V version) firmware from 0.5.1 (which had some problem talking to AS5047P encoders via SPI) to 0.5.4 (which seems to deal with them just fine.)

However I am now facing a litany of problems. In fact, almost every time I run a test I seem to hit a brand new, totally different problem, which has made systematically investigating anything rather hard.

Firstly, trying to set encoder mode to ENCODER_MODE_INCREMENTAL causes a constant MOTOR_ERROR_DRV_FAULT (on both axis0 and axis1, even though I’m only using axis0).
This isn’t a showstopper for me, since don’t want to use incremental mode, and after encoder.config.mode = 257 everything works fine. But 0.5.1 didn’t have this issue.

Secondly, during calibration, the motor movement is very juddery. This is new behaviour since in 0.5.1 it moved smoothly.

  • Calibration has a similar success rate to 0.5.1, though roughly 1 in 4 attempts still result in ENCODER_ERROR_CPR_POLEPAIRS_MISMATCH; I have never been quite sure why.

Thirdly, in closed loop torque control, setting the torque to a low value (eg 0.7) causes the comms between the odrive and the PC to drop out. I get this message in odrivetool (which then itself becomes unresponsive):

Oh no odrv0 disappeared
In [11]: ←[91;1m11:39:31.425452100 [USB] Transfer on EP 0x83 still in progress. This is gonna be messy.←[0m
Exception in callback <CFunctionType object at 0x000001FE1221E798>(<ctypes.winty...001FE1221BEC8>)
handle: <Handle <CFunctionType object at 0x000001FE1221E798>(<ctypes.winty...001FE1221BEC8>)>
Traceback (most recent call last):
  File "c:\python36\lib\asyncio\events.py", line 145, in _run
    self._callback(*self._args)
OSError: exception: access violation writing 0x0000000000000024

Fourthly, if instead of letting the motor spin up to any kind of speed I actually hold it firm and fight it, the comms don’t drop out but I see a very strange behaviour. Basically I am able to ‘stick’ the motor such that I can move my hand away and it stays motionless. Then if I tap it, it suddenly experiences the torque again and begins accelerating until I grab it again.

  • when the motor is stuck this way I can set requested_state to Idle and back again, and the motor stays motionless.
  • My working theory is that this is something to do with the encoder offset being off somehow, maybe as a result of the juddering during calibration.

And finally, if I hold the motor still while it’s trying to turn(without doing the sticking trick) and steadily increase the torque, I can get it to about 1.3 before the odrive bombs out with this error:

system: no error
axis0
  axis: no error
  motor: Error(s):
    MOTOR_ERROR_UNKNOWN_TORQUE
    MOTOR_ERROR_UNKNOWN_VOLTAGE_COMMAND
  sensorless_estimator: no error
  encoder: Error(s):
    ENCODER_ERROR_ABS_SPI_COM_FAIL
  controller: Error(s):
    CONTROLLER_ERROR_INVALID_ESTIMATE

At this torque I should be well within the current_lim, which I set to 45. And although it says ENCODER_ERROR_ABS_SPI_COM_FAIL, I have never seen spi_error_rate visibly move from 0 on the liveplotter.

I do not know how to approach dealing with these issues.

No one?

This is a brand new Odrive running the latest firmware and it’s unusable. I must say I would have expected the creators at least to have some interest in why their flagship product is a brick.

I apologise for my frustration. I understand that the culture of open source is “it’s on you to be clever enough to use the product” with a healthy dose of “you get what you pay for”. But we do pay for Odrives and it’s my understanding that with the Odrive Pro you’re looking at going in a more closed source direction anyway. When the product doesn’t work as advertised, and instead gives obscure and introverted error messages, some support would really be helpful to the likes of me.

Anyway.

I’ve been staring at this for several days now and so far this is what have come up with:

  • the comms bombout seems to be down to noise. Tying the power supply earth to a spare GND on the Odrive seems to make the error less common, though it has happened sporadically since.

  • increasing calibration current improves the judder. This is confusing to me since when the same setup ran fw-0.5.1 it calibrated fine at the old level. Has the behaviour of calibration changed somehow? (I can hear by the whistling that it’s now taking current measurements during the process, when it didn’t used to.)

  • in any case I can get the calibration_current up to 17 before calibration starts failing with MOTOR_ERROR_PHASE_RESISTANCE_OUT_OF_RANGE. This is enough to improve the judder a lot but not eliminate it completely.

  • I’ve disassembled the motor and cleaned and inspected it, and I can’t visually see anything wrong in there. The resistance is the same between any two phases and it’s about 0.3Ohms.

  • the bizarre cog-sticking behaviour has improved to merely having deadspots where the torque isn’t as strong. I’m putting this down to the reduced judder meaning a reduced encoder offset error. The deadspots themselves (and just a general strong cogging feeling as you turn the shaft through its rotation) are very perceptible and strong.

  • the bombouts that happened when I held the motor still against high torque have stopped (or at least, “retreated” a bit.) I’m sort of assuming this is also due to better calibration.

  • however, I can get exactly the same error message by setting a high input_torque and letting the thing spin freely.

  • I am assuming the top two errors (MOTOR_ERROR_UNKNOWN_TORQUE and MOTOR_ERROR_UNKNOWN_VOLTAGE_COMMAND) are downstream of the real problem, which is the encoder suddenly dropping out.

  • liveplotting spi_error_rate shows that it stays at 0, not appearing to move at all, before suddenly spiking out of nowhere. If this was due to high currents or something I would have expected to see a steadily rising amount of noise that finally tips it over the edge, but nope: the encoder spi is healthy one moment and just gone the next. Anyone recognise these symptoms?

  • I found a page online that implied the MOTOR_ERROR_DRV_FAULT is caused by a problem with the DRV8301, which doesn’t fit my symptoms at all (why would the DRV8301 care which encoder mode we’re using?) But while INCREMENTAL_MODE is forbidden to me I can’t investigate whether the other problems go away when not using spi.

So now I’m wondering how to improve the calibration_cycle to remove the rest of the judder (while still being confused about why it’s suddenly different), how to improve the deadspots/cogging if better calibration doesn’t prove to be enough on its own, and how to diagnose the encoder comms suddenly giving out like it’s doing.

That sounds like it’s doing something to the SPI. Save and reboot?

All the other issues sound like SPI noise. Are you using the ferrite rings from the shop?

I don’t know what this means.

Some users have reported issues with 0.5.4 increasing the emitted electrical noise, we’re not sure what may have caused that - one theory is the GPIO and SPI pins got set to VERY_HIGH slew rate.

Thanks for replying.

Re: MOTOR_ERROR_DRV_FAULT:

The MOTOR_ERROR_DRV_FAULT was constant and persisted across power cycles. In order to test everything again in INCREMENTAL_MODE I ended up removing the wires from MOSI, MISO and CLK. After that the error went away.

This is still a regression though; with fw-0.5.1 I could wire all the encoder pins up properly and toggle modes through the Odrive tool when I wanted to check something. Having to instead constantly rewire things is annoying by comparison.

I would be interested to know why this happens. Do you know what’s going on, or what it was that you changed to cause this?

Re: spi noise:

Yes, I am using the ferrite cores (and they did help solve a similar problem when I installed them, back when I was using fw-0.5.1) but they are apparently still not enough. I’m going to try making a few changes to the rig to sort out the grounding properly.

I do however have a few questions about that diagnosis:

  • how does noise on the spi wires cause the usb comms to bomb out?
  • if there’s enough noise to sabotage the calibration process, why don’t I see any on the liveplotter/why don’t I get an SPI_COM_FAIL at that point?
  • is it possible that spi noise could be causing very noisy current measurements? I’m seeing extremely noisy current, which often jumps over my 45A limit when it really, really shouldn’t.

Re: whistling and the changed calibration process:

Yes, by ‘whistling’ I mean the emitted electrical sound you can hear when in closed loop control. I took from a previous conversation with you that that noise comes from the current measurement being taken.

However, in fw-0.5.1 I didn’t hear this noise during the calib cycle, only during CLOSED_LOOP_CONTROL mode. In fw-0.5.4 I hear it during both.

So my question is what changes have been made to the calib cycle, and could they explain the other differences I’m seeing (juddery behaviour when moving, my needing to increase the calib current when previously 10A was enough, etc.)?

(BTW - I’ve bounced around through the source code a bit, and I can’t see the point where that sound gets ‘turned on’ and ‘turned off’. It’s obviously not on all the time, so something must be doing it, but I can’t see where it is. Can you/anyone answer that?)

If I had to guess, it is because the pins are somehow configured differently between the firmwares.

Well, I was thinking more just general noise issues, not specifically SPI coupled noise. Can you put together a little drawing of your power and comms wiring?

This is more likely because you have a poor encoder offset measurement, or a motor that is hard to control (trapezoidal bEMF, etc). SPI noise can translate into bad encoder position estimate which translates to bad current control due to the FOC algorithm’s reliance on accurate phase angle measurement.

Hmm, not sure exactly. We always took current measurements, but never control in current-control mode during calibration (it’s in “Lockin” mode). With that said, I’m not sure if the lock-in mode changed between versions.

It’s specifically when the FOC algorithm is running with current feedback (closed loop), as current measurement noise gets coupled to the controller. In voltage-only FOC (motor measurement, encoder offset search, gimbal mode), we don’t use the current feedback.

I unfortunately do not know enough about microelectronics to make anything of this response. Does this relate to what internal pullup or pulldown resistors are configured on the pin, or am I way off the mark?

I can, but it’s pretty standard and I don’t think it would illuminate anything. I’m going to try changing the rig anyway - I think that shaft voltage from the motor might be responsible for the noise - so we’ll see what works and what doesn’t after I’ve done that.

This is what I was thinking. I’m just not sure how to make it better. Again I have to hope that if I reduce the noise I’ll see some improvement. The calibration changes and problems are an open question.

Interesting. I’m not working with gimbal mode - but does that mean that people who do aren’t experiencing any noise?

No, it only means that they don’t see the current measurement noise - the electrical noise that is giving you trouble generally comes from switching or shaft voltage etc.