So the feedback system will be the position of the apriltag relative to the camera attached to my 2D gantry. It will Not be the encoder on the Odrive. I am having trouble wrapping my head around how to implement an accurate control algorithm.

I was looking for any advice you might have for this. Should I implement my own PID controller where the feedback input is tag position, and the output is motor velocity? Should I Just calculate the distance and give a position command to the odrive controller?

Happy to answer any questions you may have! Thanks so much, this looks like a fantastic community!

What you describe is called “Visual Servoing” and is becoming quite a popular subject in robotics
Normally, this is applied as a ‘correction’ offset to the position command, by an external loop. But for high-performance (fast/agile/animal-like) robots, you may want to bypass all the joint-based PID loops and control torque on each joint at a high rate from a model-predictive controller (MPC), which is aware of the physics between joints, as well as the effect of the load on each joint’s inertia, etc.
The relation between ‘forces’ at the end-effector and ‘torques’ at each joint of a ‘series manipulator’ aka a standard robot arm, can be found by taking the ‘Jacobian’ of the kinematic chain of coordinate transforms which desribes the robot.

Of course, for a Syncronous Motor (including any Brushless DC Motor) to produce a controllable torque, you still need an encoder.
Even in current mode (aka torque mode) you need an encoder (preferably a fast, high-resolution encoder) to know how to translate a 1-dimensional torque demand into the 2-dimensional current demand needed for a 3-phase synchronous motor.

What you may do is to use a proportional navigation control law, where the input is the line-of-sight rate, aka a vector of the velocity of your goal point. An example is that if your object is moving 100 pixels per second to the left, your camera has 1000 pixel horizontal resolution, and your lens has 100 degree field of view, then your object has a CCW relative rotation rate of (100 pixel/s) * (100deg/1000pixel) = 10 deg/s.

N = 3, the rest are vectors

Feed this into the PN law and you can compute the lateral acceleration (latax) to be commanded to your moving platform.

This can achieve leading interception instead of pure pursuit homing; this is very popular in missile guidance.