If I can modus tolens this modus polens, it feels to me that

Indeed, even for 100 questions [...] this would come up as significant less than 50% of the time

is evidence that the noise level is low, and the skill difference is small.

E.g., taking the top 20 forecasters in Metaculus' last Quarterly Cup, we see average score differences of ~0.05 (equivalent to your highest noise level), and that's among the very top forecasters we had on that tournament!