I don't consider it "that" strange, if this was a funded professional venture, I wouldn't expect users to be compiling our regression tests for us, I would expect us to be able to release a new version on a less regular basis using a large set of generated test cases. Assuming users create test cases, they're definitely valuable but in terms of test coverage it's pretty poor waiting on something to go wrong and a case to be created for it.
(Note I did say to create a set of "ALL ENCOMPASSING" tests would be a massive effort.. as opposed to just a few specific cases).
Given that, it's all we have for now so I agree we should catalog all the ones we get in.
For example what we should have is a test-library of every imaginable combination of PROC, arguments, Uses, stackframe etc (obviously within reason).. The problem is to verify the accuracy of this you need a baseline for comparison, which we don't have as yet until this is all 100% stable.
I think we're reaching a point now where I feel comfortable that we've added enough "new" stuff, so really what I'd like is to leave it alone to stabilize and only release bug-fixes for a while. Then once that is done I will look to generate a full set of new regression tests that verify stack setup, alignments, locals, prologue/epilogue generation, uses etc.. As that is really where the source of all these issues have been.
For the proc related stuff you really want to do it at a binary level comparison wise.. and that is difficult if you don't have something you know to be correct to use for the comparison, so I'm thinking by letting it stabilise we can reach a point "albeit" manually where we're sufficiently satisfied to use what it generates at that point as the baseline for future comparison.
We went through a phase where there were a lot of encoding avx/avx2 and evex encoding fixes but those seem to be settled down now, and we've got a test-suite for them.