About the Role
This role focuses on making software evaluation reproducible and trustworthy. You will: • Build standardized, reproducible test environments using Docker images to replicate known issues and validate expected outputs. • Review and strengthen unit test coverage to assess correctness and stability of target repositories. • Validate task and test set completeness for workflows aligned with SWE-Bench and Terminal-Bench. • Produce high-quality task documentation (e.g., task.yaml and README) that emphasizes reproducibility and standardized processes.



