Repo-based Code Annotator

Remote, repo-focused role building reproducible Docker test environments, reviewing unit tests, validating SWE-Bench/Terminal-Bench workflows, and writing clear task documentation. Compensation: USD $80–$120 per day, subject to skills and experience.

Job Image

About the Role

This role focuses on making software evaluation reproducible and trustworthy. You will: • Build standardized, reproducible test environments using Docker images to replicate known issues and validate expected outputs. • Review and strengthen unit test coverage to assess correctness and stability of target repositories. • Validate task and test set completeness for workflows aligned with SWE-Bench and Terminal-Bench. • Produce high-quality task documentation (e.g., task.yaml and README) that emphasizes reproducibility and standardized processes.

Key Responsibilities

• Author Dockerfiles and compose minimal, efficient images that reliably reproduce target repo states. • Analyze existing test suites; add or refine pytest-based tests to increase signal without introducing flakiness. • Implement and verify controls for randomness and external I/O (mocking, seeding, fixtures). • Ensure Git/GitHub workflows produce clean, reproducible pull requests with clear diffs and instructions. • Maintain task harnesses and automation scripts in Python (CLI tools, fixtures, adapters). • Validate benchmark-aligned workflows (SWE-Bench/Terminal-Bench) for repeatability and fairness. • Document assumptions, environment variables, entrypoints, and verification steps for each task.

Required Skills

• Proficient in Linux command line and Shell scripting; comfortable with grep, sed, awk, curl, jq. • Strong Python programming skills for harnesses, tests, and automation. • Proficient with Docker: writing Dockerfiles, building images, and ensuring reproducible builds. • Familiar with pytest and unit testing practices, including mocking and controlling randomness. • Comfortable with Git/GitHub workflows; able to submit clean, reproducible pull requests.

Professional Background

• Background in Computer Science, Software Engineering, Artificial Intelligence, or related fields; or experience in Software Development, Test Engineering, DevOps, or Data Engineering. • Preference for contributors to open-source projects, especially in automated testing, CI/CD, or containerization.

Bonus Points

• Experience with high-performance languages (Go or Rust). • Familiarity with sandbox and orchestration tech (Docker Compose, Podman). • Ability to design datasets/tasks that discourage “task cheating.” • Understanding of benchmark design principles: fairness, repeatability, and scalability. • Experience with CI/CD or automated testing systems and cross-disciplinary work.

Compensation

USD $80–$120 per day, dependent on demonstrated skills, experience, and the complexity of assigned tasks.

Work Arrangement

• Remote opportunity. • Collaborate asynchronously via GitHub issues/PRs and documented workflows. • Outcome-oriented: emphasis on reproducibility, clarity, and measurable test improvements.

Tools & Technologies

• Linux, Shell (bash/sh), grep/sed/awk/curl/jq • Python, pytest • Docker, Dockerfiles • Git, GitHub • Optional: Go, Rust, Docker Compose, Podman

Quality & Standards

• Reproducibility: deterministic environments with documented versions and seeds. • Clarity: task.yaml/README must include setup, usage, and validation steps. • Maintainability: modular harnesses, minimal external dependencies, clear test fixtures. • Integrity: benchmark-aligned tasks that are resistant to shortcuts and ensure fair evaluation.

Repo-based Code Annotator — FAQ

  • Q: What does “repo-based code annotation” entail in this role?

    You will prepare repositories for reliable automated evaluation by building Dockerized environments, refining unit tests, and writing documentation that makes tasks repeatable and verifiable. The focus is on correctness, coverage, and clear instructions.

  • Q: How do SWE-Bench and Terminal-Bench relate to day-to-day work?

    You will align workflows and test sets with the requirements of SWE-Bench and Terminal-Bench, ensuring tasks are complete, fair, and reproducible, with deterministic behavior and well-defined validation steps.

  • Q: Which programming and tooling skills are essential?

    Python (for harnesses/tests), Linux shell tools (grep/sed/awk/curl/jq), Docker (Dockerfiles, reproducible builds), pytest (fixtures, mocking, seeding), and Git/GitHub for clean PRs.

  • Q: Is experience with Go or Rust required?

    No. Go and Rust are a plus for performance-oriented tasks, but they are not mandatory for success in this role.

  • Q: How is success measured?

    By reproducible environments, improved test coverage and stability, clear documentation, and workflows that meet benchmark standards without flakiness or ambiguity.

  • Q: Is the position fully remote and asynchronous?

    Yes. Work is remote and collaboration occurs via GitHub and documented processes, with an emphasis on asynchronous communication.

  • Q: What level of seniority is expected?

    A practitioner comfortable with Python testing, Docker, and Linux tooling. You should be able to independently produce reproducible environments and clear documentation without requiring advanced research-level expertise.

  • Q: What compensation can I expect?

    The daily rate is USD $80–$120, dependent on your proven skills, relevant experience, and the complexity of work assigned.

  • Q: Do I need open-source contributions to be considered?

    Open-source contributions are preferred—especially in testing, CI/CD, or containerization—but they are not strictly required.

  • Q: What kind of documentation is expected?

    Task.yaml and README files that state environment setup, dependencies, execution steps, validation procedures, and any assumptions or constraints. The goal is repeatability and clarity.

230+Domains Covered
120K+PhD, Specialist, Experts Onboarded
50+Countries Represented

Industry-Leading Compensation

We believe exceptional intelligence deserves exceptional pay. Our platform consistently offers rates above the industry average, rewarding experts for their true value and real impact on frontier AI. Here, your expertise isn’t just appreciated—it's properly compensated.

Work Remotely, Work Freely

No office. No commute. No constraints. Our fully remote workflow gives experts complete flexibility to work at their own pace, from any country, any time zone. You focus on meaningful tasks—we handle the rest.

Respect at the Core of Everything

AI trainers are the heart of our company. We treat every expert with trust, humanity, and genuine appreciation. From personalized support to transparent communication, we build long-term relationships rooted in respect and care.

Ready to shape the future of code annotation?

Apple below.

I'M INTERESTED