Repo-based Code Annotator (Remote)

Contract role for a Repo-based Code Annotator to build reproducible Docker-based environments, strengthen unit test coverage, validate benchmark-aligned task sets, and deliver clear, standardized documentation. Daily rate: USD $80–$120 depending on skills and experience.

Job Image

About the Role

You will create reliable, reproducible task environments and testing workflows that ensure target repositories behave as expected. The work centers on containerized testbeds, unit testing, and documentation that enable consistent validation across SWE-Bench and Terminal-Bench workflows.

Key Responsibilities

• Build reproducible, standardized test environments using Docker images to replicate known issues or achieve expected outputs per defined procedures. • Review and improve unit test coverage and effectiveness to evaluate correctness and stability of target codebases. • Validate completeness and rationality of test sets, ensuring workflows align precisely with SWE-Bench and Terminal-Bench requirements. • Write high-quality task documentation (task.yaml/README) with an emphasis on reproducibility, determinism, and standardized processes.

Required Skills

• Proficient with Linux command line and Shell scripting; comfortable with tools such as grep, sed, awk, curl, and jq. • Expert-level Python for building task harnesses, writing unit tests, and developing automation tooling. • Strong Docker skills, including crafting Dockerfiles and building reproducible environments. • Familiarity with testing frameworks (e.g., pytest), including structured unit tests, mocking, and controlling randomness. • Competent with Git/GitHub workflows and able to submit high-quality, reproducible pull requests.

Preferred Background

• Background in Computer Science, Software Engineering, Artificial Intelligence, or related fields; or relevant experience in Software Development, Test Engineering, DevOps, or Data Engineering. • Demonstrated contributions to open-source projects, especially in automated testing, CI/CD, or containerization.

Bonus Points

• Proficiency in Go or Rust for high-performance tooling. • Familiarity with sandbox and orchestration technologies (e.g., Docker Compose, Podman). • Ability to design datasets/tasks that mitigate or prevent task cheating. • Understanding of scientific benchmark design principles (fairness, repeatability, scalability). • Experience with automated testing systems or CI/CD and a cross-disciplinary perspective.

Compensation & Engagement

• Rate: USD $80–$120 per day, commensurate with skills and experience. • Engagement: Contract-based, remote. • Scope: Task-oriented assignments focused on reproducibility, testing, and documentation quality.

Tools & Tech Stack

• Linux, Shell (grep, sed, awk, curl, jq) • Python (pytest, mocks, fixtures, determinism controls) • Docker (Dockerfiles, image builds, reproducible pipelines) • Git/GitHub (branches, PRs, code review)

Success Metrics

• Reproducible builds and deterministic test runs. • Improved and meaningful unit test coverage tied to project goals. • Accurate alignment with SWE-Bench and Terminal-Bench workflows. • Clear, standardized documentation (task.yaml/README) enabling others to reproduce results.

Frequently Asked Questions

  • Q: What does a typical assignment involve?

    You will containerize a repository with Docker, implement or refine unit tests (pytest), validate workflows against SWE-Bench/Terminal-Bench expectations, and produce clear task.yaml/README documentation that guarantees reproducibility.

  • Q: How is compensation structured?

    The role is contract-based with a daily rate of USD $80–$120, depending on skills and experience. The exact rate is set after evaluating your technical fit against the role’s requirements.

  • Q: What level of experience is required?

    You should be comfortable working independently with Linux, Python, Docker, pytest, and Git/GitHub. Prior open-source contributions in testing, CI/CD, or containerization are preferred but not strictly required.

  • Q: Do I need prior experience with SWE-Bench or Terminal-Bench?

    Familiarity is helpful but not mandatory. You must be able to follow benchmark-aligned procedures precisely and design tests and environments that are deterministic and reproducible.

  • Q: Which tools and frameworks will I use most often?

    Linux shell tooling (grep, sed, awk, curl, jq), Python (pytest), Docker (Dockerfiles), and Git/GitHub. Experience with Go/Rust, Docker Compose, or Podman is a plus.

  • Q: Is the role fully remote?

    Yes. The work is remote and suited to asynchronous collaboration, provided deliverables meet reproducibility and quality standards.

  • Q: What does success look like in this role?

    Reliable Docker builds, deterministic tests, accurate alignment with benchmark workflows, and concise documentation that enables others to reproduce results without ambiguity.

  • Q: Are there opportunities for follow-on work?

    Possibly. Availability of additional assignments depends on project needs and the quality, timeliness, and reproducibility of your deliverables.

230+Domains Covered
120K+PhD, Specialist, Experts Onboarded
50+Countries Represented

Industry-Leading Compensation

We believe exceptional intelligence deserves exceptional pay. Our platform consistently offers rates above the industry average, rewarding experts for their true value and real impact on frontier AI. Here, your expertise isn’t just appreciated—it's properly compensated.

Work Remotely, Work Freely

No office. No commute. No constraints. Our fully remote workflow gives experts complete flexibility to work at their own pace, from any country, any time zone. You focus on meaningful tasks—we handle the rest.

Respect at the Core of Everything

AI trainers are the heart of our company. We treat every expert with trust, humanity, and genuine appreciation. From personalized support to transparent communication, we build long-term relationships rooted in respect and care.

Ready to shape the future of code annotation?

Apple below.

I'M INTERESTED