2 openingsin-person3+ yearsCloses 30 May 2026

Agentic AI Engineer

Build agentic AI systems that turn messy, real-world information into structured knowledge that production chat applications can rely on.

Location

Pune, India

CTC

INR 18 LPA

Start

Immediate

Experience

3+ years

About the role

We are hiring a senior engineer to lead the design and build of agentic AI systems that we ship to our customers. The work starts with messy raw material: internal documents, runtime data, expert tribal knowledge that lives only in the heads of subject matter experts. Your job is to turn that material into a typed, source-cited knowledge layer that production chat applications query against. You will own the ingestion pipelines, the agentic skills that maintain the knowledge base, the MCP servers that expose it to other systems, and the evaluation harnesses that grade every output the system produces. This is a hands-on role. You will write most of the code in Claude Code or OpenAI Codex, design the prompts and the tool surfaces yourself, decide how memory is structured across long-running sessions, and stress-test the system until it stops fabricating. You will work in person at our Pune office alongside the founders and the rest of the technical team.

What we will test you on

The interview process is calibrated to five specific skill areas. For each one, we tell you upfront what the area is, what we are looking for, and a concrete example of the kind of question or situation you will be asked to reason about.

Context engineering

How you decide what goes into the prompt, where it goes, what gets cached, what gets summarized, and what stays out entirely. We want to see strong instincts about prompt structure under real cost and latency constraints.

Example

You have a 50-page operations manual and a chat application that needs to answer questions from it accurately. Walk us through the choices you would make about where to place the manual in the prompt, what to cache, and what to retrieve on demand.

Memory engineering for agents

How you decide what an agent should persist across sessions, what it should recompute from scratch, and what it should retrieve on demand. We are looking for someone who has thought hard about the line between working memory, persistent memory, and retrieved memory.

Example

An agent holds multi-day conversations with a user about their evolving project. Some facts must persist across days. Others should be recomputed every time. Some should be looked up only when needed. Walk us through how you would design this and where you would draw each line.

Tool design and MCP

How you build MCP servers, how you write tool descriptions that an agent will actually use the way you intended, and how you debug an agent that keeps picking the wrong tool. We want to see that you treat tool design as a first-class engineering problem.

Example

Your MCP server exposes both `list_documents` and `search_documents`. The agent keeps calling `list_documents` and filtering the results in its own head instead of using `search_documents`. Walk us through how you would diagnose and fix this without changing the agent's prompt.

Working in Claude Code or OpenAI Codex

Your actual day-to-day loop when you start a task in these tools. What you delegate to the AI. What you keep manual. What feature of the tool you have built a personal workflow around. We are looking for someone who has shipped at least one production system using Claude Code or Codex.

Example

Show us your workflow. Walk us through one recent session in detail. Tell us about something you built using Claude Code or Codex that is now running in production, and be specific about which parts the AI wrote and which parts you wrote.

Inference engineering

How you pick the right model for a given task, how you manage cost across a large run, and how you decide when prompt caching pays off versus when it does not. We want to see that you reason about model selection on cost and capability grounds, not on familiarity.

Example

You have a long ingestion pipeline with hundreds of LLM calls. Some calls need the most expensive model. Others can run on a much cheaper one. Walk us through how you would decide which is which and how you would set up the cost envelope.

What you will do

·Take messy real-world artifacts (documents, runtime data, expert knowledge captured from conversations) and drive them through agentic ingestion pipelines into typed, source-cited knowledge structures.
·Author Claude Code skills (essentially sub-agents) for ingestion, query, conflict detection, lint, and coverage. These small skills are the building blocks the rest of the system runs on, and writing good ones is most of the job.
·Stand up MCP servers that let domain experts query the system in natural language and get back structured answers with citations to the source material.
·Design and run the evaluation harnesses for every system you ship. You treat evaluation as part of the deliverable, not as an afterthought.
·Make per-task model selection decisions. Use the cheapest model that meets the bar; reserve the expensive ones for the cases where they actually change the answer.
·Refuse to let the system fabricate. When evidence is missing, the system says so. When sources contradict each other, both sides are quoted with attribution and the human curator decides.

What we expect

·First-principles fluency in the five skill areas we test on (context engineering, memory engineering, tool and MCP design, working in Claude Code or Codex, inference engineering). You should be able to defend specific choices, not just describe the landscape.
·Working knowledge of Claude Code or OpenAI Codex as your daily driver. You have shipped at least one production system using them.
·Comfort designing typed schemas and structured outputs that automated linting and coverage checks can rely on.
·An engineer who actively improves input quality at the source rather than quietly compensating for bad inputs with smarter prompts.
·Precision in how you communicate. We are a small team and we cannot afford long ambiguous threads about what was actually decided.
·Willingness to learn an unfamiliar domain in days rather than weeks. Many of our customer engagements drop you into a domain you have never worked in before.

Highlights

Stack

Frontier LLMs (Claude Sonnet, Haiku, Opus; OpenAI flagship; Gemini 2.5 Pro), MCP servers, typed knowledge graphs, supervised agentic skills.

Methodology

Supervised ingestion, citation on every claim, conflicts surfaced and not silently resolved, evaluation harnesses shipped alongside the code.

Success bar

Every claim cites its source. Contradictions are quoted on both sides with attribution. The system refuses to make up an answer when the evidence is missing.

Cadence

Real customer engagements with real deadlines, working in person with the founders. Not a research role.

What we do not care about

·Your CGPA, your college, or the brand on your last employer's logo.
·Whether you have shipped this exact stack before. We care that you can reason about it from first principles.
·Polished, conventional answers to hard questions. We are happy to hear you think out loud.

How the process works

1
Apply
Sign in with Google, share a few basic details, upload your resume, and tell us about one or two things you have built with Claude Code or Codex.
2
Written reflection
Five short subjective questions, about 15 to 20 minutes. We block paste and record typing patterns because we want to see how you think when nobody is helping you.
3
Asynchronous video interview
Behavioral and technical questions, about 25 to 30 minutes total. You record on your own time. Camera and microphone are on.
4
Live build session
A short hands-on session on hire.vizuara.ai where you build something with an AI assistant. We score your prompts, your decisions, and the artifact you produce. About 30 minutes.
5
Final round
A 60-minute conversation with the founders at our Pune office.

Ready to apply?

Start application