[ 01 ]
Full-Stack AI Engineer

Munim
Ahmad

Building systems that reason, remember, and respond. Open-source author. Based in Lahore, Pakistan.

Specialized in RAG pipelines, LLM optimization, and production ML deployment. Based in Lahore, Pakistan.

Key Metric
3+
Years Building

Shipping production AI systems end-to-end.

Scroll to explore
Selected Work
[ 02 ]

Selected Projects

Context

Exact-match caches miss paraphrased prompts, so repetitive intent still triggers full-price LLM calls even when user meaning is identical.

Method

Wrap completion calls once, compute local prompt embeddings with ONNX, then perform cosine similarity lookup before forwarding misses to the provider.

Outcomes
  • Sub-10ms lookup overhead on CPU.
  • In-process architecture with no external orchestration layer.
  • 40–70% savings potential for repetitive support/FAQ workloads.
Stack
PythonONNX RuntimeFastEmbedRedisPrometheus
Recallm semantic cache decision workflow
Intercept → similarity check → cache hit/miss routing from public Recallm architecture.
Context

Research and support workflows require cited answers from long PDF corpora without context switching to browser-based assistants.

Method

Electron + React shell with retrieval pipelines over chunked documents, multi-LLM compatibility, and local-first interaction patterns.

Outcomes
  • Cross-platform desktop distribution with installable releases.
  • Citation-oriented answers over user-provided PDFs.
  • Designed for iterative reading and follow-up questioning.
Stack
ElectronReactTypeScriptLangChainChromaDB
About
[ 03 ]

About & Approach

I build AI systems that are genuinely useful — not demos. My work sits at the intersection of machine learning infrastructure and full-stack engineering, with a particular obsession for making LLMs faster, cheaper, and more reliable in production.

At Endshift I shipped a RAG pipeline handling real traffic — optimizing Mistral-7B inference with semantic caching, cutting latency and cost. On the side, I built Recallm, an open-source Python library for LLM semantic caching.

I'm wrapping up my Computer Science degree at UCP in Lahore and actively looking for AI engineering roles — especially involving LLM evaluation, RLHF, or production-scale inference systems.

Skills & Technologies

AI / ML

  • PyTorch
  • Hugging Face
  • LangChain
  • LlamaIndex
  • FastEmbed + ONNX
  • ChromaDB / FAISS
  • LoRA / QLoRA

Backend Systems

  • .NET (C#)
  • Node.js / Express
  • FastAPI / Flask
  • REST + GraphQL APIs
  • WebSockets
  • PostgreSQL + Redis
  • Okta / Auth0 / JWT

Frontend & Apps

  • React
  • Next.js
  • TypeScript
  • Electron
  • Tailwind CSS
  • Jest

Cloud & DevOps

  • AWS (EC2, S3, Lambda, ECS)
  • Docker
  • Kubernetes
  • GitHub Actions
  • Prometheus + Grafana
  • RabbitMQ / Celery
Experience
[ 04 ]

Where I've Been

May 2025 – Oct 2025

ML Engineer

Endshift — Lahore, Pakistan

Architected production RAG pipeline (500+ queries daily, 98% uptime). Reduced Mistral-7B inference latency by 65% through batching, quantization, and optimization. Built .NET backend with Okta SSO, deployed on AWS EC2 with Docker.

2022 – Present

Open Source Author

Recallm — recallm.dev

Designed and published Python semantic cache library for LLMs with embedding-based similarity matching. Production-focused architecture with Redis/in-memory storage, Prometheus metrics, and async support.

2022 – July 2026

BSc Computer Science

University of Central Punjab — Lahore

Final semester. Focus areas include machine learning, distributed systems, and software engineering.

Contact
[ 05 ]

Let's build
something real.

Formal correspondence regarding AI engineering opportunities, infrastructure collaboration, or product research.
Typical response window: < 24h.

02GitHub
github.com/munimx
03LinkedIn
linkedin.com/in/munimahmad
04Location
Lahore, Pakistan
Ref: CV / Resume (PDF)[ Download .pdf ]