[ 01 ]

Full-Stack AI Engineer

Munim
Ahmad

Building systems that reason, remember, and respond. Open-source author. Based in Lahore, Pakistan.

Specialized in RAG pipelines, LLM optimization, and production ML deployment. Based in Lahore, Pakistan.

View Work Get in touch →

Key Metric

Years Building

Shipping production AI systems end-to-end.

Scroll to explore

Selected Work

[ 02 ]

Selected Projects

Context

Exact-match caches miss paraphrased prompts, so repetitive intent still triggers full-price LLM calls even when user meaning is identical.

Method

Wrap completion calls once, compute local prompt embeddings with ONNX, then perform cosine similarity lookup before forwarding misses to the provider.

Outcomes

Sub-10ms lookup overhead on CPU.
In-process architecture with no external orchestration layer.
40–70% savings potential for repetitive support/FAQ workloads.

Stack

PythonONNX RuntimeFastEmbedRedisPrometheus

Visit recallm.dev ↗GitHub repository ↗

Recallm semantic cache decision workflow — Intercept → similarity check → cache hit/miss routing from public Recallm architecture.

Context

Research and support workflows require cited answers from long PDF corpora without context switching to browser-based assistants.

Method

Electron + React shell with retrieval pipelines over chunked documents, multi-LLM compatibility, and local-first interaction patterns.

Outcomes

Cross-platform desktop distribution with installable releases.
Citation-oriented answers over user-provided PDFs.
Designed for iterative reading and follow-up questioning.

Stack

ElectronReactTypeScriptLangChainChromaDB

GitHub repository ↗Release downloads ↗

About

[ 03 ]

About & Approach

I build AI systems that are genuinely useful — not demos. My work sits at the intersection of machine learning infrastructure and full-stack engineering, with a particular obsession for making LLMs faster, cheaper, and more reliable in production.

At Endshift I shipped a RAG pipeline handling real traffic — optimizing Mistral-7B inference with semantic caching, cutting latency and cost. On the side, I built Recallm, an open-source Python library for LLM semantic caching.

I'm wrapping up my Computer Science degree at UCP in Lahore and actively looking for AI engineering roles — especially involving LLM evaluation, RLHF, or production-scale inference systems.

Skills & Technologies

AI / ML

PyTorch
Hugging Face
LangChain
LlamaIndex
FastEmbed + ONNX
ChromaDB / FAISS
LoRA / QLoRA

Backend Systems

.NET (C#)
Node.js / Express
FastAPI / Flask
REST + GraphQL APIs
WebSockets
PostgreSQL + Redis
Okta / Auth0 / JWT

Frontend & Apps

React
Next.js
TypeScript
Electron
Tailwind CSS
Jest

Cloud & DevOps

AWS (EC2, S3, Lambda, ECS)
Docker
Kubernetes
GitHub Actions
Prometheus + Grafana
RabbitMQ / Celery

Experience

[ 04 ]

Where I've Been

May 2025 – Oct 2025

ML Engineer

Endshift — Lahore, Pakistan

Architected production RAG pipeline (500+ queries daily, 98% uptime). Reduced Mistral-7B inference latency by 65% through batching, quantization, and optimization. Built .NET backend with Okta SSO, deployed on AWS EC2 with Docker.

2022 – Present

Open Source Author

Recallm — recallm.dev

Designed and published Python semantic cache library for LLMs with embedding-based similarity matching. Production-focused architecture with Redis/in-memory storage, Prometheus metrics, and async support.

2022 – July 2026