Token-Efficient Question Answering with Adaptive RAG via Uncertainty Estimation. A novel RAG pipeline that selectively triggers retrieval based on an LLM's uncertainty estimation — reducing token usage while preserving accuracy. An inference-time controller computes entropy-based thresholds to decide when external context is needed vs. when parametric knowledge suffices. Benchmarked on SQuAD, Natural Questions, and TriviaQA against standard RAG baselines.
Open paper →Probing and Lightweight Fine-Tuning of DNA Foundation Models. Evaluates pretrained DNA foundation models (DNABERT-2, Nucleotide Transformer, HyenaDNA) on three genomic classification tasks — human non-TATA promoters, human enhancers (Cohn), and Drosophila enhancers (Stark) — comparing probing vs. fine-tuning strategies (last-4-layer unfreeze, progressive unfreeze, LoRA) across three pooling functions.
Open paper →A Duolingo-style app that teaches industry-specific language through interactive learning — Wordle-style games, news feeds from the GNews API, and chat-style flashcards, backed by 1,000+ jargon terms. 100+ users, actively beta-testing with 30 healthcare professionals in Ethiopia.
Visit trybizlang.com →- Automated story and regression tests in Java with Selenium across ServiceNow's internal lists, forms, and streams. Merged 5 PRs.
- Engineered an AI-driven extension in TypeScript + Node using the GitHub and OpenAI APIs that auto-fills build requests and summarizes PRs — streamlining the merge process.
- Deployed a Mattermost bot for live defect and story updates, built on JavaScript, JSON, and a Jenkins pipeline.
- Engineered scalable infrastructure for evaluating agentic router prompts across multiple LLMs (GPT-4o, Claude, Gemini, LLaMA) — high-throughput prompt tuning and reproducible benchmarking.
- Built an evaluation pipeline in Python + PyTorch measuring routing accuracy, tool-invocation success, and task-completion rates across tens of thousands of model outputs.
- Designed AutoChat and Judge LLM systems to simulate and evaluate full multi-turn conversations for automated conversational QA.
- Collaborated with linguists to deploy labeled datasets into production workflows; merged 6 PRs and surfaced insights to agentic-adjacent teams.
- Implement pre-trained LLMs (CLIP) to generate image embeddings from the MuFaSAA dataset — 165 robots and 1,200+ entries — and identify algorithms that reduce computational time by 87%.
- Hyperparameter-tune SVR and transformers with Optuna to predict social and functional expectations of robots based on design features and metaphors, reaching R² > 0.7 on warmth, discomfort, and competency.
- Collaborate with a PhD mentor on Vision-Language Models that predict explanatory metaphors for robot designs — aimed at enhancing user understanding and engagement with robotic technologies.