Applied Model Scientist Specialist LLMs, SLMs and Domain Evaluation

Celebal Technologies · Pune City, Maharashtra, India

Full-time · Senior · Posted 8 days ago

Hiring Profile
Applied Model Scientist
Specialist LLMs, SLMs and Domain Evaluation
A practical hiring specification for building hands-on capability in fine-tuning, base model selection, domain adaptation, specialist model evaluation, and enterprise deployment.

Executive Summary
The team does not need a generic GenAI engineer or prompt engineer. It needs a hands-on Applied Model Scientist who can work close to the model layer: selecting base models, fine-tuning or adapting them, designing domain-specific evaluations, and proving when a smaller or specialist model outperforms, matches, or complements a large general-purpose model.
Energy can be a useful context, but it should not be the core hiring constraint. The stronger requirement is that the candidate understands model internals, model repositories, fine-tuning methods, specialist evaluation design, inference trade-offs, and deployment across cloud, local, and on-prem environments.
The ideal person should be able to move from research paper to working prototype to benchmark report. They should know how to test models rigorously, explain failure modes, write model cards, and advise when to use RAG, fine-tuning, small language models, large models, graph models, time-series models, or physics-informed models.

Applied Model Scientist - Specialist LLMs, SLMs and Domain Evaluation

We hiring for a hands-on Applied Model Scientist who can select, fine-tune, evaluate, and deploy specialist LLMs and SLMs for narrow domain tasks, and prove through rigorous benchmarks when a smaller or adapted model wins on accuracy, latency, cost, privacy, or deployment fit.

What This Person Should Actually Do
Responsibility area
What they need to do
Base model selection
Compare Phi, Llama, Mistral, Qwen, Gemma, DeepSeek, domain models, embedding models, multimodal models, and time-series models for specific tasks.
Fine-tuning
Know when to use supervised fine-tuning, LoRA, QLoRA, DPO, preference tuning, continued pretraining, distillation, adapters, or simply RAG.
Domain adaptation
Convert raw domain data into training sets, instruction datasets, evaluation datasets, tool-use traces, and synthetic examples.
Evaluation science
Build domain-specific evals that prove whether a small or specialist model beats a large generic model.
Model internals
Understand tokenization, context length, attention, quantization, embeddings, adapters, decoding, hallucination behavior, failure modes, and inference constraints.
Model repositories
Be comfortable with Hugging Face, Ollama, LM Studio, Azure AI Foundry, AWS Bedrock, SageMaker, NVIDIA NIM, vLLM, TensorRT-LLM, llama.cpp, and on-prem deployment.
Hands-on experiments
Run controlled bake-offs: GPT-4 class model versus Llama 8B versus Phi versus Qwen versus domain-tuned model.
Specialist model thinking
Understand that energy and industry models may be language models, graph models, time-series models, physics models, or multimodal models, not only chatbots.
Model cards and publication quality
Produce publishable model cards, eval reports, benchmark methodology, ablation notes, and reproducible repos.
Enterprise deployment
Know how to deploy securely with cost, latency, privacy, GPU/CPU footprint, monitoring, rollback, and governance in mind.
Must-Have Technical Skills
Skill area
Must-have depth
Transformer fundamentals
Understand attention, embeddings, tokenization, positional encoding, context windows, KV cache, decoding strategies, and limitations.
Fine-tuning
Hands-on experience with SFT, LoRA, QLoRA, PEFT, adapters, DPO or preference tuning.
Base model comparison
Experience comparing Llama, Mistral, Qwen, Phi, Gemma, DeepSeek, Mixtral, domain models, or similar open models.
Evaluation design
Ability to create task-specific benchmark sets, golden datasets, rubrics, human eval workflows, and automated LLM-as-judge pipelines.
Data preparation
Can create instruction datasets, synthetic datasets, negative examples, domain ontologies, and evaluation splits.
Inference stack
Experience with vLLM, llama.cpp, Ollama, LM Studio, TensorRT-LLM, Triton, NVIDIA NIM, or similar.
Cloud AI platforms
Practical exposure to Azure AI Foundry, AWS Bedrock or SageMaker, NVIDIA stack, Databricks Mosaic AI, or equivalent.
Hugging Face ecosystem
Model cards, datasets, transformers, PEFT, accelerate, bitsandbytes, evaluate, TRL, safetensors.
MLOps / LLMOps
Experiment tracking, model registry, versioning, deployment, monitoring, rollback, cost and latency profiling.
Python and PyTorch
Strong hands-on coding ability. Should be able to train, fine-tune, evaluate, and deploy models, not just call APIs.

Strongly Preferred Skills
Skill
Why it matters
Published models or datasets on Hugging Face
Shows real hands-on credibility.
Papers, blogs, or technical reports on domain-specific evals
Shows they can prove model performance, not just claim it.
Experience with synthetic data generation
Critical for specialist models where labeled data is limited.
Experience with small models
Important for SLM economics, edge, on-prem, and agent workers.
Quantization experience
Needed for smaller form factors: 4-bit, 8-bit, AWQ, GPTQ, GGUF.
Multimodal exposure
Industry models often involve documents, diagrams, images, sensor data, maps, and time series.
Time-series or graph ML exposure
Highly relevant for energy, supply chain, grid, manufacturing, and industrial use cases.
Domain evaluation experience
Especially healthcare, finance, energy, legal, manufacturing, agriculture, or scientific AI.
Agentic systems experience
Tool use, routing, planner-worker architectures, model cascades, and agent traces.