Antonio Lobo-Santos | Research Engineer

Research & Engineering Focus

LLM Evaluation & AI Safety

Designing evaluation pipelines for LLM robustness, moral-capability analysis and failure-mode discovery. Experience with synthetic data generation, custom metrics, adversarial prompting, value-alignment tests and computational experiment execution on shared infrastructure.

Efficient LLM Serving

Building low-latency LLM serving pipelines with vLLM, quantisation and open-weight models for multi-user research environments and real-time HRI experiments.

Robotics & Human-Robot Interaction

Integrating LLM-side interaction modules into ROS2-based socially assistive robotics systems, with structured dialogue, ASR/TTS coordination and real-time constraints for lab-based HRI studies.

ML Systems & HPC

Training and evaluating transformer-based and BERT-style models using shared HPC, Slurm/HTCondor workflows and multi-GPU servers, with reproducible experiment configuration and analysis pipelines.

Reinforcement Learning & Optimisation

Research-level exposure to reinforcement learning, including from-scratch PPO and no-critic GRPO implementations for continuous-control experiments, plus coursework honours in Reinforcement Learning.

Production Software & Data Platforms

Designing production Python/TypeScript microservices, APIs, CI/CD pipelines, Kubernetes/MicroK8s deployments and Kafka/NiFi data-processing systems for real-time monitoring platforms.

Professional Profile

I am a research engineer and ML systems builder working across LLM evaluation, efficient model serving, AI safety, robotics/HRI and production data platforms. I enjoy problems where scientific uncertainty meets engineering constraints: systems that need rigorous modelling, careful evaluation and reliable software.

At CSIC-IIIA, I design evaluation workflows for moral capabilities, robustness and model behaviour in text classifiers and LLMs. My work includes synthetic data generation, custom metric design, BERT-style classifier fine-tuning, open-weight LLM evaluation and computational experiments on shared HPC and multi-GPU infrastructure.

I also build efficient LLM-serving pipelines for real-time human-robot interaction, using vLLM, quantisation and structured dialogue components integrated with ROS2-based robotic systems. The goal is practical AI behaviour under real interaction constraints, not just offline benchmark performance.

Before AI research, I worked full-time as a software engineer at Axión, building production microservices, on-prem Kubernetes/MicroK8s infrastructure, GitLab CI/CD, Kafka/NiFi ETL pipelines and real-time monitoring systems for public-service and private infrastructure.

At Oxford, my current work focuses on game-theoretic results for learning stability and equilibrium in multi-agent systems, with applications to multi-agent coordination and learning dynamics.

LLM Evaluation vLLM Serving ROS2 HRI HPC Experiments Kubernetes/Data Platforms

Selected Systems Built

System 01

Real-time LLM serving for human-robot interaction

Built the LLM-side serving and integration stack for ROS2-based HRI lab studies: vLLM serving, quantised open-weight models, structured dialogue components, ASR/TTS coordination and real-time inference constraints.

vLLMQuantisationOpen-weight LLMsROS2LangChain/LangGraphPydanticASR/TTSMulti-GPU servers

System 02

Moral-capability and robustness evaluation pipeline

Designed evaluation workflows for moral-capability analysis in text classifiers and LLMs, grounded in moral psychology theories such as Moral Foundations Theory and Schwartz's theory of values. Built synthetic datasets, custom metrics and experiment pipelines for classifier and LLM evaluation.

Current research; manuscript in preparation.

PythonPyTorchTransformersBERT-style classifiersLLM evaluationSynthetic dataHTCondorShared HPC

System 03

Production monitoring and data-platform infrastructure

Main developer on production monitoring systems for public-service and private infrastructure. Built Python/FastAPI and TypeScript/NestJS microservices, APIs and real-time monitoring components; operated Kafka/NiFi ETL workflows; deployed GitLab CI/CD; and migrated web applications to an on-prem high-availability MicroK8s/Kubernetes platform.

PythonFastAPITypeScriptNestJSKafkaApache NiFiElasticsearch/KibanaWebSocketsGraphQLDockerMicroK8sGitLab CI/CD

System 04

Deep open-set recognition framework

Co-designed and developed a modular PyTorch Lightning framework for open-set recognition experiments, with configurable backbones, OSR strategies, Hydra experiment configuration, standardised metrics and visualisation tools.

PyTorch LightningHydraResNetEfficientNetViTDINOv2AUROCOSCRROC/t-SNE visualisation

System 05

Flood decision-support full-stack system

Integrated the components of a flood decision-support project into a full-stack application combining data integration, ML models, APIs and interactive dashboards for flood risk analysis and decision support.

PythonFastAPIStreamlitML pipelinesData integrationRisk visualisation

System 06

ShadowPals - computer-vision interaction system

Developed the mathematical parametrisation layer for an interactive shadow-puppetry learning platform using computer vision, MediaPipe hand landmarks and real-time feedback.

Computer visionMediaPipeParametric modellingFastAPIReactDocker

Research & Engineering Experience

Academic Visitor - Multi-Agent Systems & Game Theory

University of Oxford

March 2026 – July 2026 Oxford, UK

Studying game-theoretic conditions for learning stability and equilibrium in multi-agent systems, with applications to multi-agent coordination and learning dynamics.

Research Intern → AI Research Engineer - LLMs, Robotics & Evaluation

CSIC-IIIA, Spanish National Research Council

Nov 2024 – Feb 2026 Barcelona, Spain

Designed moral-capability evaluation workflows for text classifiers and LLMs, grounded in moral psychology theories including Moral Foundations Theory and Schwartz's theory of values.
Created synthetic datasets, custom metrics and experiment pipelines for robustness, value-alignment and behavioural failure-mode analysis.
Fine-tuned BERT-style classifiers with representation-learning techniques and custom losses for moral text classification.
Ran computationally intensive experiments on shared HPC, HTCondor and local multi-GPU servers for classifiers and open-weight LLMs.
Built efficient LLM-serving pipelines using vLLM, quantisation and recent open-weight models for real-time HRI lab studies.
Integrated LLM-side dialogue components into ROS2-based socially assistive robotics systems, coordinating with ASR/TTS and robotics modules.
Contributed to software development for ROS2 experimental workflows, including real-time interaction constraints and structured dialogue interfaces.

Lecturer in Probabilistic Graphical Models

University Pompeu Fabra

Jan 2025 – April 2025 Barcelona, Spain

Designed and delivered lectures on probabilistic graphical models for AI, covering theoretical foundations and practical applications, with emphasis on efficient Bayesian inference in structured probabilistic models.

Undergraduate Researcher - LLMs & Mathematical Knowledge Graphs

University of Seville

Sept 2023 – July 2024 Seville, Spain

Built and evaluated a LangChain/RAG pipeline for extracting mathematical knowledge from LaTeX sources into OWL knowledge graphs. The work became my bachelor's thesis, was selected as a university finalist and later resulted in a peer-reviewed journal publication.

Software Engineer - Platform, Microservices & Real-Time Monitoring

Axión

July 2021 – July 2023 Seville, Spain

Full-time while completing dual B.Sc. degrees in Computer Engineering and Mathematics.

Built and maintained Python/FastAPI and TypeScript/NestJS microservices and APIs for production monitoring platforms.
Deployed and operated an on-prem high-availability MicroK8s/Kubernetes platform, migrating company web applications to the new deployment model with GitLab CI/CD.
Operated Apache Kafka and Apache NiFi ETL workflows for real-time data-processing and monitoring systems.
Developed real-time monitoring features using WebSockets, GraphQL, Elasticsearch/Kibana and selected AngularJS integrations.
Worked on production deployments for public-service and private infrastructure monitoring, including transport, port, public-space and communications-related systems.
Delivered training sessions to technical and non-technical public-administration users, translating operator feedback into implementation priorities.
Coordinated a small engineering team in the second year, bridging communication between management and junior developers, reviewing work, assigning tasks, shaping architecture and mentoring two interns.

Selected Projects

MariChatmen - written Andalûh language model

Experimental Qwen3.5 adaptation for written Andalusian Spanish, covering tokeniser expansion, Andalûh adaptive pretraining, SFT, ORPO, persona tuning and release-gate evaluation.

LLM fine-tuningDialectal AIHugging Face

Read Blog Post View Repository

Moral-capability evaluation for LLMs and classifiers

Synthetic data, custom metrics, moral psychology, BERT-style classifiers, LLM evaluation, HPC/HTCondor and multi-GPU experiments.

Current researchAI safety evaluation

Real-time LLM serving for HRI

vLLM, quantised open-weight models, ROS2, structured dialogue, ASR/TTS coordination and real-time lab-study constraints.

vLLMROS2HRI

Deep Open-Set Recognition framework

PyTorch Lightning/Hydra framework with ResNet, EfficientNet, ViT and DINOv2 backbones, OSR metrics and visualisation.

View Repository

PPO/GRPO continuous-control RL framework

From-scratch course project implementing PPO and no-critic GRPO variants, configurable experiments, logging, checkpoints and performance comparison.

View Repository

Flood-IDSS

Full-stack flood decision-support application integrating data sources, ML models, backend APIs and interactive dashboards.

View Repository

Mathematical Knowledge Graphs with LLMs

LLM/RAG pipeline for extracting mathematical knowledge from LaTeX into OWL knowledge graphs; resulted in peer-reviewed publication.

Read Publication

AB Data Challenge - Finalist

Transformer-based time-series anomaly detection for utility data; designed training code and deployed experiments on distributed hardware.

Time seriesTransformersFinalist

ShadowPals

Computer-vision learning platform using MediaPipe landmarks and mathematical hand-pose parametrisation.

View Repository

Earlier Projects & Awards

App Inventor Against Cyberbullying

Mobile application recognised as MIT App Inventor "App of the Month" for community cyberbullying awareness.

Read Article

Smarthuerto

Award-winning smart agriculture app that won the EC2CE Grow-Lab contest and was featured by MIT App Inventor.

Read Article

Publications & Talks

Publications

ACM/IEEE HRI 2026 Demo

EMY: Supporting Autism Therapy with a Socially Assistive Robot

Sara Cooper, Antonio Lobo-Santos, et al.

HRI '26: 21st ACM/IEEE International Conference on Human-Robot Interaction

Contributed to LLM-supported interaction and system integration for a socially assistive robotics system.

View Article

Modelling, 2025

Enhancing Mathematical Knowledge Graphs with Large Language Models

A. Lobo-Santos, J. Borrego-Díaz

Peer-reviewed journal article derived from my bachelor's thesis

Built and evaluated an LLM-assisted pipeline for extracting mathematical knowledge from LaTeX into queryable knowledge graphs.

View Article

ECAI 2025 Workshop

Trustworthy AI Through Dual-Role Reasoning

Workshop contribution

Safety-oriented reasoning and evaluation work for LLM systems

Explored dual-role reasoning as a safety-oriented approach to evaluating and improving reasoning behaviour in LLM systems.

Current Research

Moral-capability evaluation for classifiers and LLMs

Designing evaluation workflows grounded in moral psychology, with synthetic data generation, custom metrics and HPC/multi-GPU experiments.

NLP pragmatics and formal grammars

Researching NLP problems involving pragmatics and classical context-free grammar formalisms.

Invited Talks

October 2025 · Madrid, Spain

Invited Speaker: LLM-Tools for Coding

La Moncloa (Spanish Government)

Delivered an invited seminar on effective and proper use of LLM tools for coding, focused on best practices, capabilities and safety considerations for software engineering workflows.

November 2025 · Barcelona, Spain

Seminar: LLM-Tools for Coding

CSIC-IIIA

Conducted a technical seminar for researchers at IIIA on advanced usage of LLM tools for coding, with emphasis on prompting strategies, assistant workflows and safe implementation practices.

View Seminar Details

Earlier Publications

Sept 2023

Building the "mapamático" (map-matic)

Antonio Lobo-Santos, Pablo Martín Berná, Juan Núñez Valdés

Epsilon Journal of Mathematics Education, Number 114

View PDF

Mar 2023

Infinity: Some Curiosities and its Teaching in High School

Antonio Lobo-Santos, Pablo Martín Berná, Juan Núñez Valdés

Números Journal, Volume 113

View Article

Education & Training

M.Sc. in Artificial Intelligence

UPC–UB–URV

2024 – June 2026 Barcelona, Spain

Focus: reinforcement learning, deep learning, statistical modelling, ML systems and AI research. Honours in Machine Learning, Deep Learning, Reinforcement Learning and Complex Networks. Selected projects include from-scratch PPO/GRPO implementations, open-set recognition, flood decision support and computer-vision interaction systems.

B.Sc. in Mathematics

University of Seville

2019 – 2024 Seville, Spain

Strong training in probability theory, statistics, optimisation, mathematical modelling, and rigorous analysis. Developed a deep theoretical foundation for advanced machine learning and data science.

B.Sc. in Computer Engineering

University of Seville

2019 – 2024 Seville, Spain

Graduated with 9 honour distinctions, including Algorithms and Complexity, Computer Architecture, Computer Networks, and Intelligent Systems.

Bachelor's Thesis: Large Language Models and Knowledge Graphs to Model Mathematical Knowledge

Developed and evaluated an LLM-based pipeline for extracting mathematical knowledge from LaTeX into a structured knowledge graph.

Read Thesis

Exec. Summary Pt 1 Pt 2 Pt 3 Pt 4

Certifications

AI Alignment Course

BlueDot Impact • November 2024

View Credential

Technical Skills

Core areas

LLM evaluationAI safety evaluationEfficient LLM servingRobotics/HRIML systemsReinforcement learningProduction backend engineeringData platforms

ML / AI

PyTorchTransformersBERT-style classifiersRepresentation learningRAGLangChainLangGraphDSPyModel evaluationSynthetic dataCustom metrics

LLM serving / inference

vLLMQuantisationOpen-weight LLMsBatchingGPU server deploymentStructured outputsPydantic schemas

RL / optimisation

PPOGRPOSACTD3Reward modellingContinuous-control experimentsMulti-agent learning dynamics

Robotics / HRI

ROS2Socially assistive robotsReal-time interaction pipelinesASR/TTS integrationStructured dialogue systems

Infrastructure / systems

DockerKubernetesMicroK8sGitLab CI/CDLinuxSlurmHTCondorMulti-GPU serversObservability

Backend / data

PythonFastAPITypeScriptNestJSGraphQLWebSocketsPostgreSQLArangoDBKafkaApache NiFiElasticsearch/KibanaRabbitMQ

Languages

Working languages for research, engineering and collaboration.

Research Engineer

Antonio Lobo Santos

ML Systems · AI Safety & Robotics

Research & Engineering Focus

LLM Evaluation & AI Safety

Efficient LLM Serving

Robotics & Human-Robot Interaction

ML Systems & HPC

Reinforcement Learning & Optimisation

Production Software & Data Platforms

Professional Profile

Selected Systems Built

Real-time LLM serving for human-robot interaction

Moral-capability and robustness evaluation pipeline

Production monitoring and data-platform infrastructure

Deep open-set recognition framework

Flood decision-support full-stack system

ShadowPals - computer-vision interaction system

Research & Engineering Experience

Academic Visitor - Multi-Agent Systems & Game Theory

Research Intern → AI Research Engineer - LLMs, Robotics & Evaluation

Lecturer in Probabilistic Graphical Models

Undergraduate Researcher - LLMs & Mathematical Knowledge Graphs

Software Engineer - Platform, Microservices & Real-Time Monitoring

Selected Projects

MariChatmen - written Andalûh language model

Moral-capability evaluation for LLMs and classifiers

Real-time LLM serving for HRI

Deep Open-Set Recognition framework

PPO/GRPO continuous-control RL framework

Flood-IDSS

Mathematical Knowledge Graphs with LLMs

AB Data Challenge - Finalist

ShadowPals

Earlier Projects & Awards

App Inventor Against Cyberbullying

Smarthuerto

Publications & Talks

Publications

EMY: Supporting Autism Therapy with a Socially Assistive Robot

Enhancing Mathematical Knowledge Graphs with Large Language Models

Trustworthy AI Through Dual-Role Reasoning

Current Research

Moral-capability evaluation for classifiers and LLMs

NLP pragmatics and formal grammars

Invited Talks

Invited Speaker: LLM-Tools for Coding

Seminar: LLM-Tools for Coding

Earlier Publications

Building the "mapamático" (map-matic)

Infinity: Some Curiosities and its Teaching in High School

Education & Training

M.Sc. in Artificial Intelligence

B.Sc. in Mathematics

B.Sc. in Computer Engineering

Certifications

AI Alignment Course

Technical Skills

Core areas

ML / AI

LLM serving / inference

RL / optimisation

Robotics / HRI

Infrastructure / systems

Backend / data

Languages