Hadi Abdine

42 rue Notre Dame des Victoires . 75002 Paris, France·
hadi[underscore]abdine[at]outlook[dot]com·

I am Hadi Abdine, an engineer at IFM Paris (MBZUAI Institute of Foundation Models). I hold a Ph.D. in computer science, data and AI from Institut Polytechnique de Paris. My studies were done at LIX-École Polytechnique with the DaSciM team under the supervision of Prof. Michalis Vazirgiannis. My current research work focuses on natural language processing, pretrained language models and their application. Before joining LIX as a Ph.D. candidate, I graduated with a Master degree in data science from Institut Polytechnique de Paris, an engineering degree in data science from Telecom Paris and an engineering degree in computer science and telecommunication from the Lebanese University, Faculty of Engineering 1.

Dedicated to the field of Natural Language Processing (NLP) and the advancements facilitated by large language models, I am deeply passionate about the intersection of technology and linguistics. My research focuses on diverse NLP applications using transformer-based language models and LLMs. This envolves semantic, political, legal and bioinformatical (e.g. proteins function generation in free text using their 3D structures and amino acid sequences) applications.

News

September 2025: Our paper Prot2Text-V2: Protein Function Prediction with Multimodal Contrastive Alignment is accepted in NeurIPS 2025!
September 2025: Our paper Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts is accepted in ArabicNLP 2025!
July 2025: New preprint: Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts.
June 2025: New preprint: Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning.
May 2025: New preprint: Prot2Text-V2: Protein Function Prediction with Multimodal Contrastive Alignment.
May 2025: New models: we release Nile-Chat, a family of open instruction-tuned models for Egyptian dialect, developed to handle both scripts commonly used in Egypt: Arabic script and Latin-based Arabizi.
January 2025: Our paper: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect, received the best paper award at LoResLM@COLING 2025!
October 2024: New preprint: Graph Linearization Methods for Reasoning on Graphs with Large Language Models.
September 2024: New preprint: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect. Atlas-Chat models and data are available on HuggingFace!
September 2024: Prot2Text is now available on HuggingFace!
April 2024: I have successfully defended my Ph.D. thesis!
March 2024: New preprint: Neural Graph Generator: Feature-Conditioned Graph Generation using Latent Diffusion Models.
February 2024: Our paper, GreekBART: The First Pretrained Greek Sequence-to-Sequence Model has been accepted at LREC-COLING 2024
January 2024: Prot2Text models, datasets and code are publicly available on Github!
December 2023: Our paper Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers has been accepted at AAAI 2024!
December 2023: Our paper Word sense induction with agglomerative clustering and mutual information maximization has been published in AI Open Journal!
October 2023: Our paper Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers has been accepted as spotlight at DGM4H and AI4Science at NeurIPS 2023!
May 2022: Our paper, Evaluation of Word Embeddings from Large-Scale French Web Content, is accepted to appear at CNIA 2022, Saint-Etienne, France.
May 2022: Our paper, Political Communities on Twitter: Case Study for the 2022 French Presidential Election, is accepted to appear at PoliticalNLP 2022, Marseille, France.
April 2022: I was invited to give a keynote talk entitled “JuriBERT: A Language Model Adaptation for French Legal Text” at the e-Juris workshop in 'palais des juridictions administratives lyonnaises organised' by Isabelle SAYN.[Slides]
Oct 2021: JuriBERT models are now available to download on our web app.
Sep 2021: Our paper, JuriBERT: A Masked-Language Model Adaptation for French Legal Text, is accepted to appear at NLLP 2021, Punta Cana, Dominican Republic.
Jan 2021: I developed the web app nlp.polytechnique.fr to illustrate the NLP work in DaSciM Team including BARThez, BERTweetFR and the French Word2Vec.

Experience

MBZUAI Institute of Foundation Models - Paris

Natural Language Processing Researcher / Engineer

As an AI/NLP researcher, I am primarily responsible for the training and fine-tuning of large and small language models (LLMs), with a particular focus on Arabic dialects. My work involves designing and implementing data collection pipelines, curating high-quality datasets for both pretraining and instruction fine-tuning, and developing robust evaluation frameworks to assess model performance across a range of NLP tasks such as summarization, translation, and sentiment analysis. I led the development of dialect-specific models including Nile-Chat for Egyptian Arabic and Atlas-Chat for Moroccan Darija, optimizing them for dialogue and generative capabilities. My role also includes experimenting with and applying advanced techniques such as LoRA, Direct Preference Optimization (DPO), and multilingual alignment to enhance the performance and adaptability of these models.

June 2024 - Now

École Polytechnique

Teaching

ALTEGRAD: practical NLP lab sessions in Master program Data Science École Polytechnique and in Master M2 MVA covers the different NLP and deep learning advanced models from RNN and 1D-CNN to transformers using PyTorch and Keras. (2020 to 2024)
Introduction to Text Mining and NLP (INF582): practical NLP lab sessions in École Polytechnique, Enginnering Program covers text mining basics and deep learning in NLP. (2021 to 2025)
Advanced Deep Learning (INF581A): practical lab sessions in École Polytechnique, Enginnering Program covers transformers, transfer learning and large language models (LLMs). (2024)
Text Mining and Deep Learning for NLP practical lab sessions in AAI and DSSP programs of École Polytechnique Executive Education (2020 to 2025)
Data Mining: Deep Learning for NLP practical lab sessions in the SPEIT program - Shanghai Jiaotong University. (18 March 2022)

LIX - École Polytechnique

Natural Language Processing Researcher / Engineer

Distributed word representations are popularly used in many tasks in natural language processing to achieve high performance in many NLP tasks. In this project, we crawled a huge French corpus and used it to train static French word embeddings (Word2Vec). These word embeddings achived the highest performance in natural language understanding tasks among all the static French word Embeddings. This work is published in CNIA 2022 [PDF]. All the resources and code are published here.

May 2019 - November 2019

AZM center for Biomedical Research

Biomedical Engineering Intern

In this internship the main objective was designing and developing an ECG monitoring software using Raspberry Pi 3.

July 2018 - September 2018

CodenDot

Android Application Developement Intern

The main obgective of this internship was developing a drawing library, a Face detection tool, and the FaceVerter tool for the social app ”Docomix” using JAVA language.

July 2017 - September 2017

Education

École Polytechnique

Ph.D. in computer science

The era of transformer-based language models has led the way in a new paradigm in Natural Language Processing (NLP), enabling remarkable performance across a wide range of tasks from both fields Natural Language Understanding (NLU) and Natural Language Generation (NLG). This dissertation delves into the transformative potential of transformer-based language models when applied to specialized domains and languages. It comprises four distinct research endeavors, each contributing to the overarching goal of enhancing language understanding and generation in specialized contexts.

December 2020 - April 2024

Institut Polytechnique de Paris

Master M2 in Data Science

September 2019 - November 2020

Telecom Paris

Engineering Degree in Data Science

Eiffel Excellence Scholarship (08/2018 – 08/2020)

August 2018 - Octobre 2020

Lebanese University, Faculty of Engineering

Engineering Degree in Computer Science and Telecommunication

September 2014 - July 2018

Publications

Guokan Shang, Hadi Abdine, Ahmad Chamma, Amr Mohamed, Mohamed Anwar, Abdelaziz Bounhar, Omar El Herraoui, Preslav Nakov, Michalis Vazirgiannis, Eric Xing. 2025. arxiv preprint.[PDF][Models][Demo]
Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, Michalis Vazirgiannis. 2025. Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning. 2025. arxiv preprint. [PDF]
Xiao Fei, Michail Chatzianastasis, Sarah Almeida Carneiro, Hadi Abdine, Lawrence P. Petalidis, Michalis Vazirgiannis. 2025. Prot2Text-V2: Protein Function Prediction with Multimodal Contrastive Alignment. 2025. arxiv preprint. [PDF]
Christos Xypolopoulos, Guokan Shang, Xiao Fei, Giannis Nikolentzos, Hadi Abdine, Iakovos Evdaimon, Michail Chatzianastasis, Giorgos Stamou, Michalis Vazirgiannis. 2024. Graph Linearization Methods for Reasoning on Graphs with Large Language Models. 2024. arxiv preprint. [PDF]
Iakovos Evdaimon, Giannis Nikolentzos, Christos Xypolopoulos, Ahmed Kammoun, Michail Chatzianastasis, Hadi Abdine, Michalis Vazirgiannis. 2024. Neural Graph Generator: Feature-Conditioned Graph Generation using Latent Diffusion Models. 2024. arxiv preprint. [PDF]
Guokan Shang, Hadi Abdine, Yousef Khoubrane, Amr Mohamed, Yassine Abbahaddou, Sofiane Ennadir, Imane Momayiz, Xuguang Ren, Eric Moulines, Preslav Nakov, Michalis Vazirgiannis, Eric Xing. 2024. Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect. 2024. published at LoResLM@COLING 2025.[PDF][Models][Demo]
Hadi Abdine, Michail Chatzianastasis, Costas Bouyioukos, Michalis Vazirgiannis. 2023. Prot2Text: Multimodal Protein’s Function Generation with GNNs and Transformers. Published in AAAI 224, Spotlight at DGM4H Neurips 2023 and AI4Science Neurips 2023.[PDF][Code & Dataset][Demo]
Hadi Abdine, Moussa Kamal Eddine, Davide Buscaldi, Michalis Vazirgiannis. 2023. Word sense induction with agglomerative clustering and mutual information maximization. In AI Open, Volume 4, Pages 193-201.[PDF][Code]
Iakovos Evdaimon, Hadi Abdine, Christos Xypolopoulos, Stamatis Outsios, Michalis Vazirgiannis, and Giorgos Stamou (2023). « GreekBART: The First Pretrained Greek Sequence-to-Sequence Model. », Published at LREC-COLING 2024. [PDF][Code][Dataset]
Hadi Abdine, Christos Xypolopoulos, Moussa Kamal Eddine, Michalis Vazirgiannis. 2022. Evaluation of Word Embeddings from Large-Scale French Web Content. In Conférence National en Intelligence Artificielle 2022, Saint-Etienne, France.[PDF][Code]
Hadi Abdine, Yanzhu Guo, Virgile Renard, Michalis Vazirgiannis. 2022. Political Communities on Twitter: Case Study for the 2022 French Presidential Election. In Proceedings of the Political Natural Language Processing Workshop 2022, Marseille, France.[PDF][Slides]
Stella Douka, Hadi Abdine, Michalis Vazirgiannis, Rajaa El Hamdani, and David Restrepo Amariles. 2021. JuriBERT: A Masked-Language Model Adaptation for French Legal Text. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 95–101, Punta Cana, Dominican Republic. Association for Computational Linguistics.[PDF]

Skills

Programming Languages & Tools

Workflow

Training and evaluation of language models and deep Learning models
Data crawling and pre-processing
Web Development