Project RBZ

Taha Tobaili
Knowledge Media Institute

What is Arabizi?

Arabizi is a mix of Araby and Englizi, which stands for Arabic and English. It is an informal written language where native Arabs transcribe their dialectal mother tongue in text using Latin alphanumeral instead of Arabic script. For example the expression يلا حبيبي which means common darling transliterates to yalla 7abibi.

What makes it a challenge for NLP?

Processing Arabizi is considered a complex task because:

  1. As to Arabic it is rich in morphology in which some words may have around 100 inflections such as the sentimental word 7ob meaning love (Lebanese Arabizi variants of the Arabic word love)
  2. The way it is expressed in text is esoteric to each region in the Arab world, because each region have its own spoken Arabic dialect.
  3. It lacks a unified orthography where a single word might be spelled differently.
  4. It is found in multilingual streams of text and users tend to mix it with other languages: Hi, Kifak Cava? There you go, the famous trilingual Lebanese greeting!

Different Arabic Dialects for the words: Smart and Dumb.

Standard Arabic Egyptian Arabizi Levant Arabizi North African Arabizi
Thaki (Smart) lamma7, fahlawi, gamed falteh, fo2is, 7arbou2 kafiz, 5afif, saji
Ablah (Dumb) 3abit, daye3, bati5a mastoul, 5ales, ta2e2 mklej, mjmek, 7abes

Non-unified Orthography

There is no correct or wrong way of spelling Arabizi, as long as the message is conveyed, no body cares how it is written:
Marhabtein (Greeting): Mar7abtein, Mar7abtayn, Marhabten, Mar7btain, Marhabtan, Mrhabtayn, etc...

State of the Art

Researchers in the Arabic NLP perceive Arabizi as an Arabic language therefore they try their utmost best to transliterate Arabizi to Arabic, which in many cases results in broken Arabic because there is no orthographic consistency and different letter representations are used interchangeably.
7abibi (darling) could be written as: 7abibi or habibi or even hbb, easily interpreted by humans but a transliterator might confuse the 7 and the h since they map to two distinct Arabic letters حبيبي, هبيبي, هبب.
The k in kalb could be mapped to ك or ق forming either heart or dog. You don't want to transliterate: 7abib kalbi to حبيب كلبي the love of my dog... right?
Therefore, accurate transliteration requires lots of training and large parallel corpora.

There are online tools that transliterate Arabizi to Arabic (Yamli, and Google Input Tools ) however they are designed to help people output formal Arabic text using an English keyboard, where they provide a list of possible options for every input word. These tools are not designed to transliterate whole chunks of text. Automatic transliteration of whole text produces broken Arabic text.

Samples of what a transliterated Arabizi text would look like:

Hahaha balad felten sar l byeswa wl ma byeswa hamel sleh w 3amel zalame ya 3ayb choum
هاهاها بلد فالتن سار بيسوا وال ما بيسوا حمل صله دبليو عامل ظلام يا عيب شم

Bystehal amir aktar min hek
بيستحل امر أكتر من هيك

Mnee7 ma 2alo 3meleh 3 tawouk 3a zaw2ak toum zyedeh...ta7eyeh min el aleb lal aleb lal khota el amneyeh
منيح ما قالوا عمله 3 توك عا ذوقك تم زيده...تحيه من ال الاب لل الاب لل خطة ال امنيه

Allah yer7amon Aslan 5abar sarlo achhor mano jdid halae tzkarto tenchruh
لله يرحمون أصلا خبر سارلو اشهر مانو جديد هل تذكرته تنشره

Eno plz ba2a t7arro abeel walaw
إنه من فضلك بقى تحرره ابيل ولو

Resources

In this project we provide Arabizi resources, free of charge (free is the best price a student can get), to the NLP community in general and to the Arabic NLP in specific to process and analyse Arabizi data.

Arabizi Identification in Twitter Data

In this work we provide statistical insights about the usage of Arabizi on Twitter across Lebanon and Egypt and we create a classifier that identifies Arabizi from multilingual Twitter streams.
Please cite our paper in your research.

  Cite

The following are the datasets that we have collected and pre-processed (cleaned) as part of this work and manually annotated as Arabizi vs Not Arabizi:

Twitter Legal Rights of Redistribution

Did you forget to cite our paper? Don't forget to cite our paper!

Coming Soon

We are developing resources and larger datasets to process and analyse Arabizi in several dialects that will be released soon!

Contact Us

If you have any question about this work or wish to collaborate or contribute, we are happy to listen to you and assist wherever possible.
Please send an email to the author taha.tobaili@open.ac.uk