ψML - Trying Out HuggingFace

Updated Blog Posting Method

After going through the pain of converting a Notebook to a markdown file and then editing that markdown file to look nice (in my last post), I saw that there was a better way to hand that process: fastpages. The process was slightly rocky, but I finally think I have things more or less figured out, including linking it to a domain under my name!

As a first test of the capability of uploading a notebook to a blog post, I am going to toy with the Hugging Face models. Interesting name for a company/group, with lots of Alien vibes. I saw this super cool tweet: > twitter: https://twitter.com/huggingface/status/1293240692924452864?s=20

Per the instructions here I made a virtual environment to try out some transformers:

pyenv virtualenv 3.8 hface
pyenv activate hface
pip install jupyter
pip install --upgrade pip
pip install torch
pip install transformers

And then I tested with:

python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I hate you'))"

Which gave a correct sentiment score, I think at least, a negative score close to 1.

from transformers import pipeline

pipeline('sentiment-analysis')('jog off')

[{'label': 'NEGATIVE', 'score': 0.905813992023468}]

pipeline('sentiment-analysis')('exactly')

[{'label': 'POSITIVE', 'score': 0.9990326166152954}]

pipeline('sentiment-analysis')('I saw this super cool tweet')

[{'label': 'POSITIVE', 'score': 0.998775064945221}]

Very cool!

Trying Out Pipelines

I attempted running a zero-shot classifier, but got an error ("Unknown task zero-shot-classification, available tasks are ['feature-extraction', 'sentiment-analysis', 'ner', 'question-answering', 'fill-mask', 'summarization', 'translation_en_to_fr', 'translation_en_to_de', 'translation_en_to_ro', 'text-generation']"). I guess this is because it is a new feature that hasn’t quite made it to the latest version:

classifer = pipeline('zero-shot-classification')

Instead, I will play around with some of the other classifers.

en_to_de_translate = pipeline('translation_en_to_de')

/home/simon/.pyenv/versions/3.8.3/envs/hface/lib/python3.8/site-packages/transformers/modeling_auto.py:796: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
  warnings.warn(
Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

en_to_de_translate("no")

[{'translation_text': 'nein, nein, nein, nein!'}]

Checks out. Let’s see if some other stuff accords with my degrading knowledge of German:

en_to_de_translate("this is my room")

[{'translation_text': 'das ist mein Raum.'}]

I probably would have used Zimmer instead of Raum, since Raum is more “space” than “room” to me.

en_to_de_translate("monkey, hippo, porcupine, dog, cat, rabbit")

[{'translation_text': 'Affen, Hippo, Pfauen, Hunde, Katzen, Kaninchen, Hunde, Katzen, Kaninchen.'}]

It looks like it uses the plural for nouns. Hippo didn’t translate to anything different, apparently Flusspferd (water horse) is favored by Leo. I like Stachelschwein (“spike pig”) better for porcupine (which apparently live in Texas now!?) and furthermore Pfauen looks to actual mean peacocks. I’m not sure why dog (Hund), cat (Katze), and rabbit (Kaninchen) are repeated, but those look good.

Thus it’s not perfect, but something that took less than a minute can out-translate my 4-ish years of German classes that I haven’t touched up on in like a decade. Ouch.

Named Entity Recognition

Finally, to cap off this short test post let’s try out the named entity recognition task. They provide an example of the classifier in their docs as well as a short list of what different abbreviations mean: * O, Outside of a named entity * B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity * I-MIS, Miscellaneous entity * B-PER, Beginning of a person’s name right after another person’s name * I-PER, Person’s name * B-ORG, Beginning of an organisation right after another organisation * I-ORG, Organisation * B-LOC, Beginning of a location right after another location * I-LOC, Location

ner = pipeline('ner')

First using their example:

sequence = ("Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
    + "close to the Manhattan Bridge which is visible from the window.")

from pprint import pprint

for entry in ner(sequence):
    pprint(entry)

{'entity': 'I-ORG', 'index': 1, 'score': 0.9995632767677307, 'word': 'Hu'}
{'entity': 'I-ORG', 'index': 2, 'score': 0.9915938973426819, 'word': '##gging'}
{'entity': 'I-ORG', 'index': 3, 'score': 0.9982671737670898, 'word': 'Face'}
{'entity': 'I-ORG', 'index': 4, 'score': 0.9994403719902039, 'word': 'Inc'}
{'entity': 'I-LOC', 'index': 11, 'score': 0.9994346499443054, 'word': 'New'}
{'entity': 'I-LOC', 'index': 12, 'score': 0.9993270635604858, 'word': 'York'}
{'entity': 'I-LOC', 'index': 13, 'score': 0.9993864893913269, 'word': 'City'}
{'entity': 'I-LOC', 'index': 19, 'score': 0.9825621843338013, 'word': 'D'}
{'entity': 'I-LOC', 'index': 20, 'score': 0.9369831085205078, 'word': '##UM'}
{'entity': 'I-LOC', 'index': 21, 'score': 0.8987104296684265, 'word': '##BO'}
{'entity': 'I-LOC',
 'index': 29,
 'score': 0.9758240580558777,
 'word': 'Manhattan'}
{'entity': 'I-LOC', 'index': 30, 'score': 0.9902493953704834, 'word': 'Bridge'}

Impressive, especially how it recognizes DUMBO as a location. (side note, I actually visited that area in my first trip to NYC last year).

Looking forward to trying out these transformers more in the future!