Dark mode logo
Last Updated:
VQA idea

Exploring Visual Question Answering using Python

Introduction


Visual Question Answering (VQA) is a subfield of artificial intelligence which aims at answering questions related to a picture.

For example, if you give the AI model a picture of a cat and ask questions related to it, the AI model will answer you correctly.

In this tutorial, we will use a free model available in Huggingface to perform VQA with Python. Don't be afraid if all this seems intimidating. We will guide you step-by-step.

π—’π˜ƒπ—²π—Ώπ˜ƒπ—Άπ—²π˜„

To get a basic idea of the capabilities of this model, we can use the free inference provided by the Hugging Face library. Check it out by clicking here.



Now that we have a basic understanding of the model's capabilities, let's start coding.

π—¦π—²π˜π˜π—Άπ—»π—΄ 𝗨𝗽 π˜π—΅π—² π—˜π—»π˜ƒπ—Άπ—Ώπ—Όπ—»π—Ίπ—²π—»π˜

Before coding, there are a few libraries you have to download.Β 
To do that, open the command prompt in Windows (by typing cmd in start) or the terminal in Linux/mac.
Now type the following lines of code:

pip install transformers
pip install requests
pip install pillow

π—œπ—Ίπ—½π—Ήπ—²π—Ίπ—²π—»π˜π—Άπ—»π—΄ VQA

Β 

We are going to code in Python, So open your favourite text editor.

Note: The full code is provide at the end.
First, we need to import the necessary libraries. Let's do that!

from transformers import ViltProcessor, ViltForQuestionAnswering
import requests
from PIL import Image

Now we have to use the Hugging Face function to load our VQA model. The model which we are going to use is dandelin/vilt-b32-finetuned-vqa. We can do that by

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

Now let's fetch the image. Type the image's URL, and let's use Image.open() to open the image.Β 
url = "http://images.cocodataset.org/val2017/000000039769.jpg" Β # sample image
image = Image.open(requests.get(url, stream=True).raw)

Now let's ask the question which you want to ask:
text = 'Which animal is this?'

Now, we have to change the question into the correct format, and load it into the model.

encoding = processor(image, text, return_tensors="pt")
outputs = model(**encoding)
logits = outputs.logits
idx = logits.argmax(-1).item()

We can now get the output by:
print("Predicted answer:", model.config.id2label[idx])

Thus, we have our own VQA model in our hands.Β 

Conclusion

You can put your favourite picture into this model, and ask it questions. But remember that this model is not perfect.
In fact, it is light-years away from perfection. So use it with caution.

Here is the whole code, for the copy pasters:)

from transformers import ViltProcessor, ViltForQuestionAnswering
import requests
from PIL import Image
processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = 'Which animal is this?'
encoding = processor(image, text, return_tensors="pt")
outputs = model(**encoding)
logits = outputs.logits
idx = logits.argmax(-1).item()
print("Predicted answer:", model.config.id2label[idx])

If you have any questions, feel free to ask it in the comment section.
Note that the model and the code used are not ours, and are available in the Hugging Face library.

BibTeX entry and citation info
@misc{kim2021vilt,
Β  Β  Β  title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision},Β 
Β  Β  Β  author={Wonjae Kim and Bokyung Son and Ildoo Kim},
Β  Β  Β  year={2021},
Β  Β  Β  eprint={2102.03334},
Β  Β  Β  archivePrefix={arXiv},
Β  Β  Β  primaryClass={stat.ML}
}