Multimodal Information Extraction From Videos