Computer vision techniques applied on images opportunistically captured from body-worn cameras or mobile phones offer tremendous potential for vision-based context awareness. In this paper, we evaluate the potential to recognise the modes of locomotion and transportation of mobile users, by analysing single images captured by body-worn cameras. We evaluate this with the publicly available Sussex-Huawei Locomotion and Transportation Dataset, which includes 8 transportation and locomotion modes performed over 7 months by 3 users. We present a baseline performance obtained through crowd sourcing using Amazon Mechanical Turk. Humans infered the correct modes of transportations from images with an F1-score of 52%. The performance obtained by five state-of-the-art Deep Neural Networks (VGG16, VGG19, ResNet50, MobileNet and DenseNet169) on the same task was always above 71.3% F1- score. We characterise the effect of partitioning the training data to fine-tune different number of blocks of the deep networks and provide recommendations for mobile implementations. Index Terms—Activity recognition, Body-worn camera, Computer Vision, Deep learning, Crowd sourcing, Mechanical Turk.