Department of Computer Science and Engineering – DISI
Luigi Di Stefano is Professor at the Department of Computer Science and Engineering (DISI) of the University of Bologna, where he founded and leads the Computer Vision Laboratory (CVLab). His research interests are focused on computer vision, machine learning and deep learning. In these fields, he has coordinated many academic research projects funded by public national and European grants as well as by private companies and he is author of more than 150 papers in renowned international journals and conferences and several patents. He has given invited lectures in workshops and PhD schools, has been called as a member of the Thesis Defense Committee of several PhD candidates both in Italy and abroad, serves regularly as a reviewer for the main international journals and conferences. He has been member of the Board of Directors of Datalogic SpA as an Independent Director and scientific consultant for Pirelli Tyres in the area of computer vision. In 2011-2012 he was Scientific Supervisor of VIALAB (Vision for Industrial Applications Laboratory), a research and technology transfer laboratory focused on computer vision located in Bologna. In January 2020 he has co-founded the start-up eyecan.ai (https://www.eyecan.ai/).
Deep Scene Perception without Labeled Data
This talk will present the recent research work carried out at CVLab-University of Bologna in the field of deep learning for scene perception. The leitmotif behind our work concerns avoiding reliance on supervision from labeled data, which, indeed, I am lead to posit not to even exist when it comes to train models aimed at key perception tasks like depth prediction. Firstly, I will address depth estimation from both stereo as well as monocular views. Here, the main contribution of our research concerns deploying self-supervision to pursue domain adaptation of deep CNNs pre-trained on computer generated imagery. This, indeed, lead us to the development of the first-ever on-line adaptive stereo network, i.e. a CNN that deploys an efficient continual learning paradigm to keep-up with domain changes in real-time. As for depth-from-mono, based on the intuition that effective monocular depth cues arise from semantic knowledge, I will show how joint learning of depth and per-pixel class labels can ameliorate depth prediction significantly. I will dwell further into cross-task learning by presenting our novel AT/DT framework, which allows for transferring learned representations across different task and domains, so to, e.g., enable predicting depths in a target domain by leveraging on sematic labels only (or vice-versa). I will then present our latest results dealing with the first CNN architecture for comprehensive scene perception from monocular videos: ΩNet (CVPR 2020) can predict depth, semantic labels, optical flow, per-pixel motion probabilities and motion mask based on a novel training protocol relying on self-supervision and knowledge distillation. Finally, I will address perception from point clouds and present our rotation-equivariant local 3D descriptor based on Spherical CNNs and learned end-to-end from raw data without any explicit supervision. Peculiarly, this proposal is conducive to extraction of a canonical orientation from the learned rotation-equivariant representation so as to allow for rotation-invariant descriptor matching. To conclude the talk, I will briefly show some unpublished results from an on-going project carried out in cooperation with a major company.