Tutorial at VCIP 2022

Date:

Tutorial at VCIP 2022, Suzhou, China

More information here

Title

Vision Transformer: More is different

Brief Description

Big data contains a tremendous amount of dark knowledge. The community has realized that effectively exploring and using such knowledge is essential to achieving superior intelligence. How can we effectively distill the dark knowledge from ultra-large-scale data? One possible answer is: "through Transformers". Transformers have proven their prowess at extracting and harnessing dark knowledge from data. This is because more is truly different when it comes to Transformers.

This tutorial will introduce the structural design, training methods, and applications of vision transformers. We will start with the development of neural networks and introduce their theoretical foundations through CNNs to visual transformers. Then, we will discuss the structural design of vision transformers, including the plain vision transformer and vision transformers with inductive bias, followed by a discussion of how to train these models in a supervised, self-supervised, and multi-modality way. Next, we will present the applications of vision transformers to both low-level tasks and high-level tasks, which have redefined the art of computer vision. Finally, we discuss the open challenges of current vision transformers and give future expectations for the vision transformer developments.


Speaker

Qiming Zhang (Student Member, IEEE) is currently pursuing the Ph.D. degree with the School of Computer Science, the University of Sydney under the supervision of Prof. Dacheng Tao and Dr. Jing Zhang. He received the B.Sc. degree from Zhejiang University in 2017, and the M.Phil degree from the University of Sydney in 2019. His research interests include transfer learning, recognition, and vision transformer in computer vision.

Yufei Xu (Student Member, IEEE) received the B.E. degree from the University of Science and Technology of China (USTC), Hefei, China, in 2019. He is currently pursuing the Ph.D. degree with The University of Sydney, under the supervision of Prof. Dacheng Tao and Dr. Jing Zhang. His research interests include computer vision and deep learning.

Jing Zhang (Member, IEEE) is currently a Research Fellow at the School of Computer Science, The University of Sydney. He has published more than 60 papers on prestigious conferences and journals, such as CVPR, ICCV, ECCV, NeurlPS, ICLR, ACM Multimedia, AAAI, IJCAI, International Journal of Computer Vision (IJCV), IEEE Transactions on Image Processing (T-IP), IEEE Transactions on Neural Networks and Learning Systems (T-NNLS), IEEE Transactions on Cybernetics (T-CYB). His research interests include computer vision and deep learning. He is a Senior Program Committee Member of the AAAI Conference on Artificial Intelligence and the International Joint Conference on Artificial Intelligence. He serves as a reviewer for many journals and conferences, such as CVPR, ICCV, ECCV, NeurlPS, ICLR, T-PAMI, IJCV, and T-IP.

Dacheng Tao (Fellow, IEEE) is currently the Inaugural Director of the JD Explore Academy and a Senior Vice President of JD.com, Inc. He mainly applies statistics and mathematics to artificial intelligence and data science. His research is detailed in one monograph and over 200 publications in prestigious journals and proceedings at leading conferences. He is a fellow of the Australian Academy of Science, AAAS, and ACM. He received the 2015 Australian Scopus-Eureka Prize, the 2018 IEEE ICDM Research Contributions Award, and the 2021 IEEE Computer Society McCluskey Technical Achievement Award.


Outline

  • Introduction
    • History of neural networks
    • Theoretical foundation (deeper, wider, bigger, multiple modalities)
    • The scope of this tutorial
  • Vision Transformer Architecture
    • Preliminary
    • Plain vision transformers
      • ViT structure
      • Deeper transformer
      • Patch embedding
      • Normalization
    • Hierachical Vision transformers
      • Locality
      • Spatial redundancy reduction
      • Scale Invariance
      • Summary
    • Lightweight and efficient Vision Transformer
      • Lightweight
      • Linear complexity
  • Vision Transformer Pretraining
    • Supervised pretraining
    • Self-Self-supervised pretraining
      • Contrastive learning
      • Masked image modeling
    • Multi-modality pretraining
      • Contrastive based approaches
      • Generative based approaches
    • Summary
  • Vision Transformer Applications
    • Low-level tasks
      • Image restoration and super resolution
      • Optical flow and depth
      • Image matting
    • High-level tasks
      • Object detection
      • Object tracking
      • Semantic segmentation
      • Pose estimation
    • Remote sensing tasks
    • Summary
  • Conclusion
    • Open challenges
    • Discussion

Expected Audiences

Graduate students, engineers, and researchers interested in or working on image processing, computer vision, and deep learning, and the list may go on.


Relevant Recent Publications

  1. Zhang, Jingwei, Tongliang Liu, and Dacheng Tao. “An information-theoretic view for deep learning.” arXiv preprint arXiv:1804.09060 (2018).
  2. Xu, Chang, Dacheng Tao, and Chao Xu. “Multi-view intact space learning.” IEEE transactions on pattern analysis and machine intelligence 37.12 (2015): 2531-2544.
  3. Liu, Tongliang, Dacheng Tao, Mingli Song, and Stephen J. Maybank. “Algorithm-dependent generalization bounds for multi-task learning.” IEEE transactions on pattern analysis and machine intelligence 39, no. 2 (2016): 227-241.
  4. Xu, Yufei, Qiming Zhang, Jing Zhang, and Dacheng Tao. “Vitae: Vision transformer advanced by exploring intrinsic inductive bias.” Advances in Neural Information Processing Systems 34 (2021): 28522-28535.
  5. Zhang, Qiming, Yufei Xu, Jing Zhang, and Dacheng Tao. “Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond.” arXiv preprint arXiv:2202.10108 (2022).
  6. Zhang, Qiming, Yufei Xu, Jing Zhang, and Dacheng Tao. “VSA: Learning Varied-Size Window Attention in Vision Transformers.” ECCV, 2022.
  7. Zhang, Jiangning, Xiangtai Li, Yabiao Wang, Chengjie Wang, Yibo Yang, Yong Liu, and Dacheng Tao. “EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm.” arXiv preprint arXiv:2206.09325 (2022).
  8. He, Haoyu, Jing Liu, Zizheng Pan, Jianfei Cai, Jing Zhang, Dacheng Tao, and Bohan Zhuang. “Pruning self-attentions into convolutional layers in single path.” arXiv preprint arXiv:2111.11802 (2021).
  9. Han, Kai, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Dacheng Tao. “A survey on vision transformer.” IEEE transactions on pattern analysis and machine intelligence (2022).
  10. He, Lingshen, Yiming Dong, Yisen Wang, Dacheng Tao, and Zhouchen Lin. “Gauge equivariant transformer.” Advances in Neural Information Processing Systems 34 (2021): 27331-27343.
  11. Tang, Yehui, Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chao Xu, and Dacheng Tao. “Patch slimming for efficient vision transformers.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12165-12174. 2022.
  12. Chen, Xianing, Qiong Cao, Yujie Zhong, Jing Zhang, Shenghua Gao, and Dacheng Tao. “DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12052-12062. 2022.
  13. Yang, Kaiwen, Tianyi Zhou, Xinmei Tian, and Dacheng Tao. “Identity-Disentangled Adversarial Augmentation for Self-supervised Learning.” In International Conference on Machine Learning, pp. 25364-25381. PMLR, 2022.
  14. Xu, Yufei, Qiming Zhang, Jing Zhang, and Dacheng Tao. “RegionCL: Can Simple Region Swapping Contribute to Contrastive Learning?.” ECCV, 2022.
  15. Wang, Zhen, Liu Liu, Yajing Kong, Jiaxian Guo, and Dacheng Tao. “Online Continual Learning with Contrastive Vision Transformer.” ECCV, 2022.
  16. Hou, Zhi, Baosheng Yu, and Dacheng Tao. “BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7256-7266. 2022.
  17. Hou, Zhi, Baosheng Yu, Chaoyue Wang, Yibing Zhan, and Dacheng Tao. “BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning.” arXiv preprint arXiv:2204.01254 (2022).
  18. Liang, Yaqian, Shanshan Zhao, Baosheng Yu, Jing Zhang, and Fazhi He. “MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis.” ECCV, 2022.
  19. Li, Wei, Qiming Zhang, Jing Zhang, Zhen Huang, Xinmei Tian, and Dacheng Tao. “Toward Real-world Single Image Deraining: A New Benchmark and Beyond.” arXiv preprint arXiv:2206.05514 (2022).
  20. Xu, Haofei, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. “GMFlow: Learning Optical Flow via Global Matching.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8121-8130. 2022.
  21. Zhao, Haimei, Jing Zhang, Sen Zhang, and Dacheng Tao. “JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes.” ECCV 2022.
  22. Chen, Zhe, Jing Zhang, Yufei Xu, and Dacheng Tao. “Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection.” arXiv preprint arXiv:2207.06603 (2022).
  23. Shi, Dingfeng, Yujie Zhong, Qiong Cao, Jing Zhang, Lin Ma, Jia Li, and Dacheng Tao. “ReAct: Temporal Action Detection with Relational Queries.” arXiv preprint arXiv:2207.07097 (2022).
  24. Ye, Maoyuan, Jing Zhang, Shanshan Zhao, Juhua Liu, Bo Du, and Dacheng Tao. “DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer.” arXiv preprint arXiv:2207.04491 (2022).
  25. Du, Bo, Jian Ye, Jing Zhang, Juhua Liu, and Dacheng Tao. “I3CL: Intra-and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection.” International Journal of Computer Vision (2022): 1-17.
  26. Chen, Chen, Zhe Chen, Jing Zhang, and Dacheng Tao. “Sasa: Semantics-augmented set abstraction for point-based 3d object detection.” In AAAI Conference on Artificial Intelligence, vol. 1. 2022.
  27. Wang, Yikai, TengQi Ye, Lele Cao, Wenbing Huang, Fuchun Sun, Fengxiang He, and Dacheng Tao. “Bridged Transformer for Vision and Point Cloud 3D Object Detection.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12114-12123. 2022.
  28. Chen, Zhe, Jing Zhang, and Dacheng Tao. “Recurrent glimpse-based decoder for detection with transformer.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5260-5269. 2022.
  29. Wang, Wen, Yang Cao, Jing Zhang, Fengxiang He, Zheng-Jun Zha, Yonggang Wen, and Dacheng Tao. “Exploring sequence feature alignment for domain adaptive detection transformers.” In Proceedings of the 29th ACM International Conference on Multimedia, pp. 1730-1738. 2021.
  30. Wang, Wen, Yang Cao, Jing Zhang, and Dacheng Tao. “Fp-detr: Detection transformer advanced by fully pre-training.” In International Conference on Learning Representations. 2021.
  31. Wang, Wen, Jing Zhang, Yang Cao, Yongliang Shen, and Dacheng Tao. “Towards Data-Efficient Detection Transformers.” ECCV, 2022.
  32. Xu, Yang, Hao Wang., De Xie, Cheng Deng, and Dacheng Tao. Object-Agnostic Transformers for Video Referring Segmentation. IEEE Transactions on Image Processing, 31, 2839-2849, 2022
  33. He, Haoyu, Jianfei Cai, Zizheng Pan, Jing Liu, Jing Zhang, Dacheng Tao, and Bohan Zhuang. “Dynamic Focus-aware Positional Queries for Semantic Segmentation.” arXiv preprint arXiv:2204.01244 (2022).
  34. Xu, Yangyang, Xiangtai Li, Haobo Yuan, Yibo Yang, Jing Zhang, Yunhai Tong, Lefei Zhang, and Dacheng Tao. “Multi-Task Learning with Multi-query Transformer for Dense Prediction.” arXiv preprint arXiv:2205.14354 (2022).
  35. Yuan, Haobo, Xiangtai Li, Yibo Yang, Guangliang Cheng, Jing Zhang, Yunhai Tong, Lefei Zhang, and Dacheng Tao. PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation. ECCV, 2022.
  36. Lan, Meng, Jing Zhang, Fengxiang He, and Lefei Zhang. “Siamese network with interactive transformer for video object segmentation.” In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, pp. 1228-1236. 2022.
  37. Ma, Sihan, Jizhizi Li, Jing Zhang, He Zhang, and Dacheng Tao. “Rethinking Portrait Matting with Privacy Preserving.” arXiv preprint arXiv:2203.16828 (2022).
  38. Li, Jizhizi, Jing Zhang, and Dacheng Tao. “Referring Image Matting.” arXiv preprint arXiv:2206.05149 (2022).
  39. Xu, Yufei, Jing Zhang, Qiming Zhang, and Dacheng Tao. “ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation.” arXiv preprint arXiv:2204.12484 (2022).
  40. Zhang, Xu, Wen Wang, Zhe Chen, Jing Zhang, and Dacheng Tao. “PromptPose: Language Prompt Helps Animal Pose Estimation.” arXiv preprint arXiv:2206.11752 (2022).
  41. Wang, Di, Jing Zhang, Bo Du, Gui-Song Xia, and Dacheng Tao. “An Empirical Study of Remote Sensing Pretraining.” IEEE Transactions on Geoscience and Remote Sensing (2022).
  42. Wang, Di, Qiming Zhang, Yufei Xu, Jing Zhang, Bo Du, Dacheng Tao, and Liangpei Zhang. “Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model.” arXiv preprint arXiv:2208.03987 (2022).