计算机视觉基础
Foundations of Computer Vision (2024)

原始链接: https://visionbook.mit.edu

托拉尔巴、伊索拉和弗里曼合著的《计算机视觉基础》(麻省理工学院出版社,2024年)提供了对计算机视觉的基础性理解,融合了图像处理和机器学习的视角。本书面向本科生和研究生,在直观的可视化和简洁的解释之间取得了平衡,源于十余年的写作历程,反映了该领域的演变,包括深度学习革命。 本书强调统一的主题和多种视角,涵盖了从图像形成和学习基础到信号处理、线性滤波器、多尺度表示和神经网络(CNN、RNN、Transformer)等主题。它还深入探讨了统计和生成模型、表示学习、用于三维重建的几何工具、序列处理和场景理解。虽然本书并非对最新应用的综述,但它强调了基本概念。作者感谢相关的书籍和众多贡献者。配套的幻灯片可供教师使用。

A Hacker News thread discusses the freely available book, "Foundations of Computer Vision (2024)." A key takeaway is the importance of hard work and perseverance in graduate school, where intelligence alone isn't enough. Posters emphasize the need for strong work ethic, soft skills, networking, self-direction, and adaptability to succeed in a PhD program and research. The discussion also highlights the continued relevance of classic computer vision techniques alongside newer ML-based methods. Older methods are still used in many commercial computer vision applications, especially when GPU acceleration is limited. Finally the importance of proper camera, optics, and lighting for real world machine vision systems is brought up with industrial uses as examples.
相关文章

原文

The print version was published by

The MIT Press

Cambridge, Massachusetts

London, England

Preface

Dedicated to all the pixels.

About this Book

This book covers foundational topics within computer vision, with an image processing and machine learning perspective. We want to build the reader’s intuition and so we include many visualizations. The audience is undergraduate and graduate students who are entering the field, but we hope experienced practitioners will find the book valuable as well.

Our initial goal was to write a large book that provided a good coverage of the field. Unfortunately, the field of computer vision is just too large for that. So, we decided to write a small book instead, limiting each chapter to no more than five pages. Such a goal forced us to really focus on the important concepts necessary to understand each topic. Writing a short book was perfect because we did not have time to write a long book and you did not have time to read it. Unfortunately, we have failed at that goal, too.

Writing this Book

To appreciate the path we took to write this book, let’s look at some data first. shows the number of pages written as a function of time since we mentioned the idea to MIT press for the first time on November 24, 2010.

Figure 1: Evolution of the number of pages written as a function of time.

Starting to write this book was like entering this cave. We had no idea what we were getting into.

Writing this book has not been a linear process. As the plot shows, the evolution of the manuscript length is non-monotonic, with a period when the book shrank before growing again. Lots of things have happened since we started thinking about this book in November 2010; yes, it has taken us more than 10 years to write this book. If we knew on the first day all the work that is involved in writing a book like this one there is no way we would have started. However, from today’s vantage point, with most of the work behind us, we feel happy we started this journey. We learned a lot by writing and working out the many examples we show in this book, and we hope you will too by reading and reproducing the examples yourself.

When we started writing the book, the field was moving ahead steadily, but unaware of the revolution that was about to unfold in less than 2 years. Fortunately, the deep learning revolution in 2012 made the foundations of the field more solid, providing tools to build working implementations of many of the original ideas that were introduced in the field since it began. During the first years after 2012, some of the early ideas were forgotten due to the popularity of the new approaches, but over time many of them returned. We find it interesting to look at the process of writing this book with the perspective of the changes that were happening in the field. Figure 1 shows some important events in the field of artificial intelligence (AI) that took place while writing this book.

Structure of the Book

Computer vision has undergone a revolution over the last decade. It may seem like the methods we use now bear little relationship to the methods of 10 years ago. But that’s not the case. The names have changed, yes, and some ideas are genuinely new, but the methods of today in fact have deep roots in the history of computer vision and AI. Throughout this book we will emphasize the unifying themes behind the concepts we present. Some chapters revisit concepts presented earlier from different perspectives.

One of the central metaphors of vision is that of multiple views. There is a true physical scene out there and we view it from different angles, with different sensors, and at different times. Through the collection of views we come to understand the underlying reality. This book also presents a collection of views, and our goal will be to identify the underlying foundations.

The book is organized in multiple parts, of a few chapters each, devoted to a coherent topic within computer vision. It is preferable to read them in that order as most of the chapters assume familiarity with the topics covered before them. The parts are as follows:

Part I discusses some motivational topics to introduce the problem of vision and to place it in its societal context. We will introduce a simple vision system that will let us present concepts that will be useful throughout the book, and to refresh some of the basic mathematical tools.

Part II covers the image formation process.

Part III covers the foundations of learning using vision examples to introduce concepts of broad applicability.

Part IV provides an introduction to signal and image processing, which is foundational to computer vision.

Part V describes a collection of useful linear filters (Gaussian kernels, binomial filters, image derivatives, Laplacian filter, and temporal filters) and some of their applications.

Part VI describes multiscale image representations.

Part VII describes neural networks for vision, including convolutional neural networks, recurrent neural networks, and transformers. Those chapters will focus on the main principles without going into describing specific architectures.

Part VIII introduces statistical models of images and graphical models.

Part IX focuses on two powerful modeling approaches in the age of neural nets: generative modeling and representation learning. Generative image models are statistical image models that create synthetic images that follow the rules of natural image formation and proper geometry. Representation learning seeks to find useful abstract representations of images, such as vector embeddings.

Part X is composed of brief chapters that discuss some of the challenges that arise from building learning-based vision systems.

Part XI introduces geometry tools and their use in computer vision to reconstruct the 3D world structure from 2D images.

Part XII focuses on processing sequences and how to measure motion.

Part XIII deals with scene understanding and object detection.

Part XIV is a collection of chapters with advice for junior researchers on effective methods of giving presentations, writing papers, and the mentality of an effective researcher.

Part XV returns to the simple visual system and applies some of the techniques presented in the book to solve the toy problem introduced in Part I.

What Do We Not Cover?

This should be a long section, but we will keep it short. We do not provide a review on the current state of the art of computer vision; we focus instead on the foundational concepts. We do not cover in depth the many applications of computer vision such as shape analysis, object tracking, person pose analysis, or face recognition. Many of those topics are better studied by reading the latest publications from computer vision conferences and specialized monographs.

Acknowledgments

We thank our teachers, students, and colleagues all over the world who have taught us so much and have brought us so much joy in conversations about research. This book also builds on many computer vision courses taught around the world that helped us decide which topics should be included. We thank everyone that made their slides and syllabus available. A lot of the material in this book has been created while preparing the MIT course, “Advances in Computer Vision.”

We thank our colleagues who gave us comments on the book: Ted Adelson, David Brainard, Fredo Durand, David Fouhey, Agata Lapedriza, Pietro Perona, Olga Russakovsky, Rick Szeliski, Greg Wornell, Jose María Llauradó, and Alyosha Efros. A special thanks goes to David Fouhey and Rick Szeliski for all the help and advice they provided. We also thank Rob Fergus and Yusuf Aytar for early contributions to this manuscript. Many colleagues and students have helped proof reading the book and with some of the experiments. Special thanks to Manel Baradad, Sarah Schwettmann, Krishna Murthy Jatavallabhula, Wei-Chiu Ma, Kabir Swain, Adrian Rodriguez Muñoz, Tongzhou Wang, Jacob Huh, Yen-Chen Lin, Pratyusha Sharma, Joanna Materzynska, and Shuang Li. Thanks to Manel Baradad for his help on the experiments in Chapter 55  A Simple Vision System—Revisited, to Krishna Murthy Jatavallabhula for helping with the code for Chapter 44  Multiview Geometry and Structure from Motion, and Aina Torralba for help designing the book cover and several figures.

Antonio Torralba thanks Juan, Idoia, Ade, Sergio, Aina, Alberto, and Agata for all their support over many years.

Phillip Isola thanks Pam, John, Justine, Anna, DeDe, and Daryl for being a wonderful source of support along this journey.

William Freeman thanks Franny, Roz, Taylor, Maddie, Michael, and Joseph for their love and support.

How to Cite This Book

If you would like to cite this book, please use the following BibTeX entry:

  1. the MIT Press.

  2. Slides that accompany this book are available for download here.

[1]

D.A. Forsyth, J. Ponce, Computer vision - a modern approach, second edition., Pitman, 2012.

[2]

R. Szeliski, Computer vision algorithms and applications., 2nd ed., Springer, 2022.

[3]

B.K.P. Horn, Robot vision., MIT Press, Cambridge, MA, 1986.

[4]

D. Marr, Vision., MIT Press, Cambridge, MA, 2010.

[5]

R. Hartley, A. Zisserman, Multiple view geometry in computer vision., 2nd ed., Cambridge University Press, Cambridge, UK, 2004.

[6]

J.J. Koenderink, Solid shape., MIT Press, Cambridge, MA, 1990.

[7]

O. Faugeras, Three-dimensional computer vision: A geometric viewpoint., MIT Press, Cambridge, MA, 1993.

[8]

E. Trucco, A. Verri, Introductory techniques for 3-d computer vision., Prentice Hall PTR, USA, 1998.

[9]

D.J.C. MacKay, Information theory, inference and learning algorithms., Cambridge University Press, 2003.

[10]

C.M. Bishop, Pattern recognition and machine learning., Springer-Verlag, 2006.

[11]

K.P. Murphy, Probabilistic machine learning: An introduction., MIT Press, Cambridge, MA, 2022.

[12]

I. Goodfellow, Y. Bengio, A. Courville, Deep learning., MIT Press, Cambridge, MA, 2016.

[13]

S.J.D. Prince, Computer vision: Models learning and inference., Cambridge University Press, 2012.

[14]

S.E. Palmer, Vision science: Photons to phenomenology., MIT Press, Cambridge, MA, 1999.

[15]

G. Granlund, H. Knutsson, Signal processing for computer vision., Springer, New York, NY, 1995.

[16]

S. Ullman, High-level vision., MIT Press, Cambridge, MA, 2000.

[17]

M. Minnaert, Light and color in the outdoors., Springer, New York, 2012.

联系我们 contact @ memedata.com