Week 5 Lecture - Object Recognition
5a01 - Why Object Recognition Is Difficult
- Segmentation
- Lighting
- Deformation
- Affordances
- Viewpoint
5b01 - Ways To Achieve Viewpoint Invariance
It's hard to appreciate how difficult it is. It's the main difficulty in making computers perceive. We are still lacking generally accepted solutions
5b02 - Some ways to achieve viewpoint invariance
- Use redundant invariant features
- Put a box around the object and use normalized pixels
- Lecture 5c: use replicated features with pooling: "convolutional neural nets"
- Use a hierarchy of parts that have explicit poses relative to the camera (will be described later in course)
5b03 - The Invariant Feature Approach
- Extract a large, redundant set of features that are invariant under transformations
- e.g. a pair of roughly parallel lines with a red dot between them
- this is what baby herring gulls use to know where to peck for food
- With enough invariant features, there is only one way to assemble them into an object
- We don't need to represent the relationships between the features directly because they are captured by other features.
- For recognition, we must avoid forming features from parts of different objects
5b04 - The Judicious Normalization Approach
- Put a box around the object and use it as a coordinate frame for a set of normalized pixels.
- This solves the dimension hopping problem. If we choose the box correctly, the same part of an object always occurs on the same normalized pixels.
- The box can provide invariance to many degrees of freedom: translation, rotation, scale, shear, stretch ...
- Choosing the box is difficult because of: segmentation errors, occlusion, unusual orientations
- We need to recognize the shape to get the box right!
We recognize the letter before doing mental rotation to decide if it's a mirror image:
5b05 - The Brute Force Normalization Approach
- When training the recognizer, use well-segmented, upright images to fit the correct box.
- At test time try all possible boxes in a range of positions and scales.
- Widely used for detecting upright things like faces and house numbers in unsegmented images.
- It is much more efficient if the recognizer can cope with some variation in position and scale so that we can use a coarse grid when trying all possible boxes.
5c01 - Convolutional Neural Networks For Hand-Written Digit Recognition
CNNs originated in 1980s. It was possible on computers then, and they worked really well.
5c02 - The replicated feature approach
(currently the dominant approach for neural networks)
Use many different copies of the same feature detector with different positions.
- We could replicate across scale and orientation, but that's tricky and expensive.
- Replication greatly reduces the number of free parameters to be learned.
Al
5c03 - Backpropagation with weight constraints
- it's easy to modify the backpropagation algorithm
5c04 - What does replicating the feature detectors achieve?
- many people claim replicating translation invariance
- that's not true
- they achieve equivariance, not invariance
- equivariant activities: replicated features do not make the neural activities invariant to translation. the activities are equivariant.
- [images here]
- Invariant knowledge:
"We are achieving equivariance in the activities and invariance in the weights."
5c04 - Pooling the outputs of replicated feature detectors
Get a small amount of translational invariance at each level of the net by averaging four neighboring replicated detectors to give a single output to the next level.
- reduces the number of inputs to the next layer of feature extraction, which allows us to have many more different feature maps
- taking the max of four works slightly better (than averaging? works better for what?)
Problem: after several levels of pooling, we lose info about positions of things
- consequence: makes it impossible to use precise spatial relationships
5c - Question
5:53
Consider the following convolutional neural network. The input is a 3 x 3 image and each hidden unit has a 2 x 2 weight filter that connects to a localized region of this image. This is shown in the following diagram (note that this represents one filter map and some lines are dashed just to avoid clutter):
In the image shown to the network, black pixels have a value of 1 while white pixels have a value of 0, so only two pixels in this image have a value of 1. Suppose we wanted to pool the output of each hidden unit using max pooling. If and max-pooling is defined as , then what will be the value of in this example?
- 2
- 0
- 1
- 3
5c - Question Work
I was not expecting that pooling would reuse pixels. If is the maximum of any output , then it should be #1: 2.
5c - Question Answer
Correct. Max pooling takes the output of each hidden unit in the map and picks the maximum value among these. In this case, the maximum value is 2 and it comes from hidden unit . The neurons on the layer above therefore only see that some hidden unit produced an output of 2. They do not know which hidden unit among the 4 caused this, and therefore they lose the ability to distinguish where interesting things happen in the image. They only know that something interesting happened somewhere.
In general, we don't pool over the entire image, but instead we pool over regions. For example, we might pool over to create two outputs. This lets the network build up progressively more invariant features with each successive layer.
5c05 - Le Net
5:55
Yann LeCun et al developed a good recognizer for handwritten digits
- Many hidden layers
- Many maps of replicated units in each later
- Pooling of the outputs to nearby replicated units (?)
- A wide net that can
- It was a clever way of training a complete system, not just a recognizer for individual characters.
- You put in pixels at one end and get out zip codes at the other.
- They used a method called maximum margin today for training.
- Used for reading ~10% of the checks in North America (there's a great practical value)
- Look at impressive demos of LENET at http://yann.lecun.com
- look at all of these
- shows how well it copes with overlaps of digits, variations in size
5c06 - The architecture of LeNet5
Input is pixels
Seq of feature maps followed by subsampling
In C1 feature map, there are 6 different maps, each of which are 28x28. They contain small features that look at 3x3 pixels. The weights are constrained together. So per map there is only 9 features.
What they call subsampling is now called pooling. You pool the outputs of neighbored feature replicators, which gives you a smaller map, which gives you a smaller map to input to the next layer which is discovering more complicated features.
As you go up hierarchy, you get features which are more complicated, but are more invariant to position.
5c07 - The 82 errors made by LeNet5
- It's better than 99% correct.
- There might be digits that LeNet got right and most people would get wrong.
5c08 - Priors and Prejudice
9:43
We can put in prior knowledge by design of the network
- connectivity
- weight constraints
- neuron activation functions
Less intrusive than hand-designed features
- still prejudices network towards particular solution
Alternative to putting in prior knowledge that gives network "a freer hand"
- give a whole lot more training data
- steel mill fortran simulator - Hofman&Tresp, 1993. They had real data and simulated data
- They ran simulated data and added it to real data
- if you generate a lot of synthetic data, it may take much longer.
This approach allows discovering clever ways of using multilayer network we did not think of (?)
5c09 - The brute force approach
16:01
5c10 - The errors made by the Ciresan et. al. net
tbd How to detect a significant drop in the error rate
tbd
5d01 - Convolutional Neural Networks For Object Recognition
TBD
5d02 - From hand-wriHen digits to 3-D objects
TBD
5d03 - The ILSVRC-2012 competition on ImageNet
TBD
5d04 - Examples from the test set (with the network’s guesses)
TBD
5d05 - Error rates on the ILSVRC-2012 competition
TBD
5d06 - A neural network for ImageNet
TBD
5d07 - Tricks that significantly improve generalization
TBD
5d08 - The hardware required for Alex’s net
TBD
5d09 - Finding roads in high-resolution images
Week 5 Quiz
Week 5 Quiz Prob 1 - Dimension Hopping
Hinton appears to be making a specific point about dimension hopping. He doesn't just mean that information might have been transcribed incorrectly by humans using forms. I believe that he means that when trying to detect a specific signal in data set examples, the signature may or may not be there.
Positive examples:
- Determining whether a wave has high frequency or low frequency. The input is a set of time values along with their corresponding vertical displacements.
- Determining whether a given image shows a bike or a car. The bike or car might appear anywhere in the image. The input is the whole set of pixels for the image.
Negative examples:
- Estimating the risk that a patient will develop heart disease given their age, weight, blood pressure, and cholesterol level.
- "Does it make sense for any of these inputs to have their values switched with each other? Would several hundred thousand schools in one neighbourhood make sense?"
- Estimating the market price of a house given the number of rooms in the house, the number of schools and the average income of the surrounding neighbourhood, and the average sale price of the surrounding houses.
- "Does it make sense for any of these inputs to have their values switched with each other? Would several hundred thousand schools in one neighbourhood make sense?"
Week 5 Quiz Prob 2 - CNN for 3
Week 5 Assignment
See assignment2.
Week 5 Vocab
weight filter [matrix, applied to CNN, used in 5c question]: TBD
filter map [used in 5c question]: TBD
convolutional neural network: TBD
Week 5 FAQ
- Question 5c talks about a technique of pooling called max pooling, where one takes the maximum number of activated pixels from any of the hidden units in a given region. Lecture 5 Quiz question 3 also mentioned sum pooling. What are some other techniques for max pooling, along with their common uses?
- Week 5 Quiz Question 4 said, "More inputs means more parameters, and thus increases the risk of overfitting." What is the difference between an input and a parameter, and what is the relationship between parameters and the risk of overfitting?
Week 5 Other
Papers
Links
- Another Artificial Intelligence Gitbook With CNNs
- Octave CheatSheet by obstschale
- A Programmer's Guide to Octave
People
Some people in the deep learning field:
- Geoff Hinton, Google - backpropagation, boltzmann machines
- Yann Lecun, Facebook - convolution
- Yoshua Bengio, U. Montreal - stacked auto-encoders
- Andrew Ng, Baidu - GPU utilization
- Alex Krizhevsky - dropout