[Capturing an Image in a Numeric
Array | The Detection of Lines and Boundaries | Descriptions
of Texture| Matching with a Model | Conflicts with
Reality]
computers, with
the aid of appropriate camera equipment, can easily view objects just as we do.
However, they do not interpret images and objects in the same way that we can.
For example, when we see a sunny beach, we can make out palm trees, specks of sand,
the ocean, birds, and surfers hitting the waves. However, in that same scene all a
computer "sees" is an two-dimensional grid of pixels(short for "picture element"; one
of thousands of points on a computer screen from which digital images are
formed)(Herndon, 121) with varying colors and degrees of brightness based on
numerical values. The goal
in artificially intelligent machines that use computerized
vision is to be able recognize objects, from any angle, even when
the objects themselves are slightly distorted.
Image formation is the most technically developed stage of machine vision. A
camera records the amount of light reflected into it from the surfaces of objects in
a three-dimensional scene. The information is then transmitted through
a converter that changes the analog signals into digital information that the
computer can interpret. The digital information samples represent positions on
a range of brightness, or intensity, values called a gray scale. These numbers
are formed into a two-dimensional grid called a gray-level array. Each
value in the array or grid makes a pixel of the digitized image. AI-vision systems
commonly use gray scales with values that range from 255(lightest) to zero(darkest).
However, color vision systems have three separate measurements for different intensities
of red, green, and blue.(RGB) This digital RGB system is what the digital TVs that
will become the standard in the television are based on.(Herndon, 19)
Once the computer has received the stream of numbers representing the varying light
intensities reflected from a scene in the three-dimensional world, it has to make use of
the numbers to understand what they mean. The first step is edge detection, in which
the computer makes outlines of objects or parts of objects. In order to do
this, the computer searches for sudden changes in brightness values that are associated
with edges such as those that result from surface creases, object boundaries, or changes
in color. However, these edges can sometimes be masked by noise (minor variations in intensity caused by
surface texture or imperfections such as scratches, and by electronic fluctuations
inherent in the digitizing process). In order to screen out noise, the computer must
erase or reduce these insignificant values by a process known as smoothing. In
smoothing, the value of each pixel in the grid or array is replaced by an average of
itself and its neighbors. The larger amount of pixels that are averaged, the
smoother the picture is. After smoothing, an object appears to look like an
outline of itself.(Herndon, 19)
One way for computer-vision systems to pick out an object is by performing
texture analysis. Since a particular texture is represented in the gray-level array
as a particular pattern of brightness values, sudden changes in texture could indicate
changes in the physical surface. For example, if you were to look at an aerial
photograph of a highway in the middle of the desert, the texture of the road would stand
out from the textures of the desert, thereby indicating a change in surface.
It is something that we naturally do so easily that it is much more complex in the
computing world of bits and bytes. Two ways of describing the texture of
a surface are structural analysis and statistical analysis. In structural
analysis, the system looks for features and the relationships among them. For
example, on a pineapple there specific wooden sections of thorns that are
arranged in diagonal rows, which would be identified as specific features arranged in
a certain pattern. Using edge detection, a computer can discern the scales that make
up the skin of a fish, thus being able to identify a fish
with its scales. Statistical analysis is used when specific features, such as
thorns, are not easily discernable or are not visible. In this method, the
computer focuses on the relationship between a single pixel and its neighbors, analyzing
the probability that a pixel's intensity is similar to that of its neighbors. As a
result the computer will be able to interpret whether a texture is rough, has
contrasting colors, the regularity of the features if any, and to what degree for
each of those quantities. Thus a computer can make the hairs of a fur coat, the
rough texture of a rock, or the smooth texture of silk.(Herndon, 23)
In order to recognize an object, the computer must match its outline,
shape, color or texture with models stored in its memory. Basically, the best way
of doing this is to find an outline in its memory that is the best match for the
outline of the object. However, the outline of the object must be one that is from a
similar angle as that of the stored outline. The cone-like shape of a strawberry
from a side-view is completely different from the circular shape of a top view of it.
Feature extraction, in which objects are further classified by textures or colors
and shape, allows for further object distinction. Although a computer could
tell the difference between a melon and a banana by shape alone, it would need to
analyze textures and color to tell the difference between a melon or on an orange. A
more complex system would separate objects into separate components, and be able to
discern between very similar objects. For example, if the computer were to analyze two
different humans, it could compare the length of their legs or the width of
their arms. To go even further, a computer could also compare how separate
components could relate to one another. The computer could differentiate
between one person with his hands on his waist and another with his hands on his
head.(Herndon, 27)
Although there are computer-vision systems used in a number of
different fields, most systems can only perform under very constrained conditions.
Optical character readers(OCR) can recognize letters printed in most typewriter
fonts, but cannot recognize various forms of handwriting. Robots that are involved
in manufacturing can discern certain parts from others, but only from a specific
perspective; if you turn the robot's camera 20 degrees to the left it would not be able to
recognize any of the same parts it understood in its original position. Even this is
a minute problem, as an object's appearance can be altered in many other ways. We as
humans understand that an orange that is peeled is still an orange, or that a tree with
wilted leaves during winter can be the same tree with bright green leaves in
spring. Excess illumination or lack of can also distort the perception of an
object and its boundaries. We might be able to tell that its the neighbor's dog
Rover wandering the streets during sunset, but a computerized vision system might not
be able to make heads or tails of it due to lack of light that makes it harder to
discern textures and color.(Herndon, 29)