E-mail spam, also known as junk e-mail, is a subset of spam that involves nearly identical messages sent to numerous recipients by e-mail. A common synonym for spam is unsolicited bulk e-mail. Definitions of spam usually include the aspects that email is unsolicited and sent in bulk.
E-mail spam has steadily, even exponentially grown since the early 1990s to several billion messages a day. Spam has frustrated, confused, and annoyed e-mail users. The total volume of spam (over 100 billion emails per day as of April 2008) has leveled off slightly in recent years, and is no longer growing exponentially. The amount received by most e-mail users has decreased, mostly because of better filtering. About 80% of all spam is sent by fewer than 200 spammers. Botnets, networks of virus-infected computers, are used to send about 80% of spam.
E-mail addresses are collected from chat rooms, web sites., newsgroups, and viruses which harvest users' address books, and are sold to other spammers. Much of spam is sent to invalid e-mail addresses. ISP's have attempted to recover the cost of spam through lawsuits against spammers, although they have been mostly unsuccessful in collecting damages despite winning in court.
From the beginning of the Internet, sending of junk e-mail has been prohibited, enforced by the Terms of Service/Acceptable Use Policy of internet service providers (ISP's) and peer pressure. Even with a thousand users junk e-mail for advertising is not tenable, and with a million users it is not only impractical, but also expensive. It is estimated that spam cost businesses on the order of $100 billion in 2007. As the scale of the spam problem has grown, ISP's and the public have turned to government for relief from spam, which has failed to materialize.
Spam has several definitions, varying by the source.
Many spam e-mails contain URL's to a web site or web sites. According to a Commtouch report in June 2004, "only five countries are hosting 99.68% of the global spammer web sites.", of which the foremost is China, hosting 73.58% of all web sites referred to within spam.
According to information compiled by Spam-Filter-Review.com, E-mail spam for 2006 can be broken down as follows.
|E-mail Spam by Category|
Rolex watches and Viagra-type drugs are two common products advertised in spam e-mail.
Advance fee fraud spam such as the Nigerian "419" scam may be sent by a single individual from a cyber cafe in a developing country. Organized "spam gangs" operating from Russia or Eastern Europe share many features in common with other forms of organized crime, including turf battles and revenge killings.
Spam is also a medium for fraudsters to scam users to enter personal information on fake Web sites using e-mail forged to look like it is from a bank or other organization such as PayPal. This is known as phishing. Spear-phishing is targeted phishing, using known information about the recipient, such as making it look like it comes from their employer.
Spam is growing, with no signs of abating. The amount of spam users see in their mailboxes is just the tip of the iceberg, since spammers' lists often contain a large percentage of invalid addresses and many spam filters simply delete or reject "obvious spam".
According to Steve Ballmer, Microsoft founder Bill Gates receives four million e-mails per year, most of them spam. At the same time Jef Poskanzer, owner of the domain name acme.com, was receiving over one million spam emails per day.
In terms of volume of spam: According to Sophos, the major sources of spam in the fourth quarter of 2008 (October to December) were:
When grouped by continents, spam comes mostly from:
In terms of number of IP addresses: The Spamhaus Project (which measures spam sources in terms of number of IP addresses used for spamming, rather than volume of spam sent) ranks the top three as the United States, China, and Russia, followed by Japan, Canada, and South Korea.
In terms of networks: As of 5 June 2007, the three networks hosting the most spammers are Verizon, AT&T, and VSNL International. Verizon inherited many of these spam sources from its acquisition of MCI, specifically through the UUNet subsidiary of MCI, which Verizon subsequently renamed Verizon Business.
Some popular methods for filtering and refusing spam include e-mail filtering based on the content of the e-mail, DNS-based black hole lists (DNSBL), gray listing, spam traps, Enforcing technical requirements of e-mail (SMTP), check summing systems to detect bulk email. Each method has strengths and weaknesses and each is controversial due to its weaknesses.
Anti-spam techniques should not be employed on abuse email addresses, as is commonly the case. The result of this is that when people attempt to report spam to a host, the spam message is caught in the spam filter and the host remains unaware that their network is being exploited by spammers.
In order to send spam, spammers need to obtain the e-mail addresses of the intended recipients. To this end, both spammers themselves and list merchants gather huge lists of potential e-mail addresses. Since spam is, by definition, unsolicited, this address harvesting is done without the consent (and sometimes against the expressed will) of the address owners. As a consequence, spammers' address lists are inaccurate. A single spam run may target tens of millions of possible addresses many of which are invalid, malformed, or undeliverable.
Sometimes, if the sent spam is "bounced" or sent back to the sender by various programs that eliminate spam, or if the recipient clicks on an unsubscribe link, that may cause that email address to be marked as "valid", which is interpreted by the spammer as "send me more".
A common practice of spammers is to create accounts on free web mail services, such as Hotmail, to send spam or to receive e-mailed responses from potential customers. Because of the amount of mail sent by spammers, they require several e-mail accounts, and use web bots to automate the creation of these accounts.
In an effort to cut down on this abuse, many of these services have adopted a system called the captcha: users attempting to create a new account are presented with a graphic of a word, which uses a strange font, on a difficult to read background. Humans are able to read these graphics, and are required to enter the word to complete the application for a new account, while computers are unable to get accurate readings of the words using standard techniques. Blind users of captchas typically get an audio sample.
Early on, spammers discovered that if they sent large quantities of spam directly from their ISP accounts, recipients would complain and ISP's would shut their accounts down. Thus, one of the basic techniques of sending spam has become to send it from someone else's computer and network connection. By doing this, spammers protect themselves in several ways: they hide their tracks, get others' systems to do most of the work of delivering messages, and direct the efforts of investigators towards the other systems rather than the spammers themselves. The increasing broadband usage gave rise to a great number of computers that are online as long as they are turned on, and whose owners do not always take steps to protect them from malware. A botnet consisting of several hundred compromised machines can effortlessly churn out millions of messages per day. This also complicates the tracing of spammers.
Many spam-filtering techniques work by searching for patterns in the headers or bodies of messages. For instance, a user may decide that all e-mail they receive with the word "Viagra" in the subject line is spam, and instruct their mail program to automatically delete all such messages. To defeat such filters, the spammer may intentionally misspell commonly-filtered words or insert other characters. The principle of this method is to leave the word readable to humans (who can easily recognize the intended word for such misspellings), but not likely to be recognized by a literal computer program. This is only somewhat effective, because modern filter patterns have been designed to recognize blacklisted terms in the various iterations of misspelling. Other filters target the actual obfuscation methods; such as the non-standard use of punctuation or numerals into unusual places, for example: within in a word.
(Using most common variations, it is possible to spell "Viagra" in over 1.3 * 1021 ways.)
HTML-based e-mail gives the spammer more tools to obfuscate text. Inserting HTML comments between letters can foil some filters, as can including text made invisible by setting the font color to white on a white background, or shrinking the font size to the smallest fine print.
Another common ploy involves presenting the text as an image, which is either sent along or loaded from a remote server. This can be foiled by not permitting an e-mail-program to load images.
As Bayesian filtering has become popular as a spam-filtering technique, spammers have started using methods to weaken it. To a rough approximation, Bayesian filters rely on word probabilities. If a message contains many words which are only used in spam, and few which are never used in spam, it is likely to be spam. To weaken Bayesian filters, some spammers, alongside the sales pitch, now include lines of irrelevant, random words, in a technique known as Bayesian poisoning. A variant on this tactic may be borrowed from the Usenet abuser known as "Hip crime" -- to include passages from books taken from Project Gutenberg, or nonsense sentences generated with "dissociated press" algorithms. Randomly generated phrases can create spoetry (spam poetry) or spam art.
After these nonsense subject lines were recognized as spam, the next trend in spam subjects started: Biblical passages. A program much like Mark V Shaney is fed Bible passages and chops them up into segments. The reasoning is that this text, often very different from the writing style of today such as the King James Version, will confuse both humans and spam filters.
However, as many or most Bayesian filtering programs only use the most spam-like and least spam-like words for deciding whether an email is spam or not; injecting extra non-spam related words means that these extra words do not correlate well with spam, and so do not usually affect the result. However, they do decrease the effectiveness slightly, which, for spammers can make a significant percentage difference in the number of users actually seeing their spam.