A Quick Introduction to

News Research on the Web

News Research Links - Simplified Links Page

Skip to the Search Engine Note or Web Page Reliability

I work for the St. Louis Post-Dispatch as a News Researcher, which is a fancy name for Librarian. Unlike Public Librarians, our main job is to do research for reporters and editors (and artists and photographers and other newspaper employees), rather than help members of the public -- although we do some public requests, for a fee on a limited basis.

Some days I think of myself as a librarian, other days I think of myself as a journalist. I'm really both.

The two main functions of our department (officially the News Research Dept., also still known variously as the Reference Dept., the Library and "The Morgue") are to maintain an archive of the paper and to do research. While all of us do at least some work on both of those functions, our job titles indicate which of them is our "main" job, and I'm mainly a researcher.

The archive is electronic these days, but years ago they used to cut stories out of the paper and file them away in envelopes, in boxes a bit like shoe boxes, but longer and narrower and a little bit taller. Although these are maintained primarily for our own use, we do also handle public requests for old articles -- but our prices are fairly steep, and our turnaround is slow (usually 4-6 weeks) because we have to work it in around our normal duties. On the other hand, if you want a copy of a recent story, stories from our electronic archive back to 1988 are available for $2.95 per article from STLtoday, the Post's electronic sister site, and several vendors like Lexis-Nexis and Factiva. (By the way, in St. Louis, when you say "The Post " you don't mean the Washington Post, you mean the Post-Dispatch).

What I do, mostly, is do research for reporters - sometimes in our old clippings files from the pre-electronic days, sometimes in our collection of reference books, sometimes online. I spend probably half my day doing online research of one kind or another.

Many of the resources I use when I do research for reporters are proprietary databases we pay money to access, but I also use a lot that's available free to anyone on the World Wide Web. I've compiled an extensive list of categorized bookmarks, which may be of interest to folks doing various kinds of research. I have deleted from this list a few links from the P-D Intranet and subscription bookmarks our department pays for. You wouldn't be able to use them without a password, anyway.

For anyone who wants to know why I bother with such an extensive list of bookmarked links when I can just use Google, see my note about Search Engines below for details, but basically there's a lot of information that is not indexed by search engine sites, and a lot of what does come up in Google searches (or any other such search) is of dubious reliability. If I get a fact from a press release from a dgovernment department, it is reliable enough to put in the newspaper, and there's a trackable source that can be held reliable if it turns out to be wrong. If I quote a "fact" from some college student's web page in the newspaper, I better cross my fingers and hope it's right. There are, in fact, some "inspired amateurs" whose sites have created a reputation of trustworthiness -- the Internet Movie Database, for example, has long been relied upon by journalists, and though it is not error-free, neither is the average copy of the New York Times.

An extensive array of reliable bookmarks is essential to a serious news researcher. Don't get me wrong, I love search engines -- I probably use Google a dozen times a day, and All the Web and Altavista and Teoma and several others several times a week each. But I would rather go directly to a site that has information I can give a reporter with confidence that stumble around the Web hoping to find something useful. I do have a few rules of thumb about reliability of pages found through Search Engines below.

Even more than most webpages, this page is obviously a work in progress and will always be under construction. However, this page takes a LOT of work to create and maintain (I don't have Dreamweaver and despise Front Page and am doing my own HTML with a simple editor that just simplifies most of the tag entries for me), so it won't be updated often. The previous version was on the Web unchanged for nearly three years, and this one has four times as many links, so you can see it changed a lot off-line in between updates. I'll try to do better than that, but frankly make no promises. My goal is every two or three months. I will, however, update the simplified version (just a straight list of links, no intro, no category tables) fairly frequently, probably weekly, possibly more or less depending mainly on how much it changes. Whenever I add new sites or delete dead links, I'll try to remember to post a copy to the website.

Some, but not all, of the sites have descriptions or comments. The titles are sometimes the site's own and sometimes either edited or provided by me. Descriptions or parts of descriptions in quotes are from the site itself, either from a longer title I've truncated, from somewhere on the main page, or sometimes from the META file (descriptive information enclosed in a tag that doesn't display but is used by search engines and the like). Otherwise, the descriptions are my personal opinions, not to be confused with or taken for the opinions of the News Research Department or the St. Louis Post-Dispatch.

Back to News Research Links or Skip to Simplified Version

A Note Regarding Search Engines

A question I used to be asked much more frequently that still comes up a lot is "What's your favorite search engine?" The short answer is " Google " (whose popularity is the main reason I don't get asked this question as much as I used to), but the real answer is more complicated.

First off, you should understand that search engines do not do what most people think they do -- that is, they do not go out searching across the Internet and bring you back links that match your search. A search engine, properly speaking, is a piece of software that searches a database. There are lots of different kinds of search engines, and the popular search sites we usually refer to by that name are a small portion of them. Any software that searches any database is, properly speaking, a search engine.

For instance, if you go to the National Highway Transportation Safety Administration, you'll find that one of the resources available there is a database of customer complaints about automobiles with respect to safety issues, searchable by make, model and year. The software that searches this database is, technically, a search engine, as is the search bar function of Amazon.com or the Internet Movie Database. But of course, that's not how we usually use the term.

The search engine sites we're all familiar with have compiled databases of many, many web pages. They do this in two ways. People send them URLs of web pages they have created, and they have robot programs that investigate links on existing pages they already know about. Thus, if one of these robot programs (usually called "spiders") came upon this page, they would note every page this page links to, and then all the pages those pages link to, etc.

The search engine proper, that is the actual piece of software that does the search, doesn't search the whole Internet. It can't. It searches this database, this collection of pages that the search engine site has compiled. Most of them have millions of pages -- Google and All the Web have billions -- but they often have only the main page and perhaps a few others of most sites. Moreover, they have no way of getting at the information in online databases like the NHTSA complaint records, because the server creates pages on the fly based on your search, and there's nothing there for them to index. Because of this, it's been estimated that as little as 1% of the information available on the web is actually indexed by the search engine sites.

Although technically the term "search engine" refers to the search software, not the database, the fact is we all refer to the totality of a site like All the Web or AltaVista as a search engine, so I shall do so as well. I've also gone along with the technically incorrect usage of referring to directories as search engines. The difference between a directory (such as Yahoo !) and a search engine (such as Google ) is simple yet profound. Search engines attempt to index as much of the web as possible, and leave it to the search software to find the pages you are interested in. Directories, on the other hand, while they can also be searched, are organized into categories, usually in a hierarchical structure. The advantage is that everything in Yahoo ! has been seen by a human being, who decided that this site was worthy of inclusion and what category it belonged in. The search engine sites are more comprehensive, but also more likely to link you to useless or unreliable sites (although today's search engine sites have gone a long way toward solving that problem). The directories are precise and reliable, but waiting for humans to review the pages puts them severely behind.

There's not that much difference in how they're generally used, however, so I've lumped all these in together in the category of " General " search engines.

So which search engine (or directory) is best? It depends on what you're looking for. Even counting pages that are actually available for indexing (leaving aside, in other words, the information contained in databases like the NHTSA complaint records), no single search engine database actually contains anything like the totality of the web. A study in March 2002 showed that out of 141 distinct pages returned by a deliberately small search across 10 of the most popular search engines, 71 of the pages only showed up on one search engine, while another 30 showed up on only two. Google is not perfect, All the Web doesn't really search all the web, and a good researcher needs to have several such search tools and his or her disposal.

My favorite search engines is Google, and something like 8 or 9 times out of 10 Google is going to find a page -- not necessarily the precise page you think you're looking for, mind you, but a page that has the information that you're looking for and put it on the first page of the search results. (The reliability of the information is another matter, dealt with below.) It's really amazing, especially to those of us who know how hard it is to do what Google's doing, how well it works. The fact that they have only text ads and no banners to slow things down helps immensely -- and has forced their competitors, for the most part, to follow suit, which is wonderful.

Still, Google doesn't always give you what you need, and often doesn't quite give you what you want, so I have several other such " General " search engines listed here.

My " Specialized " and " People " categories go back to the strict technical definition of search engines. They do not search databases of websites (well, most of them don't -- more on that in a moment), but internal specialized information databases, like the NHTSA complaint records database mentioned above, or the more famous Internet Movie Database -- a wonderful compendium of information about motion pictures. Actually, there are some web search sites in the " Specialized " category. An example is Findlaw, a legal directory site that includes a search engine much like Google except that its database of websites is restricted to those involved in legal matters -- and it is more complete in indexing those sites than any generalized search engine site. My first few "Specialized" sites are databases of specialized search sites themselves, which are again Google-like databases of Web pages -- but in this case the web pages are all specialized search sites like IMDB or Findlaw.

Anyone interested in learning more about this topic should check out Search Engine Watch, the first link on my " About Search Engines" category.

Back to News Research Links or Skip to Simplified Version

Judging the Reliability of Web Pages

OK, none of my bookmarks helped you so you did a search on Google or All the Web and came up with some hits that told you exactly what you needed to know -- how do you know whether or not you can trust the information?

Too many people don't even consider this question. Some just accept anything found on the Web as uncritically as they would information in an encyclopedia or their daily newspaper (two sources which actually shouldn't be accepted uncritically either, but are probably more reliable than the Web taken as a whole). Others are dubious about the Internet as a research tool altogether, regarding the Web as a swamp of misinformation, urban legend and slanted articles parading extremist left-wing or right-wing opinion as "news."

The first thing you need to do is look closely at the (Universal Resource Locator -- the web page "address") in the "Address" or "Location" bar on your browser. For instance, for this page it is "http://home.swbell.net/jsteveb/intro.html". What we're interested in here is the last bit before the first single slash -- or, if there are no slashes after the "http://" the very last bit.

(A side note: the "home" in this URL tells you that this is probably the personal home page for an individual subscriber to an Internet Service Provider, many of whom have personal web pages set up this way ("members" is a similar clue). So you can tell right away that my site isn't going to be one of the extremely reliable ones. You're going to want to take anything learned here with a grain of salt. Also watch for "geocities" and "angelfire" sites, which are not only generally personal, but free, meaning anybody and everybody can get one without even having to pay money for the privilege. That doesn't necessarily mean they're worthless -- one of my bookmarks is a geocities page -- but it is a red flag that should urge you to more-than-usual caution.)

Although there are a few new domains like ".biz" and ".info," most of the sites you're going to find on the Internet end in one of the main 3-letter designations .com, .net, .org, .gov or .edu, or in the two-letter code for country. The U.S. is .us, but that is mostly used for state and local governments, while most sites from Canada and the United Kingdom use .ca and .uk, including commercial sites (the British version of Amazon, for instance, is at http://www.amazon.co.uk)

The most reliable sites are going to be ".gov" sites. These are all from the U.S. Federal Government, which may not always be trustworthy but at least maintains reliable sources of figures and statistics like the census report.

The next most reliable is probably ".org" -- but be careful. Not only can any non-profit organization get a .org domain, some unscrupulous profit-making enterprises have gotten hold of them simply by lying. So while you'll find the Red Cross and the American Cancer Society here, not everybody in this domain group will be as reliable. See below for further details on how to judge sites of organizations unfamiliar to you.

The tricky ones are ".edu" sites. All such sites are (or should be) registered to four-year degree-granting colleges and universities. But not everything on a school website is authorized or legitimatized by the school itself. Many of them have personal home pages available for faculty and/or students. If you see an article about Greek Mythology with ".edu" in the URL, it makes a considerable difference whether it's part of the Classics Department or just a college sophomore's essay.

The .net designation was originally intended for companies and organizations directly involved in providing Internet services, but the distinction between it and .com eroded long ago. Anybody with a company name for which some similarly-named company already has the .com version of registered is likely to be found at companyname.net instead, and it's not unusual these days for a company to register both at once. Comments below on .com apply equally to .net domains.

Anybody with money can register a .com domain name. You do not need to be certified, licensed, or prove you have the right to it. After the fact, a trademark holder can often get your site taken down, if you try to put up a site called, say "toyota.com," but the fact is that unless it's already taken, initially you can do so unimpeded. So even if you're at, say "mcdonalds.com" it doesn't mean you're actually at the site of the fast-food franchise that features a clown mascot (in this case you would be, but that's beside the point). In the early days of the Web there were a lot of "domain prospectors" who would tied up dozens or even hundreds of domain names and offer to sell them to companies, often for ridiculous amounts. ICANN took care of that by pretty much automatically deciding in favor of trademark holders in most disputes, perhaps leaning too far the other way (taking away domain names from Harry Potter fan sites because Warner wanted to register everything remotely resembling anything Potteresque in preparation for the first movie release, for instance). Still, typing "companyname.com" is not always going to get you to the official site of company X, and you may in fact be surprised at what you turn up that way.

Assuming you really have reached the official website for a company, some information there is more reliable than others. Many public corporations now post their SEC filings and/or annual reports (usually under "investor information" or some such), and this is relatively reliable (a bit more so since the crackdowns on misleading filings by companies like Worldcom). On the other hand, press releases are generally going to be puff pieces touting only the positive things going on in the company in outrageously complimentary language. As you're backing through the site from the random page you've found (see below), check to see how it has been categorized. That may give you some clue as to its trustworthiness.

The vast majority of web pages found with searches on sites like Google will be located on .com sites. They automatically have a lower reliability than the other domain names.

I don't have much to say about the two-letter international designations. Most .us sites are state and local governments, which are relatively reliable, but some small third world countries essentially set themselves up as providers of domain names to raise money, and have even less accountability about what's on their site than, say, geocities or angelfire (2 free home page services), which at least respond to court orders regarding people using the Web for illegal purposes. In these countries' cases, the internet service provider IS usually the government, and as long as the people being ripped off are Americans or Europeans, they're often more concerned with the money they're getting from the people using the domain name than with holding those people accountable. Check a list of these International codes to see where the page you're looking at is registered before deciding how far to trust it. One such list is at http://www.101domain.com/domain_whois_server.php

The next thing you need to do is find the home page for the site that the page you're looking at is part of. Look at the URL again.

Point your cursor to the very end of the URL and click. Now hit your Backspace key until you come to the last "/" in the URL (if it ends in a "/" then backspace to the previous one). Now hit return. Again, doing this on the current page would yield "http://home.swbell.net/jsteveb/". If you still get a page of ambiguous origin (you can't tell who's responsible for it), do it again. And again. You may get error messages like "You are not authorized to view that directory." Just go up and backtrack to the next previous slash.

In this case, doing this once will give you my home page, which is as far back as you can usefully go. My page is on the server at "http://home.swbell.net/" but is not directly tied to Southwestern Bell (now a nameless part of SBC Corp.) Similarly, if you find the home page of a student or a teacher on a university site, going back to the main home page for the university probably won't do you much good. It's the home page for the person, company or organization responsible for the page you found with the search engine that you're looking for.

The basic journalistic questions are as good for judging the reliability of sources as they are for gathering information: Who? What? Where? When? Why? How? The What? you've already answered (whatever information you found on the page your search turned up), so let's look at the others.

Who? Who put this information on the World Wide Web? If it is an organization or a company, is it a familiar one? If not, does the home page provide a link to information, like a bio or resume for an individual or a list of officers and mission statement for a company or organization? Is there a link to a way to contact this person (complete address, phone number and e-mail is best -- e-mail only could indicate privacy concerns with an individual, but with a company or organization could be a red flag for a fly-by-night outfit with no fixed place of business)? If there is a phone number, does it work? Can the person on the other end answer your questions?

Where? This will actually not be all that important in most cases, but if the information is highly localized, it might make some difference whether or not the provider is actually familiar with the territory. The contact information mentioned above should be helpful in determining this, if it's there.

When? How old is this information? Is the website updated periodically or is it an old site that hasn't been revamped since Netscape was the hottest browser around? There are subtle ways you can sometimes find out this information, but the most reliable pages should have update dates right on their pages.

How? I'm taking this one out of order to save the best for last. How was the information gathered or compiled? If it presents information coming from a study, does it quote the original authors and publication information? Can you easily determine how this person or organization obtained this information? Hiding or neglecting citations is a BIG red flag -- I don't ever use information that anyone claims come from "a study" or "research" and doesn't cite sources for.

Why? Why did whoever put this information on the Web do so? The answer will almost never be to simply share information -- and ironically, when it is, it will nearly always be one of those individual sites you have to automatically think of as suspect. Companies and organizations have a point of view they are trying to promote, and the information they provide to the public nearly always coincides with this. That doesn't necessarily mean it's wrong, but it's something to take into account when you judge its reliability.

Finally, I seldom, if ever, take the word of a single source. Maybe the Census Dept., simply because their figures are the official figures even if they're wrong, but generally I try to find two or three different sources that say more-or-less the same thing. NOT exactly the same wording -- this often indicates what I like to call weblication, the proliferation of many web sites copying information from each other without regard to its authenticity. Even if the information is good, five copies of the same paragraph on different web sites is not nearly as comforting as two different reports with essentially the same facts.

How many sources you need to be sure you've got the right answer depends to some extent on both the kind of question you're asking and the level of reliability of your first source. If even one independent source confirmed something from the Internet Movie Database, that would be fine with me. If I found something on some college student's home page, I'd be dubious even if half a dozen other students confirmed it, though I'd be leaning in the direction that it was likely to be true (assuming different wording, etc. indicating they weren't all quoting the same source).

Always, always, always try to find at least two sources. Sometimes you can't. Sometimes you really don't need to, as in the Census example -- the official population of the U.S. is what the Census Dept. says it is, and if they're wrong, their figure is STILL the official one. But get in the habit of assuming that the Encyclopedia Britannica and the New York TImes can be wrong. As the old newspaper saying goes, "If your mother says she loves you, check it out."

Back to News Research Links or Skip to Simplified Version

The Research Wizard
News	Research	Comics
Come Visit "Steve's Reads" on Geocities

Comments? Questions? Drop me a line
this page was last updated: 03/06/2004

All pages in this webspace created by J. Stephen Bolhafner. Contents copyright J. Stephen Bolhafner, except where noted.

A Quick Introduction to

News Research on the Web

News Research Links - Simplified Links Page

Skip to the Search Engine Note or Web Page Reliability

A Note Regarding Search Engines

Judging the Reliability of Web Pages

Comments? Questions? Drop me a line this page was last updated: 03/06/2004

Comments? Questions? Drop me a line
this page was last updated: 03/06/2004