bioinformatics2.html

April 2, 1999

Title: the Structure of Information on Algae


John G. Rueter
Department of Biology
Portland State University
PO Box 751
Portland, OR 97207-0751

For submission to Journal of Phycology

Abstract:

The domain of information on phycology can be approximated by combining the concepts of algae, phytoplankton and cyanobacteria. This information can be searched electronically and the structure analyzed. Information on algae can be found in many different journals. The information can be divided into categories that are related to sub-disciplines in the life sciences. These categories form part of a search strategy and particular search terms and indexing are other search strategies. The value of phycological research in both teaching and education depends on a well structured domain that is the responsibility of both individual authors and the Journal of Phycology.


Keywords: algae, search strategies


Introduction:


The information on algae has a "structure" that is related to how this information is stored and how it can be retrieved. As users of that information we need to understand how to access the information. As creators of that information we have control, to some degree, over how it is stored and indexed. There are five general ways that information can be organized; by location, alphabetically, by time, in categories or in a hierarchical scheme (Wurtman 1997). In the Journal of Phycology we use four of these (all except location). We keep track of the date of publication, we can search alphabetically by author or from a list of index terms, each Journal issue is divided into categories that are sub-disciplines of the life sciences and there genus and species names (hierarchical scheme) are also indexed.

The goal of this paper is to describe how to characterize the set of all papers that have to do with algae or phytoplankton and then to determine how we can search for specific topics in that entire set. Once this "domain" of literature on algae can be established, it is much easier to search for information on particularly topics. The key to a good search strategy within a domain is to use key terms that are sufficiently broad to identify all the papers of interest but narrow enough to elimnate the rest of the papers in the domain. describe the target information but that are relatively infrequently used in the rest of the papers. Each term has a frequency of occurence in the entire domain, i.e. there are a certain number of papers on that topic or with that keyword. The paramter of interest is the IDf (inverse document frequency) which equals the log of the quantity (total number of documents in the domain divided by the number of documents that contain a particular term). Low values of IDf indicate that the term does not discriminate between articles in the domain. For example, the terms "biochemistry" or " biochemical" are used so broadly that they loose their value as search terms. Over 34% of the articles in the domain for algae (that we will see below) contain this term. The other important aspect of a good search is to use Boolean operators. The Boolean operators "OR", "AND" and "NOT" can be used in combinations to expand, limit or exclude terms from a search, respectively. A good search strategy must use these terms explicitly and not rely on the default logic used by different search engines.

The use of electronic databases and search engines is only the first filter in the process of evaluation of information. These tools, however, are becoming stronger, easier to use and much faster. For example, the Biosis database now includes abstracts which can be used to evaluate the results of a search strategy very effectively before reading the articles themselves. Although the technology and archiving of information is handled by librarians and informatisists, the control over how the information is structured is still up to us as authors. Articles in the Journal of Phycology contain key words that are used for searching. Authors should be encouraged to verify that the search strategies that use the suggested key words will result in articles with similar information. In particular, all articles in the Journal of Phycology should fall into some easily delimited domain of information on algae.


Materials and Methods:


All of the work and analysis in this paper was done on a desktop computer (either Macintosh or Windows) connected to the Internet. The university has license to the Biosis database which includes its own search tools. The search options are described in the "Help" and "Tutorial" menus of Biosis but they are a combination of simple Boolean operators (and, or, not) (Boole 1894) and relational terms such as adjoins, same paragraph, within the same sentence. There are other special formats such as "1994.yr." specifies the year of publication as 1994 or "Journal of Phycology".jn which designates a specific journal. The "wild card" in Biosis is the $. The results from the searches were noted and analyzed using EXCEL. In one instance an entire file of 3632 references was downloaded to my machine and sorted by journal title name.

Results:

Characterization of the domain - Many terms are used to describe algae, such as ,"algae", "algal", "phytoplankton", "phytoplankter", "cyanobacteria", "cyanobacterial", and others. What are the fewest terms that can be used that will lead to a set of documents that contains almost all of the documents on algae. Using the wild card character $ in Biosis, the domain was defined as alga$ OR phytoplank$ OR cyanobact$. This search resulted in 51552 articles (Table 1). When combining these three terms there are eight possible sets (Figure 1) but as expected, there is considerable overlap where several terms would occur in many documents.


Table 1. The number of articles found in the entire Biosis database when using each key
word or combinations of key words. This search was performed on April 2, 1999 and used a version of Biosis that covers from 1990 through the 12 week of 1999.

search number search logic documents % of total domain
1 alga$ 44362  
2 cyanobact$ 10260  
3 phytoplank$ 8505  
4 1 OR 2 OR 3 (the "domain") 51552 100.0
  1 NOT (2 or 3) 33712 65.4
  2 NOT (1 or 3) 5897 11.4
  3 NOT (1 or 2) 1178 2.3
  1 AND 2 NOT 3 3438 6.7
  2 AND 3 NOT 1 115 0.2
  1 AND 3 NOT 2 6402 12.4
  1 AND 2 AND 3 810

1.6

 

 

 

insert figure 1 - Venn diagram of Table 1


There is no practical way to check how many articles in the entire Biosis database that have to do with algae are not included in this domain. However, we can analyze the articles in the Journal of Phycology to see what we missed. If we assume that all of these articles deal with algae in some manner, then they should all be represented in our domain. If they are not, that means that either our definition of the domain was not sufficient or that the authors did not provide appropriate key words. To simplify this test, I examined all of the articles in the domain and all of the articles in J. Phycology for the year 1998 (Table 2). There were eight articles that were published in the Journal of Phycology in 1998 that were outside of the established domain. Examination of the titles of these articles show that four were abstracts published for the annual meeting, two were clearly methods papers (of which one was an abstract also), leaving four articles out of 285 (about 1.4%) that were omitted. Examination of the titles of these articles did not identify any additional general key words that should be included to increase the domain. This indicates that the authors should have picked at least one key word that would have identified their contribution with the domain of information on algae and phytoplankton. This analysis could be done with other journals, but it would be more difficult to determine what percent of other journals should pertain to algae, for example it can not be expected that 100% of the articles in the Journal of Plankton Reasearch would be about some type of algae.

Table 2. Search on Biosis for the same domain as in Table 1 but limited to 1998 compared to all of the articles listed for the Journal of Phycology.

search
number
search phrase and logic number of documents
1 "journal of phycology" in 1998 285
2 domain (alga$ or phytoplank$ or cyanobact$) in 1998 3989
  1 NOT 2 8




The distribution of articles in the domain by journal - Another treatment of this data is to sort all of the articles in the specified domain by journal. In 1998 the Journal of Phcology only accounts for 285 of the 3989 articles published on algae (given the "domain"). In one way this list is rather sobering, because even if you read the top ten journals on algae you would have only read 22% of the total articles. On the other hand, this list shows that there are many journals that have ten or more articles per year on algae and that they range across a broad spectrum of disciplines in life and environmental sciences.


Table 3. List of Journal titles that had 20 or more articles that met the search criteria (alga$
or phytoplank$ or cyanobact$) in the year 1994. Note this is a different year than for tables 1 and 2.

Number of
articles in 1994

Journal title
197 Photosynthesis Research
117 Hydrobiologia
105 Journal of Phycology
96 Bulletin of Marine Science
95 Plant Physiology (Rockville)
73 Marine Ecology Progress Series
65 Archiv fuer Hydrobiologie Supplement b and
56 Journal of Applied Physiology (?? or phycology)
49 Limnology and Oceanography
48 Marine Biology (Berlin)
47 Journal of Plankton Research
44 Botanica Marina
44 Molecular Biology of the Cell
42 Journal of Experimental Marine Biology and Ecology
41 Abstracts of the General Meeting of the American Society for Microbiology
41 Bulletin of the Ecological Society of America
40 Memoirs of the Queensland Museum
38 Biologia Plantarum (Prague)
38 Ergebnisse der Limnologie
37 Journal of Biological Chemistry
36 Phycologia
35 Plant Molecular Biology
34 Biological Chemistry Hoppe-Seyler
31 Biologia (Bratislava)
30 Archiv fuer Hydrobiologie
29 Canadian Technical Report of Fisheries and Aquatic Sciences
28 American Journal of Botany
26 Journal of Bacteriology
26 Phytochemistry (Oxford)
25 Biophysical Journal
24 Biochimica et Biophysica Acta
24 Protoplasma
23 Biochemistry
23 Natural Toxins
23 Planta (Heidelberg)
20 Japanese Journal of Phycology
20 Photochemistry and Photobiology
20 Proceedings of the National Academy of Sciences of the United States of Am
20 Review of Palaeobotany and Palynology


Establishing categories within the domain - Once the domain has been established it is important to develop efficient strategies for finding articles on specific topics. This can be done soley through search terms or by further categorization of the domain. The study of phycology is undertaken at many levels, across the sub-disciplines of the life sciences. These categories are not always the same as our curricular tracks or departments, usually for historical and political reasons. The categories in biology as determined by informatisits outside of academia (such as those who work for Yahoo) are shown in Table 3. For most of the general categories in life sciences there are corresponding categories in the J. Phycology subheading scheme. Several do not correlate. Botany should really be considered a superheading for J. Phycology and in the Yahoo index, phycology is listed under botany. Exobiolgy and biophysics are not explicitly covered in the Journal of Phycology. The Techniques section in the Journal of Phycology does not map onto the life science categories and is a problem area for categorization as we shall see later.

Table 3. Categories in the life sciences from Yahoo and corresponding sub-headings in The Journal of Phycology.

Life science category Sub-headings in J. Phycol
biochemisty Physiology and Biochemistry
physiology
cell biology Cellular and Molecular Biology
molecular biology
developmental biology Developmental Biology
genetics Population Biology and Ecology
ecology
biodiversity Phylogenetics and Taxonomy
systematics and taxonomy
evolution
structural biology Morphology and Applied Phycology
biotechnology
botany  
exobiology  
biophysics  
  Techniques

If these categories are valid descriptors of the entire domain, then we should be able to categorize every article. Operationally this means that using the search terms derived from each of the category headings should give all of the journal articles for the year. Some articles may fall under more than one category but none should be missed. The following search was performed in Biosis for the year 1998.

all of the articles in Journal of Phycology for 1998

NOT (ecolog$ or population or physiol$ or biochem$ or cell$ or molecular or evolu$ or phylog$ or taxon$ or morphol$ or develo$ or morphol$ or applied)

The result was 36 articles or abstracts just from the year 1998 for which key words based on the subheadings did not find those articles. Based on the titles of these articles and abstracts, these were sorted into their categories (Table 4). Most articles in this search were the meeting abstracts which are not catgorized under the normal subheadings in the journal. There were eight abstracts for papers that didn't match journal subheadings and couldn't easily be placed in any particular category. There were four papers on physiology that weren't caught in the search. Based on the structure of information provided by the journal in terms of sub-headings, these authors should have included key words that would identify their contributions to this category. The highest number of misses however was for applied phycology papers. The titles for these abstractes were easily identified as being applied however they were not identified in the search. Again, that authors of these abstracts should have self-identified their contributions.

Table 4. Categories, based on inspection of the title, for the 36 articles that were missed. Search on the 1998 titles in Journal of Phycology using the search string given above. The table also lists whether this was one of the meeting abstracts which are not listed in the journal in the same sub-headings.

J. Phycology subheading meeting abstract total
number
Population Biology and Ecology    
Physiology and Biolchemistry 5 9
Cellular and Molecular Biology    
Developmental Biology    
Phylogenetics and Taxonomy 3 3
Morphology and Applied Phycology
11 12
Techniques 3 4
undetermined 8 8


Indexing and search terms- Indexing in journals is slightly different than using electronic searching. For example in the Journal of Phycology in 1994 the index in the December issue tended to give a lower number of references to the same topic when checked on Biosis (Table 5). Most of index values are in the usable range, i.e. a value that would be worth looking up whereas some of the values for the Biosis search are not very helpful (in particular the 39 hits on "light" and the 26 hits on "photosynthesis". On the other hand if you wanted to find out who might have mentioned chlorophyll even though that was a central area of their work, the Biosis list give 16 hits instead of 7. These differences illustrate that even on a small scale, such as the index for one journal for one year, there are significant differences in the way the information gets indexed and retrieved using different methods.


Table 5. Some key terms taken from the index of the last issue of Journal of Phycology
for 1994 compared to the number of hits from Biosis on the same term (when limited to
the Journal of Phycology for 1994 also). The fourth column is the total number of articles
in all of Biosis (for all years) that are found for that term.

Term

1994 J. Phycology index Biosis for 1994
carbon fixation 2 1
chlorophyll 7 16
dynamic light regimes 1 0
fluorescence 2 6
light 6 39
primary productivity 1 1
photosynthesis 13 26
photosynthetic pigments 8 5
photosystem II 3 2
pigments 3 9
quantum yield 2 3



An example of the use of these search capabilities is to look for phrases (terms within one line) that can be used as the best descriptor of a particular area of work. Biosis allows the use of "with" to mean within the same sentence. The following search was limited to all articles that met the (alga$ or phytoplank$ or cyanobact$) domian. Comparing the terms "light", "irradiance" and "illumination" reveals the useful nature of these terms. The terms irradiance and illumination are both 86 to 88% the same as light but the terms irradiance and illumination are almost totally exclusive with only a very small overlap. Because of this low frequency of occurence the terms "irradiance" and "illumination" can be very powerful terms in a search.


Table 6. Search of Biosis limited to the set that meets the (alga$ or phytoplank$ or
cyanobact$) criteria. The terms light, irradiance and illumination were compared for their
overlap and exclusivity. This search perfomed in 1994 on the Biosis from 1990 to 1994.

search term articles comment
light OR irradiance OR illumination 3254  
light 3180  
irradiance 399  
illumination 201  
light and irradiance 353 86% overlap with light
light and illumination 173 84% overlap with light
irradiance and illumination 7 articles that use both terms are rare



Discussion:

The domain of information on algae seems to be effectively represented by the combination of algae, phytoplankton and cyanobacteria. The small number of articles that were missed with this search strategy could be attributed to author index words. With near universal access to digital search technology and in particular Biosis, authors should be able to determine how they want their articles categorized and make sure that appropriate key terms are included in the title or key word list.

The Journal of Phycology represents almost the entire range of sub-disciplines of the life sciences as applied to the study of algae. The sub-headings that are used in the Journal map onto accepted sub-disciplines with only one gap, which is the articles on techniques. Technique and methodology papers need to include key words that identify the area of phycology that this technique is being used. This would cross-reference the technique to the sub-disciplines in the life sciences. Again, this is something that is under the author's control and can be handled at the time of submitting the article or presenation abstract. Publication of the abstracts for talks at the national PSA meeting is the most obvious problem for categorization schemes (see Table 4). This may be because novel topics are addressed and talks are given that would not otherwise appear in the Journal. There are, however, a number of abstracts that clearly fit within the determined sub-headings and should be identified with key words.

The establishment of a domain of knowledge and understanding its structure is very important for the development of phycology. Phycology as an area of research, has much to offer the other disciplines. As individual researchers we study a wide range of basic biological questions and often these are in a context that can be directly compared. Consider some of the classical work on competitive exclusion by Tilman (****) that is represented in ecology text books or how the basic research on photosynthesis has been greatly facilitated by looking at cyanobacterial models. The use of this information in other disciplines also increases the value of other research on related questions in phycology. Teaching in phycology can also be enhanced by using this domain. For example, in my course on algal physiology, I explain the structure of the domain to the students and have them use the simple search string (alga$ OR phytoplank$ OR cyanobact$) to limit their initial query and then we develop search strategies for more specific topics in the course. The students are able to access the literature much more efficiently. They can read a number of abstracts on-line through Biosis before they find the entire article. They can also use Biosis to limit their search to journals that are in our library's collection. Both research and educational uses of this information can be made more valuable by a coherent structure to the domain that is easily accessed.

This paper addressed the categorization of articles within the area of phycology and how that information could be useful to researches and students. Another aspect that needs to be explored is how the concepts that we are using build on basic concepts in the other sub-disciplines of the life and environmental sciences. For example, what is the pre-requisite knowledge required to understand and appreciate a specific paper. Very few papers that I have read in the Journal of Phycology explicitly state what background is required or reference general references on biochemistry, cell biology, etc. This might difficult to include in the print version of the Journal or it may not be desireable to include general background references in each paper. Such information could be included in electronic versions of the journal. For example, just as a course might include pre-requisites for students, each sub-section of the Journal could include a set of background references that would be assumed knowledge for all readers. If the reader needed that background, they could refer to it, and if the author needed to introduce new topics they could provide references for the reader.

Digital information will transform print media. There are many values of the print journal that need to be continued. Because the process of publishing is currently synonymous with publishing in print, now is the time to lay the foundation for an information structure that will be just as useful in print and electronic forms and that the information in the two are congruent. In addition, the addition of some digital information resources to the Journal can increase the value of all the information for both research and education.



References:

Boole, G. 1854. An Investigation of the Laws of Thought. London: Walton and Moberly.

Cheong, F-C. 1996. Internet Agents: Spiders, Wanderers, Brokers, and Bots. New Riders Publishing. pg. 92.

Tilman

Wurtman, R. S. 1997 Information Architects. Graphis. NY.