Evaluating Web Site Structure: A Set of Techniques : U.S. Bureau of Labor Statistics

Bureau of Labor Statistics > Office of Survey Methods Research > Publications > Browse Research Papers

K. Frederickson-Mele, Michael D. Levi, and Frederick G. Conrad

Evaluating Web Site Structure: A Set of Techniques

K. Frederickson-Mele, Michael D. Levi, and Frederick G. Conrad

Introduction

As the novelty of World Wide Web site development subsides, old lessons are resurfacing. Once again the importance of user centered design is becoming increasingly evident. Given the sheer volume of Web sites and Web users, the opportunities for wasted effort and time are overwhelming. Thus, creating sites that are compatible with the way people organize, remember, and use information is essential. Web site design can be broken into two main components: page design and site structure. Page design is roughly comparable to screen design in other interactive environments, and is concerned with effective and consistent arrangement of screen objects; appropriate use of language; proper utilization of color, typefaces, and graphics; etc. Site structure is concerned with the proper organization of content to facilitate efficient and predictable navigation. Inter-page and intra-page movement is the locus of interactivity in most Web sites (leaving aside forms or scripted applets); thus site structuring largely takes the place of dialog design in other types of interactive software. Many rather good Web style guides are appearing, which can help tremendously in page design. Site structure, however, is much more difficult to generalize, as dialog and navigation are so tightly linked to the details of any given task. Validating a proposed page design is important, validating a proposed site structure is essential.

Background for the Case Study

The Bureau of Labor Statistics (BLS) released an Internet Web site in September, 1995. A primary reason it was successful was because its usability was evaluated repeatedly, beginning with the prototype site (Levi & Conrad, 1996). Encouraged by this experience, the BLS decided to use the new Web technology to develop an improved procedure for distributing internal information to its employees. A small intranet design team was established in August, 1996. The team was given two weeks to design an approach for developing an intranet, and present recommendations to upper management. Their proposal recommended that a prototype be developed, and be subjected to usability testing. The prototype was completed in October. Management approached the usability test team, and requested that usability testing be conducted. The test team, which consisted of the authors, was given two and one half weeks to design the evaluation, conduct the tests, evaluate the results, and present the findings. We met with the prototype's management team to discuss their requirements, and identified the overall organizational issues (rather than the content of the individual pages) as being the most critical. Given this focus on the prototype's high level structure, we decided to use a Card Sort Exercise, an Icon Mix-and-Match test, and a Category Membership Expectations test. The Card Sort exercise was designed to determine what mental hierarchy users construct when given a set of anticipated leaf pages from the intranet site. The Icon Mix-and-Match test was designed to find associations and/or interference between button pictures and the associated button text. The Category Membership Expectations test was designed to elicit users' understanding of a set of categories and their associated labels. The tests were conducted in one afternoon, in two separate sessions (first the Card Sort, then the Icon Mix-and-Match and the Category Membership Expectations test). Seventeen test subjects were identified and recruited. None of the participants were involved in designing or implementing the prototype. They were distributed over as many offices as possible. All were expected to have worked in the Bureau long enough to have a reasonably firm grasp of the organizational structure, and the type of work performed there. Some Web use experience was preferred. Since the objective of this series of tests focused on the prototype's overall design, the tests did not address individual page design.

Conducting the Tests

The Card Sort

Each participant was given one set of randomly ordered colored index cards that contained the individual items (anticipated leaf pages from the intranet site), rubber bands, and blank white index cards. They were asked to arrange the items into logical groupings and place a rubber band around each group. If the banded groups could be further aggregated, they were asked to band those, place a blank white index card on top, and label each grouping with a title that best described its content.

Sample items (anticipated leaf pages):

BLS Phone Directory
Reserve a Consumer Price Index (CPI) Conference Room
Consumer Price Index (CPI) Monthly Production Status
Consumer Price Index (CPI) Work Team on Data Collection
Reserve a Current Employment Survey (CES) Conference Room
Current Employment Survey (CES) Monthly Production Status
Current Employment Survey (CES) Work Team on Data Collection

We performed a hierarchical cluster analysis on the data. The resultant dendrogram aggregated the users' sorting decisions, providing a picture of how the users mentally organized the site's (likely) information. This was not the final word on an optimal site structure. By comparing the results to the proposed site structure, it was possible to see how well it reflected users' mental organization of the same information. In this card sort, the participants could have grouped leaf nodes by function. Some respondents, for example, did group all conference room reservations together, but the majority grouped leaf nodes by the office or program. This largely confirmed the design assumptions of the prototype.

In addition to looking at the composite results, we attempted to determine how or why individual users may have sorted the cards as they did. One pattern in the individual users' piles was they clearly distinguished between agency wide functions and the functions of their particular office or division. We return to this in the section "What Happened as a Result."

The expenses associated with card sorting were the time it took to make the cards (one day), conduct the test (an afternoon), and compile the results (two days). Conducting this test would have been less useful if the test participants had been new employees, and consequently less familiar with the agency's organization. Card sorting can be used anytime in the development process when information needs to be "chunked". However, when used early in the design phase, it helps produce a well organized site from the start; this avoids wasted effort designing individual pages that may become obsolete if the structure is changed later.

The Icon-Mix-and-Match

Research in human-computer interaction has found that the benefit of an icon/label pair is that the two different formats reinforce one another and users can focus on either the picture or the text, whichever is more efficient. To make this possible, people must agree that a particular icon brings to mind the same concept that a particular string of text brings to mind, and that the icon does not bring to mind other concepts. We tested this for the intranet site with an "Icon-Mix-and-Match" test. In this test, participants selected an icon they felt best represented a category. "What's New," for example, might be matched to a tiny picture of a newspaper by a participant choosing from a matrix of pictographic representations. If several other participants made the same choice and did not pick the newspaper to represent something else, then there's a good chance that the wider user population will make that association as well. The test team looked at how participants matched a textual category label to an icon, as well as any possible interference existing between icons and category labels. We established a threshold of 70% agreement for an icon label pair to be successful. Participants were asked to match 16 icons with six categories. Participants were given a table with the icons as column headers and the categories as row stubs. They were instructed to select the best icon that represented the corresponding category, and place an X in the cell. If more than one icon corresponded to that category, or none did, participants could place an X in multiple cells, or in none. The icons and the categories were placed in random order to minimize bias.

The strongest match for any category was 100%, the weakest was 10%. The criterion for selection was a category with a match equal to or greater than 70%. The test team also looked for any icons that matched multiple categories, where this was defined as an icon being assigned to two different categories 40% or more of the time. This never occurred.

Minimal resources were required to conduct the test. We designed the spreadsheet in less than one day, and it took approximately 20 minutes for the participants to complete, and about an hour to tally the results. Since this particular test had icons and possible associated categories, it was not sensitive in determining whether users might have provided other textual category labels for the icons.

Category Membership Expectation

A Category Membership Expectation test is designed to elicit users' understanding of a set of categories and their associated labels, and thus determine how or if an organizational scheme is usable. We looked for thematic agreement among the participants as to what they believed would appear in a certain category, and whether or not it conformed to the intranet designers' view. Responses were tallied and consolidated. Participants were given a form which listed the six prototype categories. They were asked to list the kind of material they would expect to find in those categories when they were on the BLS home page, and when they were on their organization's home page. To preserve context, the categories were listed in the same order as they appeared in the prototype site. The categories were: (the phrases in parentheses explained the category to the reader)

New (what's the latest)
Reference (docs & materials)
Tools (applications)
Services (support functions)
Org View (internal structure)
Map (site menu)

Two categories clearly worked as intended: 'New' and 'Reference'. Most people had clear ideas of what topics they would expect to see in these two categories, (all participants expected to see updated or new pages under "New" and there was agreement among the participants at both the BLS and organizational levels with respect to reference.) However, three categories did not work: 'Tools', 'Services', and 'Map'. There was no agreement as to what should be placed under any of these categories, with only three or four participants listing items that conformed to the intranet design team's definitions. The remaining category, 'Org View', was a partial success: users expected an organization chart, but not a clickable chart that could be used for navigating to sub-sites. Minimal resources were required for this test. We designed the form in less than one day, and it took approximately 30 minutes for the participants to complete, and about three hours to tally the results. This test method should be used early in the development process.

What Happened As a Result

One unexpected result — due in large part to organizational politics, but brought into perspective by the usability tests — had to do with control of the intranet as a whole. The initial expectation of the prototype development team was that the entire site would be internally consistent, with a high-level structure mirrored in program-specific subsite structures. A single organizational model was to be implemented by each office. Development and maintenance would be centrally controlled. The prototype organized material by the originating office. The card sort, in particular, showed us that users did not quite follow this mental model. Instead they distinguished between BLS-level information (such as personnel policies or technology standards, regardless of the source) on the one hand, and program-specific information (such as Consumer Price Index meetings) on the other. It seemed likely that the majority of users would access the BLS-level information, and their home office information, but would rarely if ever access another office's internal material. We emphasized the importance of consistency between the high-level structure and any given office's organization, but also pointed out that cross office consistency was much less significant. This made the intranet management team's political task much easier. No longer would offices be forced into consistency with one another; instead they merely needed to maintain commonality with the central site. Control and content creation was subsequently decentralized. We also strongly recommended that further page level and sub-site level testing be carried out in the future. We noticed many pages within subsites that we considered sub-optimal. However, management has not requested a follow-up study.

Observations and Reflections

When we originally met with the management team, we used an interview script. This assisted us in deciding to focus on the prototype's high level structure. Given the two week time constraint, we used techniques that could be applied rapidly. This meant that we could not collect data from end users visiting the prototype site because that kind of study is relatively slow and labor intensive. All tests were conducted using paper materials which were constructed quickly. The tasks were quick for the participants to complete and were relatively quick to analyze. (The card sort, however, involved a laborious data entry process.) Recruiting the participants was conducted by a source outside of the evaluation team, and as a result, there were some communication problems. For example, we were compelled to disqualify some of the participants because they were part of the intranet effort. Similarly, three participants were recruited for all of the tests even though we intended to use one group for the card sort, and another group for the other two tests. Consequently, we believe that we should have been very specific in our participant profile request. A final issue concerning participants is the breadth of the sample. While the skill mix was diverse with respect to web experience, we could have expanded the participant base to include users from additional offices. After the tests were completed, we presented our findings to the management team. If we were to do this again, we would not use the dendrogram as an explanatory technique. Many individuals found it difficult to interpret. A final point is that our recommendations should have been more explicit. We may have expected readers of the final report to infer too much.

Conclusion

The interrelationship between the tests is what made them especially useful because each asked users to group essentially the same type of information, (categories), but through different instruments. The Card Sort asked participants to group predetermined items into hierarchical categories, while the Category Membership Expectation test asked users to place their own explicit items in the categories. The Icon Mix-and-Match asked participants to match icons with categories. The tests results provided valuable information about the proposed categories. The Card Sort exercise validated the prototype designers' fundamental approach: follow the BLS organizational structure. The Category Membership Expectation test showed that the specific instantiation chosen by the designers was flawed, at least for some of the categories. The Icon Mix-and-Match identified icons that were not interpreted as the designers expected they would be. Our interpretation of the results lead to a high level split between information of general interest to most BLS employees, and information relevant to specific offices. This series of tests did not address page design at all. The evaluators noticed many areas in different pages and sub-sites that we considered sub-optimal (gratuitous animation, for instance). Hence, we strongly recommended that page level and sub-site level usability testing be carried out in the future.