Come summer or winter, pie would be an aromatic dish to tickle the tastebuds. Just like pie is enjoyed year-round, psychological tests find places in people’s personal and professional lives. And just like pies come in various degrees of palatability, psychological tests are not all created equal.
There are five points you need to consider about the desirable quality of psychological tests. Let’s go straight to our main course.
If the pie looks like a trifle, then it lacks face validity
A test measuring empathy should look on like it is measuring empathy, and not maths literacy. Social conventions and expectations should guide the test developer to produce the appearance of the test that is true to its purpose. It is true that there are projective tests that mask themselves well in order tap the hidden or sub-conscious thoughts; but they require well-trained psychologist to serve the scores on a comprehensible plate.
Bogus personality tests prey on people’s gullibility, or perhaps misguided common sense, in popularising tests that look more like pre-schoolers activity book. For example, ‘Choose a door and we will tell you your personality’. How does the preference for a particular door design relate to our stable pattern of behaviour? These bogus tests leave a bad taste that’s difficult to wash out.
If it is missing the crust, then it is deficient of content validity
A good test would have sufficient samples of what it is measuring. For example, the measurement of ‘attitude’ commonly comprises of affect, behaviour and cognition towards the attitudinal object. So, if a test of attitude towards COVID-19 vaccine is built on the ABC model, but lacking items measuring one of the core elements, then it is lacking content validity. This is why a test blueprint is important: it would specify the domains and the number of items we want the test to have.
Asking subject matter experts may help in ensuring the test has items representing the various domain that it is supposed to measure. The experts would be able to examine the pool of items give suggestions of overlooked domains. Usually, the experts would rate the items in terms of their relevance to the assigned domain. This is akin to asking seasoned pie makers to check our recipes for the ‘must-have’ ingredients.
If the crust is indistinguishable from the apple, then its structural validity is questionable
From preparing to serving the pie, the integrity of some ingredients must remain intact. The filling should be visibly different from the crust. In a test, items that are assigned to one domain (e.g. behaviour) should be perceived by the test takers as behaviour-related items. Through analysis like Exploratory Factor Analysis (EFA), it would be possible to determine, based on test takers’ responses, where the items belong.
Based on the EFA, test developers may need to revise the test. Items that cross-load (seem to belong to two domains) may have to be removed. Items that go into an unexpected domain may also resign to the same fate. Items that cross their domain boundaries should end up in the bin, much like a pie with too soggy of a crust and too floury of a filling.
If the pie’s flavour is 100% sweet, then it is a unidimensional pie
The enjoyment of a pie depends on its success in delivering a multitude of taste and aroma. The smell of sweet apple and cinnamon enveloped by buttery crust would successfully stimulate the salivary gland. A pie is not meant to be unidimensional in terms of its taste. If a test is designed as a one-dimensional test, (such as the Perceived Stress Scale), then they should appear unidimensional when put under the empirical microscope. A common experience is that a unidimensional test would appear have two dimensions when the items are worded positively and negatively.
Similarly, a multidimensional test should be shown to have distinguishable dimensions as theorised. For tests that are multidimensional, analysis like confirmatory factor analysis should lead us to reject a unidimensional structure and accept a multidimensional structure.
The way the scores are interpreted is also directly influenced by the tests’ dimensionality. For example, the Brief COPE is meant to measure the strength of different coping strategies. What would a total score mean? The sum of overall strategies?
In contrast, the Multidimensional Scale of Perceived Social Support measures support from friends, family, and significant others. Adding the scores from each source would give a total ‘amount’ of social support perceived by the test takers. In this case, the scores for each dimension and the total score are meaningful. A pie should have a clear taste of cinnamon as well as apple, and butter. As a whole, they create the most enjoyable wholesome pie.
If different people find the pie taste differently, then our pie is lacking invariance
Our pie should taste the same for everyone: otherwise, it would be biased. If we want everyone at work to enjoy our pie, we should be sensitive to diet restrictions of our potential eaters. We should select ingredients that are suitable for almost everyone. Of course, individual differences would also affect how people savour the slices presented to them. However, the differences should not be to a significant degree.
Similarly, the test items should not be biased against any group of test takers. Differential Item Functioning (DIF), an analysis from the Rasch Measurement Model, could help us identify items that perform differently for different groups. A test measuring common mental disorder (SRQ-20) include the item about the frequency of “uncomfortable stomach”. This item may be indicative of a psychosomatic disturbance more accurately for male than females especially for whom ‘uncomfortable stomach’ is a normal monthly occurrence. Thus, some researchers suggest different scoring procedures for males and females.
Thankfully, creating invariant test items induce less headache than trying to cater to people’s various dietary restrictions. People are much more tolerant to different ways of asking questions about attitude, personality, literacy, and other constructs. If this is not the case, then we would have to have different versions of the test just like having to prepare different pies to cater to people who cannot tolerate gluten, lactose, nuts, and non-organic materials.
The five points made above are meant as appetisers to those who are interested in the issue of tests’ validity. By drawing the gastronomic analogy, the points made are hopefully more palatable and easier to digest. From teaching and supervising experiences, statistics and psychometrics are difficult and confusing. Yes. But mastering the intricacies of validity issues are worth it. We can’t afford to use a poor test and end up with a self-inflicted pie on our face.
Harris Shah Abd Hamid, PhD is a senior lecturer at the University of Malaya.