Questionnaire Design and Measurement Challenges in IO Psychology

The Great IO Get-Together (The GIG)

00:00 / 00:40:36

Hosts Richard Landers and Tara Behrend explore questionnaire design and item selection with Dr. Cort Rudolph (Wayne State University) and Dr. Mindy Shoss (University of Central Florida). The conversation covers strategies for working with archival datasets, the unique challenges of longitudinal research, and the critical importance of measurement quality in industrial-organizational psychology. Dr. Rudolph shares insights from his multi-year COVID-era study of German employees, while Dr. Shoss discusses the value and limitations of large-scale archival data. Both guests emphasize the need to revisit foundational literature and resist repackaging old constructs as novel contributions.

Key Takeaways:

Archival datasets offer representative samples but often include single-item measures or changing scales across waves
Longitudinal research requires careful attention to measurement invariance and attrition over time
Multiverse analysis can strengthen measurement decisions by testing multiple analytical pathways
Researchers should critically examine whether “new” constructs genuinely add to existing theory
Good measurement forms the foundation of meaningful organizational research
Ethical research practice involves testing ideas with existing data before collecting new samples
Context matters: what counts as counterproductive behavior changes across time and culture
Scale selection involves balancing theoretical fidelity with practical constraints
Primary sources often reveal disconnects between original theories and current interpretations

Website: https://thegig.online/
Follow us on LinkedIn: https://www.linkedin.com/company/great-io/
Join our Discord here: https://discord.gg/WTzmBqvpyt
Join The GIG Email List: https://docs.google.com/forms/d/e/1FAIpQLSfVQ4hyF8MA4G9W-ERwVL8_e91a-MUMuhNvxhXmgkSFUDFatg/viewform?embedded=true%22

00:00 – Welcome and Introduction
00:57 – Archival Data vs Primary Collection
02:43 – COVID Study Origin Story
03:11 – Survey Length and Scale Selection
04:39 – Working with Archival Datasets
06:37 – When to Salvage Research Projects
09:19 – Single Items vs Multi-Item Scales
13:27 – Measurement Invariance Challenges
16:39 – Exploratory vs Confirmatory Analysis
21:45 – AI and Participant Interactions
26:11 – Positively vs Negatively Worded Items
28:46 – Item Response Theory Applications
32:10 – Multiverse Analysis Approach
35:44 – Recommended Papers and Resources
37:19 – Revisiting Foundational Literature
39:19 – Importance of Good Measurement
40:22 – Closing and Resources

Transcript

[Richard Landers] (0:00 – 0:56)
Welcome to the Great IO Get Together. On tonight’s show, quips and queries about the world of work as IO Psychology comes alive. Now please welcome our hosts, Richard and Tim.

Welcome everyone to Great IO Get Together number 35. My name is Richard. This is my co-host Tara.

Today we are exploring chapter nine of our textbook, Research Methods for IO Psychology, and this chapter is all about writing and using good items on questionnaires. So to help us do that, on the Dr. Cort Rudolph, Professor of Psychology at Wayne State University. Welcome to the show.

Thanks for having us. So to start us out, both of you work with large archival datasets. I’m thinking, for example, of your 2024 paper in Journal of Applied Psychology on job insecurity and health.

What are some of the biggest challenges that you hit in working with items that other people wrote? How does that shape whether you choose to use an archival dataset or try to collect your own?

[Cort Rudolph] (0:57 – 1:18)
Well, actually that wasn’t an archival dataset. Oh. Yeah, that was data that I collected along with my colleague, Hannes Zacher, from Leipzig University.

And then we brought Mindy on that project because of her expertise in job insecurity. But we certainly have had challenges with using other people’s items in that regard. But that was a primary data collection.

[Richard Landers] (1:18 – 1:22)
Oh, was it explicitly for this project or was it part of a broader effort?

[Cort Rudolph] (1:22 – 2:43)
Yeah. So I’ll make a very long story kind of short, but yeah, that was part of a multi-year effort that was funded beginning in December of 2019 by the Volkswagen Foundation, which is like the philanthropic arm of the car company. We had received funding then to conduct a longitudinal study that was of rather limited scope.

And the idea was to collect data every three months, December 2019, and then March 2020. And then like the week after we collected the March 2020 wave, the world shut down. I don’t know if you guys have heard of COVID.

It was a big deal. We went back to the funder and said, hey, we have this really unique opportunity to collect. We have baseline data and two waves of essentially pre-lockdown, pre-whatever COVID was at that point data.

Would you all be interested in sort of transitioning this into a study of COVID? And they were super generous and quick and got back to us right away. And then so starting the first week of April 2020, and then for, I don’t know, it was like 54 waves thereafter, we had monthly data collection of a panel of about 1,800 German employees.

So sorry to throw a wrench in your gears there initially, but yeah, that was not necessarily archival in the sense that we did choose the scales and we did collect that data.

[Tara Behrend] (2:44 – 3:11)
Well, that actually raises a totally different set of questions for me, which is like, when you know that you have really limited real estate in terms of what you can ask people and you can’t give them a 30 minute survey, that’s a totally different set of constraints and things to think about. So can you talk us through how you approach that? Did you use validated scales?

We have 10 items and they have to do a lot of work, so we got to do our best with the 10 items that we have. How did you think about that?

[Cort Rudolph] (3:11 – 3:54)
A little contrary to that, we actually had a fairly long survey. So the survey itself, I believe probably took people between 30 and 45 minutes at each wave. And so yeah, absolutely.

I’ve definitely dealt with situations before where you have to think very closely about what you’re measuring and how you’re measuring it. And if you had more space, you could obviously include longer measures or multiple measures of the same construct or something like that. But in this particular case, yeah, it was actually, it’s kind of funny, right?

It’s like, we always hear like, oh, you’re so constrained by time and you’re never going to have follow-up. This was sort of a perfect storm research project where we had really good sampling and good retention and a pretty long survey that we were able to collect.

[Tara Behrend] (3:54 – 4:01)
It sounds like a unicorn and also a terrible example for people to start- Yeah, I know. Weird, because like, this has never happened, ever.

[Cort Rudolph] (4:02 – 4:37)
Yeah, I’m sorry. Like I said, this might not be an example of times when things didn’t work out because certainly there have been challenges along the way. And sort of the focus sort of shifts to thinking about things like, oh, how do you deal with attrition?

And how do you deal with measurement of variance, measurement and variance across a five-year window? And like, how do you deal with stuff like that versus, I don’t know, thinking about scale choice and stuff like that? So, I mean, yeah, no, definitely there was lots of other challenges along the way.

And I’m not saying that everything was perfect by any means, but it was a bit of a unicorn. I sort of think like this was probably my one shot to actually do something impactful in the field. So, I don’t know.

Let’s say that.

[Mindy Shoss] (4:39 – 6:35)
Most data collections don’t go that way. You’re either constrained for space. And I think that is the challenge, right?

When you talk about archival data sets is that they’re trying to survey large proportions or representative proportions of the population. And there’s a bunch of things that need to be surveyed. So, for example, you consider the general social survey.

There’s tons of modules and tons of topics in there. And so, then you’re left with mostly single item measures or repeated cross-sectional data sets. So, essentially, different sets of people are surveyed at different times.

So, even though the general social survey has been asking people, for example, to answer the extent to which they think their job is secure for decades, the question becomes the comparability of those groups. I think to Richard’s original question, therein lies the opportunity, right? I mean, you get a representative or pseudo probably better than what we can do in a lot of our research, represented population of workers.

And I think one thing we try to go for is variability. Well, we’re probably going to get more variability in those kind of samples than you will working in a particular organization. I mean, everything, strengths, weaknesses, but I personally really like archival data.

I think it’s really worthwhile. I think it’s a good, at a minimum, proof of concept, right? If this data exists out there and we can test an idea with a large sample, I think we ought to do it.

I think we’ve got an ethical obligation to do it before we spend a lot of time, money, resources, not only our own, but our participants to then do other surveys and follow it up. I am a big fan of archival data and situations like courts. But for those who don’t get the unicorns all the time, archival data is great.

[Richard Landers] (6:37 – 7:06)
So one of the struggles that we talk about in a few places in the book is resisting the urge to salvage datasets that sometimes projects just don’t go the way that you want them to do. And I feel like there’s an inherent tension when looking at archival research of saying, man, this is so close to the question I want to ask. I don’t know.

How do you navigate that? Is there a point where you’re like, you know what, it’s just not worth it? Or how much effort do you put in to try to recover something really interesting out of it?

[Cort Rudolph] (7:07 – 7:53)
There’s a couple of things here. One, people do this all the time, right? So they’ll go through and they’ll sort of mine through the questions that are asked and sort of put together these pseudo-scales and then test a measurement model that approximates something that looks like a construct, right?

I guess the other thing I was going to say, and this is probably more relevant to what we were talking about before, is that if you are using archival datasets with these sort of longer scope, especially like a long-term longitudinal study. So I’m thinking of like MIDUS, like midlife in the U.S. data or something like that. And what happens too is that the scales that are included change over time.

So they’ll measure a construct one way in one study and then they’ll drop an item and add an item or something like that. And so that creates an entirely different set of challenges, which I know doesn’t get to the question you asked. Mindy, do you have thoughts?

I’m sure you do.

[Mindy Shoss] (7:53 – 8:27)
Yeah. I mean, it’s a fine line of when do you say it’s relevant enough or not relevant? I mean, some of it’s just looking at the items and saying, could this fit in a measure or would this really turn out and look differently?

And sometimes you have to do some piloting and some analyses to try to figure out whether, not only would you look at it that way, but would a group of participants respond to these items in a similar way? If so, great. If not, then unfortunately maybe that’s not the best place to go with the research question.

[Cort Rudolph] (8:28 – 8:52)
That’s exactly what I was going to say too, which is if you have a set of items that you think are appropriate, it doesn’t take much to take those items and pilot them in a new sample, do a bit of a validation study of your measure, so to say, from the original archival source and use that as a pilot to make a stronger case for why the items that you’re considering work in the way that you think that they do.

[Tara Behrend] (8:52 – 8:56)
Have you seen examples of that in the literature of people doing that?

[Cort Rudolph] (8:56 – 9:04)
I’m not sure that I have, but I think- These are things that I tell my students to do and then they don’t. So I don’t know. I don’t have an example offhand.

Do you have any examples of that?

[Mindy Shoss] (9:06 – 9:50)
I’ve seen people do it with employability scales, some of the scales in the international social survey program, again, because it’s typically tend to be single items. I think people do it, but right now I’m trying to- I have a brain of where I’ve seen that. It seems like a reasonable thing to do, but it’s always a challenge, right?

Because then there’s always that warning, don’t throw good data after bad or garbage in, garbage out kind of things. I think there’s some just judgment you have to make too. And that really is just being, keeping in mind, what is the actual research question?

I think if you’re chasing the research question, you’ll be okay. I think if you’re sort of chasing the data set, then you’re going to run into trouble.

[Richard Landers] (9:51 – 10:08)
Another big theme that I’m hearing, and I’m hearing from the book that I’m hearing a lot, is professional judgment. Do you think you’ve become more or less cautious about this kind of thing as you’ve advanced in your careers? Is this something that you thought a lot about, like as a fresh out grad student, for example, versus you’re thinking about it now?

[Mindy Shoss] (10:08 – 11:48)
I think I have been, but maybe in a way that’s somewhat unexpected. I had this experience early on in my career back when we were in St. Louis and I worked with a childcare organization. We surveyed all the different locations and all the different employees, but we went there.

I took my students, we had paper, pencil surveys. And so I got the really interesting experience of watching people respond to these surveys. And I realized for a lot of these workers who are in kind of precarious positions, actually what was the recommended thing that I learned in grad school, multi-item measures, were really problematic.

So I remember them responding to this job satisfaction scale. And if anyone’s seen that, it’s, I’m satisfied with my job. I like my job.

I don’t like my job. And their response was, is this a psychological trick? Are you trying to trick us?

Did our management ask you to trick us? Are you trying to find something out? I mean, it had this really oversized reaction.

And I thought, wow, this is really interesting. And I never really thought about this with designing a surveys. I did everything I was taught to do.

Okay. We need to be able to do alpha as we need confirmatory factor analysis. We need multiple items.

And then I realized, yeah, this is kind of a problem when we take this to real world environments, particularly with certain worker populations. So I think I’ve become much more sensitive to that kind of throughout my career. And then again, trying to think of what is it we’re actually trying to measure?

Because when you review journal articles, I can’t tell me how many I review and I read the introduction like, oh, this is really great. And then I see how I measure these constructs. I’m like, I don’t think that’s what you asked people.

[Cort Rudolph] (11:48 – 13:00)
I’ve had similar sorts of experiences just sort of to add to what Mindy’s was saying, like this idea of using like a think aloud type of a protocol with your items. I feel like that was not something we talked about when I was a grad student. And like actually putting the survey in the hands of the type of person that you want to respond and like asking them about their response process and not just expecting that they’re going to read the item and understand why they’re doing it.

Not just to get a kind of their motivations that are underlying the responses, but like what do they see when they read these items? And like what do they think that they’re meant to represent? It would be hard to make an argument for construct validity or content validity as a sort of an aspect without knowing exactly what the participant is thinking about when they’re reading that item and providing that response.

So I think that’s extremely important. And I think that’s something that we don’t really, I mean, at least when I was a student, we don’t really emphasize that aspect of the sort of, I don’t know if that’s user experience from the perspective of your respondent. Like what is the actual, what are they actually thinking about?

What are they actually doing when they commit that numerical rating to paper or whatever?

[Tara Behrend] (13:00 – 13:31)
I think it’s incredibly important to just consider the psychological and social context that exists around the experience of taking a survey that no amount of statistics in the world will tell you that answer, right? I think that’s why we try to focus on design as a concept instead of analysis because you cannot, you can’t discover the kinds of issues that you’re raising about like fear of repercussions or any of that by doing more factor analyses, right? Like it will never tell you that that’s what’s going on.

[Cort Rudolph] (13:31 – 15:21)
Yeah. I think, you know, as a general statement to your point about long surveys and choosing good items and how we think about the process, I think as students we’re trained to think about complexity too much. Everything in psychology is multidimensional and I’m not going to argue that it’s not, but our theories aren’t written at that level.

And our ability to test them is constrained by those operationalizations because if you have a complex factor model that has, you know, an overarching hierarchical by factor, whatever structure and 17 specific factor, like at what level is the theory written that would allow you to test the relation between X and Y if there’s 17 underlying dimensions, right? We don’t have that unfortunately. And so I think if we’re trying to streamline surveys and test, you know, test relations that make sense based on our theories, we have to ask ourselves, is that complexity meaningful?

Is there a simpler way to get at the same idea with fewer items that’s going to burden participants less, that’s going to actually let us answer the question a little bit more directly. Speaking about that, you know, project we were, you know, sort of talking about a few minutes ago, that’s one of the challenges we certainly had there, which is like, yeah, we have all these nice multidimensional scales, but the question is like, at what level are we really interested in making the arguments here? Often the theory isn’t written at the level of the dimensions and the dimensions are interesting, but they don’t really help us make better explanations for people’s behavior.

So that’s something I, you know, I think I struggle with and I struggle impressing that upon students too, because I teach a psychometrics course to our PhD students, and we do a scale development project and they all want to develop these, oh, this is a multidimensional measure of X or Y or whatever. And it’s like, no, you have to figure out how to develop a good measure of one thing first, and then explain the complexity of it. And I think that’s sort of where people struggle a little bit is trying to do everything.

[Tara Behrend] (15:21 – 15:32)
And it’s certainly the case that we’re not measuring behavior at that level of granularity. And so we’ve got these like super fancy predictors of really gross, broad strokes of behavior, which doesn’t really align.

[Richard Landers] (15:33 – 15:50)
Absolutely. Maybe to give us some practical examples, how did you navigate this like as a research team for your Dozens of Waves Unicorn Project? Were you having discussions about response processes and how people are thinking or like, how did you tackle it?

[Cort Rudolph] (15:50 – 17:10)
Yeah, so I’ll talk out of both sides of my mouth here for a second, because I think we should be doing that. But on the other hand, this was a pretty quick process, right? We had to build this really quickly.

And so it wasn’t as systematic as I alluded to, right? We didn’t have the opportunity to go and talk to these folks and get a sense of all this. No, it was more about, can we find measures that represent kind of our best guess at the best way of measuring or assessing this particular, you know, any particular idea that we were interested in.

At the same time during that project, during the pandemic, we were seeing like almost daily, like a new measure coming out to measure some aspect of pandemic fatigue or like safety behaviors or things like that. And so we were able to adapt those in to certain ways of the survey when it made sense. Not every measure was collected at every wave.

Certain things that we thought would be more stable were collected maybe on a yearly basis instead of being on a monthly basis. So for example, we have measures of big five markers in there, but they weren’t collected at every wave. They were collected at baseline and then in every December survey.

So I mean, I think it’s about making choices about not only like where, not only what measures you collect, but when you collect them and why you collect them when you do, if that makes sense.

[Richard Landers] (17:10 – 17:49)
Yeah. You know, I, part of the reason I ask is, is I feel like the kind of challenges that you’ve faced and that people face in longitudinal projects like that also more closely mimic what people face in practice, where you’re often jumping into an environment where there’s some history of measurement that you don’t really quite know exactly what happened. And people have been like, oh, this is really important to me now.

And they add things and they remove things. It’s very similar sort of pattern of problems. And there’s a lot of folks, especially in IO who graduate and then suddenly are part of an organization with a data history that they trying to piece together.

It just seems like there’s a lot of very similar kind of themes in trying to navigate the construction of such a long, like a massive longitudinal project.

[Cort Rudolph] (17:49 – 18:43)
Right. Yeah, absolutely. And I think too, working with that type of data is also interesting in those types of applied environments, right?

Because you might come into an organization that thinks they’re very data rich, but they’re probably not because either they collect the data yearly and it’s never linked together. Right. So like you can’t tell within unit or within individual whether things have changed because it’s all reported that, you know, an aggregate unit level or something like that, but we can’t link them year to year or something like that.

Or it’s living in three or four different systems that don’t talk to each other because of the complexities of integrating those systems together. Right. So, yeah, I mean, the good with doing survey research is that if you’re in control of the entire process, you can build those checks in, you can build the links in, you know, how you’re collecting the data.

It’s much more challenging, obviously, when you deal with, I guess, real data, right? Real world data.

[Richard Landers] (18:44 – 19:41)
You created a follow-up question for me. So how do you tackle it in a project like this, especially now, if you were to do it today, I think less of an issue in the timeframe that we’re talking about, the issue of data quality and what people are actually doing has become just a fever pitch. You know, we’ve run a project where I’m, based on the metrics that I’ve calculated, it’s like maybe 30, 40% AI and bot use in like panel research.

I don’t know how much of that was a concern for you in the project we were talking about, but I don’t know. I’ve personally been struggling with what’s the best way to tackle these sort of data quality issues, both traditional and these more emergent things that are starting to pop up now, because it seems like some of that is research design. Some of that is just post hoc checking.

Some of that might be the way items are made. I don’t know. What is, how do you view that problem?

What do you do and what would you recommend to tackle it?

[Mindy Shoss] (19:42 – 21:32)
Yeah, that actually came up in my department today because the SONA team, faculty in charge of SONA, a student came to them, a graduate student, who said, you know, look, I can get a chat GPT agent to take my online survey and basically if you send it through that agent, the agent just alerts the user when the user needs to respond, like it needs help on a capture or it doesn’t know how to respond to something like, are you an AI? But it’s hard.

You would never know. And it takes about the same amount of time as someone reading quickly. So they were trying to figure out because the traditional indicators you might use for bots aren’t working anymore.

And so that’s a real challenge, I think. I wish I had a great answer for it. Someone might say, well, then just sort of work with organizations and do the survey through an organization.

However, of course, people could still use the internet. There’s no reason they wouldn’t. If you do it in person, then depending on the research topic, you run into a whole bunch of issues.

Like for example, in doing job insecurity research or precarious work research, the last thing you want to do is ask people in their work subbing, which you are allowed to come in because you asked management to come in and then ask them something about their job insecurity. They won’t participate. They’re especially not likely to participate if you then want to match supervisor ratings or something to it, which our field has historically viewed as the gold standard or held up on a pedestal.

But you run into a number of really ethical challenges and you don’t want to do that. I don’t know. I mean, again, some of the large scale surveys do in-person or phone-based interviews where they talk to people.

I don’t know if we’re going to end up back there. I’m really curious what you all think about this.

[Cort Rudolph] (21:32 – 22:35)
I mean, it sort of mirrors the exact same issue that we would have in the classroom, right? Which is like if we’re worried about people using AI-based tools to complete their work, then we need to do it in a way that ensures that they don’t have the ability to use these tools, right? So instead of doing online exams or take-home tests, you do them in the classroom, right?

So it kind of feels like at this point, I don’t have a good answer. I just think that it’s something I’ve thought about, but it’s like, at what point do we have to question the quality of data coming from any blind panel, right? It’s very obvious that this sort of stuff happens.

We know that bot-type responses in surveys are to be expected at this point. I’m not sure that any number of white text items or CAPTCHA-type responses or anything like that are going to, those are not outpacing the technology to get around them, to be sure. I guess there’s just a broad statement that concerns me.

It concerns me as a researcher. It concerns me as a consumer of research, right? We have to be somewhat worried about that.

[Tara Behrend] (22:36 – 23:12)
I have to think that one way around it is to do research that participants care about. And so they don’t want to cheat or get it done with quickly because they want the research to happen and they’ve been involved in, you know, identifying important research questions. And I think that’s a very normal and natural thing for other fields.

But we have historically not been so great at that. But of course, you’re not going to use an AI bot if you have stakes in the outcome of the research. It means we have to ask different kinds of research questions though, which I don’t know if anyone’s ready for that.

[Richard Landers] (23:12 – 23:24)
One alternative, which I’d love to hear reactions on, is just to give up on the humans and just have the AI be the participants in all of our research. Is that a realistic option to you two?

[Cort Rudolph] (23:27 – 24:24)
I mean, it would make my life easier, I guess. I don’t know. Probably not.

I’m sure. And I mean, like I said earlier, I mean, like we’re seeing more and more of this every day, right? I feel like I just saw something on Blue Sky or maybe it’s in like psych science, a paper that basically addresses that question, Richard.

It was kind of are AI participants like reasonable proxies for human participants? And I didn’t read the paper. I read the abstract and I think the general conclusion was again, probably not.

But that doesn’t mean people aren’t going to be creative with thinking about how to do that. I think the broader question too is not just, is it possible? Because we know that it is.

It’d be more about the acceptability of that type of research to a larger audience, right? And so if that can’t make it past the editor’s desk, it’s not rewarded in that way, then I don’t think it’s going to become a viable option for many. I don’t think it’ll stop people from doing it and maybe publishing it somewhere else, but I don’t know.

What do you think, Mindy?

[Mindy Shoss] (24:25 – 25:47)
Yeah, I was kind of there. I mean, you know, on one hand, yeah, AI is trained on data that’s out there that in theory was produced by humans. So maybe it could give a range of human-like responses.

However, just because you can doesn’t mean you should. And so, right, what’s the goal of our field? It’s to help people understand and function better in their environments to create better policies and practices for organizations, create better institutions surrounding work that ultimately help promote sort of dignity and sustainability and variety of positive quality of life outcomes.

And the extent to which you’re no longer studying humans, you’re studying a computer mimicking humans, I think that makes that goal, those set of goals harder to reach. You know, I am curious as so much is done by AI and there’s so much we see out there, if there’s going to be kind of, I don’t know if I want to say backlash, but an emphasis on things that are human and really a true value of something that’s human and a value of feedback that truly comes from real people. So in a sense, I’d say, Richard, maybe instead of, you know, going all in on AI, maybe what we are going to look for is more of those qualitative type studies or phone interviews or field experiments or something where you’re really engaged in the real world.

[Tara Behrend] (25:47 – 26:35)
I certainly hope so. I mean, you’re both correct that the emerging data are pretty skeptical about whether this is even viable. But if we start from the position like, well, maybe it can do a reasonable impression of like the average socially desirable human writing on the internet, it still can’t adopt a persona.

Like if you were trying to understand what a particular kind of person’s experience is like, it can’t do that. And it can’t give you advice about how things could be different in the future to Mindy’s point, right? Like about imagining better futures, like it can just do a very good job of describing the average of everything that is right now, which feels hollow.

I would like to think that our field does share those goals. And if so, like this isn’t the way to get there. I totally agree.

[Richard Landers] (26:35 – 26:39)
If you wouldn’t trust it as a participant, do you trust it to help you write items?

[Cort Rudolph] (26:39 – 26:41)
It does do a good job of writing items.

[Mindy Shoss] (26:41 – 26:49)
Yeah. It also helps reduce the reading level of items. We’ve used that before.

That’s been quite helpful.

[Cort Rudolph] (26:49 – 28:24)
And I just saw something the other day. It’s essentially that it’s a content validity checking tool where you provide it with essentially an operational definition of your idea, and then it’ll help map items onto that in a way that’s pretty smart, right? So it’s essentially like an LLM-based Qsort.

I think there’s a lot of promise to that sort of work too. But if we think backwards a little bit, I mean, I think the fundamental idea of learning how to write good items needs to start from the theory that you’re trying to write the items to test, right? So you can type in a definition to chat GPT and have it write you survey items, and they look good, good enough, right?

And they probably work pretty well because survey items tend to work pretty well when they’re all kind of homogeneous, right? But the question is, are they really a good reflection of what you’re trying to measure? Are they hanging together because they’re all pretty similar?

Or are they hanging together because there truly is something we assume is underlying those items that is driving people’s response processes? And from a training perspective, I think it’s an interesting, maybe an interesting pedagogical experiment to have people write their own items and have a language model write items that are similar or at least based on the same definition and see where they kind of correspond. But that’s sort of my fear is that we’re going to lose ground to these technologies when it comes to our ability to understand how to translate theory into actual good surveys and good survey items and things like that.

I’d like to be proven wrong on that, I guess time will tell. I guarantee you it’s happening everywhere at this point.

[Tara Behrend] (28:24 – 28:36)
Well, it does sound like you’re saying though that it has to happen alongside expert judgment, not just to check, but to really be guiding the process. And so people still need to develop that expertise before they use the tools.

[Cort Rudolph] (28:36 – 28:53)
Yeah. No, I mean, it’s like any tool, right? In the wrong hands, and it was a metaphor there, but like anything we do, if you don’t know exactly what you’re getting at, or if you only have a superficial understanding of why this matters or how it’s potentially going to go awry, then the likelihood that it will is much higher, I think.

[Richard Landers] (28:53 – 29:43)
It’s a really interesting tension to me. There was just a post on an iOS psychology social media thing about, can I just use ChatGPT to write my questionnaires? There’s a massive gap between that and can I use it to help in specific decision-making in specific contexts?

And trying to explain how do you know when you have enough expertise to use something responsibly, that’s become the really challenging part. Because from the perspective of a novice, it doesn’t seem any different. It just makes questions.

It seems fine, whatever. Like most things with AI, it’s like 80% of the way there. It’s mostly right, but there’s some key areas where it’s not.

I’ve really struggled trying to explain to novices how to navigate that. Like, well, just don’t use it for 10 years and you’ll develop the expertise and see what I do. It’s not a good answer.

[Cort Rudolph] (29:44 – 31:13)
I think about it as a tool in a triangulation process. If we think about typical scale development, typical item development, we start with a clear definition of what we’re trying to measure based on whatever theory. We develop the model, the theory of measurement that we want to test.

We develop items that correspond to that theory. We collect data. We test it.

Part of that process of validation is probably collecting information about whether experts believe that the content of the items maps onto the content domain of the construct. We can ask people to do that. We can ask the language model to do that.

If they agree with each other, then we’re probably onto something. I don’t think by itself, any tool that we have by itself is insufficient. But in correspondence with other tools that we have available to us, if all arrows are pointing at least in the same direction or a similar direction, then that gives us more ground, more firm ground to stand on when we make these types of decisions.

By itself, it’s going to get questioned. I was in a SIOP session last year on using AI and meta-analysis. The conclusion there was, this is neat, but we still need humans to make sure that it’s doing a good job.

Like anything, if we can get to the same place by using two different approaches, one that might be aided by a language model, then that’s awesome. It doesn’t make our work any less important. It just builds some credibility into the process that we followed as perfectly imperfect humans making these judgment calls.

[Mindy Shoss] (31:13 – 32:15)
That’s a good point. Perfectly imperfect humans. Because certainly, I think we’ve all had this experience.

You look at measures that exist out there and you go, what are you talking about? I have no idea what they’re trying to measure. It certainly doesn’t seem like it measures the construct.

It’s not like the measurement’s always been perfect. But I think Cort’s point is well taken. It’s also a tool for learning.

These models are developed to be average models or to have a pretty good handle of language and words that people use. That has some advantages. If I write items, I’m writing it from my viewpoint, how I look at the world or the words that I use to describe the world and the things that I’ve been exposed to.

But if you could expand that and have access to greater ways of wording things, then maybe that’s an advantage to including AI as at least part of the process with humans. Of course, with all the caveats that Taryn Richard, you noted in several papers now.

[Tara Behrend] (32:16 – 32:28)
I think another issue that I worry about is that we might be in the golden age of AI right now. And pretty soon the entire internet will be filled with AI generated nonsense, which means that the models will stop producing anything useful.

[Cort Rudolph] (32:29 – 32:30)
Is it a dead internet theory?

[Tara Behrend] (32:30 – 32:36)
It’s an extension of the dead internet theory. Yeah. We might as well use it while we can.

[Richard Landers] (32:38 – 32:52)
All the talk about triangulating and perspectives, I can’t help but be reminded of multiverses. Can we look at these problems? Maybe you can tell us a little more about this multiverse concept.

Is that something that could help us with all these questions we’re struggling with here?

[Cort Rudolph] (32:52 – 34:52)
Yeah, I think there’s potential there. Just as a general overview, multiverse analysis is not a single thing, but it’s an approach to understanding decision points, essentially, that would occur throughout the process of conducting research. And so researchers make decisions all the time about which items to include on scales or what types of inclusion or exclusion criteria to assign based on things like job status or various decisions around whether cases are outliers or things like that.

And those are all documentable decisions that might have some bearing on the conclusions that you draw. A multiverse approach allows you to do is to specify, this sounds very Marvel-esque, but you specify a universe essentially of all possible decision points or a reasonable set of decision points. And you test whether those specific decisions in isolation and combination with each other lead you to very different places in terms of the decisions that you might make.

And so that idea of doing multiverse analysis is something that I’ve been dabbling around with over the last couple of years. But I think, yeah, there are definite opportunities to use that approach in developing better measures. I mean, there’s nothing to stop you from developing a multiverse approach, for example, item selection or something like that, where you could, for example, take a large set of items and make choices about which you retain, whether you include, for example, positively and negatively worded items.

And then you would just have to pick a target model, right? Is it a confirmatory factor model? Is it the results of a model that are based off of scale scores that are derived from a confirmatory factor model?

Are you accounting for other factors that exist in the environment while you’re measuring things or not or whatever? I mean, if it’s possible to make a decision around something, it can certainly be baked into a multiverse approach to analysis.

[Tara Behrend] (34:53 – 34:54)
So there’s no Spider-Man?

[Cort Rudolph] (34:55 – 35:08)
I mean, if you could do it with Spider-Man, perfectly fine, yeah. This is not a term that I coined, unfortunately. So specification curves, there’s a couple of different sort of terminology sets that exist in this space.

[Richard Landers] (35:09 – 35:11)
So there’s a universe where there’s no AI?

[Cort Rudolph] (35:13 – 35:17)
I think that’s a quantum explanation, actually.

[Richard Landers] (35:19 – 35:42)
To kind of wrap us up here, one of the things we like to ask guests is about any specific papers that they’ve looked at over the last maybe year or so that have really influenced or made an impression on you. And we’re talking about items, measurements, psychometrics, etc. Are there any papers or even authors that you’ve read or would recommend?

I’d love to hear from both of you. You’re putting us on the spot here, Richard.

[Mindy Shoss] (35:44 – 36:05)
Well, I’m saying this because it’s true and not because of present company, but I really liked your Journal of Business and Psychology, your own participant interactions with AI. So I think that’s a great paper that I am now assigning to all of my students as we embark on this new world.

[Tara Behrend] (36:06 – 36:09)
Well, thanks. Your check is in the mail. Well, yeah.

[Mindy Shoss] (36:10 – 37:08)
And I got to go before court, so I’ll share with you, don’t worry. But I guess another one that I always recommend to students and I always like is actually this. It’s Klotz and Buckley.

It’s a journal management history paper on the history of counterproductive work behavior. And what’s so interesting, and I always like this paper, is because it talks about how how counterproductive work behavior really was different throughout history. I mean, the types of behaviors we would think about as being counterproductive and as an indicator of this construct really have changed.

And I just think it’s so important to think about that. Again, we’re doing research within a certain context and how will these things change or be changed as the future goes on and in different environments. I just think it’s an important intellectual experiment to think about.

[Tara Behrend] (37:08 – 37:18)
I love that point. Disagreeing with your boss is actually really great if you’re trying to get to a better decision, right? Or maybe not.

That’s really important to think about.

[Cort Rudolph] (37:19 – 39:17)
Wish I had a better answer for you. I don’t know. I read a lot and I read a lot of good stuff and I read a lot of bad stuff in various roles that I hold.

And so it’s a bit of a challenging question to answer. So I guess I’ll answer with just sort of more of a general sort of statement, which is that I think people need to go back and read a lot of the early works when they’re developing new ideas. It’s really surprising to me the number of times that I read a paper and they’ll cite some Keystone theory development paper or some paper that outlines the ideas of a concept or construct in our field.

And you read what the new paper writes and you read what the old paper says and there’s a massive disconnect there. And so I would encourage people to, whenever they see something shiny and new, to think a little bit critically about where that idea came from and whether or not it actually reflects something that is necessarily novel. And I think if we’re talking about measurement, this is an area where people tend to do this, where they’ll take what’s the old wine and new bottles sort of idea, where they’ll repackage an existing idea.

And I could point fingers at certain ideas in our field that are sort of repackaged that way, but I think I won’t right now. But I would say to the point, I think making sure that if you’re thinking about a new idea, making sure that it’s really novel. And if it’s not, at least trying to give some credit to where the idea came from originally and think about primary sources for those ideas a little bit more critically.

It’s remarkable to be the amount of times that we reinvent the wheel in IO psychology. And that’s not necessarily a bad thing. I think it’s part of the structure of how we do our work and sort of the incentives for publishing new and novel things, but it doesn’t necessarily make for a stronger science.

So, yeah, I don’t have a specific paper in mind and I’m not going to call out anybody’s constructs here, but they exist.

[Tara Behrend] (39:19 – 39:30)
It’s so true though, that wouldn’t it be great if instead of novelty, we focused on working together to solve the really big puzzles and build things that last, but that’s not the world we live in.

[Cort Rudolph] (39:31 – 39:55)
That’s not what’s incentivized. That’s not what’s rewarded. Unfortunately, if we’re talking about measurement, without good measurement, we have very little to speak to, right?

I mean, we often don’t study very tangible things, right? And so if we can’t measure the intangible well, we have nothing as a field. I mean, we stand on pretty shaky ground if we don’t have good measurements, so.

[Richard Landers] (39:55 – 39:57)
What a happy note to end on.

[Cort Rudolph] (39:58 – 40:00)
I thought it was inspiring.

[Richard Landers] (40:00 – 40:19)
Yes, true and inspiring and a little sad. So, yeah, thank you. Thank you both so much for taking the time.

I will point our watchers, viewers, and listeners once again to your non-archival unicorn paper in Journal of Applied Psychology, which is certainly worth a read. Just thank you both for your time again. Thrilled to have had you here.

[Tara Behrend] (40:19 – 40:22)
Thank you very much. Thank you. Good to have you.

[Richard Landers] (40:22 – 40:35)
That’s it for another gig. To stay in touch, subscribe on YouTube, check out our website at thegig.online, join our LinkedIn group, sign up for our email notification list, and join our Discord. Thanks for joining us and see you next time for another great IO get-together.