Managing your MTurk Data with Python

I recently prescreened 1,500 workers using questions embedded into 167 individual HITs. 1 This means that my data were spread across 167 CSV documents in 167 different locations. I had no desire to download and merge these files by hand, so I used a few lines of python to do the work for me. 2

Accessing details about workers who complete a HIT (such as their worker IDs) is easy with Boto. 3 Figuring out how to access their responses is more complicated. The solution is found in this line of code:

  assignment.answers[0][0].fields[0]

"assignment" refers to one of your worker’s assignment objects, which has an "answers" attribute. At answers[0] you will find a ResultSet object, and within this object is the worker’s responses to your questions in the form of QuestionFormAnswer objects. The worker’s response to your first question is located at assignment.answers[0][0], the response to your second question is located at assignment.answers[0][1], and so on. 4 Similarly, replacing "fields[0]" with "qid" will return the labels of the questions responded to by a worker.

Having figured out how to extract respondents' answers, my goal was now to write a brief script to extract data from an arbitrarily number of HITs, with each consisting of the same arbitrary number of responses. Achieving this goal was more problematic than expected.

I encountered two difficulties:

1) No list of questions presented to workers

You can extract the labels of questions responded to by a worker. However, you cannot extract the list of questions presented to a specific worker, nor can you (using the HIT object) extract the list of questions presented to all workers. 5 The list of questions responded to by a worker and the list of questions presented to the same worker will differ if the worker fails to respond to at least one of your questions — when you're surveying thousands of workers, such discrepancies are bound to appear.

Not knowing the questions presented to a worker wouldn't have been as annoying if it wasn't for a second, more obnoxious issue:

2) Ignoring missing data

The assignment object does not record missing values. For example, let's say a worker answers your second question but neglects to respond to your first question. When you extract the QuestionFormAnswer for your worker's first response (i.e. assignment.answers[0][0].fields[0]) you will actually receive the worker's response to your second question, rather than a value indicating missing data for the first question.

Depending on how you write your code, looping through a worker's answers (in order to write them to an external file) may require you to match each response to a master list — the list of questions presented to the worker — in order to determine if missing values need to be inserted. However, since the HIT object will not give you this list, you will need to generate this list on your own. 6

My Solution

The previous two issues may not be important to you. Certainly you know what questions you asked your respondents, and you could manually supply this information to your code for each study you run. To avoid this hassle in the future, I instead opted to infer the list of questions presented to all workers by assuming that the longest list of questions responded to by a worker in a HIT would be identical to the list of questions presented to all workers; I assumed that at least one worker would answer all of my questions. Using this worker's list of question labels, I then identified the index values where missing data ought to be inserted into other workers' responses.

My code could be cleaner, but it at least gets the job done:

  mturk = MTurkConnection(aws_access_key_id=access_id,
        aws_secret_access_key=secret_key,
        host='mechanicalturk.amazonaws.com')


  #returns the number of answers in an assignment
  def question_counter(assignment):
    q_counter = 0
    cont = True

    while cont == True:

      try:
        assignment.answers[0][q_counter].fields[0]
        q_counter += 1
      except IndexError:
        cont = False

    return q_counter


  #returns the list of questions asked in a HIT
  def question_list(hit):
    q_counter = 0
    question_list = []

    assignments = mturk.get_assignments(hit)

    for a in assignments:
      current_q_counter = 0
      cont = True
      while cont == True:

        try:
          assignments[0].answers[0][current_q_counter]
                     .fields[0]
          current_q_counter += 1
        except IndexError:
          cont = False

      if current_q_counter > q_counter:
        q_counter = current_q_counter
        question_list[:] = []
        for item in range (0, q_counter):
          question_list.append(assignments[0].answers
                    [0][item].qid)

    return question_list


  #returns a list of index values corresponding
  #to the location of missing data

  def missing_locations(primary_questions,reduced_questions):
    locations = []
    index = 0

    for item in primary_questions:

      try:
        if item == reduced_questions[index]:
          pass

        else:

          locations.append(index)
          reduced_questions.insert(0, "")
        index += 1
      except IndexError:

        locations.append(index)
        index += 1

    return locations



  #returns the answers to each question in the HIT
  def get_answers(hit, questions):

    q_count = len(questions)

    assignments = mturk.get_assignments(hit)

    data = {}

    for a in assignments:

      #check if current assignment includes
      #responses for all questions
      if question_counter(a) == q_count:

        current_counter = 0
        current_data = []
        for answer in range(0, q_count):

          current_data.append(a.answers[0]
                  [current_counter].fields[0])
          current_counter += 1

        data[a.WorkerId] = current_data

      #if at least one response is missing, enter missing
      #data as a blank string
      else:

        current_questions = []

        for item in range(0, question_counter(a) - 1):
          current_questions.append(a.answers[0]
                       [item].qid)

        #find the index values of the missing answers
        missing = missing_locations(questions,
                      current_questions)

        #add empty space at the missing index values
        current_counter = 0
        current_data = []

        for index in range(0, q_count - len(missing)):

          current_data.append(a.answers[0]
                  [current_counter].fields[0])

          current_counter += 1

        current_counter = 0
        for index in range(0, len(missing)):

          current_data.insert(missing
                    [current_counter],"")
          current_counter += 1


        data[a.WorkerId] = current_data


    return data


  #returns the answers to each question in every HIT
  def get_answers_from_all_hits(hits):
    first = True
    questions = []
    data = {}

    for h in hits:
      if first == True:
        questions = question_list(h)
        first = False
      data.update(get_answers(h,questions))

    return data

With the dictonary returned from the "get_answers_from_all_hits(hits)" function, you can do whatever you would like with your participants' answers, including writing them to an Excel file.


  1. I divided my prescreen into so many HITs in order to avoid MTurk’s recent rate hike. By allowing no more than nine workers to complete each HIT, I can keep MTurk’s fee at 20%, rather than 40%. I use a python script to create my HITs, though there are several non-programmatic methods for dividing your study into multiple HITs. 

  2. Normally I send my participants to Qualtrics, which keeps all of my data in a single convenient location. However, for prescreening I find that it may be best to make use of MTurk's limited survey tools; not requiring workers to spend time going to an external website (where they can't be sure how much time the study will really take) allows me to reduce workers' time commitments, which in turn reduces costs. 

  3. See this guide for more details. 

  4. Solving this issue highlighted some of the annoyances of using Boto. In Boto, each assignment object has an "answers" attribute. This differs from the official AWS documents, which refers to each assignment object as having an "answer" attribute. Boto contains respondents’ answers in "QuestionFormAnswer" objects — objects which the AWS documents label as "QuestionFormAnswers." I still haven’t figured out what the ResultSet object equivalent is in the AWS documents. 

  5. The "Question" attribute in the HIT object seemed like a promising source of data, but it appears to only return the HIT's HTML code — and I didn't care to parse through such code to find the question labels. 

  6. I have confirmed that missing data is treated similarly when data is downloaded directly from MTurk's website; if all workers fail to respond to a question, a column for this question will not appear in your CSV file.