Analyses#

In Psifr, analyses take a full dataset in data frame format and calculate some statistic or statistics for each subject. Results are output in data frame format, which is flexible and allows for rich information to be output for each participant. For example, the probability of Nth recall analysis gives the conditional probability of recall of each serial position for each output position during recall.

Creating a Measure class#

Most analyses call a low-level statistics function that takes in list-format data. Here, we’ll use the recall_probability function we made in the Statistics section. To run this function separately for each subject in a dataset, we can create a Measure by inheriting from the TransitionMeasure class. This will handle iterating over subjects, converting the data to list format, calculating the statistic, and outputting the data in a results table.

We just need to define an __init__ method to take in any user options and an analyze_subject method to run the analysis for one subject.

Analysis method#

The analyze_subject method must take in subject, pool, and recall inputs. The pool and recall inputs are dictionaries containing information from columns of the data frame, which have been converted into list format. Both pool and recall may contain three different types of information about study pools and recall sequences:

  • items indicate item identifiers. This is used to keep track of which items are available for recall and to match them up with recall sequences. Input position is commonly used for this.

  • label indicates item labels. This is a separate input from items because sometimes an analysis requires other information about the items, such as stimulus category, which is not unique to that item.

  • test indicates values for testing whether a given item or transition between items should be included in the analysis. A separate test function must be supplied to check whether or not to include that transition or recall, based on its test value.

Note that these are just conventions; it’s up to the individual analysis code to decide how (and whether) to use items, label, and test inputs. Each of these entries is pulled from the data frame; the individual analysis determines which columns each entry corresponds to. For example, in lag-CRP analysis, both items and label are pulled from the input column.

In general, items inputs are required, but labels and tests are optional. In the case of recall_probability, we just need to know how many study items there were in each list and how many items were recalled. Serial position works well in this case (though note that this code assumes that there are no repeated recalls or intrusions).

Measure initialization#

In our example, the initialization method doesn’t require any user input. We’ll just assume that input position (or serial position) is stored in the 'input' column. The TransitionMeasure class has two required inputs: a column of the data table with identifiers for the pool of studied items, and a column with identifiers for recalled items. Here, we can just use serial position (the input column in standard Psifr format) for both of these inputs.

Here, we just need to specify item identifiers, but other analyses may need additional information. For example, the category_crp analysis uses input to define item identifiers and a category column to indicate stimulus category labels.

Analysis output#

Putting everything together, we can define our Measure class:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: from psifr import measures

In [4]: def recall_probability(study_items, recall_items):
   ...:     recall_prob = []
   ...:     for study, recall in zip(study_items, recall_items):
   ...:         recall_prob.append(len(recall) / len(study))
   ...:     return recall_prob
   ...: 

In [5]: class Recall(measures.TransitionMeasure):
   ...:     def __init__(self):
   ...:         super().__init__('input', 'input')
   ...:     def analyze_subject(self, subject, pool, recall):
   ...:         stat = recall_probability(pool['items'], recall['items'])
   ...:         rec = pd.DataFrame({'recall': np.mean(stat)}, index=[subject])
   ...:         rec.index.name = 'subject'
   ...:         return rec
   ...: 

Note that we define a data frame that organizes the statistics for one subject, and includes subject as the index for the data frame. Analysis data frames may contain multiple indices, such as subject, input position, and output position, depending on the analysis.

Adding statistics and measures to Psifr#

If we were writing real code to be added to Psifr, we would add recall_probability to the psifr.stats module, and Recall to the measures module. We would also add a docstring to the recall_probability function, following NumPy’s standard docstring format.

As mentioned previously, we would also add a unit test of recall_probability to the tests directory. That test would ideally handle edge cases like repeats and intrusions. It would then show that our current implementation is flawed and needs to be modified to handle these edge cases correctly.

Adding an analysis to Psifr#

Finally, create a high-level function to run the analysis. This function should take in a data frame in Psifr merged format as its first input.

In [6]: def prec(data):
   ...:     measure = Recall()
   ...:     rec = measure.analyze(data)
   ...:     return rec
   ...: 

We can test it on some sample data:

In [7]: from psifr import fr

In [8]: raw = fr.sample_data('Morton2013')

In [9]: data = fr.merge_free_recall(raw)

In [10]: prec(data)
Out[10]: 
    subject    recall
0         1  0.541667
1         2  0.619792
2         3  0.503472
3         4  0.590278
4         5  0.710069
5         6  0.541667
6         7  0.679688
7         8  0.433160
8        10  0.528646
9        11  0.593750
10       12  0.516493
11       15  0.506076
12       16  0.388021
13       18  0.561632
14       20  0.640625
15       22  0.513889
16       23  0.559028
17       24  0.559896
18       25  0.502604
19       26  0.746528
20       27  0.391493
21       28  0.603299
22       29  0.582465
23       30  0.714410
24       31  0.643229
25       32  0.763889
26       33  0.531250
27       34  0.414931
28       35  0.621528
29       36  0.627604
30       37  0.302083
31       38  0.567708
32       40  0.402778
33       41  0.563368
34       42  0.405382
35       43  0.624132
36       44  0.488715
37       45  0.515625
38       46  0.572049
39       47  0.640625

If we were writing real code to be added to Psifr, we could add prec to the psifr.fr module, thus making it available through the high-level fr API. We would add a docstring for prec describing the inputs and outputs in standard NumPy docstring format. Ideally, at the end of this docstring, we’d include a doctest-compatible example of how to run an analysis and the expected output for that example. Finally, we would also add a unit test on some sample data in tests/test_fr.py, thus adding a test of analysis functionality to Psifr’s automated test suite.