Analyses#
In Psifr, analyses take a full dataset in data frame format and calculate some statistic or statistics for each subject. Results are output in data frame format, which is flexible and allows for rich information to be output for each participant. For example, the probability of Nth recall analysis gives the conditional probability of recall of each serial position for each output position during recall.
Creating a Measure class#
Most analyses call a low-level statistics function that takes in list-format data. Here, we’ll use the recall_probability
function we made in the Statistics section. To run this function separately for each subject in a dataset, we can create a Measure by inheriting from the TransitionMeasure
class. This will handle iterating over subjects, converting the data to list format, calculating the statistic, and outputting the data in a results table.
We just need to define an __init__
method to take in any user options and an analyze_subject
method to run the analysis for one subject.
Analysis method#
The analyze_subject
method must take in subject
, pool
, and recall
inputs. The pool
and recall
inputs are dictionaries containing information from columns of the data frame, which have been converted into list format. Both pool
and recall
may contain three different types of information about study pools and recall sequences:
items
indicate item identifiers. This is used to keep track of which items are available for recall and to match them up with recall sequences. Input position is commonly used for this.label
indicates item labels. This is a separate input fromitems
because sometimes an analysis requires other information about the items, such as stimulus category, which is not unique to that item.test
indicates values for testing whether a given item or transition between items should be included in the analysis. A separate test function must be supplied to check whether or not to include that transition or recall, based on its test value.
Note that these are just conventions; it’s up to the individual analysis code to decide how (and whether) to use items
, label
, and test
inputs. Each of these entries is pulled from the data frame; the individual analysis determines which columns each entry corresponds to. For example, in lag-CRP analysis, both items
and label
are pulled from the input
column.
In general, items
inputs are required, but labels and tests are optional. In the case of recall_probability
, we just need to know how many study items there were in each list and how many items were recalled. Serial position works well in this case (though note that this code assumes that there are no repeated recalls or intrusions).
Measure initialization#
In our example, the initialization method doesn’t require any user input. We’ll just assume that input position (or serial position) is stored in the 'input'
column. The TransitionMeasure
class has two required inputs: a column of the data table with identifiers for the pool of studied items, and a column with identifiers for recalled items. Here, we can just use serial position (the input
column in standard Psifr format) for both of these inputs.
Here, we just need to specify item identifiers, but other analyses may need additional information. For example, the category_crp
analysis uses input
to define item identifiers and a category
column to indicate stimulus category labels.
Analysis output#
Putting everything together, we can define our Measure class:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from psifr import measures
In [4]: def recall_probability(study_items, recall_items):
...: recall_prob = []
...: for study, recall in zip(study_items, recall_items):
...: recall_prob.append(len(recall) / len(study))
...: return recall_prob
...:
In [5]: class Recall(measures.TransitionMeasure):
...: def __init__(self):
...: super().__init__('input', 'input')
...: def analyze_subject(self, subject, pool, recall):
...: stat = recall_probability(pool['items'], recall['items'])
...: rec = pd.DataFrame({'recall': np.mean(stat)}, index=[subject])
...: rec.index.name = 'subject'
...: return rec
...:
Note that we define a data frame that organizes the statistics for one subject, and includes subject as the index for the data frame. Analysis data frames may contain multiple indices, such as subject, input position, and output position, depending on the analysis.
Adding statistics and measures to Psifr#
If we were writing real code to be added to Psifr, we would add recall_probability
to the psifr.stats
module, and Recall
to the measures
module. We would also add a docstring to the recall_probability
function, following NumPy’s standard docstring format.
As mentioned previously, we would also add a unit test of recall_probability
to the tests
directory. That test would ideally handle edge cases like repeats and intrusions. It would then show that our current implementation is flawed and needs to be modified to handle these edge cases correctly.
Adding an analysis to Psifr#
Finally, create a high-level function to run the analysis. This function should take in a data frame in Psifr merged format as its first input.
In [6]: def prec(data):
...: measure = Recall()
...: rec = measure.analyze(data)
...: return rec
...:
We can test it on some sample data:
In [7]: from psifr import fr
In [8]: raw = fr.sample_data('Morton2013')
In [9]: data = fr.merge_free_recall(raw)
In [10]: prec(data)
Out[10]:
subject recall
0 1 0.541667
1 2 0.619792
2 3 0.503472
3 4 0.590278
4 5 0.710069
5 6 0.541667
6 7 0.679688
7 8 0.433160
8 10 0.528646
9 11 0.593750
10 12 0.516493
11 15 0.506076
12 16 0.388021
13 18 0.561632
14 20 0.640625
15 22 0.513889
16 23 0.559028
17 24 0.559896
18 25 0.502604
19 26 0.746528
20 27 0.391493
21 28 0.603299
22 29 0.582465
23 30 0.714410
24 31 0.643229
25 32 0.763889
26 33 0.531250
27 34 0.414931
28 35 0.621528
29 36 0.627604
30 37 0.302083
31 38 0.567708
32 40 0.402778
33 41 0.563368
34 42 0.405382
35 43 0.624132
36 44 0.488715
37 45 0.515625
38 46 0.572049
39 47 0.640625
If we were writing real code to be added to Psifr, we could add prec
to the psifr.fr
module, thus making it available through the high-level fr
API. We would add a docstring for prec
describing the inputs and outputs in standard NumPy docstring format. Ideally, at the end of this docstring, we’d include a doctest-compatible example of how to run an analysis and the expected output for that example. Finally, we would also add a unit test on some sample data in tests/test_fr.py
, thus adding a test of analysis functionality to Psifr’s automated test suite.