Psifr documentation¶
In free recall, participants study a list of items and then name all of the items they can remember in any order they choose. Many sophisticated analyses have been developed to analyze data from free recall experiments, but these analyses are often complicated and difficult to implement.
Psifr leverages the Pandas data analysis package to make precise and flexible analysis of free recall data faster and easier.
See the code repository for version release notes.
Installation¶
You can install the latest stable version of Psifr using pip:
pip install psifr
You can also install the development version directly from the code repository on GitHub:
pip install git+git://github.com/mortonne/psifr
User guide¶
Importing data¶
In Psifr, free recall data are imported in the form of a “long” format table. Each row corresponds to one study or recall event. Study events include any time an item was presented to the participant. Recall events correspond to any recall attempt; this includes repeats of items there were already recalled and intrusions of items that were not present in the study list.
This type of information is well represented in a CSV spreadsheet, though any file format supported by pandas may be used for input. To import from a CSV, use pandas. For example:
import pandas as pd
data = pd.read_csv("my_data.csv")
Trial information¶
The basic information that must be included for each event is the following:
- subject
Some code (numeric or string) indicating individual participants. Must be unique for a given experiment. For example,
sub-101
.- list
Numeric code indicating individual lists. Must be unique within subject.
- trial_type
String indicating whether each event is a
study
event or arecall
event.- position
Integer indicating position within a given phase of the list. For
study
events, this corresponds to input position (also referred to as serial position). Forrecall
events, this corresponds to output position.- item
Individual thing being recalled, such as a word. May be specified with text (e.g.,
pumpkin
,Jack Nicholson
) or a numeric code (682
,121
). Either way, the text or number must be unique to that item. Text is easier to read and does not require any additional information for interpretation and is therefore preferred if available.
Example¶
subject |
list |
trial_type |
position |
item |
---|---|---|---|---|
1 |
1 |
study |
1 |
absence |
1 |
1 |
study |
2 |
hollow |
1 |
1 |
study |
3 |
pupil |
1 |
1 |
recall |
1 |
pupil |
1 |
1 |
recall |
2 |
absence |
Additional information¶
Additional fields may be included in the data to indicate other aspects of the experiment, such as presentation time, stimulus category, experimental session, distraction length, etc. All of these fields can then be used for analysis in Psifr.
Scoring data¶
After importing free recall data, we have a DataFrame with a row for each study event and a row for each recall event. Next, we need to score the data by matching study events with recall events.
Scoring list recall¶
First, let’s create a simple sample dataset with two lists:
In [1]: import pandas as pd
In [2]: data = pd.DataFrame({
...: 'subject': [
...: 1, 1, 1, 1, 1, 1,
...: 1, 1, 1, 1, 1, 1,
...: ],
...: 'list': [
...: 1, 1, 1, 1, 1, 1,
...: 2, 2, 2, 2, 2, 2,
...: ],
...: 'trial_type': [
...: 'study', 'study', 'study', 'recall', 'recall', 'recall',
...: 'study', 'study', 'study', 'recall', 'recall', 'recall',
...: ],
...: 'position': [
...: 1, 2, 3, 1, 2, 3,
...: 1, 2, 3, 1, 2, 3,
...: ],
...: 'item': [
...: 'absence', 'hollow', 'pupil', 'pupil', 'absence', 'empty',
...: 'fountain', 'piano', 'pillow', 'pillow', 'fountain', 'pillow',
...: ],
...: })
...:
In [3]: data
Out[3]:
subject list trial_type position item
0 1 1 study 1 absence
1 1 1 study 2 hollow
2 1 1 study 3 pupil
3 1 1 recall 1 pupil
4 1 1 recall 2 absence
5 1 1 recall 3 empty
6 1 2 study 1 fountain
7 1 2 study 2 piano
8 1 2 study 3 pillow
9 1 2 recall 1 pillow
10 1 2 recall 2 fountain
11 1 2 recall 3 pillow
Next, we’ll merge together the study and recall events by matching up corresponding events:
In [4]: from psifr import fr
In [5]: merged = fr.merge_free_recall(data)
In [6]: merged
Out[6]:
subject list item input output study recall repeat intrusion
0 1 1 absence 1.0 2.0 True True 0 False
1 1 1 hollow 2.0 NaN True False 0 False
2 1 1 pupil 3.0 1.0 True True 0 False
3 1 1 empty NaN 3.0 False True 0 True
4 1 2 fountain 1.0 2.0 True True 0 False
5 1 2 piano 2.0 NaN True False 0 False
6 1 2 pillow 3.0 1.0 True True 0 False
7 1 2 pillow 3.0 3.0 False True 1 False
For each item, there is one row for each unique combination of input and output position. For example, if an item is presented once in the list, but is recalled multiple times, there is one row for each of the recall attempts. Repeated recalls are indicated by the repeat column, which is greater than zero for recalls of an item after the first. Unique study events are indicated by the study column; this excludes intrusions and repeated recalls.
Items that were not recalled have the recall column set to False. Because they were not recalled, they have no defined output position, so output is set to NaN. Finally, intrusions have an output position but no input position because they did not appear in the list. There is an intrusion field for convenience to label these recall attempts.
merge_free_recall()
can also handle additional attributes beyond
the standard ones, such as codes indicating stimulus category or list condition.
See Working with custom columns for details.
Filtering and sorting¶
Now that we have a merged DataFrame, we can use pandas methods to quickly get different views of the data. For some analyses, we may want to organize in terms of the study list by removing repeats and intrusions. Because our data are in a DataFrame, we can use the DataFrame.query method:
In [7]: merged.query('study')
Out[7]:
subject list item input output study recall repeat intrusion
0 1 1 absence 1.0 2.0 True True 0 False
1 1 1 hollow 2.0 NaN True False 0 False
2 1 1 pupil 3.0 1.0 True True 0 False
4 1 2 fountain 1.0 2.0 True True 0 False
5 1 2 piano 2.0 NaN True False 0 False
6 1 2 pillow 3.0 1.0 True True 0 False
Alternatively, we may also want to get just the recall events, sorted by output position instead of input position:
In [8]: merged.query('recall').sort_values(['list', 'output'])
Out[8]:
subject list item input output study recall repeat intrusion
2 1 1 pupil 3.0 1.0 True True 0 False
0 1 1 absence 1.0 2.0 True True 0 False
3 1 1 empty NaN 3.0 False True 0 True
6 1 2 pillow 3.0 1.0 True True 0 False
4 1 2 fountain 1.0 2.0 True True 0 False
7 1 2 pillow 3.0 3.0 False True 1 False
Note that we first sort by list, then output position, to keep the lists together.
Recall performance¶
First, load some sample data and create a merged DataFrame:
In [1]: from psifr import fr
In [2]: df = fr.sample_data('Morton2013')
In [3]: data = fr.merge_free_recall(df)
Raster plot¶
Raster plots can give you a quick overview of a whole dataset. We’ll look at all of the first subject’s recalls. This will plot every individual recall, colored by the serial position of the recalled item in the list. Items near the end of the list are shown in yellow, and items near the beginning of the list are shown in purple. Intrusions of items not on the list are shown in red.
In [4]: subj = fr.filter_data(data, 1)
In [5]: g = fr.plot_raster(subj).add_legend()
Serial position curve¶
We can calculate average recall for each serial position
using spc()
and plot using plot_spc()
.
In [6]: recall = fr.spc(data)
In [7]: g = fr.plot_spc(recall)
Using the same plotting function, we can plot the curve for each individual subject:
In [8]: g = fr.plot_spc(recall, col='subject', col_wrap=5)
Probability of Nth recall¶
We can also split up recalls, to test for example how likely participants were to initiate recall with the last item on the list.
In [9]: prob = fr.pnr(data)
In [10]: prob
Out[10]:
prob actual possible
subject output input
1 1 1 0.000000 0 48
2 0.020833 1 48
3 0.000000 0 48
4 0.000000 0 48
5 0.000000 0 48
... ... ... ...
47 24 20 NaN 0 0
21 NaN 0 0
22 NaN 0 0
23 NaN 0 0
24 NaN 0 0
[23040 rows x 3 columns]
This gives us the probability of recall by output position ('output'
)
and serial or input position ('input'
). This is a lot to look at all
at once, so it may be useful to plot just the first three output positions.
We can plot the curves using plot_spc()
, which takes an
optional hue
input to specify a variable to use to split the data
into curves of different colors.
In [11]: pfr = prob.query('output <= 3')
In [12]: g = fr.plot_spc(pfr, hue='output').add_legend()
This plot shows what items tend to be recalled early in the recall sequence.
Recall order¶
A key advantage of free recall is that it provides information not only about what items are recalled, but also the order in which they are recalled. A number of analyses have been developed to charactize different influences on recall order, such as the temporal order in which the items were presented at study, the category of the items themselves, or the semantic similarity between pairs of items.
Each conditional response probability (CRP) analysis involves calculating the probability of some type of transition event. For the lag-CRP analysis, transition events of interest are the different lags between serial positions of items recalled adjacent to one another. Similar analyses focus not on the serial position in which items are presented, but the properties of the items themselves. A semantic-CRP analysis calculates the probability of transitions between items in different semantic relatedness bins. A special case of this analysis is when item pairs are placed into one of two bins, depending on whether they are in the same stimulus category or not. In Psifr, this is referred to as a category-CRP analysis.
Lag-CRP¶
In all CRP analyses, transition probabilities are calculated conditional on a given transition being available. For example, in a six-item list, if the items 6, 1, and 4 have been recalled, then possible items that could have been recalled next are 2, 3, or 5; therefore, possible lags at that point in the recall sequence are -2, -1, or +1. The number of actual transitions observed for each lag is divided by the number of times that lag was possible, to obtain the CRP for each lag.
First, load some sample data and create a merged DataFrame:
In [1]: from psifr import fr
In [2]: df = fr.sample_data('Morton2013')
In [3]: data = fr.merge_free_recall(df, study_keys=['category'])
Next, call lag_crp()
to calculate conditional response
probability as a function of lag.
In [4]: crp = fr.lag_crp(data)
In [5]: crp
Out[5]:
prob actual possible
subject lag
1 -23.0 0.020833 1 48
-22.0 0.035714 3 84
-21.0 0.026316 3 114
-20.0 0.024000 3 125
-19.0 0.014388 2 139
... ... ... ...
47 19.0 0.061224 3 49
20.0 0.055556 2 36
21.0 0.045455 1 22
22.0 0.071429 1 14
23.0 0.000000 0 6
[1880 rows x 3 columns]
The results show the count of times a given transition actually happened
in the observed recall sequences (actual
) and the number of times a
transition could have occurred (possible
). Finally, the prob
column
gives the estimated probability of a given transition occurring, calculated
by dividing the actual count by the possible count.
Use plot_lag_crp()
to display the results:
In [6]: g = fr.plot_lag_crp(crp)
The peaks at small lags (e.g., +1 and -1) indicate that the recall sequences show evidence of a temporal contiguity effect; that is, items presented near to one another in the list are more likely to be recalled successively than items that are distant from one another in the list.
Lag rank¶
We can summarize the tendency to group together nearby items using a lag rank analysis. For each recall, this determines the absolute lag of all remaining items available for recall and then calculates their percentile rank. Then the rank of the actual transition made is taken, scaled to vary between 0 (furthest item chosen) and 1 (nearest item chosen). Chance clustering will be 0.5; clustering above that value is evidence of a temporal contiguity effect.
In [7]: ranks = fr.lag_rank(data)
In [8]: ranks
Out[8]:
rank
subject
1 0.610953
2 0.635676
3 0.612607
4 0.667090
5 0.643923
... ...
43 0.554024
44 0.561005
45 0.598151
46 0.652748
47 0.621245
[40 rows x 1 columns]
In [9]: ranks.agg(['mean', 'sem'])
Out[9]:
rank
mean 0.624699
sem 0.006732
Category CRP¶
If there are multiple categories or conditions of trials in a list, we can test whether participants tend to successively recall items from the same category. The category-CRP estimates the probability of successively recalling two items from the same category.
In [10]: cat_crp = fr.category_crp(data, category_key='category')
In [11]: cat_crp
Out[11]:
prob actual possible
subject
1 0.801147 419 523
2 0.733456 399 544
3 0.763158 377 494
4 0.814882 449 551
5 0.877273 579 660
... ... ... ...
43 0.809187 458 566
44 0.744376 364 489
45 0.763780 388 508
46 0.763573 436 571
47 0.806907 514 637
[40 rows x 3 columns]
In [12]: cat_crp[['prob']].agg(['mean', 'sem'])
Out[12]:
prob
mean 0.782693
sem 0.006262
The expected probability due to chance depends on the number of categories in the list. In this case, there are three categories, so a category CRP of 0.33 would be predicted if recalls were sampled randomly from the list.
Restricting analysis to specific items¶
Sometimes you may want to focus an analysis on a subset of recalls. For example, in order to exclude the period of high clustering commonly observed at the start of recall, lag-CRP analyses are sometimes restricted to transitions after the first three output positions.
You can restrict the recalls included in a transition analysis using
the optional item_query
argument. This is built on the Pandas
query/eval system, which makes it possible to select rows of a
DataFrame
using a query string. This string can refer to any
column in the data. Any items for which the expression evaluates to
True
will be included in the analysis.
For example, we can use the item_query
argument to exclude any
items recalled in the first three output positions from analysis. Note
that, because non-recalled items have no output position, we need to
include them explicitly using output > 3 or not recall
.
In [13]: crp_op3 = fr.lag_crp(data, item_query='output > 3 or not recall')
In [14]: g = fr.plot_lag_crp(crp_op3)
Restricting analysis to specific transitions¶
In other cases, you may want to focus an analysis on a subset of transitions based on some criteria. For example, if a list contains items from different categories, it is a good idea to take this into account when measuring temporal clustering using a lag-CRP analysis. One approach is to separately analyze within- and across-category transitions.
Transitions can be selected for inclusion using the optional
test_key
and test
inputs. The test_key
indicates a column of the data to use for testing transitions; for
example, here we will use the category
column. The
test
input should be a function that takes in the test value
of the previous recall and the current recall and returns True or False
to indicate whether that transition should be included. Here, we will
use a lambda (anonymous) function to define the test.
In [15]: crp_within = fr.lag_crp(data, test_key='category', test=lambda x, y: x == y)
In [16]: crp_across = fr.lag_crp(data, test_key='category', test=lambda x, y: x != y)
In [17]: crp_combined = pd.concat([crp_within, crp_across], keys=['within', 'across'], axis=0)
In [18]: crp_combined.index.set_names('transition', level=0, inplace=True)
In [19]: g = fr.plot_lag_crp(crp_combined, hue='transition').add_legend()
The within
curve shows the lag-CRP for transitions between
items of the same category, while the across
curve shows
transitions between items of different categories.
Comparing conditions¶
When analyzing a dataset, it’s often important to compare different experimental conditions. Psifr is built on the Pandas DataFrame, which has powerful ways of splitting data and applying operations to it. This makes it possible to analyze and plot different conditions using very little code.
Working with custom columns¶
First, load some sample data and create a merged DataFrame:
In [1]: from psifr import fr
In [2]: df = fr.sample_data('Morton2013')
In [3]: data = fr.merge_free_recall(
...: df, study_keys=['category'], list_keys=['list_type']
...: )
...:
In [4]: data.head()
Out[4]:
subject list item input ... repeat intrusion list_type category
0 1 1 TOWEL 1.0 ... 0 False pure obj
1 1 1 LADLE 2.0 ... 0 False pure obj
2 1 1 THERMOS 3.0 ... 0 False pure obj
3 1 1 LEGO 4.0 ... 0 False pure obj
4 1 1 BACKPACK 5.0 ... 0 False pure obj
[5 rows x 11 columns]
The merge_free_recall()
function only includes columns from the
raw data if they are one of the standard columns or if they’ve explictly been
included using study_keys
, recall_keys
, or list_keys
.
list_keys
apply to all events in a list, while study_keys
and
recall_keys
are relevant only for study and recall events, respectively.
We’ve included a list key here, to indicate that the list_type
field should be included for all study and recall events in each list, even
intrusions. The category
field will be included for all study events
and all valid recalls. Intrusions will have an undefined category.
Analysis by condition¶
Now we can run any analysis separately for the different conditions. We’ll use the serial position curve analysis as an example.
In [5]: spc = data.groupby('list_type').apply(fr.spc)
In [6]: spc.head()
Out[6]:
recall
list_type subject input
mixed 1 1.0 0.500000
2.0 0.466667
3.0 0.600000
4.0 0.300000
5.0 0.333333
The spc
DataFrame has separate groups with the results for each
list_type
.
Warning
When using groupby
with order-based analyses like
lag_crp()
, make sure all recalls in all recall
sequences for a given list have the same label. Otherwise, you will
be breaking up recall sequences, which could result in an invalid
analysis.
Plotting by condition¶
We can then plot a separate curve for each condition. All plotting functions
take optional hue
, col
, col_wrap
, and row
inputs that can be used to divide up data when plotting. See the
Seaborn documentation
for details. Most inputs to seaborn.relplot()
are supported.
For example, we can plot two curves for the different list types:
In [7]: g = fr.plot_spc(spc, hue='list_type').add_legend()
We can also plot the curves in different axes using the col
option:
In [8]: g = fr.plot_spc(spc, col='list_type')
We can also plot all combinations of two conditions:
In [9]: spc_split = data.groupby(['list_type', 'category']).apply(fr.spc)
In [10]: g = fr.plot_spc(spc_split, col='list_type', row='category')
Plotting by subject¶
All analyses can be plotted separately by subject. A nice way to do this is
using the col
and col_wrap
optional inputs, to make a grid
of plots with 6 columns per row:
In [11]: g = fr.plot_spc(
....: spc, hue='list_type', col='subject', col_wrap=6, height=2
....: ).add_legend()
....:
Tutorials¶
See the psifr-notebooks project for a set of Jupyter notebooks with sample code. These examples go more in depth into the options available for each analysis and how they can be used for advanced analyses such as conditionalizing CRP analysis on specific transitions.
API reference¶
Free recall analysis¶
Managing data¶
|
Merge standard free recall events. |
|
Merge study and recall events together for each list. |
|
Filter data to get a subset of trials. |
|
Reset list index in a DataFrame. |
|
Convert free recall data from one phase to split format. |
Recall probability¶
|
Serial position curve. |
|
Probability of recall by serial position and output position. |
Transition probability¶
|
Lag-CRP for multiple subjects. |
|
Conditional response probability of within-category transitions. |
|
Conditional response probability by distance bin. |
Transition rank¶
|
Calculate rank of the absolute lags in free recall lists. |
|
Calculate rank of transition distances in free recall lists. |
Plotting¶
|
Plot recalls in a raster plot. |
|
Plot a serial position curve. |
|
Plot conditional response probability by lag. |
|
Plot response probability by distance bin. |
|
Plot points as a swarm plus mean with error bars. |
Measures¶
Transition measure base class¶
|
Measure of free recall dataset with multiple subjects. |
|
Get relevant fields and split by list. |
Analyze a free recall dataset with multiple subjects. |
|
|
Analyze a single subject. |
Transition measures¶
|
Measure recall probability by input and output position. |
|
Measure conditional response probability by lag. |
|
Measure lag rank of transitions. |
|
Measure conditional response probability by category transition. |
|
Measure conditional response probability by distance. |
|
Measure transition rank by distance. |
Transitions¶
Counting transitions¶
|
Count actual and possible serial position lags. |
|
Count within-category transitions. |
|
Count transitions within distance bins. |
Ranking transitions¶
|
Get percentile rank of a score compared to possible scores. |
|
Calculate rank of absolute lag for free recall lists. |
|
Calculate percentile rank of transition distances. |
Iterating over transitions¶
|
Iterate over transitions with masking. |
Outputs¶
Counting recalls by serial position and output position¶
|
Count actual and possible recalls for each output position. |
Iterating over output positions¶
|
Iterate over valid outputs. |
Development¶
Transitions¶
Psifr has a core set of tools for analyzing transitions in free recall data. These tools focus on measuring what transitions actually occurred, and which transitions were possible given the order in which participants recalled items.
Actual and possible transitions¶
Calculating a conditional response probability involves two parts: the frequency at which a given event actually occurred in the data and frequency at which a given event could have occurred. The frequency of possible events is calculated conditional on the recalls that have been made leading up to each transition. For example, a transition between item \(i\) and item \(j\) is not considered “possible” in a CRP analysis if item \(i\) was never recalled. The transition is also not considered “possible” if, when item \(i\) is recalled, item \(j\) has already been recalled previously.
Repeated recall events are typically excluded from the counts of both actual and possible transition events. That is, the transition event frequencies are conditional on the transition not being either to or from a repeated item.
Calculating a CRP measure involves tallying how many transitions of a given type were made during a free recall test. For example, one common measure is the serial position lag between items. For a list of length \(N\), possible lags are in the range \([-N+1, N-1]\). Because repeats are excluded, a lag of zero is never possible. The count of actual and possible transitions for each lag is calculated first, and then the CRP for each lag is calculated as the actual count divided by the possible count.
The transitions masker¶
The psifr.transitions.transitions_masker()
is a generator that makes
it simple to iterate over transitions while “masking” out events such as
intrusions of items not on the list and repeats of items that have already
been recalled.
On each step of the iterator, the previous, current, and possible items are yielded. The previous item is the item being transitioned from. The current item is the item being transitioned to. The possible items includes an array of all items that were valid to be recalled next, given the recall sequence up to that point (not including the current item).
In [1]: from psifr.transitions import transitions_masker
In [2]: pool = [1, 2, 3, 4, 5, 6]
In [3]: recs = [6, 2, 3, 6, 1, 4]
In [4]: masker = transitions_masker(pool_items=pool, recall_items=recs,
...: pool_output=pool, recall_output=recs)
...:
In [5]: for prev, curr, poss in masker:
...: print(prev, curr, poss)
...:
6 2 [1 2 3 4 5]
2 3 [1 3 4 5]
1 4 [4 5]
Only valid transitions are yielded, so the code for a specific analysis only needs to calculate the transition measure of interest and count the number of actual and possible transitions in each bin of interest.
Four inputs are required:
- pool_items
List of identifiers for all items available for recall. Identifiers can be anything that is unique to each item in the list (e.g., serial position, a string representation of the item, an index in the stimulus pool).
- recall_items
List of identifiers for the sequence of recalls, in order. Valid recalls must match an item in pool_items. Other items are considered intrusions.
- pool_output
Output codes for each item in the pool. This should be whatever you need to calculate your transition measure.
- recall_output
Output codes for each recall in the sequence of recalls.
By using different values for these four inputs and defining different transition measures, a wide range of analyses can be implemented.