Psifr documentation

In free recall, participants study a list of items and then name all of the items they can remember in any order they choose. Many sophisticated analyses have been developed to analyze data from free recall experiments, but these analyses are often complicated and difficult to implement.

Psifr leverages the Pandas data analysis package to make precise and flexible analysis of free recall data faster and easier.

See the code repository for version release notes.

Installation

You can install the latest stable version of Psifr using pip:

pip install psifr

You can also install the development version directly from the code repository on GitHub:

pip install git+git://github.com/mortonne/psifr

User guide

Importing data

In Psifr, free recall data are imported in the form of a “long” format table. Each row corresponds to one study or recall event. Study events include any time an item was presented to the participant. Recall events correspond to any recall attempt; this includes repeats of items there were already recalled and intrusions of items that were not present in the study list.

This type of information is well represented in a CSV spreadsheet, though any file format supported by pandas may be used for input. To import from a CSV, use pandas.read_csv(). For example:

import pandas as pd
data = pd.read_csv("my_data.csv")

Trial information

The basic information that must be included for each event is the following:

subject

Some code (numeric or string) indicating individual participants. Must be unique for a given experiment. For example, sub-101.

list

Numeric code indicating individual lists. Must be unique within subject.

trial_type

String indicating whether each event is a study event or a recall event.

position

Integer indicating position within a given phase of the list. For study events, this corresponds to input position (also referred to as serial position). For recall events, this corresponds to output position.

item

Individual thing being recalled, such as a word. May be specified with text (e.g., pumpkin, Jack Nicholson) or a numeric code (682, 121). Either way, the text or number must be unique to that item. Text is easier to read and does not require any additional information for interpretation and is therefore preferred if available.

Example

Sample data

subject

list

trial_type

position

item

1

1

study

1

absence

1

1

study

2

hollow

1

1

study

3

pupil

1

1

recall

1

pupil

1

1

recall

2

absence

Additional information

Additional fields may be included in the data to indicate other aspects of the experiment, such as presentation time, stimulus category, experimental session, distraction length, etc. All of these fields can then be used for analysis in Psifr.

Scoring data

After importing free recall data, we have a DataFrame with a row for each study event and a row for each recall event. Next, we need to score the data by matching study events with recall events.

Scoring list recall

First, let’s create a simple sample dataset with two lists. We can use the table_from_lists() convenience function to create a sample dataset with a given set of study lists and recalls:

In [1]: from psifr import fr

In [2]: list_subject = [1, 1]

In [3]: study_lists = [['absence', 'hollow', 'pupil'], ['fountain', 'piano', 'pillow']]

In [4]: recall_lists = [['pupil', 'absence', 'empty'], ['pillow', 'pupil', 'pillow']]

In [5]: data = fr.table_from_lists(list_subject, study_lists, recall_lists)

In [6]: data
Out[6]: 
    subject  list trial_type  position      item
0         1     1      study         1   absence
1         1     1      study         2    hollow
2         1     1      study         3     pupil
3         1     1     recall         1     pupil
4         1     1     recall         2   absence
5         1     1     recall         3     empty
6         1     2      study         1  fountain
7         1     2      study         2     piano
8         1     2      study         3    pillow
9         1     2     recall         1    pillow
10        1     2     recall         2     pupil
11        1     2     recall         3    pillow

Next, we’ll merge together the study and recall events by matching up corresponding events:

In [7]: merged = fr.merge_free_recall(data)

In [8]: merged
Out[8]: 
   subject  list      item  input  ...  repeat  intrusion  prior_list  prior_input
0        1     1   absence    1.0  ...       0      False         NaN          NaN
1        1     1    hollow    2.0  ...       0      False         NaN          NaN
2        1     1     pupil    3.0  ...       0      False         NaN          NaN
3        1     1     empty    NaN  ...       0       True         NaN          NaN
4        1     2  fountain    1.0  ...       0      False         NaN          NaN
5        1     2     piano    2.0  ...       0      False         NaN          NaN
6        1     2    pillow    3.0  ...       0      False         NaN          NaN
7        1     2    pillow    3.0  ...       1      False         NaN          NaN
8        1     2     pupil    NaN  ...       0       True         1.0          3.0

[9 rows x 11 columns]

For each item, there is one row for each unique combination of input and output position. For example, if an item is presented once in the list, but is recalled multiple times, there is one row for each of the recall attempts. Repeated recalls are indicated by the repeat column, which is greater than zero for recalls of an item after the first. Unique study events are indicated by the study column; this excludes intrusions and repeated recalls.

Items that were not recalled have the recall column set to False. Because they were not recalled, they have no defined output position, so output is set to NaN. Finally, intrusions have an output position but no input position because they did not appear in the list. There is an intrusion field for convenience to label these recall attempts. The prior_list and prior_input fields give information about prior-list intrusions (PLIs) of items from prior lists. The prior_list field gives the list where the item appeared and prior_input indicates the position in which is was presented on that list.

merge_free_recall() can also handle additional attributes beyond the standard ones, such as codes indicating stimulus category or list condition. See Working with custom columns for details.

Filtering and sorting

Now that we have a merged DataFrame, we can use Pandas methods to quickly get different views of the data. For some analyses, we may want to organize in terms of the study list by removing repeats and intrusions. Because our data are in a DataFrame, we can use the query() method:

In [9]: merged.query('study')
Out[9]: 
   subject  list      item  input  ...  repeat  intrusion  prior_list  prior_input
0        1     1   absence    1.0  ...       0      False         NaN          NaN
1        1     1    hollow    2.0  ...       0      False         NaN          NaN
2        1     1     pupil    3.0  ...       0      False         NaN          NaN
4        1     2  fountain    1.0  ...       0      False         NaN          NaN
5        1     2     piano    2.0  ...       0      False         NaN          NaN
6        1     2    pillow    3.0  ...       0      False         NaN          NaN

[6 rows x 11 columns]

Alternatively, we may also want to get just the recall events, sorted by output position instead of input position:

In [10]: merged.query('recall').sort_values(['list', 'output'])
Out[10]: 
   subject  list     item  input  ...  repeat  intrusion  prior_list  prior_input
2        1     1    pupil    3.0  ...       0      False         NaN          NaN
0        1     1  absence    1.0  ...       0      False         NaN          NaN
3        1     1    empty    NaN  ...       0       True         NaN          NaN
6        1     2   pillow    3.0  ...       0      False         NaN          NaN
8        1     2    pupil    NaN  ...       0       True         1.0          3.0
7        1     2   pillow    3.0  ...       1      False         NaN          NaN

[6 rows x 11 columns]

Note that we first sort by list, then output position, to keep the lists together.

Recall performance

First, load some sample data and create a merged DataFrame:

In [1]: from psifr import fr

In [2]: df = fr.sample_data('Morton2013')

In [3]: data = fr.merge_free_recall(df)

Raster plot

Raster plots can give you a quick overview of a whole dataset [RKT16]. We’ll look at all of the first subject’s recalls. This will plot every individual recall, colored by the serial position of the recalled item in the list. Items near the end of the list are shown in yellow, and items near the beginning of the list are shown in purple. Intrusions of items not on the list are shown in red.

In [4]: subj = fr.filter_data(data, 1)

In [5]: g = fr.plot_raster(subj).add_legend()
_images/raster_subject.svg

Serial position curve

We can calculate average recall for each serial position [Mur62] using spc() and plot using plot_spc().

In [6]: recall = fr.spc(data)

In [7]: g = fr.plot_spc(recall)
_images/spc.svg

Using the same plotting function, we can plot the curve for each individual subject:

In [8]: g = fr.plot_spc(recall, col='subject', col_wrap=5)
_images/spc_indiv.svg

Probability of Nth recall

We can also split up recalls, to test for example how likely participants were to initiate recall with the last item on the list.

In [9]: prob = fr.pnr(data)

In [10]: prob
Out[10]: 
                          prob  actual  possible
subject output input                            
1       1      1      0.000000       0        48
               2      0.020833       1        48
               3      0.000000       0        48
               4      0.000000       0        48
               5      0.000000       0        48
...                        ...     ...       ...
47      24     20          NaN       0         0
               21          NaN       0         0
               22          NaN       0         0
               23          NaN       0         0
               24          NaN       0         0

[23040 rows x 3 columns]

This gives us the probability of recall by output position ('output') and serial or input position ('input'). This is a lot to look at all at once, so it may be useful to plot just the first three output positions. We can plot the curves using plot_spc(), which takes an optional hue input to specify a variable to use to split the data into curves of different colors.

In [11]: pfr = prob.query('output <= 3')

In [12]: g = fr.plot_spc(pfr, hue='output').add_legend()
_images/pnr.svg

This plot shows what items tend to be recalled early in the recall sequence.

Prior-list intrusions

Participants will sometimes accidentally recall items from prior lists; these recalls are known as prior-list intrusions (PLIs). To better understand how prior-list intrusions are happening, you can look at how many lists back those items were originally presented.

First, you need to choose a maximum list lag that you will consider. This determines which lists will be included in the analysis. For example, if you have a maximum lag of 3, then the first 3 lists will be excluded from the analysis. This ensures that each included list can potentially have intrusions of each possible list lag.

In [13]: pli = fr.pli_list_lag(data, max_lag=3)

In [14]: pli
Out[14]: 
                  count  per_list      prob
subject list_lag                           
1       1             7  0.155556  0.259259
        2             5  0.111111  0.185185
        3             0  0.000000  0.000000
2       1             9  0.200000  0.191489
        2             2  0.044444  0.042553
...                 ...       ...       ...
46      2             1  0.022222  0.100000
        3             0  0.000000  0.000000
47      1             5  0.111111  0.277778
        2             1  0.022222  0.055556
        3             0  0.000000  0.000000

[120 rows x 3 columns]

In [15]: pli.groupby('list_lag').agg(['mean', 'sem'])
Out[15]: 
         count            per_list                prob          
          mean       sem      mean       sem      mean       sem
list_lag                                                        
1         5.55  0.547664  0.123333  0.012170  0.210631  0.014726
2         1.35  0.230801  0.030000  0.005129  0.043458  0.007032
3         0.75  0.174496  0.016667  0.003878  0.023385  0.005602

The analysis returns a raw count of intrusions at each lag (count), the count divided by the number of included lists (per_list), and the probability of a given intrusion coming from a given lag (prob). In the sample dataset, recently presented items (i.e., with lower list lag) are more likely to be intruded.

Recall order

A key advantage of free recall is that it provides information not only about what items are recalled, but also the order in which they are recalled. A number of analyses have been developed to charactize different influences on recall order, such as the temporal order in which the items were presented at study, the category of the items themselves, or the semantic similarity between pairs of items.

Each conditional response probability (CRP) analysis involves calculating the probability of some type of transition event. For the lag-CRP analysis, transition events of interest are the different lags between serial positions of items recalled adjacent to one another. Similar analyses focus not on the serial position in which items are presented, but the properties of the items themselves. A semantic-CRP analysis calculates the probability of transitions between items in different semantic relatedness bins. A special case of this analysis is when item pairs are placed into one of two bins, depending on whether they are in the same stimulus category or not. In Psifr, this is referred to as a category-CRP analysis.

Lag-CRP

In all CRP analyses, transition probabilities are calculated conditional on a given transition being available [Kah96]. For example, in a six-item list, if the items 6, 1, and 4 have been recalled, then possible items that could have been recalled next are 2, 3, or 5; therefore, possible lags at that point in the recall sequence are -2, -1, or +1. The number of actual transitions observed for each lag is divided by the number of times that lag was possible, to obtain the CRP for each lag.

First, load some sample data and create a merged DataFrame:

In [1]: from psifr import fr

In [2]: df = fr.sample_data('Morton2013')

In [3]: data = fr.merge_free_recall(df, study_keys=['category'])

Next, call lag_crp() to calculate conditional response probability as a function of lag.

In [4]: crp = fr.lag_crp(data)

In [5]: crp
Out[5]: 
                   prob  actual  possible
subject lag                              
1       -23.0  0.020833       1        48
        -22.0  0.035714       3        84
        -21.0  0.026316       3       114
        -20.0  0.024000       3       125
        -19.0  0.014388       2       139
...                 ...     ...       ...
47       19.0  0.061224       3        49
         20.0  0.055556       2        36
         21.0  0.045455       1        22
         22.0  0.071429       1        14
         23.0  0.000000       0         6

[1880 rows x 3 columns]

The results show the count of times a given transition actually happened in the observed recall sequences (actual) and the number of times a transition could have occurred (possible). Finally, the prob column gives the estimated probability of a given transition occurring, calculated by dividing the actual count by the possible count.

Use plot_lag_crp() to display the results:

In [6]: g = fr.plot_lag_crp(crp)
_images/lag_crp.svg

The peaks at small lags (e.g., +1 and -1) indicate that the recall sequences show evidence of a temporal contiguity effect; that is, items presented near to one another in the list are more likely to be recalled successively than items that are distant from one another in the list.

Compound lag-CRP

The compound lag-CRP was developed to measure how temporal clustering changes as a result of prior clustering during recall [LK14]. They found evidence that temporal clustering is greater immediately after transitions with short lags compared to long lags. This analysis calculates conditional response probability by lag, but with the additional condition of the lag of the previous transition.

In [7]: crp = fr.lag_crp_compound(data)

In [8]: crp
Out[8]: 
                          prob  actual  possible
subject previous current                        
1       -23.0    -23.0     NaN       0         0
                 -22.0     NaN       0         0
                 -21.0     NaN       0         0
                 -20.0     NaN       0         0
                 -19.0     NaN       0         0
...                        ...     ...       ...
47       23.0     19.0     NaN       0         0
                  20.0     NaN       0         0
                  21.0     NaN       0         0
                  22.0     NaN       0         0
                  23.0     NaN       0         0

[88360 rows x 3 columns]

The results show conditional response probabilities as in the standard lag-CRP analysis, but with two lag columns: previous (the lag of the prior transition) and current (the lag of the current transition).

This is a lot of information, and the sample size for many bins is very small. Following [LK14], we can apply bins to the lag of the previous transition to increase the sample size in each bin. We first sum the actual and possible transition counts, and then calculate the probability of each of the new bins.

In [9]: binned = crp.reset_index()

In [10]: binned.loc[binned['previous'].abs() > 3, 'Previous'] = '|Lag|>3'

In [11]: binned.loc[binned['previous'] == 1, 'Previous'] = 'Lag=+1'

In [12]: binned.loc[binned['previous'] == -1, 'Previous'] = 'Lag=-1'

In [13]: summed = binned.groupby(['subject', 'Previous', 'current'])[['actual', 'possible']].sum()

In [14]: summed['prob'] = summed['actual'] / summed['possible']

In [15]: summed
Out[15]: 
                          actual  possible      prob
subject Previous current                            
1       Lag=+1   -23.0         0         2  0.000000
                 -22.0         0         2  0.000000
                 -21.0         0         4  0.000000
                 -20.0         0         6  0.000000
                 -19.0         1         7  0.142857
...                          ...       ...       ...
47      |Lag|>3   19.0         1        30  0.033333
                  20.0         2        19  0.105263
                  21.0         1        14  0.071429
                  22.0         0         7  0.000000
                  23.0         0         2  0.000000

[5640 rows x 3 columns]

We can then plot the compound lag-CRP using the standard plot_lag_crp() plotting function.

In [16]: g = fr.plot_lag_crp(summed, lag_key='current', hue='Previous').add_legend()
_images/lag_crp_compound.svg

Note that some lags are considered impossible as they would require a repeat of a previously recalled item (e.g., a +1 lag followed by a -1 lag is not possible). For both of the adjacent conditions (+1 and -1), the lag-CRP is sharper compared to the long-lag condition (\(| \mathrm{lag} | >3\)). This suggests that there is compound temporal clustering.

Lag rank

We can summarize the tendency to group together nearby items using a lag rank analysis [PNK09]. For each recall, this determines the absolute lag of all remaining items available for recall and then calculates their percentile rank. Then the rank of the actual transition made is taken, scaled to vary between 0 (furthest item chosen) and 1 (nearest item chosen). Chance clustering will be 0.5; clustering above that value is evidence of a temporal contiguity effect.

In [17]: ranks = fr.lag_rank(data)

In [18]: ranks
Out[18]: 
             rank
subject          
1        0.610953
2        0.635676
3        0.612607
4        0.667090
5        0.643923
...           ...
43       0.554024
44       0.561005
45       0.598151
46       0.652748
47       0.621245

[40 rows x 1 columns]

In [19]: ranks.agg(['mean', 'sem'])
Out[19]: 
          rank
mean  0.624699
sem   0.006732

Category CRP

If there are multiple categories or conditions of trials in a list, we can test whether participants tend to successively recall items from the same category. The category-CRP estimates the probability of successively recalling two items from the same category [PNK09].

In [20]: cat_crp = fr.category_crp(data, category_key='category')

In [21]: cat_crp
Out[21]: 
             prob  actual  possible
subject                            
1        0.801147     419       523
2        0.733456     399       544
3        0.763158     377       494
4        0.814882     449       551
5        0.877273     579       660
...           ...     ...       ...
43       0.809187     458       566
44       0.744376     364       489
45       0.763780     388       508
46       0.763573     436       571
47       0.806907     514       637

[40 rows x 3 columns]

In [22]: cat_crp[['prob']].agg(['mean', 'sem'])
Out[22]: 
          prob
mean  0.782693
sem   0.006262

The expected probability due to chance depends on the number of categories in the list. In this case, there are three categories, so a category CRP of 0.33 would be predicted if recalls were sampled randomly from the list.

Category clustering

A number of measures have been developed to measure category clustering relative to that expected due to chance, under certain assumptions. Two such measures are list-based clustering (LBC) [SBW+02] and adjusted ratio of clustering (ARC) [RTB71].

These measures can be calculated using the category_clustering() function.

In [23]: clust = fr.category_clustering(data, category_key='category')

In [24]: clust.agg(['mean', 'sem'])
Out[24]: 
           lbc       arc
mean  2.409398  0.608763
sem   0.127651  0.016809

Both measures are defined such that positive values indicate above-chance clustering. ARC scores have a maximum of 1, while the upper bound of LBC scores depends on the number of categories and the number of items per category in the study list.

Distance CRP

While the category CRP examines clustering based on semantic similarity at a coarse level (i.e., whether two items are in the same category or not), recall may also depend on more nuanced semantic relationships.

Models of semantic knowledge allow the semantic distance between pairs of items to be quantified. If you have such a model defined for your stimulus pool, you can use the distance CRP analysis to examine how semantic distance affects recall transitions [HK02, MP16].

You must first define distances between pairs of items. Here, we use correlation distances based on the wiki2USE model.

In [25]: items, distances = fr.sample_distances('Morton2013')

We also need a column indicating the index of each item in the distances matrix. We use pool_index() to create a new column called item_index with the index of each item in the pool corresponding to the distances matrix.

In [26]: data['item_index'] = fr.pool_index(data['item'], items)

Finally, we must define distance bins. Here, we use 10 bins with equally spaced distance percentiles. Note that, when calculating distance percentiles, we use the squareform() function to get only the non-diagonal entries.

In [27]: from scipy.spatial.distance import squareform

In [28]: edges = np.percentile(squareform(distances), np.linspace(1, 99, 10))

We can now calculate conditional response probability as a function of distance bin, to examine how response probability varies with semantic distance.

In [29]: dist_crp = fr.distance_crp(data, 'item_index', distances, edges)

In [30]: dist_crp
Out[30]: 
                             bin      prob  actual  possible
subject center                                              
1       0.467532  (0.352, 0.583]  0.085456     151      1767
        0.617748  (0.583, 0.653]  0.067916      87      1281
        0.673656  (0.653, 0.695]  0.062500      65      1040
        0.711075  (0.695, 0.727]  0.051836      48       926
        0.742069  (0.727, 0.757]  0.050633      44       869
...                          ...       ...     ...       ...
47      0.742069  (0.727, 0.757]  0.062822      61       971
        0.770867  (0.757, 0.785]  0.030682      27       880
        0.800404  (0.785, 0.816]  0.040749      37       908
        0.834473  (0.816, 0.853]  0.046651      39       836
        0.897275  (0.853, 0.941]  0.028868      25       866

[360 rows x 4 columns]

Use plot_distance_crp() to display the results:

In [31]: g = fr.plot_distance_crp(dist_crp).set(ylim=(0, 0.1))
_images/distance_crp.svg

Conditional response probability decreases with increasing semantic distance, suggesting that recall order was influenced by the semantic similarity between items. Of course, a complete analysis should address potential confounds such as the category structure of the list. See the Restricting analysis to specific items section for an example of restricting analysis based on category.

Distance rank

Similarly to the lag rank analysis of temporal clustering, we can summarize distance-based clustering (such as semantic clustering) with a single rank measure [PNK09]. The distance rank varies from 0 (the most-distant item is always recalled) to 1 (the closest item is always recalled), with chance clustering corresponding to 0.5.

In [32]: dist_rank = fr.distance_rank(data, 'item_index', distances)

In [33]: dist_rank.agg(['mean', 'sem'])
Out[33]: 
          rank
mean  0.625932
sem   0.003466

Distance rank shifted

Like with the compound lag-CRP, we can also examine how recalls before the just-previous one may predict subsequent recalls. To examine whether distances relative to earlier items are predictive of the next recall, we can use a shifted distance rank analysis [MP16].

Here, to account for the category structure of the list, we will only include within-category transitions (see the Restricting analysis to specific items section for details).

In [34]: ranks = fr.distance_rank_shifted(
   ....:     data, 'item_index', distances, 4, test_key='category', test=lambda x, y: x == y
   ....: )
   ....: 

In [35]: ranks
Out[35]: 
                   rank
subject shift          
1       -4     0.518617
        -3     0.492103
        -2     0.516063
        -1     0.579198
2       -4     0.463931
...                 ...
46      -1     0.581420
47      -4     0.504383
        -3     0.526840
        -2     0.504953
        -1     0.586689

[160 rows x 1 columns]

The distance rank is returned for each shift. The -1 shift is the same as the standard distance rank analysis. We can visualize how distance rank changes with shift using seaborn.relplot().

In [36]: g = sns.relplot(
   ....:     data=ranks.reset_index(), x='shift', y='rank', kind='line', height=3
   ....: ).set(xlabel='Output lag', ylabel='Distance rank', xticks=[-4, -3, -2, -1])
   ....: 
_images/distance_rank_shifted.svg

Restricting analysis to specific items

Sometimes you may want to focus an analysis on a subset of recalls. For example, in order to exclude the period of high clustering commonly observed at the start of recall, lag-CRP analyses are sometimes restricted to transitions after the first three output positions.

You can restrict the recalls included in a transition analysis using the optional item_query argument. This is built on the Pandas query/eval system, which makes it possible to select rows of a DataFrame using a query string. This string can refer to any column in the data. Any items for which the expression evaluates to True will be included in the analysis.

For example, we can use the item_query argument to exclude any items recalled in the first three output positions from analysis. Note that, because non-recalled items have no output position, we need to include them explicitly using output > 3 or not recall.

In [37]: crp_op3 = fr.lag_crp(data, item_query='output > 3 or not recall')

In [38]: g = fr.plot_lag_crp(crp_op3)
_images/lag_crp_op3.svg

Restricting analysis to specific transitions

In other cases, you may want to focus an analysis on a subset of transitions based on some criteria. For example, if a list contains items from different categories, it is a good idea to take this into account when measuring temporal clustering using a lag-CRP analysis [MP17, PEK11]. One approach is to separately analyze within- and across-category transitions.

Transitions can be selected for inclusion using the optional test_key and test inputs. The test_key indicates a column of the data to use for testing transitions; for example, here we will use the category column. The test input should be a function that takes in the test value of the previous recall and the current recall and returns True or False to indicate whether that transition should be included. Here, we will use a lambda (anonymous) function to define the test.

In [39]: crp_within = fr.lag_crp(data, test_key='category', test=lambda x, y: x == y)

In [40]: crp_across = fr.lag_crp(data, test_key='category', test=lambda x, y: x != y)

In [41]: crp_combined = pd.concat([crp_within, crp_across], keys=['within', 'across'], axis=0)

In [42]: crp_combined.index.set_names('transition', level=0, inplace=True)

In [43]: g = fr.plot_lag_crp(crp_combined, hue='transition').add_legend()
_images/lag_crp_cat.svg

The within curve shows the lag-CRP for transitions between items of the same category, while the across curve shows transitions between items of different categories.

Comparing conditions

When analyzing a dataset, it’s often important to compare different experimental conditions. Psifr is built on the Pandas DataFrame, which has powerful ways of splitting data and applying operations to it. This makes it possible to analyze and plot different conditions using very little code.

Working with custom columns

First, load some sample data and create a merged DataFrame:

In [1]: from psifr import fr

In [2]: df = fr.sample_data('Morton2013')

In [3]: data = fr.merge_free_recall(
   ...:     df, study_keys=['category'], list_keys=['list_type']
   ...: )
   ...: 

In [4]: data.head()
Out[4]: 
   subject  list      item  input  ...  list_type  category  prior_list  prior_input
0        1     1     TOWEL    1.0  ...       pure       obj         NaN          NaN
1        1     1     LADLE    2.0  ...       pure       obj         NaN          NaN
2        1     1   THERMOS    3.0  ...       pure       obj         NaN          NaN
3        1     1      LEGO    4.0  ...       pure       obj         NaN          NaN
4        1     1  BACKPACK    5.0  ...       pure       obj         NaN          NaN

[5 rows x 13 columns]

The merge_free_recall() function only includes columns from the raw data if they are one of the standard columns or if they’ve explictly been included using study_keys, recall_keys, or list_keys. list_keys apply to all events in a list, while study_keys and recall_keys are relevant only for study and recall events, respectively.

We’ve included a list key here, to indicate that the list_type field should be included for all study and recall events in each list, even intrusions. The category field will be included for all study events and all valid recalls. Intrusions will have an undefined category.

Analysis by condition

Now we can run any analysis separately for the different conditions. We’ll use the serial position curve analysis as an example.

In [5]: spc = data.groupby('list_type').apply(fr.spc)

In [6]: spc.head()
Out[6]: 
                           recall
list_type subject input          
mixed     1       1.0    0.500000
                  2.0    0.466667
                  3.0    0.600000
                  4.0    0.300000
                  5.0    0.333333

The spc DataFrame has separate groups with the results for each list_type.

Warning

When using groupby with order-based analyses like lag_crp(), make sure all recalls in all recall sequences for a given list have the same label. Otherwise, you will be breaking up recall sequences, which could result in an invalid analysis.

Plotting by condition

We can then plot a separate curve for each condition. All plotting functions take optional hue, col, col_wrap, and row inputs that can be used to divide up data when plotting. Most inputs to seaborn.relplot() are supported.

For example, we can plot two curves for the different list types:

In [7]: g = fr.plot_spc(spc, hue='list_type').add_legend()
_images/spc_list_type.svg

We can also plot the curves in different axes using the col option:

In [8]: g = fr.plot_spc(spc, col='list_type')
_images/spc_list_type_col.svg

We can also plot all combinations of two conditions:

In [9]: spc_split = data.groupby(['list_type', 'category']).apply(fr.spc)

In [10]: g = fr.plot_spc(spc_split, col='list_type', row='category')
_images/spc_split.svg

Plotting by subject

All analyses can be plotted separately by subject. A nice way to do this is using the col and col_wrap optional inputs, to make a grid of plots with 6 columns per row:

In [11]: g = fr.plot_spc(
   ....:     spc, hue='list_type', col='subject', col_wrap=6, height=2
   ....: ).add_legend()
   ....: 
_images/spc_subject.svg

Tutorials

See the psifr-notebooks project for a set of Jupyter notebooks with sample code. These examples go more in depth into the options available for each analysis and how they can be used for advanced analyses such as conditionalizing CRP analysis on specific transitions.

API reference

Free recall analysis

Managing data

table_from_lists(subjects, study, recall[, ...])

Create table format data from list format data.

check_data(df)

Run checks on free recall data.

merge_free_recall(data, **kwargs)

Score free recall data by matching up study and recall events.

merge_lists(study, recall[, merge_keys, ...])

Merge study and recall events together for each list.

filter_data(data[, subjects, lists, ...])

Filter data to get a subset of trials.

reset_list(df)

Reset list index in a DataFrame.

split_lists(frame, phase[, keys, names, ...])

Convert free recall data from one phase to split format.

pool_index(trial_items, pool_items_list)

Get the index of each item in the full pool.

block_index(list_labels)

Get index of each block in a list.

Recall probability

spc(df)

Serial position curve.

pnr(df[, item_query, test_key, test])

Probability of recall by serial position and output position.

Intrusions

pli_list_lag(df, max_lag)

List lag of prior-list intrusions.

Transition probability

lag_crp(df[, lag_key, count_unique, ...])

Lag-CRP for multiple subjects.

category_crp(df, category_key[, item_query, ...])

Conditional response probability of within-category transitions.

distance_crp(df, index_key, distances, edges)

Conditional response probability by distance bin.

Transition rank

lag_rank(df[, item_query, test_key, test])

Calculate rank of the absolute lags in free recall lists.

distance_rank(df, index_key, distances[, ...])

Calculate rank of transition distances in free recall lists.

distance_rank_shifted(df, index_key, ...[, ...])

Rank of transition distances relative to earlier items.

Clustering

category_clustering(df, category_key)

Category clustering of recall sequences.

Plotting

plot_raster(df[, hue, palette, marker, ...])

Plot recalls in a raster plot.

plot_spc(recall, **facet_kws)

Plot a serial position curve.

plot_lag_crp(recall[, max_lag, lag_key, split])

Plot conditional response probability by lag.

plot_distance_crp(crp[, min_samples])

Plot response probability by distance bin.

plot_swarm_error(data[, x, y, swarm_color, ...])

Plot points as a swarm plus mean with error bars.

Measures

Transition measure base class

TransitionMeasure(items_key, label_key[, ...])

Measure of free recall dataset with multiple subjects.

TransitionMeasure.split_lists(data, phase[, ...])

Get relevant fields and split by list.

TransitionMeasure.analyze(data)

Analyze a free recall dataset with multiple subjects.

TransitionMeasure.analyze_subject(subject, ...)

Analyze a single subject.

Transition measures

TransitionOutputs(list_length[, item_query, ...])

Measure recall probability by input and output position.

TransitionLag(list_length[, lag_key, ...])

Measure conditional response probability by lag.

TransitionLagRank([item_query, test_key, test])

Measure lag rank of transitions.

TransitionCategory(category_key[, ...])

Measure conditional response probability by category transition.

TransitionDistance(index_key, distances, edges)

Measure conditional response probability by distance.

TransitionDistanceRank(index_key, distances)

Measure transition rank by distance.

Transitions

Counting transitions

count_lags(list_length, pool_items, recall_items)

Count actual and possible serial position lags.

count_category(pool_items, recall_items, ...)

Count within-category transitions.

count_distance(distances, edges, pool_items, ...)

Count transitions within distance bins.

Ranking transitions

percentile_rank(actual, possible)

Get percentile rank of a score compared to possible scores.

rank_lags(pool_items, recall_items[, ...])

Calculate rank of absolute lag for free recall lists.

rank_distance(distances, pool_items, ...[, ...])

Calculate percentile rank of transition distances.

Iterating over transitions

transitions_masker(pool_items, recall_items, ...)

Iterate over transitions with masking.

Outputs

Counting recalls by serial position and output position

count_outputs(list_length, pool_items, ...)

Count actual and possible recalls for each output position.

Iterating over output positions

outputs_masker(pool_items, recall_items, ...)

Iterate over valid outputs.

Development

Transitions

Psifr has a core set of tools for analyzing transitions in free recall data. These tools focus on measuring what transitions actually occurred, and which transitions were possible given the order in which participants recalled items.

Actual and possible transitions

Calculating a conditional response probability involves two parts: the frequency at which a given event actually occurred in the data and frequency at which a given event could have occurred. The frequency of possible events is calculated conditional on the recalls that have been made leading up to each transition. For example, a transition between item \(i\) and item \(j\) is not considered “possible” in a CRP analysis if item \(i\) was never recalled. The transition is also not considered “possible” if, when item \(i\) is recalled, item \(j\) has already been recalled previously.

Repeated recall events are typically excluded from the counts of both actual and possible transition events. That is, the transition event frequencies are conditional on the transition not being either to or from a repeated item.

Calculating a CRP measure involves tallying how many transitions of a given type were made during a free recall test. For example, one common measure is the serial position lag between items. For a list of length \(N\), possible lags are in the range \([-N+1, N-1]\). Because repeats are excluded, a lag of zero is never possible. The count of actual and possible transitions for each lag is calculated first, and then the CRP for each lag is calculated as the actual count divided by the possible count.

The transitions masker

The psifr.transitions.transitions_masker() is a generator that makes it simple to iterate over transitions while “masking” out events such as intrusions of items not on the list and repeats of items that have already been recalled.

On each step of the iterator, the previous, current, and possible items are yielded. The previous item is the item being transitioned from. The current item is the item being transitioned to. The possible items includes an array of all items that were valid to be recalled next, given the recall sequence up to that point (not including the current item).

In [1]: from psifr.transitions import transitions_masker

In [2]: pool = [1, 2, 3, 4, 5, 6]

In [3]: recs = [6, 2, 3, 6, 1, 4]

In [4]: masker = transitions_masker(pool_items=pool, recall_items=recs,
   ...:                             pool_output=pool, recall_output=recs)
   ...: 

In [5]: for op, prev, curr, poss in masker:
   ...:     print(op, prev, curr, poss)
   ...: 
1 6 2 [1 2 3 4 5]
2 2 3 [1 3 4 5]
5 1 4 [4 5]

Only valid transitions are yielded, so the code for a specific analysis only needs to calculate the transition measure of interest and count the number of actual and possible transitions in each bin of interest.

Four inputs are required:

pool_items

List of identifiers for all items available for recall. Identifiers can be anything that is unique to each item in the list (e.g., serial position, a string representation of the item, an index in the stimulus pool).

recall_items

List of identifiers for the sequence of recalls, in order. Valid recalls must match an item in pool_items. Other items are considered intrusions.

pool_output

Output codes for each item in the pool. This should be whatever you need to calculate your transition measure.

recall_output

Output codes for each recall in the sequence of recalls.

By using different values for these four inputs and defining different transition measures, a wide range of analyses can be implemented.

References

HK02

Marc W Howard and Michael J Kahana. When Does Semantic Similarity Help Episodic Retrieval? Journal of Memory and Language, 46(1):85–98, 2002. doi:10.1006/jmla.2001.2798.

Kah96

Michael Jacob Kahana. Associative retrieval processes in free recall. Memory & Cognition, 24(1):103 109, 1996. doi:10.3758/bf03197276.

LK14

Lynn J. Lohnas and Michael J. Kahana. Compound cuing in free recall. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40(1):12, 2014. doi:10.1037/a0033698.

MP16

Neal W Morton and Sean M Polyn. A predictive framework for evaluating models of semantic organization in free recall. Journal of Memory and Language, 86:119 140, 2016. doi:10.1016/j.jml.2015.10.002.

MP17

Neal W Morton and Sean M Polyn. Beta-band activity represents the recent past during episodic encoding. NeuroImage, 147:692 702, 2017. doi:10.1016/j.neuroimage.2016.12.049.

Mur62

Bennet B Murdock. The serial position effect of free recall. Journal of Experimental Psychology, 64(5):482 488, 1962. doi:10.1037/h0045106.

PNK09

Sean M Polyn, Kenneth A Norman, and Michael Jacob Kahana. A context maintenance and retrieval model of organizational processes in free recall. Psychological Review, 116(1):129 156, 2009. doi:10.1037/a0014420.

PEK11

Sean M. Polyn, Gennady Erlikhman, and Michael J. Kahana. Semantic cuing and the scale insensitivity of recency and contiguity. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(3):766, 2011. doi:10.1037/a0022475.

RTB71

Daniel L Roenker, Charles P Thompson, and Sam C Brown. Comparison of measures for the estimation of clustering in free recall. Psychological Bulletin, 76(1):45 48, 01 1971. doi:10.1037/h0031355.

RKT16

Sandro Romani, Mikhail Katkov, and Misha Tsodyks. Practice makes perfect in memory recall. Learning & Memory, 23(4):169–173, 2016. doi:10.1101/lm.041178.115.

SBW+02

John L Stricker, Gregory G Brown, John T Wixted, Juliana V Baldo, and Dean Delis. New semantic and serial clustering indices for the California Verbal Learning Test–Second Edition: Background, rationale, and formulae. Journal of the International Neuropsychological Society, 8:425 435, 2002. doi:10.1017/S1355617702813224.