Skip to content

run_all

run_all(cross_validation='simple', subject='UTS02', feature='eng1000', n_train_stories=[1, 3, 5], test_story='wheretheressmoke', n_repeats=5, ndelays=5, interpolation='lanczos', ridge_implementation='ridge_huth', do_shuffle=False, use_cache=True, keep_train_stories_in_mem=True, seed=123, alphas=np.logspace(1, 3, 10), nboots=50, chunklen=40, nchunks=125, singcutoff=1e-10, single_alpha=False, use_corr=True, run_folder_name='')

Runs encoding models n_repeat times and saves results data/runs to disk.

The following outputs are saved: 0. params.json : the parameters for the run data/runs/date-id/params.json 1. scores_mean.npy : the mean scores across folds data/runs/date-id/predictor/subject/n_training_stories/shuffle/scores_mean.npy 2. scores.npy : the score for each separate fold data/runs/date-id/predictor/subject/n_training_stories/shuffle/fold/scores.npy 3. weights.npy : the model weights for each separate fold data/runs/date-id/predictor/subject/n_training_stories/shuffle/fold/weights.npy 4. best_alphas.npy : the best alphas for each voxel in that fold data/runs/date-id/predictor/subject/n_training_stories/shuffle/fold/best_alphas.npy

Parameters:

Name Type Description Default

cross_validation

(loocv, simple)

loocv uses leave-one-out cross-validation for n_stories. The stories are determined by the order of the stories parameter or its default value in config.yaml. simple computes the regression for a train/test split containing n_stories within each repeat. Stories are sampled randomly for each repeat.

"loocv"

subject

str or list of str

Subject identifier. Can be one or a list of : {"all", "UTS01", "UTS02", "UTS03", "UTS04", "UTS05", "UTS06", "UTS07", "UTS08"}

"UTS02"

feature

(all, envelope, eng1000)

Which predictor to run. all will run separate encoding models for both predictors (default). envelope will run the encoding model with the audio envelope as predictor. eng1000 will run the encoding model with the word embeddings of the stories as predictor.

"all"

n_train_stories

int or list of int

Number o of training stories for the encoding model. If a list is given, the encoding model will be fitted with each number separately.

[1, 3, 5]

test_story

str or None

The story to use as the test set. If None, it will be sampled randomly.

"wheretheressmoke"

ndelays

int

By how many TR's features are delayed to model the HRF. For ndelays=5, the features of the predictor are shifted by one TR and concatinated to themselves for five times.

5

interpolation

(lanczos, average)

Whether to use lanczos interpolation or just average the words within a TR. Only applies if predictor=eng1000.

"lanczos"

ridge_implementation

str

Which ridge regression implementation to use. ridgeCV will use scikit-learn's RidgeCV. ridge_huth will use the ridge regression implementation from Lebel et al.

'ridge_huth'

do_shuffle

bool

Whether or not to run model fits with predictors shuffled (as a control). A separate subfolder ('shuffled') will be created in the run folder with these results.

False

use_cache

bool

Whether features are cached and reused. Only applies if predictor=envelope.

True

keep_train_stories_in_mem

bool

Whether stories are kept in memory after first loading. Unless when using all stories turning this off will reduce the memory footprint, but increase the time is spent loading data. Only works if cross_validation='simple'.

True

seed

Optional[int]

Seed determining sampling of stories

123

alphas

ndarray

Array of alpha values to optimize over.

logspace(1, 3, 10)

nboots

int

The number of bootstrap samples to run. 15 to 30 works well. Only active for ridge_huth="ridge_huth".

50

chunklen

int

On each sample, the training data is broken into chunks of this length. This should be a few times longer than your delay/STRF. e.g. for a STRF with 3 delays, I use chunks of length 10. Only active for ridge_huth="ridge_huth".

40

nchunks

int

The number of training chunks held out to test ridge parameters for each bootstrap sample. The product of nchunks and chunklen is the total number of training samples held out for each sample, and this product should be about 20 percent of the total length of the training data.

125

singcutoff

float

The first step in ridge regression is computing the singular value decomposition (SVD) of the stimulus Rstim. If Rstim is not full rank, some singular values will be approximately equal to zero and the corresponding singular vectors will be noise. These singular values/vectors should be removed both for speed (the fewer multiplications the better!) and accuracy. Any singular values less than singcutoff will be removed. Only active for ridge_huth="ridge_huth".

1e-10

single_alpha

boolean

Whether to use a single alpha for all responses. Good foridentification/decoding. Only active for ridge_huth="ridge_huth".

False

use_corr

boolean

If True, this function will use correlation as its metric of model fit. If False, this function will instead use variance explained (R-squared) as its metric of model fit. For ridge regression this can make a big difference -- highly regularized solutions will have very small norms and will thus explain very little variance while still leading to high correlations, as correlation is scale-free while R**2 is not. Only active for ridge_huth="ridge_huth".

True

run_folder_name

str

The name of the folder in the runs directory (as specificed in encoders.utils.load_config()['RUNS_DIR']) to save the results in. If it doesn't exist, it is created on the fly.

''
Source code in src/encoders/run_all.py
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
def run_all(
    cross_validation: str = "simple",
    subject: Union[str, list[str]] = "UTS02",
    feature: Union[str, list[str]] = "eng1000",
    n_train_stories: Union[int, list[int]] = [1, 3, 5],
    test_story: Optional[str] = "wheretheressmoke",
    n_repeats: int = 5,
    ndelays: int = 5,
    interpolation: str = "lanczos",
    ridge_implementation: str = "ridge_huth",
    do_shuffle: bool = False,
    use_cache: bool = True,
    keep_train_stories_in_mem: bool = True,
    seed: Optional[int] = 123,
    alphas: np.ndarray = np.logspace(1, 3, 10),
    nboots: int = 50,
    chunklen: int = 40,
    nchunks: int = 125,
    singcutoff: float = 1e-10,
    single_alpha: bool = False,
    use_corr: bool = True,
    run_folder_name: str = "",
):
    """Runs encoding models n_repeat times and saves results data/runs to disk.

    The following outputs are saved:
    0. params.json : the parameters for the run
        `data/runs/date-id/params.json`
    1. scores_mean.npy : the mean scores across folds
        `data/runs/date-id/predictor/subject/n_training_stories/shuffle/scores_mean.npy`
    2. scores.npy : the score for each separate fold
        `data/runs/date-id/predictor/subject/n_training_stories/shuffle/fold/scores.npy`
    3. weights.npy : the model weights for each separate fold
        `data/runs/date-id/predictor/subject/n_training_stories/shuffle/fold/weights.npy`
    4. best_alphas.npy : the best alphas for each voxel in that fold
        `data/runs/date-id/predictor/subject/n_training_stories/shuffle/fold/best_alphas.npy`


    Parameters
    ----------
    cross_validation : {"loocv", "simple"}, default="simple"
        `loocv` uses leave-one-out cross-validation for n_stories. The stories are
         determined by the order of the `stories` parameter or its default value in
         `config.yaml`.
        `simple` computes the regression for a train/test split containing n_stories
         within each repeat.
        Stories are sampled randomly for each repeat.
    subject : str or list of str, default="UTS02"
        Subject identifier.
        Can be one or a list of : {`"all"`, `"UTS01"`, `"UTS02"`, `"UTS03"`, `"UTS04"`,
         `"UTS05"`, `"UTS06"`, `"UTS07"`, `"UTS08"`}
    feature : {"all", "envelope", "eng1000"}, default="eng1000"
        Which predictor to run.
        `all` will run separate encoding models for both predictors (default).
        `envelope` will run the encoding model with the audio envelope as predictor.
        `eng1000` will run the encoding model with the word embeddings of the stories
         as predictor.
    n_train_stories : int or list of int
        Number o of training stories for the encoding model. If a list is given, the
         encoding model will be fitted with each number separately.
    test_story : str or None, default="wheretheressmoke"
        The story to use as the test set. If `None`, it will be sampled randomly.
    ndelays : int, default=5
        By how many TR's features are delayed to model the HRF. For `ndelays=5`, the
         features of the predictor are shifted by one TR and concatinated to themselves
         for five times.
    interpolation : {"lanczos", "average"}, default="lanczos"
        Whether to use lanczos interpolation or just average the words within a TR.
        Only applies if `predictor=eng1000`.
    ridge_implementation: {"ridgeCV", "ridge_huth"}, default="ridge_huth"
        Which ridge regression implementation to use.
        `ridgeCV` will use scikit-learn's RidgeCV.
        `ridge_huth` will use the ridge regression implementation from Lebel et al.
    do_shuffle: book, default=False
        Whether or not to run model fits with predictors shuffled (as a control).
        A separate subfolder ('shuffled') will be created in the run folder with
        these results.
    use_cache: bool, default=True
        Whether features are cached and reused.
        Only applies if `predictor=envelope`.
    keep_train_stories_in_mem: bool, default=True
        Whether stories are kept in memory after first loading. Unless when using all
        stories turning this off will reduce the memory footprint, but increase the
        time is spent loading data. Only works if `cross_validation='simple'`.
    seed: int | None, default=123
        Seed determining sampling of stories
    alphas : np.ndarray
        Array of alpha values to optimize over.
    nboots : int
        The number of bootstrap samples to run. 15 to 30 works well.
        Only active for ridge_huth="ridge_huth".
    chunklen : int
        On each sample, the training data is broken into chunks of this length. This
        should be a few times longer than your delay/STRF. e.g. for a STRF with 3
        delays, I use chunks of length 10.
        Only active for ridge_huth="ridge_huth".
    nchunks : int
        The number of training chunks held out to test ridge parameters for each
        bootstrap sample. The product of nchunks and chunklen is the total number of
        training samples held out for each sample, and this product should be about 20
        percent of the total length of the training data.
    singcutoff : float
        The first step in ridge regression is computing the singular value decomposition
        (SVD) of the stimulus Rstim. If Rstim is not full rank, some singular values
        will be approximately equal to zero and the corresponding singular vectors will
        be noise. These singular values/vectors should be removed both for speed (the
        fewer multiplications the better!) and accuracy. Any singular values less than
        singcutoff will be removed.
        Only active for ridge_huth="ridge_huth".
    single_alpha : boolean
        Whether to use a single alpha for all responses. Good
        foridentification/decoding.
        Only active for ridge_huth="ridge_huth".
    use_corr : boolean
        If True, this function will use correlation as its metric of model fit. If
        False, this function will instead use variance explained (R-squared) as its
        metric of model fit. For ridge regression this can make a big difference --
        highly regularized solutions will have very small norms and will thus explain
        very little variance while still leading to high correlations, as correlation
        is scale-free while R**2 is not.
        Only active for ridge_huth="ridge_huth".
    run_folder_name: str, optional
        The name of the folder in the runs directory (as specificed in
        `encoders.utils.load_config()['RUNS_DIR']`) to save the results in.
        If it doesn't exist, it is created on the fly.
    """

    # put arguments in right format
    if isinstance(subject, str):
        subjects = [subject]
    else:
        subjects = subject
    if "all" in subjects:
        subjects = [
            "UTS01",
            "UTS02",
            "UTS03",
            "UTS04",
            "UTS05",
            "UTS06",
            "UTS07",
            "UTS08",
        ]

    if isinstance(n_train_stories, int):
        n_train_stories_list = [n_train_stories]
    else:
        n_train_stories_list = n_train_stories

    # pick the right pool of stories, depending on ridge implementation
    if ridge_implementation == "ridge_huth":
        stories = load_config()["STORIES"].copy()
    elif ridge_implementation == "ridgeCV":
        stories = load_config()["STORIES_2"].copy()

    if not isinstance(stories, list):
        raise ValueError(f"Config parameter invalid: STORIES: {stories}")

    if isinstance(feature, str):
        features = [feature]
    else:
        features = feature
    if "all" in features:
        features = ["envelope", "eng1000"]

    # toggle the shuffle swithch
    shuffle_opts = [False]
    if do_shuffle:
        shuffle_opts = [False, True]

    # if run folder name not given, create one
    if not run_folder_name:
        run_folder_name = create_run_folder_name()
    else:
        run_folder_name = run_folder_name

    run_folder = os.path.join(RUNS_DIR, run_folder_name)
    Path(run_folder).mkdir(parents=True, exist_ok=True)

    # log all parameters
    config = {
        "cross_validation": cross_validation,
        "subject_arg": subject,  # command line arguments
        "subjects": subjects,  # resolved predictors
        "feature_arg": feature,
        "features": features,
        "n_train_stories": n_train_stories,
        "stories": stories,
        "test_story": test_story,
        "n_repeats": n_repeats,
        "ndelays": ndelays,
        "interpolation": interpolation,
        "ridge_implementation": ridge_implementation,
        "do_shuffle": shuffle_opts,
        "use_cache": use_cache,
        "keep_train_stories_in_mem": keep_train_stories_in_mem,
        "seed": seed,
        "tr_len": TR_LEN,
        "alphas": list(alphas),
        "nboots": nboots,
        "chunklen": chunklen,
        "nchunks": nchunks,
        "singcutoff": singcutoff,
        "single_alpha": single_alpha,
        "use_corr": use_corr,
        "run_folder_name": run_folder_name,
    }

    log.info(f"Running experiment with the following parameters:\n{json.dumps(config)}")

    # update results file
    params_path = os.path.join(run_folder, "params.json")
    with open(params_path, "w") as f_out:
        json.dump(config, f_out, indent=4)
    log.info(f"Written parameters to {params_path}")

    # aggregate overall max correlations and continously update in json
    results_max_agg = defaultdict(partial(defaultdict, partial(defaultdict, dict)))
    # enables instantiating hierarchy of dicts without manually creating them at each
    # level.

    # get a list of 3-element tuples, with all posible combinations
    combinations = list(product(features, subjects, shuffle_opts))

    results_max_path = os.path.join(run_folder, "results_max.json")

    # run the regression pipeline
    for combination_tuple in combinations:
        current_feature, current_subject, shuffle = combination_tuple

        shuffle_str = "shuffled" if shuffle else "not_shuffled"

        # just for UTS03, nans are not an issue
        if (current_subject == "UTS03") & (ridge_implementation == "ridgeCV"):
            stories = load_config()["STORIES"].copy()
            # config is wrong for UTS03

        for current_n_train_stories in n_train_stories_list:
            output_dir = os.path.join(
                run_folder,
                current_subject,
                current_feature,
                str(current_n_train_stories),
                shuffle_str,
            )
            Path(output_dir).mkdir(parents=True, exist_ok=True)

            if cross_validation == "loocv":
                all_scores, all_weights, best_alphas = crossval_loocv(
                    feature=current_feature,
                    stories=stories,
                    n_train_stories=current_n_train_stories,
                    subject=current_subject,
                    tr_len=TR_LEN,
                    ndelays=ndelays,
                    interpolation=interpolation,
                    ridge_implementation=ridge_implementation,
                    use_cache=use_cache,
                    shuffle=shuffle,
                    alphas=alphas,
                    nboots=nboots,
                    chunklen=chunklen,
                    nchunks=nchunks,
                    singcutoff=singcutoff,
                    single_alpha=single_alpha,
                    use_corr=use_corr,
                )
            elif cross_validation == "simple":
                all_scores, all_weights, best_alphas = crossval_simple(
                    feature=current_feature,
                    stories=stories,
                    n_train_stories=current_n_train_stories,
                    test_story=test_story,
                    subject=current_subject,
                    tr_len=TR_LEN,
                    ndelays=ndelays,
                    interpolation=interpolation,
                    ridge_implementation=ridge_implementation,
                    use_cache=use_cache,
                    shuffle=shuffle,
                    n_repeats=n_repeats,
                    seed=seed,
                    keep_train_stories_in_mem=keep_train_stories_in_mem,
                    alphas=alphas,
                    nboots=nboots,
                    chunklen=chunklen,
                    nchunks=nchunks,
                    singcutoff=singcutoff,
                    single_alpha=single_alpha,
                    use_corr=use_corr,
                )
            else:
                raise ValueError(f"Invalid cross validation method: {cross_validation}")

            # aggregate scores
            mean_scores = np.mean(all_scores, axis=0)
            sem_scores = sem(
                [np.mean(repeat_scores, axis=0) for repeat_scores in all_scores], axis=0
            )

            log.info(
                f"Mean correlation (averages across splits) (r): {mean_scores.mean()}"
            )
            log.info(
                f"Max  correlation (averaged across splits) (r): {mean_scores.max()}"
            )

            np.save(os.path.join(output_dir, "scores_mean.npy"), mean_scores)
            np.save(os.path.join(output_dir, "scores_sem.npy"), sem_scores)

            results_max_agg[current_feature][current_subject][current_n_train_stories][
                shuffle_str
            ] = mean_scores.max()

            # update results file
            with open(results_max_path, "w") as f_out:
                json.dump(results_max_agg, f_out, indent=4)