Wednesday, December 16, 2015

Data dredge

As I am about to embark on my quest to find correlation in the data, I am chastened by fears of data dredging. So I propose we do randomized out-of-sample testing. Toward this end, we will split the data into two subsets of approximately equal size.

The following query gives us the identifiers for subjects who completed both parts of the survey and for whom at least some fingering data were recorded:
select response_id
from well_known_subject s
inner join
(select distinct subject
from finger where fingers != '') f
on s.response_id = f.subject
We save this query as the "complete_response_id" view. There are 191 such response_ids.
So we load the "exploratory_response_id" table like so:
insert into exploratory_response_id
select response_id
from complete_response_id
order by random()
limit 96
The 95 response_ids  not included in this table are stored in the "validation_response_id" view:
select c.response_id
from complete_response_id c
where not exists (select response_
from exploratory_response_id e
where e.response_id = c.response_id)
The actual (scrubbed) profile data will remain in the "subject" table. We will create views to provide access to the appropriate data ("exploratory_subject" and "validation_subject"), which will leverage the subject_latexable view of the subject data to use camel-case column names. This makes it unnecessary to remap the column names in R.

No comments:

Post a Comment