Monday, June 15, 2015

A fool's errand in four phases

After the survey, I am leaning toward an initial follow-up study that involves manual annotation (similar to what is requested for "Invention No. 1" in the survey, but a bit more complex) of one or more selections from Bach's Well Tempered Clavier (WTC).

This is because I have discovered high-quality machine-readable (MuseScore) editions of WTC and the Goldberg Variations are available. Also, the pianists can do this work anywhere and therefore need not be local.

Oh, and building the input interface for this is a much lower risk proposition than building the sensor/computer-vision system, and the manual system is seen as a necessary component of the "semi-automated" system we also described in the Provost's Award proposal.

So the plan is 1) survey, 2) MDC4, 3) SADC, 4) FADC.

The fully automated system is still planned, but actual work on it probably won't start in earnest until school ramps up in the fall. The next development milestone will be a robust manual data collector that copes with up to four voices. (The current system only deals with two voices and has at least one problem with pieces that contain multiple lines. It is only just barely able to support the music in the survey.)

The survey has been updated to randomize the order of presenting exercises B, C, D, F, and G. Then, based on whether the subject's age is odd or even, it either presents Czerny's fingered context or a completely unfingered context for exercise A. Then it does the same for E.

I chose A and E for the experiment because they had the highest variability reported by Parncutt et al. However, I probably should have normalized by the length of fingered fragment. A and E are the longest fragments at 8 notes each.
  • A: 10 fingerings / 8 notes = 1.25 fingering/note 
  • B: 5 fingerings / 4 notes = 1.25 
  • C: 9 / 5 = 1.8 
  • D: 8 / 7 = 1.143 
  • E: 18 / 8 = 2.25 
  • F: 5 / 6 = 0.833 
  • G: 9 / 7 = 1.286 
But this doesn't really measure the consensus I see, as A and E clearly have the widest disagreement. I need to get my head around this.

Agreement, expressed as mode count over total fingering count.
  • A: 8/28 = 0.2857 
  • B: 17/28  = 0.6071
  • C: 10/28 = 0.3571
  • D: 15/28  = 0.5375
  • E: 4/28 = 0.1429 
  • F: 23/28 = 0.8214
  • G: 14/28  = 0.5000
This seems like one avenue to a Kappa score. Or maybe a weighted Kappa score, as the fingers do indeed have a natural order. Or should we think of every note as being "categorized" with a particular (weighted) finger? This probably amounts to the same thing.

I think the average edit distance from the mode (most popular fingering) gets us a pretty good measure of overall similarity for a bunch of fingerings. The greater the consensus on a single fingering, the more edit distances of zero we have. We can normalize the edit distance by dividing by the note count. This should give us a number between 0 (perfect agreement) and 1 (no agreement).

But what about the case, like piece A, where there are two popular--and quite dissimilar--fingerings? Is this showing agreement or disagreement? Using our normalized edit distance approach, this is going to register as disagreement. Is this fair? Should we be calculating all of the distances between all of the fingerings instead? Yes, I think so.

This probably just boils down to Fleiss's Kappa. But can this be weighted? It seems as though this should be possible.

I have also added a mechanism in the survey to measure how much time the user spends on each fingering exercise, though it is still unclear what happens to this data the user backtracks and visits the exercise more than once. We apparently need to keep the BACK button active to cope with the inconsistent way that different browsers cope with our error message for incomplete fingerings. (Safari takes you to the next screen. Chrome leaves you on the current screen after sometimes printing a confusing message about the BACK button.

1 comment:

  1. Barbara advises I think a bit more about the randomization. Here goes. . . .

    There may be a natural order for the exercises (A-G)--assuming Czerny was presenting these in an order that makes pedagogical sense. The motivation of randomizing is to disperse any influence that fingering one piece would exert on the fingering of another. But why is that important to do? Why not just let this influence exert itself. If such an effect is significant, is it not better to have it be applied in a uniform way? And if it is subtle, wouldn't testing it repeatedly help us detect it. Also, we are not randomizing the order in which A and E are presented. I think showing them in a fixed order might be the way to go. And we just stick A and E (with their coin toss on recommended fingerings) back into the original order.

    ReplyDelete