Monday, February 18, 2019

Some staffs are more equal than others

After talking the matter over with Barbara, I think the problem we are having with phrase agreement is actually quite interesting. It boils down to differences in how Justin and Anne are marking the lower staff. Right hand agreement is still quite solid.

Here are the summary numbers for the first batch of revisions Justin gave me compared to Anne's (over sections 2.1, 3.1, 4.1, 5.1, and 6.1). "Staff 1" is the upper staff. "Staff 2" is the lower. We are looking at the "kappa" score. A kappa score of 1.0 reflects perfect agreement. Anything over 0.8 is fantastic. No one is going to argue with agreement like that. But many people (like me and Barbara) will be suspect of anything below 0.67.

Staff 1 kappa: 0.878523692967741
Agreement: 1255/1262 One: 31 Other: 28
Staff 2 kappa: 0.409898477157359
Agreement: 914/945 One: 33 Other: 21
Overall kappa: 0.655041358347147
Agreement: 2169/2207 One: 64 Other: 49

Anne is "One," and Justin is "Other." These are the complete phrase counts annotated by each of you. Note that Justin has 12 fewer phrases marked on the lower staff than Anne, but only 3 fewer on the upper staff.

It strikes me that all of this may be trying to tell us that phrasing in these left-hand (accompanying) voices are inherently more ambiguous than in the conventionally "melodic" voices--or that two distinct principles are being applied here. It seems quite plausible that a bass line may have its own agenda, over which an independent melody can form its own thoughts. Justin seems to see the situation more like that, while Anne seems to have a stronger sense that accompanying lines reflect the phrasing of the lines they are accompanying.

I am not saying either of you is right or wrong, but we need to see if we can find a way to get you to agree. I am also curious to see if there is any discussion in the (music theory) literature of the interplay between melodic and accompanying phrases.

Also, we did not see this disagreement in Sonatina 1. Why would that be?

I attach a zip file with the phrasing each of you originally submitted (except for 1.1 and 1.2, which have Anne's corrections). Please follow the "Comparing Annotations" procedure at the bottom of this blog post to review the data in abcDE.

I have removed the sub-phrase and motive marks, so we don't have to struggle differentiating those little vertical lines. I would be interested in hearing Anne and Justin offer a rationale for their own and maybe each other's segmentation, with reference to specific examples.

We can of course sidestep this issue for the time being by ignoring the lower staff, but I am not quite ready to do that yet.

Monday, January 21, 2019

Clementi corpus building

The goal is to mark phrases in the Clementi Op. 36 sonatinas, as in this PDF file.

Please use this web interface (abcDE) to enter the phrase marks. Click the cog button on the right and fill in the fields. You are the authority and the transcriber. After you set yourself up, you will need to reload the page in your browser.

I have divided the sonatinas into 27 sections. (Anything between double bar lines got its own section.)

Here are all the section files to be annotated:
11.abc 21.abc 31.abc 35.abc 44.abc 53.abc 62.abc
12.abc 22.abc 32.abc 41.abc 45.abc 54.abc 63.abc
13.abc 23.abc 33.abc 42.abc 51.abc 55.abc 64.abc
14.abc 24.abc 34.abc 43.abc 52.abc 61.abc
(The first number is the sonatina number. The second is the section number. So 24.abc has the fourth section of the second sonatina.)

To open a section file, click the globe button. It should show you a URL like this:
https://dvdrndlph.github.io/didactyl/corpora/clementi/11.abc
To select a different file, change the number. Click OK.

To mark phrases, select the last note included in the phrase, and type a period ("."). This will display a vertical bar after the note to mark the phrase.

The browser should remember your work, but to be safe (and to share your work with me), you will click the eyeball button, copy all of the displayed text, and save it in a text file. Or just paste the text in an email and send it to me. That might be easiest.

You should also mark sub-phrases (";") and motives (",").

I am primarily interested in the complete phrase markings right now, so you can ignore the other marks. But it wouldn't hurt to add the more granular markings while you are in the neighborhood. Of course, we are actually trying to identify how a piece is chunked with respect to fingering, not musicality. But one study suggests these are the same thing.

Regarding annotation guidelines, this is what we have:
The primary task is to demarcate phrases in the score. Mark the notes that end complete musical thoughts, typically supported by the presence of a cadence.

Each voice in a piano score may have its own independent phrasing. However, when a lower voice is accompanying an upper voice, the lower will typically end a phrase around the same time as the upper, coordinating to create the sense of cadence. Pay special attention when deviating from this general rule.

You should also mark sub-phrases and motives. Motives are the smallest definable ideas in music. Sub-phrases are more developed but remain incomplete, perhaps like clauses in a sentence.

Note that the phrase mark subsumes the sub-phrase and motive marks, and the sub-phrase mark subsumes the motive mark. That is, a sub-phrase boundary implicitly terminates a motive, and a phrase boundary implicitly terminates a sub-phrase and a motive. So at most one mark should follow any given note.

Differentiating sub-phrases and phrases can be difficult. When you have serious doubts about this distinction, prefer short phrases over long ones.

Always annotate the last phrases in the piece on both staffs. If a piece ends in the middle of a phrase on either staff, we need to change how the larger work has been divided into sections. [Let Dave know.]

Comparing Annotations

I may email you a zip file, so you can compare each other's annotations. Please follow these steps:
  1. Unzip the contents.
  2. Open abcDE.
  3. Click the cog button.
  4. Set Restore Data to Never.
  5. Click Close.
  6. Refresh your browser window.
  7. Open the individual files (through the "Choose File" button).
To see Anne's annotations, select 1 from the Sequence spinner. To see Justin's annotations, select 2.

When you are done, do the following or abcDE will never remember fingerings that you have entered previously in the browser:
  1. Click the cog button.
  2. Set Restore Data to Always.
  3. Click Close.
  4. Refresh your browser window.
Note that relying on the browser to remember your work is fraught with peril. Whenever you have work you want to save for later, you need to do this:

  1. Make sure the Annotated checkbox is not checked.
  2. Click the eyeball button.
  3. Select all of the text.
  4. Copy it to a text file using TextEdit (Mac) or NotePad (Windows).
  5. Save that file.
This is awkward, but I don't have an easier way to do this at the moment.

Thursday, January 10, 2019

Clustering of editorial advice

A possible enhancement

Again, say we have the following fingerings in our gold standard:
  1. 1234 131 (1 annotator vote) .1
  2. 1xx4 1x1 (3 votes) .3
  3. 12xx x3x (2 votes) .2
  4. 1x1x 1xx (4 votes) .4
How much credit should a model get for suggesting 1234 131? Or 1214 131? That is, how likely is it that the user will accept the advice?

Do we just sum over all the matches? So 1234 131 would get .1 + .3 + .2 = .6? And 1214 131 would get .3 + .2 + .4 = .9? But don't we have more evidence that 1234 131 is a good fingering?

We proposed amplifying fingering sequences based on how many actual annotations it has:
  1. 1234 131 (7 notes x 1 annotator = 7 votes) .189
  2. 1xx4 1x1 (4 x 3 = 12 votes) .324
  3. 12xx x3x (3 x 2 = 6 votes) .162
  4. 1x1x 1xx (3 x 4 = 12 votes) .324
Or should we reduce the contribution of a sequence if it is shared with multiple sequence sets ("clusters")? Otherwise, we are over-representing the signal from less discriminating voices and in general over-stating the likelihood that a user will be satisfied.

So 1234 131 would get .189 + .324/2 + .162/2 = .189 + .162 + .081 = .432. And 
1214 131 would have .162 + .081 + .324 = .567.

My confidence in someone who seems to agree with me is diminished every time I see this person agree with someone I disagree with. So maybe we divide the amplified weight by 1 plus the number of wrongsters we detect agreeing with an editor we see agreeing with the system suggestion.

So 1234 131 would get .189 + .324/2 + .162/2 = .189 + .162 + .081 = .432. And 
1214 131 would have .324/2 + .162/2 + .324 = .567. 

Or actually we should consider the amplified weight of the offending fingerings.

Conversely(?), what if I agree with someone and then I see this person disagree with someone I also agree with? Is agreement transitive. Does A=B and B=C imply A=C? This is not possible, unless I need more coffee. The system suggestion never includes wildcards, and the only way to disagree is by failing to match a non-wildcard element. If I (the system) agree with two people, the two people must agree with each other.





Incomplete editorial fingerings

Say we have the following fingerings in our gold standard:
  1. 1234 131 (1 annotator vote) .1
  2. 1xx4 1x1 (3 votes) .3
  3. 12xx x3x (2 votes) .2
  4. 1x1x 1xx (4 votes) .4
How much credit should a model get for suggesting 1234 131? Or 1214 131? That is, how likely is it that the user will accept the advice? How confident are we that the advice is likely to be good?

Do we just sum over all the matches? So 1234 131 would get .1 + .3 + .2 = .6? And 1214 131 would get the same: .2 + .4 = .6? But don't we have more evidence that 1234 131 is a good fingering?

We could amplify a fingering sequences based on how many actual annotations it has:
  1. 1234 131 (7 notes x 1 annotator = 7 votes) .189
  2. 1xx4 1x1 (4 x 3 = 12 votes) .324
  3. 12xx x3x (3 x 2 = 6 votes) .162
  4. 1x1x 1xx (3 x 4 = 12 votes) .324
Or I am thinking the more specific sequences should be amplified by less specific sequences that do not contradict them:
  1. 1234 131 (1 + 3 + 2 = 6 votes) .353
  2. 1xx4 1x1 (3 + 2 = 5 votes) .313
  3. 12xx x3x (2 votes) .125
  4. 1x1x 1xx (4 votes) .25
Or a combination, like this:
  1. 1234 131 (7 + 12 + 6 = 25 votes) .4
  2. 1xx4 1x1 (12 + 6 = 18 votes) .3
  3. 12xx x3x (6 votes) .1
  4. 1x1x 1xx (12 votes) .2
Shouldn't #3 and #4 reinforce each other somehow?

Also, shouldn't the amplification run both ways? If a complete annotation agrees with a partial, what does that tell us? An editor, who is pitching his advice to a general audience in what we assume is a minimally idiosyncratic way, says these milestone fingerings are the most important.

Given this, I am leaning back toward the simple summing approach I mentioned first, but amplifying by digit. So 1234 131 would get .189 + .324 + .162 = .675, and 1214 131 would get .162 + .324 = .486. This at least passes the smell test.

It strikes me that the sparseness of editorial scores may actually be a blessing. All of my trevails with edit distances seem moot in editorial scores, or at least rendered less pertinent. The editor implicitly tells us which specific annotations are most important and which are free to vary. Using Hamming distance here is less controversial: a digit either matches or it doesn't. This is clearly justified if we assume the editor is being explicit in the advice that actually appears.

Are we really not striving to model the behavior of a good editor and less that of a good pianist?

Tuesday, January 8, 2019

The semantics of editorial fingering

Dear Anne and Justin:

Fingering annotations are typically sparse in editorial scores. This is the case even in pedagogical works intended for beginners. Editors understandably do not want to clutter their scores with unnecessary information. However, this state of affairs makes it difficult to leverage editorial scores as sources of fingering data suitable for training and evaluating computational models.

This raises a number of questions for me.

What is the intent of the editor? Is it to provide complete guidance in a compact format, as appears to be the case in beginner scores? (I remember being puzzled and a little irritated by the missing annotations. Why did I have to interpolate when I don't know what I am doing?!) Or is it to convey major transitions only (hand repositionings) and leave other "minor" decisions to the performer? Or is it to provide advice only in areas of special difficulty and to leave the rest to the performer's discretion? Or is it a combination of these intents, with the emphasis varying over the length of a piece?

How much do you think two pianists would agree when transforming a typical sparsely annotated score into a completely annotated score? Would this vary by editorial intent? (We have data we could use to tease out some answers here, I think. In the WTC corpus, we should have complete human data overlapping sparse editorial data that agree on the notes marked by the editor.)

Are there rules you apply to "fill in the blanks" in editorial scores? Are such rules discussed or codified in the pedagogical literature? Would doing so constitute a potential contribution to the pedagogical literature? Is this something you would like to pursue on its own merits?

Some time ago, I tried to brainstorm on this topic. ("MDC"--Manual Data Collector--is what I originally called the abcDE editor.)

Filling in the blanks would definitely be in order to augment our training data for machine learning. But the existence of blanks is also cramping my style in validating my latest novel evaluation metric. (It involves clustering fingering advice according to how "close" the individual fingering suggestions are to each other. This idea of closeness, already somewhat controversial, is even more strained when the suggestions are riddled with blanks.)

Thursday, July 19, 2018

2018-07-19 status

Done

Administrivia

  1. Booked trip to ISMIR 2018 in Paris.

Model Building

  1. Finailized "Corrected Parncutt" implementation for everything by cyclic patterns.
  2. Confirmed inconsistencies in published Parncutt results:
    • Small-Span and Large-Span penalties are conflated.
    • Small-Span penalty definition is inconsistent.
    • Position-Change-Count and Position-Change-Size penalties are incorrect.
    • Penalty totals are incorrect.
    • Explanatory example has confusing/incorrect costs.
  3. Completed full regression test of code base.
  4. Met with Alex Demos and agreed to co-author paper for Music Scientiae on Parncutt, Corrected Parncutt, Improved Parncutt, and how to tell them apart. Will dry run some of this material in late-breaking paper at ISMIR.

    Doing

    1. Implementing support for, and clarifying definition of, cyclic pattern constraint in Parncutt. (Should also do this for Sayegh and Hart.)
    2. Double-checking pruning mechanism in Parncutt.
    3. Writing up findings on "Corrected Parncutt" model, initially for ISMIR submission.
    4. Adding mechanism to learn weights for "Improved Parncutt" rules from training data.

    Struggling

    1. How does one compare two ranked lists of sequences to a third and claim one of the two is more similar to the third in a statistically significant way? That is the big question.

    In Scope

    1. Implementing crude automatic segmenter.
    2. Developing staccato/legato classifier.
    3. Demonstrating improved Parncutt via #1 and #2.
    4. Debugging Sayegh model, which produces results inconsistent with training data.
    5. Developing better test cases for Sayegh.
    6. Updating abcDE to support manual segmentation.
    7. Completing and polishing abcD for entire Beringer corpus.
    8. Defining initial benchmark corpora and evaluation methodology.
    9. Implementing convenience methods for reporting benchmark results.
    10. Moving Beringer corpus to MySQL database.
    11. Enhancing Parncutt, following published techniques and pushing beyond them.
    12. Enhancing Hart and Sayegh to return top n solutions.
    13. Re-weighting Parncutt rules using machine learning and TensorFlow. (This seems like a good fit.)
    14. Adding support to abcDE for annotating phrase segmentation.
    15. Debugging Dactylize 88-key circuit.
    16. Collecting fingering data from JB performances in Elizabethtown.
    17. Completing Dactylize II circuit.
    18. Developing method to align performance data with symbolic data. I think this is going to be essential if we are to use Dactylize data moving forward and a key part of its proof of concept. I plan to have something for this at the ISMIR demo session (September 22 deadline).
    19. Defining procedure for sanity test of production automatic data collector (including Beringer data).
    20. Defining corpora for Dactylize data collection (WTC, Beringer, ??).
    21. Implementing end-to-end machine learning experiment, using Beringer abcD data.
    22. Submitting papers to TISMIR. Ideas: a follow-up demo paper describing Dactylize data collected; a full-length paper describing application of evaluation method to models developed; a full-length description of enhanced and/or novel models, demo of method to align collected performance data with symbolic score.

    Friday, June 22, 2018

    2018-06-22 status

    Done

    Model Building

    1. Re-implemented Parncutt cost functions for both hands.
    2. Identified possible inconsistencies in published Parncutt model description.
    3. Added feature to track more granular cost details in Dactyler models.
    4. Tracked individual rule costs to facilitate analysis of Parncutt results.
    5. Completed successful regression test for hacked-up Didactyl code. The APIs they are a-changin'.

      Doing

      1. Testing Parncutt cost functions.
      2. Reproducing original Parncutt results.
      3. Adding mechanism to learn weights for Parncutt rules from training data.

      In Scope

      1. Implementing crude automatic segmenter.
      2. Developing staccato/legato classifier.
      3. Demonstrating improved Parncutt via #1 and #2.
      4. Debugging Sayegh model, which produces results inconsistent with training data.
      5. Developing better test cases for Sayegh.
      6. Updating abcDE to support manual segmentation.
      7. Completing and polishing abcD for entire Beringer corpus.
      8. Defining initial benchmark corpora and evaluation methodology.
      9. Implementing convenience methods for reporting benchmark results.
      10. Moving Beringer corpus to MySQL database.
      11. Enhancing Parncutt, following published techniques and pushing beyond them.
      12. Enhancing Hart and Sayegh to return top n solutions.
      13. Re-weighting Parncutt rules using machine learning and TensorFlow. (This seems like a good fit.)
      14. Adding support to abcDE for annotating phrase segmentation.
      15. Debugging Dactylize 88-key circuit.
      16. Collecting fingering data from JB performances in Elizabethtown.
      17. Completing Dactylize II circuit.
      18. Developing method to align performance data with symbolic data. I think this is going to be essential if we are to use Dactylize data moving forward and a key part of its proof of concept. I plan to have something for this at the ISMIR demo session (September 22 deadline).
      19. Defining procedure for sanity test of production automatic data collector (including Beringer data).
      20. Defining corpora for Dactylize data collection (WTC, Beringer, ??).
      21. Implementing end-to-end machine learning experiment, using Beringer abcD data.
      22. Submitting papers to TISMIR. Ideas: a follow-up demo paper describing Dactylize data collected; a full-length paper describing application of evaluation method to models developed; a full-length description of enhanced and/or novel models, demo of method to align collected performance data with symbolic score.