Diachronic collostructional analysis meets the noun phrase:
Studying many a noun in COHA
Martin Hilpert (firstname.lastname@example.org)
For readers who are interested in adopting some of the techniques that are discussed in the chapter, this webpage contains supplementary information and links to other sources.
1. Further resources on collostructional analysis
A resource page for the collostructional methods in general is found here. You can download the coll.analysis script from there, along with some instructions and sample data files.
The raw data for the collostructional analysis in the chapter is in this file.
2. Some more information on the VNC algorithm
The discussion of VNC in the chapter is rather brief; the following offers a little more information. The original paper that presents the method is found here.
The frequency data that informs the analysis in the chapter is found in this file.
A visualization of the VNC process is shown in the following figures.
Figure 1. VNC dendrogram for the many a noun construction
The figure shows several things at once: The descending line is the frequency development of the many a noun construction. The overlaid dendrogram shows the sequence in which periods were merged. The dendrogram can be re-created with the frequency values (in this file) and the R workspace that is available on the companion webpage for the Gries and Hilpert chapter on periodization / VNC in the Rethinking handbook.
It can be seen, for instance, that the 1990s and 2000s are merged first and that the 1810s are merged relatively late in the process. Finally, the graph shows five horizontal grey lines that represent the mean frequencies of the chosen corpus periods. The division into five periods is not made arbitrarily, but it is suggested by the following scree plot (Figure 2):
Figure 2. Scree plot corresponding to the above VNC analysis
The scree plot shows that differentiation between up to five different clusters captures increasingly large amounts of frequency variation; further distinctions only marginally improve on that. Also this plot can be re-created with the R workspace from the companion website.
3. What results can be expected when no substantial change is going on?
When the collocational shifts of a lexically relatively stable construction are analyzed with VNC, the results will actually show that there is not much structure (i.e. principled heterogeneity) in the data. Here is a graph of the collocate frequencies of many nouns over time; as is apparent, the construction does not show clear linear trends for most of its collocates.
Figure 3. Absolute collocate frequencies of nouns in the phrase many nouns over time
This kind of data results in a VNC dendrogram that does not allow meaningful periodizations into, say, four or five groups of periods. Consider the dendrogram and the scree plot below, which are based on the collocate frequencies of many nouns across the COHA decades. The data are best split up into three groups (which is where the “elbow” of the scree plot is), but the first of these groups consists of only one period, so that is no great solution either. The only reliable piece of information inherent in the graphs is the split between the 1950s and the 1960s.
Figure 4. VNC dendrogram on the basis of the collocate frequencies of many nouns
Figure 5. Scree plot corresponding to the VNC analysis of many nouns
4. What is the role of different genres?
Patterns such as many a heart and many a businessman are genre-specific collocations. The COHA holds the sizes of its genres constant after the 1860s, which allows the researcher to track in which genres the many a noun construction is primarily used over time. As is apparent from the graph below, in which the darker bars represent earlier decades (1860s–1990s), the construction clearly shifts from primarily literate usage to an indicator of journalistic style.
Figure 6. Frequencies of the many a noun construction in different genres