Corpus Linguistics

What is it for?

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

As for language learning, corpus is a method that contains databases of words taught in each class. When the text is collected in a form of corpus, the teacher can identify which words are the most to focus. Instead of rereading all the articles having been taught in each session before the final exam, the teacher can go to the corpus database right away and find the right words that the students really want for their field of study.


Meet My Student, Sally

She has been fed with a series of reading materials, articles, fairies tales and more. 

We have read a lots

Sally and I have read 20 children's short stories, most of which are classical stories that every child is supposed to learn key new words from them.

What a surprise !!!

How can I know. What are the most words that she can memorize? Which words are frequently repeated in the short stories she has been reading so far? What vocabulary list should I make and check with her?


So, I put all texts into Microsoft Word and convert each story into .txt files.

More Tools

Then, I import them into a Word Concordance software (the same one that a dictionary maker is using). This tool generates a word list with rank and frequency for me. 



Simple Database in Excel

The simple database is here. The sample is on words in S group only. 

View the Database


Power BI

Now, the Power BI creates a report on the word counts, frequency, and popularity of the vocabulary she has read so far from all the 20 articles. 

View Report in Full Scale

And here it is

You've got to know the most frequent words and the most challenging word for Sally. Now we can practice with her on the S group, which is the group that contains most unique words extracted from the reading materials. Let she read. Then, we collect data. Now, we report you the words that she has to master as of now. 

Sample here is for S group only. These words appeared once in the stories. And they have more than 10 letters in each. So, these words are the most challenging for Sally to memorize and reuse them in the future. 

Thus, the teacher stays focused on the right words for Sally. No guesswork anymore.