[logaling-commit] logaling/logaling-command [master] Add edict dictionary importer

Back to archive index

null+****@clear***** null+****@clear*****
Tue Jan 24 18:09:15 JST 2012


SHIMADA Koji	2012-01-24 18:09:15 +0900 (Tue, 24 Jan 2012)

  New Revision: d107fb8e042d0f13daa370856bc7cbb3aa13519e

  Merged 6e6479f: Merge pull request #22 from logaling/import-edict

  Log:
    Add edict dictionary importer

  Added files:
    lib/logaling/external_glossaries/edict.rb

  Added: lib/logaling/external_glossaries/edict.rb (+50 -0) 100644
===================================================================
--- /dev/null
+++ lib/logaling/external_glossaries/edict.rb    2012-01-24 18:09:15 +0900 (6e2fd0d)
@@ -0,0 +1,50 @@
+# Copyright (C) 2012  Koji SHIMADA <koji.****@enish*****>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+require 'open-uri'
+require 'zlib'
+require 'stringio'
+
+module Logaling
+  class Edict < ExternalGlossary
+    description     'The EDICT Dictionary File (http://www.csse.monash.edu.au/~jwb/edict.html)'
+    source_language 'ja'
+    target_language 'en'
+    output_format   'csv'
+
+    private
+    def convert_to_csv(csv)
+      puts "downloading edict file..."
+      url = 'http://ftp.monash.edu.au/pub/nihongo/edict.gz'
+      Zlib::GzipReader.open(open(url)) do |gz|
+        puts "importing edict file..."
+
+        lines = StringIO.new(gz.read).each_line
+
+        lines.next # skip header
+
+        preprocessed_lines = lines.map do |line|
+          line.encode("UTF-8", "EUC-JP").chomp
+        end
+
+        preprocessed_lines.each do |line|
+          source, target = line.split('/', 2)
+          source = source.strip
+          csv << [source, target]
+        end
+      end
+    end
+  end
+end




More information about the logaling-commit mailing list
Back to archive index