null+****@clear*****
null+****@clear*****
Tue Jan 24 18:09:15 JST 2012
SHIMADA Koji 2012-01-24 18:09:15 +0900 (Tue, 24 Jan 2012) New Revision: d107fb8e042d0f13daa370856bc7cbb3aa13519e Merged 6e6479f: Merge pull request #22 from logaling/import-edict Log: Add edict dictionary importer Added files: lib/logaling/external_glossaries/edict.rb Added: lib/logaling/external_glossaries/edict.rb (+50 -0) 100644 =================================================================== --- /dev/null +++ lib/logaling/external_glossaries/edict.rb 2012-01-24 18:09:15 +0900 (6e2fd0d) @@ -0,0 +1,50 @@ +# Copyright (C) 2012 Koji SHIMADA <koji.****@enish*****> +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. + +require 'open-uri' +require 'zlib' +require 'stringio' + +module Logaling + class Edict < ExternalGlossary + description 'The EDICT Dictionary File (http://www.csse.monash.edu.au/~jwb/edict.html)' + source_language 'ja' + target_language 'en' + output_format 'csv' + + private + def convert_to_csv(csv) + puts "downloading edict file..." + url = 'http://ftp.monash.edu.au/pub/nihongo/edict.gz' + Zlib::GzipReader.open(open(url)) do |gz| + puts "importing edict file..." + + lines = StringIO.new(gz.read).each_line + + lines.next # skip header + + preprocessed_lines = lines.map do |line| + line.encode("UTF-8", "EUC-JP").chomp + end + + preprocessed_lines.each do |line| + source, target = line.split('/', 2) + source = source.strip + csv << [source, target] + end + end + end + end +end